SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
CS4642 - Data Mining & Information
Retrieval
Paper Based on KDDCup 2014 Submission
Group Members:
100227D - Jayaweera W.J.A.I.U.
100470N - Sajeewa G.K.M.C
100476M - Sampath P.L.B.
100612E - Wijewardane M.M.D.T.K.
Group Number : 13
Final Group Rank : 76
Description of Data
In this competition, five data files are available for competitors. They are donations
(contains information about the donations to each project. This is only provided for projects
in the training set), essays (contains project text posted by the teachers. This is provided for
both the training and test set), projects (contains information about each project. This is
provided for both the training and test set), resources (contains information about the
resources requested for each project. This is provided for both the training and test set) and
outcomes (contains information about the outcomes of projects in the training set). Before
starting the knowledge discovery process provided data have been analyzed.
First of all number of data records in each file has been counted to get an idea about the
amount of data available. Projects file has 664098 records, essays file has 664098 records,
outcomes file has 619326 records, resources file has 3667217 records and donations file has
3097989. Our next task was to identify the criterion which is used to differentiate test data
from training data. After reading the competition details we realized that projects after
2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data
set. According to that 619326 projects are available for training set and remaining amount
(44772) of projects are for training set. For each of the project in training set, project
description, essay data, resources requested, donations provided and outcome are given. For
each of the project in test set, project description, essay data, resources requested are given.
Data Imbalanced Problem
After having a brief understanding of data provided we started to analyze training set.
When we draw a graph between the project’s posted dates and “is_exciting” attribute, we
realized that there are no exciting projects before April 2014. Graph was completely
skewed to the right side.
This leads to a data imbalanced problem as number of exciting projects is very small
compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of
exciting and non-exciting projects was as follows.
In competition forum there was an explanation for this problem. It said that organization
might not keep track of some of the requirements needed to decide ‘is_exciting’ for the
projects before 2010. Therefore we thought that classification given in outcomes file before
2010 may not correct and we decided to use down sampling technique to handle
imbalanced data (remove projects before 2010). It is true that valuable information may get
lost when projects are removed. But accuracy obtained by removing that data outweigh the
loss of information. Therefore we were able to obtain higher accuracy by down sampling
the given data. All the classifiers that we have used performed well after removing projects
before 2010.
Preprocessing Data
First we analyzed characteristics of mining data using statistical measurements. Using the
data frame describe method we calculated number of records, mean, standard deviation,
minimum value, maximum value and the quartile values for each attribute. Given below is
a statistical measurement of two attributes.
We were able to get an idea about the distribution of attributes using these statistical
measurements.
Filling Missing Values
Initially we used pad method (propagate last valid observation forward) to fill missing
values of all the attributes. But we realized that we can achieve high accuracy by selecting a
filling method based on the type of the attribute. To do that first we calculate the percentage
of missing values. It was as follows,
Highest amount of missing values percentage was for secondary focused subject and
secondary focused area. This is because some projects may have only primary focus area
and primary focus subject. We decided to fill missing secondary values with their
respective primary values. Also we used linear interpolation for numeric values and for
other attributes we used pad method. Later when we tune up classifiers we changed the
method from pad to backfill (use next valid observation) as it obtained a higher accuracy
than pad.
Remove Outliers
When we analyzed data, outliers were detected in some of the attributes. We used scatter
plots to identify outliers. There were outliers in cost related attributes and we replaced them
with the mean value of that attribute. Given below is outlier analysis of cost attribute,
Red circle value can be considered as an outlier as it has a really huge value than other
values. These outliers have caused a lot of problems when we discretize data. To identify
outliers in resources, we used inter quartile range as a measurement.
Label Encoding
We did not use all the attributes for predictions. We focused more on repetitive features as
they will help more to the classifier to make predictions. Most of these repetitive
features/attributes have string values rather than numerical values. Available classifiers do
not accept string values for features. So we used label encoder to transform those string
values to integer values between 0 and n-1, n being the number of different values a feature
can take.
But classifiers expect continuous input and may interpret the categories as being ordered
which is not desired. To make the categorical features to features that can be used with
scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature
with k possible values into k binary features, with only one active for particular sample.
This improved the performance of classifiers to greater extent. For an example SGD
classifier obtained about 0.55 ROC score without hot encoding and with encoding it
obtained about 0.59 ROC score.
Continues Values Discretization
Project attributes such as school longitude, school latitude, zip code and total cost cannot be
directly used for predictions as they are less likely to be repetitive. But this information
cannot be eliminated as they may help to get decisions for classifiers. To make these
attributes more repetitive we used discretization. We put these continuous values into bins
and used the bin index as the attribute. For an example we used discretization for longitude
and latitude and divided projects into five regions (bins) and used region id instead of using
longitude and latitude. Discretization results for total cost attribute as follows,
We applied the same concept for cost related attributes, item count for project, total price of
items per project, number of projects per teacher etc.
This has improved the repetitiveness of attributes to a greater extent and more useful
information has been discovered which can be used by the classifier.
Attribute Construction
Some of the features given in data files cannot be used directly due to various reasons (most
of the times they are highly non repetitive). We used some of these features to construct
new features by combining multiple features or transforming one to another. Given below
is the list of derived attributes.
1. Month- posted date of the project was given but it is less repetitive. We derived
month attribute from the posted date and used it for prediction
2. Essay length- for each project corresponding essay was given but it cannot be used
directly for prediction. Therefore we calculated the length of the each essay after
removing extra spaces within the essay text and used it as an attribute.
3. Need statement length
4. Projects per teacher- we calculated number of projects per teacher by grouping the
projects with ‘teacher_acctid’ and used it as an attribute
5. Total items per project- we calculated total number of items requested per each
project from the details provided in resources file and used it as an attribute
6. Cost of total items per project- we calculated total cost of items requested per each
project from the details provided in resources file and used it as an attribute
Several other derived attributes such as date, short description length has been considered
but they did not yield a significant performance improvement.
Model Selection and Evaluation
We have used three classifiers during the project. First we used decision tree classifier, then
we used logistic regression and finally we used SGD (stochastic gradient decent) classifier.
We started with tree classifier as it was easy to use. To evaluate the performance of
classifiers initially we used the cross validation technique. But later we realized that
competition is using ROC (area under the curve) score for evaluations. So we also used
ROC scores to evaluate the performance of the classifiers. As we had several choices for
classifiers we read several articles about the usage of classifiers. From them we realized
that decision tree normally does not perform well when there is data imbalance problem
and logistic regression was used instead of that.
Logistic regression was performed well with the given data and it achieved about 0.61 ROC
score. To improve the accuracy further more we used SGD classifier (logistic regression
with SGD training). On one hand it is more efficient than the logistic regression so that
predictions can be done in less amount of time. On the other hand it achieved higher
accuracy than the regression classifier. With default parameters for SGD classifier we were
able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values
for the parameters) we performed a grid search and found optimum values for the number
of iterations, penalty, shuffle and alpha parameters. Using those values we were able
improve the accuracy up to 0.64 ROC score.
Ensemble Methods
We tried to use boosting algorithm to improve the performance of classifier. Among the
methods available we used “ada boost” method (AdaBoostClassifier) for that.
Implementation provided by scikit library only supports decision tree classifier and SGD
classifier. So we were not able to use logistic regression directly. Instead we tried to use
SGD classifier with boosting algorithm. But accuracy was increased only by an
insignificant amount.
Further Improvements
Essays data contains huge amount of data but they were not used during the predictions
apart from the essay length. We tried to extract essay data using TfidVectorizer but it was
not successful due to memory constraints. As an alternative we tried hashing methods but it
reduced the accuracy of the essay data. We think that accuracy of the classifier may
improve further if some features from the essay data are included in training data. Also use
of ensemble methods will definitely improve the accuracy of predications.
Support Libraries Used
We used ‘Pandas’ data analysis library to generate data frames from the provided comma-
separated values files which can be used with other data analysis and modeling tools which
we used. Other than that we used functions provided with ‘Pandas’ library for generating
bins in order to discretize the attributes with less repetitive values and merging data frames
from several data sources.
Then we used ‘NumPy’ extension library in order to generate multidimensional arrays
using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data
(i.e. separate the indices of training data set from test data set) and locate some properties of
data like median and quartiles. Also when combining derived attributes with existing
attributes functions provided with ‘NumPy’ library was useful.
‘Scikit-learn’ machine learning library was the library we used to integrate data analysis,
preprocessing, classification, regression and modeling tools into our implementations. From
the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like
‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools
classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic
Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like
‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve
(AUC) from prediction scores as mentioned above.

Mais conteúdo relacionado

Mais procurados

Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
Prabhakar Ganesamurthy
 

Mais procurados (8)

Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
 
Review Mining of Products of Amazon.com
Review Mining of Products of Amazon.comReview Mining of Products of Amazon.com
Review Mining of Products of Amazon.com
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 

Destaque (8)

L’energia i la seva transformació
L’energia i la seva transformacióL’energia i la seva transformació
L’energia i la seva transformació
 
2008 election in mongolia
2008 election in mongolia2008 election in mongolia
2008 election in mongolia
 
QualitySign_IVinogradova_v5
QualitySign_IVinogradova_v5QualitySign_IVinogradova_v5
QualitySign_IVinogradova_v5
 
Water deficit
Water deficitWater deficit
Water deficit
 
1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan
 
Water deficit
Water deficitWater deficit
Water deficit
 
Stream connectors
Stream connectorsStream connectors
Stream connectors
 
Corporate laws
Corporate lawsCorporate laws
Corporate laws
 

Semelhante a Group13 kdd cup_report_submitted

Summary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_DataSummary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_Data
Madeleine Organ
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
Arpita Majumder
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
PRIYANKA MEHTA
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
Gerrit Klaschke, CSM
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
ra na
 
Agile Metrics
Agile MetricsAgile Metrics
Agile Metrics
nick945
 

Semelhante a Group13 kdd cup_report_submitted (20)

KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
 
Summary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_DataSummary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_Data
 
Big data project
Big data projectBig data project
Big data project
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
 
BATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptxBATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptx
 
Unit 5
Unit   5Unit   5
Unit 5
 
Research proposal
Research proposalResearch proposal
Research proposal
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
50120130406007
5012013040600750120130406007
50120130406007
 
CompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCICompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCI
 
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache SparkAttribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Automated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionAutomated Essay Grading using Features Selection
Automated Essay Grading using Features Selection
 
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
Agile Metrics
Agile MetricsAgile Metrics
Agile Metrics
 

Último

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 

Último (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 

Group13 kdd cup_report_submitted

  • 1. CS4642 - Data Mining & Information Retrieval Paper Based on KDDCup 2014 Submission Group Members: 100227D - Jayaweera W.J.A.I.U. 100470N - Sajeewa G.K.M.C 100476M - Sampath P.L.B. 100612E - Wijewardane M.M.D.T.K. Group Number : 13 Final Group Rank : 76
  • 2. Description of Data In this competition, five data files are available for competitors. They are donations (contains information about the donations to each project. This is only provided for projects in the training set), essays (contains project text posted by the teachers. This is provided for both the training and test set), projects (contains information about each project. This is provided for both the training and test set), resources (contains information about the resources requested for each project. This is provided for both the training and test set) and outcomes (contains information about the outcomes of projects in the training set). Before starting the knowledge discovery process provided data have been analyzed. First of all number of data records in each file has been counted to get an idea about the amount of data available. Projects file has 664098 records, essays file has 664098 records, outcomes file has 619326 records, resources file has 3667217 records and donations file has 3097989. Our next task was to identify the criterion which is used to differentiate test data from training data. After reading the competition details we realized that projects after 2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data set. According to that 619326 projects are available for training set and remaining amount (44772) of projects are for training set. For each of the project in training set, project description, essay data, resources requested, donations provided and outcome are given. For each of the project in test set, project description, essay data, resources requested are given. Data Imbalanced Problem After having a brief understanding of data provided we started to analyze training set. When we draw a graph between the project’s posted dates and “is_exciting” attribute, we realized that there are no exciting projects before April 2014. Graph was completely skewed to the right side. This leads to a data imbalanced problem as number of exciting projects is very small compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of exciting and non-exciting projects was as follows.
  • 3. In competition forum there was an explanation for this problem. It said that organization might not keep track of some of the requirements needed to decide ‘is_exciting’ for the projects before 2010. Therefore we thought that classification given in outcomes file before 2010 may not correct and we decided to use down sampling technique to handle imbalanced data (remove projects before 2010). It is true that valuable information may get lost when projects are removed. But accuracy obtained by removing that data outweigh the loss of information. Therefore we were able to obtain higher accuracy by down sampling the given data. All the classifiers that we have used performed well after removing projects before 2010. Preprocessing Data First we analyzed characteristics of mining data using statistical measurements. Using the data frame describe method we calculated number of records, mean, standard deviation, minimum value, maximum value and the quartile values for each attribute. Given below is a statistical measurement of two attributes. We were able to get an idea about the distribution of attributes using these statistical measurements.
  • 4. Filling Missing Values Initially we used pad method (propagate last valid observation forward) to fill missing values of all the attributes. But we realized that we can achieve high accuracy by selecting a filling method based on the type of the attribute. To do that first we calculate the percentage of missing values. It was as follows, Highest amount of missing values percentage was for secondary focused subject and secondary focused area. This is because some projects may have only primary focus area and primary focus subject. We decided to fill missing secondary values with their respective primary values. Also we used linear interpolation for numeric values and for other attributes we used pad method. Later when we tune up classifiers we changed the method from pad to backfill (use next valid observation) as it obtained a higher accuracy than pad. Remove Outliers When we analyzed data, outliers were detected in some of the attributes. We used scatter plots to identify outliers. There were outliers in cost related attributes and we replaced them with the mean value of that attribute. Given below is outlier analysis of cost attribute,
  • 5. Red circle value can be considered as an outlier as it has a really huge value than other values. These outliers have caused a lot of problems when we discretize data. To identify outliers in resources, we used inter quartile range as a measurement. Label Encoding We did not use all the attributes for predictions. We focused more on repetitive features as they will help more to the classifier to make predictions. Most of these repetitive features/attributes have string values rather than numerical values. Available classifiers do not accept string values for features. So we used label encoder to transform those string values to integer values between 0 and n-1, n being the number of different values a feature can take. But classifiers expect continuous input and may interpret the categories as being ordered which is not desired. To make the categorical features to features that can be used with scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature with k possible values into k binary features, with only one active for particular sample. This improved the performance of classifiers to greater extent. For an example SGD classifier obtained about 0.55 ROC score without hot encoding and with encoding it obtained about 0.59 ROC score. Continues Values Discretization Project attributes such as school longitude, school latitude, zip code and total cost cannot be directly used for predictions as they are less likely to be repetitive. But this information cannot be eliminated as they may help to get decisions for classifiers. To make these attributes more repetitive we used discretization. We put these continuous values into bins and used the bin index as the attribute. For an example we used discretization for longitude and latitude and divided projects into five regions (bins) and used region id instead of using longitude and latitude. Discretization results for total cost attribute as follows,
  • 6. We applied the same concept for cost related attributes, item count for project, total price of items per project, number of projects per teacher etc. This has improved the repetitiveness of attributes to a greater extent and more useful information has been discovered which can be used by the classifier. Attribute Construction Some of the features given in data files cannot be used directly due to various reasons (most of the times they are highly non repetitive). We used some of these features to construct new features by combining multiple features or transforming one to another. Given below is the list of derived attributes. 1. Month- posted date of the project was given but it is less repetitive. We derived month attribute from the posted date and used it for prediction 2. Essay length- for each project corresponding essay was given but it cannot be used directly for prediction. Therefore we calculated the length of the each essay after removing extra spaces within the essay text and used it as an attribute. 3. Need statement length 4. Projects per teacher- we calculated number of projects per teacher by grouping the projects with ‘teacher_acctid’ and used it as an attribute 5. Total items per project- we calculated total number of items requested per each project from the details provided in resources file and used it as an attribute 6. Cost of total items per project- we calculated total cost of items requested per each project from the details provided in resources file and used it as an attribute Several other derived attributes such as date, short description length has been considered but they did not yield a significant performance improvement. Model Selection and Evaluation We have used three classifiers during the project. First we used decision tree classifier, then we used logistic regression and finally we used SGD (stochastic gradient decent) classifier. We started with tree classifier as it was easy to use. To evaluate the performance of classifiers initially we used the cross validation technique. But later we realized that competition is using ROC (area under the curve) score for evaluations. So we also used ROC scores to evaluate the performance of the classifiers. As we had several choices for classifiers we read several articles about the usage of classifiers. From them we realized that decision tree normally does not perform well when there is data imbalance problem and logistic regression was used instead of that. Logistic regression was performed well with the given data and it achieved about 0.61 ROC score. To improve the accuracy further more we used SGD classifier (logistic regression with SGD training). On one hand it is more efficient than the logistic regression so that predictions can be done in less amount of time. On the other hand it achieved higher accuracy than the regression classifier. With default parameters for SGD classifier we were able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values
  • 7. for the parameters) we performed a grid search and found optimum values for the number of iterations, penalty, shuffle and alpha parameters. Using those values we were able improve the accuracy up to 0.64 ROC score. Ensemble Methods We tried to use boosting algorithm to improve the performance of classifier. Among the methods available we used “ada boost” method (AdaBoostClassifier) for that. Implementation provided by scikit library only supports decision tree classifier and SGD classifier. So we were not able to use logistic regression directly. Instead we tried to use SGD classifier with boosting algorithm. But accuracy was increased only by an insignificant amount. Further Improvements Essays data contains huge amount of data but they were not used during the predictions apart from the essay length. We tried to extract essay data using TfidVectorizer but it was not successful due to memory constraints. As an alternative we tried hashing methods but it reduced the accuracy of the essay data. We think that accuracy of the classifier may improve further if some features from the essay data are included in training data. Also use of ensemble methods will definitely improve the accuracy of predications. Support Libraries Used We used ‘Pandas’ data analysis library to generate data frames from the provided comma- separated values files which can be used with other data analysis and modeling tools which we used. Other than that we used functions provided with ‘Pandas’ library for generating bins in order to discretize the attributes with less repetitive values and merging data frames from several data sources. Then we used ‘NumPy’ extension library in order to generate multidimensional arrays using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data (i.e. separate the indices of training data set from test data set) and locate some properties of data like median and quartiles. Also when combining derived attributes with existing attributes functions provided with ‘NumPy’ library was useful. ‘Scikit-learn’ machine learning library was the library we used to integrate data analysis, preprocessing, classification, regression and modeling tools into our implementations. From the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like ‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like ‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve (AUC) from prediction scores as mentioned above.