SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
Simpler 
Machine Learning 
with SKLL 1.0 
Dan Blanchard 
Educational Testing Service 
dblanchard@ets.org 
PyData NYC 2014
Survived Perished
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old 
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL 
It's where the learning happens
Learning to Predict Survival 
1. Split up given training set: train (80%) and dev (20%) 
$ ./make_titanic_example_data.py 
Loading train.csv... done 
Writing titanic/train/socioeconomic.csv...done 
Writing titanic/train/family.csv...done 
Writing titanic/train/vitals.csv...done 
Writing titanic/train/misc.csv...done 
Writing titanic/train+dev/socioeconomic.csv...done 
Writing titanic/train+dev/family.csv...done 
Writing titanic/train+dev/vitals.csv...done 
Writing titanic/train+dev/misc.csv...done 
Writing titanic/dev/socioeconomic.csv...done 
Writing titanic/dev/family.csv...done 
Writing titanic/dev/vitals.csv...done 
Writing titanic/dev/misc.csv...done 
Loading test.csv... done 
Writing titanic/test/socioeconomic.csv...done 
Writing titanic/test/family.csv...done 
Writing titanic/test/vitals.csv...done 
Writing titanic/test/misc.csv...done
Learning to Predict Survival 
2. Pick classifiers to try: 
1. Decision Tree 
2. Naive Bayes 
3. Random forest 
4. Support Vector Machine (SVM)
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
training learner
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
evaluating performance
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory # of siblings, = train 
spouses, parents, children 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
departure port
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
fare & passenger class 
3. Create configuration file for SKLL
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
sex & age 
3. Create configuration file for SKLL
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store evaluation results
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store trained models
Learning to Predict Survival 
4. Run the configuration file with run_experiment 
$ run_experiment evaluate.cfg 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
Loading dev/misc.csv... done 
Loading dev/socioeconomic.csv... done 
Loading dev/vitals.csv... done 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
...
Learning to Predict Survival 
5. Examine results 
Experiment Name: Titanic_Evaluate_Untuned 
SKLL Version: 1.0.0 
Training Set: train (712) 
Test Set: dev (179) 
Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] 
Learner: RandomForestClassifier 
Scikit-learn Version: 0.15.2 
Total Time: 0:00:02.065403 
+-------+------+------+-----------+--------+-----------+ 
| | 0.0 | 1.0 | Precision | Recall | F-measure | 
+-------+------+------+-----------+--------+-----------+ 
| 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | 
+-------+------+------+-----------+--------+-----------+ 
| 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | 
+-------+------+------+-----------+--------+-----------+ 
(row = reference; column = predicted) 
Accuracy = 0.8100558659217877
Aggregate Evaluation Results 
Dev. Accuracy Learner 
0.8101 RandomForestClassifier 
0.7989 DecisionTreeClassifier 
0.7709 SVC 
0.7095 MultinomialNB
[General] 
experiment_name = Titanic_Evaluate 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output 
Tuning learner 
Can we do better than default hyperparameters?
Tuned Evaluation Results 
Untuned Accuracy Tuned Accuracy Learner 
0.8101 0.8380 RandomForestClassifier 
0.7989 0.7989 DecisionTreeClassifier 
0.7709 0.8156 SVC 
0.7095 0.7095 MultinomialNB
Using All Available Data 
Use training and dev to generate predictions on test 
[General] 
experiment_name = Titanic_Predict 
task = predict 
[Input] 
train_directory = train+dev 
test_directory = test 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output
Test Set Accuracy 
Train only Train + Dev 
Learner 
Untuned Tuned Untuned Tuned 
0.727 0.756 0.746 0.780 RandomForestClassifier 
0.703 0.742 0.670 0.742 DecisionTreeClassifier 
0.608 0.679 0.612 0.679 SVC 
0.627 0.627 0.622 0.622 MultinomialNB
Advanced SKLL Features 
• Read & write .arff, .csv, 
.jsonlines, .libsvm, .megam, 
.ndj, and .tsv data 
• Parameter grids for all 
supported scikit-learn learners 
• Custom learners 
• Parallelize experiments on 
DRMAA clusters via GridMap 
• Ablation experiments 
• Collapse/rename classes from 
config file 
• Feature scaling 
• Rescale predictions to be closer 
to observed data 
• Command-line tools for joining, 
filtering, and converting feature 
files 
• Python API
Currently Supported Learners 
Classifiers Regressors 
Linear Support Vector Machine Elastic Net 
Logistic Regression Lasso 
Multinomial Naive Bayes Linear 
AdaBoost 
Decision Tree 
Gradient Boosting 
K-Nearest Neighbors 
Random Forest 
Stochastic Gradient Descent 
Support Vector Machine
Contributors 
• Nitin Madnani 
• Mike Heilman 
• Nils Murrugarra Llerena 
• Aoife Cahill 
• Diane Napolitano 
• Keelan Evanini 
• Ben Leong
References 
• Dataset: kaggle.com/c/titanic-gettingStarted 
• SKLL GitHub: github.com/EducationalTestingService/skll 
• SKLL Docs: skll.readthedocs.org 
• Titanic configs and data splitting script in examples dir on GitHub 
@dsblanch 
dan-blanchard
Bonus Slides
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# confusion Load test matrix 
examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
overall accuracy on 
# Load test examples test set 
and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
precision, recall, f-score for 
# Load test examples and each evaluate 
class 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
tuned model 
# Load test examples and evaluate 
parameters 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
objective function 
# Load test examples and evaluate 
score on test set 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold evaluation 
# Perform results 
10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold training set 
# Perform 10-fold cross-obj. validation scores 
with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
import numpy as np 
from os.path import join 
from skll import FeatureSet, NDJWriter, Writer 
# Create some training examples 
labels = [] 
ids = [] 
features = [] 
for i in range(num_train_examples): 
labels.append("dog" if i % 2 == 0 else "cat") 
ids.append("{}{}".format(y, i)) 
features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) 
feat_set = FeatureSet('training', ids, labels=labels, features=features) 
# Write them to a file 
train_path = join(_my_dir, 'train', 'test_summary.jsonlines') 
Writer.for_path(train_path, feat_set).write() 
# Or 
NDJWriter.(train_path, feat_set).write()

Mais conteúdo relacionado

Destaque (7)

streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with storm
 
9 Facts Maine Small Business Owners Should Know About Portland Radio
9 Facts Maine Small Business Owners Should Know About Portland  Radio9 Facts Maine Small Business Owners Should Know About Portland  Radio
9 Facts Maine Small Business Owners Should Know About Portland Radio
 
Chemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesChemistry and Application of Leuco Dyes
Chemistry and Application of Leuco Dyes
 
Unad’s festival film 2016
Unad’s festival film 2016Unad’s festival film 2016
Unad’s festival film 2016
 
Apparato locomotore
Apparato locomotoreApparato locomotore
Apparato locomotore
 
Le giunture
Le giuntureLe giunture
Le giunture
 
Torace e bacin0
Torace e bacin0Torace e bacin0
Torace e bacin0
 

Último

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 

Último (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

Simpler Machine Learning with SKLL 1.0

  • 1. Simpler Machine Learning with SKLL 1.0 Dan Blanchard Educational Testing Service dblanchard@ets.org PyData NYC 2014
  • 2.
  • 3.
  • 4.
  • 6. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 7. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 9. SKLL It's where the learning happens
  • 10. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
  • 11. Learning to Predict Survival 2. Pick classifiers to try: 1. Decision Tree 2. Naive Bayes 3. Random forest 4. Support Vector Machine (SVM)
  • 12. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 13. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for training learner
  • 14. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for evaluating performance
  • 15. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 16. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory # of siblings, = train spouses, parents, children test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output departure port
  • 18. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output fare & passenger class 3. Create configuration file for SKLL
  • 19. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output sex & age 3. Create configuration file for SKLL
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store evaluation results
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store trained models
  • 23. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
  • 24. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403 +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
  • 25. Aggregate Evaluation Results Dev. Accuracy Learner 0.8101 RandomForestClassifier 0.7989 DecisionTreeClassifier 0.7709 SVC 0.7095 MultinomialNB
  • 26. [General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output Tuning learner Can we do better than default hyperparameters?
  • 27. Tuned Evaluation Results Untuned Accuracy Tuned Accuracy Learner 0.8101 0.8380 RandomForestClassifier 0.7989 0.7989 DecisionTreeClassifier 0.7709 0.8156 SVC 0.7095 0.7095 MultinomialNB
  • 28. Using All Available Data Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output
  • 29. Test Set Accuracy Train only Train + Dev Learner Untuned Tuned Untuned Tuned 0.727 0.756 0.746 0.780 RandomForestClassifier 0.703 0.742 0.670 0.742 DecisionTreeClassifier 0.608 0.679 0.612 0.679 SVC 0.627 0.627 0.622 0.622 MultinomialNB
  • 30. Advanced SKLL Features • Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data • Parameter grids for all supported scikit-learn learners • Custom learners • Parallelize experiments on DRMAA clusters via GridMap • Ablation experiments • Collapse/rename classes from config file • Feature scaling • Rescale predictions to be closer to observed data • Command-line tools for joining, filtering, and converting feature files • Python API
  • 31. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear AdaBoost Decision Tree Gradient Boosting K-Nearest Neighbors Random Forest Stochastic Gradient Descent Support Vector Machine
  • 32. Contributors • Nitin Madnani • Mike Heilman • Nils Murrugarra Llerena • Aoife Cahill • Diane Napolitano • Keelan Evanini • Ben Leong
  • 33. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @dsblanch dan-blanchard
  • 35. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 36. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # confusion Load test matrix examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 37. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) overall accuracy on # Load test examples test set and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 38. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score for # Load test examples and each evaluate class test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 39. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 40. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) objective function # Load test examples and evaluate score on test set test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 41. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 42. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 43. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold evaluation # Perform results 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 44. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold training set # Perform 10-fold cross-obj. validation scores with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 45. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 46. SKLL API import numpy as np from os.path import join from skll import FeatureSet, NDJWriter, Writer # Create some training examples labels = [] ids = [] features = [] for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) feat_set = FeatureSet('training', ids, labels=labels, features=features) # Write them to a file train_path = join(_my_dir, 'train', 'test_summary.jsonlines') Writer.for_path(train_path, feat_set).write() # Or NDJWriter.(train_path, feat_set).write()