SlideShare a Scribd company logo
1 of 72
Download to read offline
Simpler 	

Machine Learning 	

with SKLL
Dan Blanchard	

Educational Testing Service	

dblanchard@ets.org	



PyData NYC 2013
Survived

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old
Survived
first class,	

female,	

1 sibling,	

35 years old

Perished
third class, 	

female,	

2 siblings,	

18 years old

second class,
male,	

0 siblings,	

50 years old

Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
SKLL
SKLL

It's where the learning happens.
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival
1. Split up given training set: train (80%) and dev (20%)
$ ./make_titanic_example_data.py
!
Creating titanic/train directory
Creating titanic/dev directory
Creating titanic/test directory
Loading train.csv............done
Loading test.csv........done
Learning to Predict Survival
2. Pick classifiers to try:	

1. Random forest	

2. Support Vector Machine (SVM)	

3. Naive Bayes
Learning to Predict Survival
3. Create configuration file for SKLL
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
directory with feature files
train_location = train
for training learner
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
directory with feature files
test_location = dev
for evaluating performance
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
# of siblings, spouses,
train_location = train children
parents,
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
departure port
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev & passenger class
fare
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
sex, & age
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
directory to store evaluation results
models = output
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
3. Create configuration file for SKLL
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Output]
results = output
models = output
directory to store trained models
Learning to Predict Survival
4. Run the configuration file with run_experiment
$ run_experiment evaluate.cfg
!
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
Loading dev/misc.csv.....done
Loading dev/socioeconomic.csv.....done
Loading dev/vitals.csv.....done
Loading train/family.csv...........done
Loading train/misc.csv...........done
Loading train/socioeconomic.csv...........done
Loading train/vitals.csv...........done
Loading dev/family.csv.....done
...
Learning to Predict Survival
5. Examine results
Experiment Name: Titanic_Evaluate
Training Set: train
Test Set: dev
Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv",
"vitals.csv"]
Learner: RandomForestClassifier
Task: evaluate
!
+-------+------+------+-----------+--------+-----------+
|
| 0.0 | 1.0 | Precision | Recall | F-measure |
+-------+------+------+-----------+--------+-----------+
| 0.000 | [97] |
18 |
0.874 | 0.843 |
0.858 |
+-------+------+------+-----------+--------+-----------+
| 1.000 |
14 | [50] |
0.735 | 0.781 |
0.758 |
+-------+------+------+-----------+--------+-----------+
(row = reference; column = predicted)
Accuracy = 0.8212290502793296
Aggregate Evaluation Results

Dev.
Accuracy

Learner

0.821

RandomForestClassifier

0.771

SVC

0.709

MultinomialNB
Tuning learner
• Can we do better than default hyperparameters?
Tuning learner
• Can we do better than default hyperparameters?
[General]
experiment_name = Titanic_Evaluate
task = evaluate
!
[Input]
train_location = train
test_location = dev
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Tuned Evaluation Results

Untuned
Accuracy

Tuned
Accuracy

Learner

0.821

0.849

RandomForestClassifier

0.771

0.737

SVC

0.709

0.709

MultinomialNB
Using All Available Data
Using All Available Data
• Use training and dev to generate predictions on test
Using All Available Data
• Use training and dev to generate predictions on test
[General]
experiment_name = Titanic_Predict
task = predict
!
[Input]
train_location = train+dev
test_location = test
featuresets = [["family.csv", "misc.csv",
"socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Test Set Performance

Untuned
Accuracy
(Train only)

Tuned
Accuracy
(Train only)

Untuned
Tuned
Accuracy
Accuracy
(Train + Dev) (Train + Dev)

0.732

0.746

0.746

0.756 RandomForestClassifier

0.608

0.617

0.612

0.641

SVC

0.627

0.623

0.622

0.622

MultinomialNB

Learner
Advanced SKLL Features
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
Advanced SKLL Features
• Read/write .arff, .csv, .jsonlines, .megam, .ndj,
and .tsv data
• Parameter grids for all supported classifiers/regressors
• Parallelize experiments on DRMAA clusters
• Ablation experiments
• Collapse/rename classes from config file
• Rescale predictions to be closer to observed data
• Feature scaling
• Python API
Currently Supported Learners
Classifiers

Regressors

Linear Support Vector Machine

Elastic Net

Logistic Regression

Lasso

Multinomial Naive Bayes

Linear
Decision Tree

Gradient Boosting
Random Forest
Support Vector Machine
Coming Soon
Classifiers

Regressors
AdaBoost
K-Nearest Neighbors

Stochastic Gradient Descent
Acknowledgements
• Mike Heilman	

• Nitin Madnani	

• Aoife Cahill
References
• Dataset: kaggle.com/c/titanic-gettingStarted	

• SKLL GitHub: github.com/EducationalTestingService/skll	

• SKLL Docs: skll.readthedocs.org	

• Titanic configs and data splitting script in examples dir
on GitHub
@Dan_S_Blanchard	

!

dan-blanchard
Bonus Slides
Cross-validation
[General]
experiment_name = Titanic_CV
task = cross_validate
!
[Input]
train_location = train+dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv",
"vitals.csv"]]
learners = ["RandomForestClassifier", "SVC", "MultinomialNB"]
label_col = Survived
!
[Tuning]
grid_search = true
objective = accuracy
!
[Output]
results = output
Cross-validation Results
Avg. CV
Accuracy

Learner

0.815

RandomForestClassifier

0.717

SVC

0.681

MultinomialNB
SKLL API
SKLL API
from skll import Learner, load_examples
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
confusion matrix
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
precision, recall, f-score
# Load test examples and evaluate
for each class
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
tuned model
# Load test examples and evaluate parameters
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
objective function
test_examples = load_examples('test.tsv')
score on test set
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
per-fold
# evaluation results cross-validation with a radial SVM
Perform 10-fold
learner = Learner('SVC')
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
from skll import Learner, load_examples
# Load training examples
train_examples = load_examples('myexamples.megam')
# Train a linear SVM
learner = Learner('LinearSVC')
learner.train(train_examples)
# Load test examples and evaluate
test_examples = load_examples('test.tsv')
(conf_matrix, accuracy, prf_dict, model_params,
obj_score) = learner.evaluate(test_examples)
# Generate predictions from trained model
predictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVM
per-fold training
learner = Learner('SVC')
set obj. scores
(fold_result_list,
grid_search_scores) = learner.cross_validate(train_examples)
SKLL API
import numpy as np
import os
from skll import write_feature_file
!
# Create some training examples
classes = []
ids = []
features = []
for i in range(num_train_examples):
y = "dog" if i % 2 == 0 else "cat"
ex_id = "{}{}".format(y, i)
x = {"f1": np.random.randint(1, 4),
"f2": np.random.randint(1, 4),
"f3": np.random.randint(1, 4)}
classes.append(y)
ids.append(ex_id)
features.append(x)
# Write them to a file
train_path = os.path.join(_my_dir, 'train',
'test_summary.jsonlines')
write_feature_file(train_path, ids, classes, features)

More Related Content

Similar to Simpler Machine Learning with SKLL

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Daniel Katz
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18DataconomyGmbH
 
DN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
DN18 | A/B Testing: Lessons Learned | Dan McKinley | MailchimpDN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
DN18 | A/B Testing: Lessons Learned | Dan McKinley | MailchimpDataconomy Media
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia岳華 杜
 
It's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithmsIt's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithmsBrian Lange
 
Automated Testing in Django
Automated Testing in DjangoAutomated Testing in Django
Automated Testing in DjangoLoek van Gent
 
“Insulin” for Scala’s Syntactic Diabetes
“Insulin” for Scala’s Syntactic Diabetes“Insulin” for Scala’s Syntactic Diabetes
“Insulin” for Scala’s Syntactic DiabetesTzach Zohar
 
Learning Java 1 – Introduction
Learning Java 1 – IntroductionLearning Java 1 – Introduction
Learning Java 1 – Introductioncaswenson
 
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet
 
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...NETWAYS
 
Taking the boilerplate out of your tests with Sourcery
Taking the boilerplate out of your tests with SourceryTaking the boilerplate out of your tests with Sourcery
Taking the boilerplate out of your tests with SourceryVincent Pradeilles
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passageErik Rose
 
2. overview of c#
2. overview of c#2. overview of c#
2. overview of c#Rohit Rao
 
An introduction to Google test framework
An introduction to Google test frameworkAn introduction to Google test framework
An introduction to Google test frameworkAbner Chih Yi Huang
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and ClassesIntro C# Book
 

Similar to Simpler Machine Learning with SKLL (20)

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
 
02 - Prepcode
02 - Prepcode02 - Prepcode
02 - Prepcode
 
ppt
pptppt
ppt
 
Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18Causal inference-for-profit | Dan McKinley | DN18
Causal inference-for-profit | Dan McKinley | DN18
 
DN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
DN18 | A/B Testing: Lessons Learned | Dan McKinley | MailchimpDN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
DN18 | A/B Testing: Lessons Learned | Dan McKinley | Mailchimp
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
It's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithmsIt's Not Magic - Explaining classification algorithms
It's Not Magic - Explaining classification algorithms
 
Final Project
Final ProjectFinal Project
Final Project
 
Automated Testing in Django
Automated Testing in DjangoAutomated Testing in Django
Automated Testing in Django
 
“Insulin” for Scala’s Syntactic Diabetes
“Insulin” for Scala’s Syntactic Diabetes“Insulin” for Scala’s Syntactic Diabetes
“Insulin” for Scala’s Syntactic Diabetes
 
Learning Java 1 – Introduction
Learning Java 1 – IntroductionLearning Java 1 – Introduction
Learning Java 1 – Introduction
 
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
 
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
 
Taking the boilerplate out of your tests with Sourcery
Taking the boilerplate out of your tests with SourceryTaking the boilerplate out of your tests with Sourcery
Taking the boilerplate out of your tests with Sourcery
 
Django’s nasal passage
Django’s nasal passageDjango’s nasal passage
Django’s nasal passage
 
JAVA LOOP.pptx
JAVA LOOP.pptxJAVA LOOP.pptx
JAVA LOOP.pptx
 
2. overview of c#
2. overview of c#2. overview of c#
2. overview of c#
 
An introduction to Google test framework
An introduction to Google test frameworkAn introduction to Google test framework
An introduction to Google test framework
 
05. Java Loops Methods and Classes
05. Java Loops Methods and Classes05. Java Loops Methods and Classes
05. Java Loops Methods and Classes
 
Object oriented concepts
Object oriented conceptsObject oriented concepts
Object oriented concepts
 

Recently uploaded

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Simpler Machine Learning with SKLL

  • 1. Simpler Machine Learning with SKLL Dan Blanchard Educational Testing Service dblanchard@ets.org 
 PyData NYC 2013
  • 2.
  • 3.
  • 4.
  • 7. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old
  • 8. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 9. Survived first class, female, 1 sibling, 35 years old Perished third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 11. SKLL
  • 12. SKLL
  • 13. SKLL It's where the learning happens.
  • 14. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%)
  • 15. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py ! Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done
  • 16. Learning to Predict Survival 2. Pick classifiers to try: 1. Random forest 2. Support Vector Machine (SVM) 3. Naive Bayes
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL
  • 18. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 19. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] directory with feature files train_location = train for training learner test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train directory with feature files test_location = dev for evaluating performance featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] # of siblings, spouses, train_location = train children parents, test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 23. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train departure port test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 24. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev & passenger class fare featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 25. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev sex, & age featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 26. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output
  • 27. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output directory to store evaluation results models = output
  • 28. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 29. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Output] results = output models = output directory to store trained models
  • 30. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg ! Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...
  • 31. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate ! +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296
  • 33. Tuning learner • Can we do better than default hyperparameters?
  • 34. Tuning learner • Can we do better than default hyperparameters? [General] experiment_name = Titanic_Evaluate task = evaluate ! [Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 38. Using All Available Data • Use training and dev to generate predictions on test
  • 39. Using All Available Data • Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict ! [Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 40. Test Set Performance Untuned Accuracy (Train only) Tuned Accuracy (Train only) Untuned Tuned Accuracy Accuracy (Train + Dev) (Train + Dev) 0.732 0.746 0.746 0.756 RandomForestClassifier 0.608 0.617 0.612 0.641 SVC 0.627 0.623 0.622 0.622 MultinomialNB Learner
  • 42. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data
  • 43. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors
  • 44. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters
  • 45. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments
  • 46. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file
  • 47. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data
  • 48. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling
  • 49. Advanced SKLL Features • Read/write .arff, .csv, .jsonlines, .megam, .ndj, and .tsv data • Parameter grids for all supported classifiers/regressors • Parallelize experiments on DRMAA clusters • Ablation experiments • Collapse/rename classes from config file • Rescale predictions to be closer to observed data • Feature scaling • Python API
  • 50. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear Decision Tree Gradient Boosting Random Forest Support Vector Machine
  • 52. Acknowledgements • Mike Heilman • Nitin Madnani • Aoife Cahill
  • 53. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @Dan_S_Blanchard ! dan-blanchard
  • 55. Cross-validation [General] experiment_name = Titanic_CV task = cross_validate ! [Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ! [Tuning] grid_search = true objective = accuracy ! [Output] results = output
  • 58. SKLL API from skll import Learner, load_examples
  • 59. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam')
  • 60. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples)
  • 61. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 62. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate confusion matrix test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 63. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 64. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score # Load test examples and evaluate for each class test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 65. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 66. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate objective function test_examples = load_examples('test.tsv') score on test set (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 67. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)
  • 68. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 69. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 70. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold # evaluation results cross-validation with a radial SVM Perform 10-fold learner = Learner('SVC') (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 71. SKLL API from skll import Learner, load_examples # Load training examples train_examples = load_examples('myexamples.megam') # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = load_examples('test.tsv') (conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM per-fold training learner = Learner('SVC') set obj. scores (fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)
  • 72. SKLL API import numpy as np import os from skll import write_feature_file ! # Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)