SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Valencian Summer School in Machine Learning
4th edition
September 13–14, 2018
Practical workshops: API and reuse
Mercè Martín
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 3 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 4 / 22
Register in BigML
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 5 / 22
Install the client tools
You can use virtual environments (recommended)
mkvirtualenv vssml18
and install BigMLer and the Python bindings
pip install bigmler
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 6 / 22
Set your credentials
They can be exported as environment variables
export BIGML_USERNAME=[username]
export BIGML_API_KEY=[api_key]
For windows users
setx BIGML_USERNAME [username]
setx BIGML_API_KEY [api_key]
The user name and API KEY can be found in your account information
section
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 7 / 22
Download the reference repo
https://github.com/mmerce/notebooks
and use the vssml18 folder
or link it through mybinder.org
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 8 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 9 / 22
Data dictionary
Defining the types of fields
Models process data according to its type
Numeric ordered unbounded sequence
Categorical unordered enumeration
Datetime Day, Month, Year, etc.
Text Full text or composed type: bag of words
Items Composed type: list of elements separated by a token
Data dictionary must be carefully set for the model to correctly interpret
your data
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 10 / 22
Missing tokens
Missings: meaningful or replaceable
The absence of a value can be
Meaningful either the model can treat a missing value as a new
category or you need to build a new predicate and feed it
to the model
Replaceable if the model cannot deal with missing values, maybe you
can fill in a sensible value: mean, zero, min, etc.
When the percentage of training instances having missing values is
small and the amount of data is enough, we can simply discard these
instances while training.
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 11 / 22
Errors
Fixing errors
Errors can be detected automatically when the values in the field are
not compatible with its type
Datetime The contents of the field cannot be parsed with the
declared datetime format
Numeric The contents of the field are not a number
However, additional errors can pass the type coherence test.
Errors need to be addressed and, as in the missing values case, either
their value is replaced by a sensible alternative or the row should be
discarded.
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 12 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 13 / 22
Feature selection
Non-preferred fields
Features can be excluded from model analysis because their values are
Constant If the field contains a unique value throughout all instances
Unique If the field contains a different value per instance
Highly sparse If only a very low percentage of instances have non-missing
values in the field
Redundant If the field is correlated to another one
Unrelated If the field contains reference information or is totally irrelevant
to the problem to solve
Supervised selection
In supervised problems, the relevant features can be preselected according
to some importance or evaluation metric
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 14 / 22
Feature generation
Transforming datasets
New features can be computed from the existing ones and added to
the training datasets to improve the model performance.
Combinations Combining existing features with operations like
substractions or ratios
Predicates Adding new information to the dataset by providing
predicates on the fields, like odd and even
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 15 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 16 / 22
Automating model configuration
Optimizing models
Models can be tuned by adjusting their configurations to better fit our
data. Examples of automatic optimizations are
Optimized Automatic search of the best configuration per model type
OptiML Automatic search for the best type of model and
configuration according to a evaluation metric
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 17 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 18 / 22
Local vs. remote predictions
Depending on the requirements
Single Usually for sparse or distributed requests for immediate
predictions
Batch For cumulative or periodic requests for predictions
Depending on the integration level
Remote Usually used for batch predictions, when the scalability
and parallelism of the server justifies the latency of the
call
Local For offline settings or low-latency predictions
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 19 / 22
Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 20 / 22
Automating the entire solution
An ML solution is rarely given by a single model
The solution to a Machine Learning problem is usually a sequence of
steps that involve different models and transformations: a workflow. A
workflow has to be stored in a programmable way so that it can be
Traceable To describe which were the steps that led to the solution
Repeatable To allow repetition with different or cumulative data
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 21 / 22
Questions?
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 22 / 22

Mais conteúdo relacionado

Semelhante a VSSML18. Practical Workshops

Iwsm2014 understanding functional reuse of erp (maya daneva) - public release
Iwsm2014   understanding functional reuse of erp (maya daneva) - public releaseIwsm2014   understanding functional reuse of erp (maya daneva) - public release
Iwsm2014 understanding functional reuse of erp (maya daneva) - public release
Nesma
 

Semelhante a VSSML18. Practical Workshops (20)

VSSML18. REST API and Bindings
VSSML18. REST API and BindingsVSSML18. REST API and Bindings
VSSML18. REST API and Bindings
 
Iwsm2014 understanding functional reuse of erp (maya daneva) - public release
Iwsm2014   understanding functional reuse of erp (maya daneva) - public releaseIwsm2014   understanding functional reuse of erp (maya daneva) - public release
Iwsm2014 understanding functional reuse of erp (maya daneva) - public release
 
Freenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformFreenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning Platform
 
VSSML17 L7. REST API, Bindings, and Basic Workflows
VSSML17 L7. REST API, Bindings, and Basic WorkflowsVSSML17 L7. REST API, Bindings, and Basic Workflows
VSSML17 L7. REST API, Bindings, and Basic Workflows
 
Overview of th Capability Idea
Overview of th Capability IdeaOverview of th Capability Idea
Overview of th Capability Idea
 
MLSEV. BigML Workshop I
MLSEV. BigML Workshop IMLSEV. BigML Workshop I
MLSEV. BigML Workshop I
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
IRJET-Attribute Reduction using Apache Spark
IRJET-Attribute Reduction using Apache SparkIRJET-Attribute Reduction using Apache Spark
IRJET-Attribute Reduction using Apache Spark
 
MLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning WorkflowsMLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning Workflows
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial models
 
Estructurar y mantener aplicaciones Rails sin morir en el intento
Estructurar y mantener aplicaciones Rails sin morir en el intentoEstructurar y mantener aplicaciones Rails sin morir en el intento
Estructurar y mantener aplicaciones Rails sin morir en el intento
 
Streamlining Feature Engineering Pipelines with Open Source
Streamlining Feature Engineering Pipelines with Open SourceStreamlining Feature Engineering Pipelines with Open Source
Streamlining Feature Engineering Pipelines with Open Source
 
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
[DSC Europe 23] Djordje Grozdic - Transforming Business Process Automation wi...
 
Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...
Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...
Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...
 
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
Analytics of Performance and Data Quality for Mobile Edge Cloud ApplicationsAnalytics of Performance and Data Quality for Mobile Edge Cloud Applications
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
 
Minh nguyen 2021 (2)
Minh nguyen 2021 (2)Minh nguyen 2021 (2)
Minh nguyen 2021 (2)
 
Deep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreathDeep Learning for Recommender Systems with Nick pentreath
Deep Learning for Recommender Systems with Nick pentreath
 
Auto Content Moderation in C2C e-Commerce at OpML20
Auto Content Moderation in C2C e-Commerce at OpML20Auto Content Moderation in C2C e-Commerce at OpML20
Auto Content Moderation in C2C e-Commerce at OpML20
 
Gautham Pai K - Resume
Gautham Pai K - ResumeGautham Pai K - Resume
Gautham Pai K - Resume
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 

Mais de BigML, Inc

Mais de BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

VSSML18. Practical Workshops

  • 1. Valencian Summer School in Machine Learning 4th edition September 13–14, 2018
  • 2. Practical workshops: API and reuse Mercè Martín
  • 3. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 3 / 22
  • 4. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 4 / 22
  • 5. Register in BigML #VSSML18 Practical workshops: API and reuse September 13–14, 2018 5 / 22
  • 6. Install the client tools You can use virtual environments (recommended) mkvirtualenv vssml18 and install BigMLer and the Python bindings pip install bigmler #VSSML18 Practical workshops: API and reuse September 13–14, 2018 6 / 22
  • 7. Set your credentials They can be exported as environment variables export BIGML_USERNAME=[username] export BIGML_API_KEY=[api_key] For windows users setx BIGML_USERNAME [username] setx BIGML_API_KEY [api_key] The user name and API KEY can be found in your account information section #VSSML18 Practical workshops: API and reuse September 13–14, 2018 7 / 22
  • 8. Download the reference repo https://github.com/mmerce/notebooks and use the vssml18 folder or link it through mybinder.org #VSSML18 Practical workshops: API and reuse September 13–14, 2018 8 / 22
  • 9. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 9 / 22
  • 10. Data dictionary Defining the types of fields Models process data according to its type Numeric ordered unbounded sequence Categorical unordered enumeration Datetime Day, Month, Year, etc. Text Full text or composed type: bag of words Items Composed type: list of elements separated by a token Data dictionary must be carefully set for the model to correctly interpret your data #VSSML18 Practical workshops: API and reuse September 13–14, 2018 10 / 22
  • 11. Missing tokens Missings: meaningful or replaceable The absence of a value can be Meaningful either the model can treat a missing value as a new category or you need to build a new predicate and feed it to the model Replaceable if the model cannot deal with missing values, maybe you can fill in a sensible value: mean, zero, min, etc. When the percentage of training instances having missing values is small and the amount of data is enough, we can simply discard these instances while training. #VSSML18 Practical workshops: API and reuse September 13–14, 2018 11 / 22
  • 12. Errors Fixing errors Errors can be detected automatically when the values in the field are not compatible with its type Datetime The contents of the field cannot be parsed with the declared datetime format Numeric The contents of the field are not a number However, additional errors can pass the type coherence test. Errors need to be addressed and, as in the missing values case, either their value is replaced by a sensible alternative or the row should be discarded. #VSSML18 Practical workshops: API and reuse September 13–14, 2018 12 / 22
  • 13. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 13 / 22
  • 14. Feature selection Non-preferred fields Features can be excluded from model analysis because their values are Constant If the field contains a unique value throughout all instances Unique If the field contains a different value per instance Highly sparse If only a very low percentage of instances have non-missing values in the field Redundant If the field is correlated to another one Unrelated If the field contains reference information or is totally irrelevant to the problem to solve Supervised selection In supervised problems, the relevant features can be preselected according to some importance or evaluation metric #VSSML18 Practical workshops: API and reuse September 13–14, 2018 14 / 22
  • 15. Feature generation Transforming datasets New features can be computed from the existing ones and added to the training datasets to improve the model performance. Combinations Combining existing features with operations like substractions or ratios Predicates Adding new information to the dataset by providing predicates on the fields, like odd and even #VSSML18 Practical workshops: API and reuse September 13–14, 2018 15 / 22
  • 16. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 16 / 22
  • 17. Automating model configuration Optimizing models Models can be tuned by adjusting their configurations to better fit our data. Examples of automatic optimizations are Optimized Automatic search of the best configuration per model type OptiML Automatic search for the best type of model and configuration according to a evaluation metric #VSSML18 Practical workshops: API and reuse September 13–14, 2018 17 / 22
  • 18. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 18 / 22
  • 19. Local vs. remote predictions Depending on the requirements Single Usually for sparse or distributed requests for immediate predictions Batch For cumulative or periodic requests for predictions Depending on the integration level Remote Usually used for batch predictions, when the scalability and parallelism of the server justifies the latency of the call Local For offline settings or low-latency predictions #VSSML18 Practical workshops: API and reuse September 13–14, 2018 19 / 22
  • 20. Outline 1 Preparing the environment 2 Data wrangling 3 Feature engineering 4 Model tuning 5 Predictions integration 6 Workflows #VSSML18 Practical workshops: API and reuse September 13–14, 2018 20 / 22
  • 21. Automating the entire solution An ML solution is rarely given by a single model The solution to a Machine Learning problem is usually a sequence of steps that involve different models and transformations: a workflow. A workflow has to be stored in a programmable way so that it can be Traceable To describe which were the steps that led to the solution Repeatable To allow repetition with different or cumulative data #VSSML18 Practical workshops: API and reuse September 13–14, 2018 21 / 22
  • 22. Questions? #VSSML18 Practical workshops: API and reuse September 13–14, 2018 22 / 22