6. Install the client tools
You can use virtual environments (recommended)
mkvirtualenv vssml18
and install BigMLer and the Python bindings
pip install bigmler
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 6 / 22
7. Set your credentials
They can be exported as environment variables
export BIGML_USERNAME=[username]
export BIGML_API_KEY=[api_key]
For windows users
setx BIGML_USERNAME [username]
setx BIGML_API_KEY [api_key]
The user name and API KEY can be found in your account information
section
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 7 / 22
8. Download the reference repo
https://github.com/mmerce/notebooks
and use the vssml18 folder
or link it through mybinder.org
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 8 / 22
9. Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 9 / 22
10. Data dictionary
Defining the types of fields
Models process data according to its type
Numeric ordered unbounded sequence
Categorical unordered enumeration
Datetime Day, Month, Year, etc.
Text Full text or composed type: bag of words
Items Composed type: list of elements separated by a token
Data dictionary must be carefully set for the model to correctly interpret
your data
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 10 / 22
11. Missing tokens
Missings: meaningful or replaceable
The absence of a value can be
Meaningful either the model can treat a missing value as a new
category or you need to build a new predicate and feed it
to the model
Replaceable if the model cannot deal with missing values, maybe you
can fill in a sensible value: mean, zero, min, etc.
When the percentage of training instances having missing values is
small and the amount of data is enough, we can simply discard these
instances while training.
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 11 / 22
12. Errors
Fixing errors
Errors can be detected automatically when the values in the field are
not compatible with its type
Datetime The contents of the field cannot be parsed with the
declared datetime format
Numeric The contents of the field are not a number
However, additional errors can pass the type coherence test.
Errors need to be addressed and, as in the missing values case, either
their value is replaced by a sensible alternative or the row should be
discarded.
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 12 / 22
13. Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 13 / 22
14. Feature selection
Non-preferred fields
Features can be excluded from model analysis because their values are
Constant If the field contains a unique value throughout all instances
Unique If the field contains a different value per instance
Highly sparse If only a very low percentage of instances have non-missing
values in the field
Redundant If the field is correlated to another one
Unrelated If the field contains reference information or is totally irrelevant
to the problem to solve
Supervised selection
In supervised problems, the relevant features can be preselected according
to some importance or evaluation metric
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 14 / 22
15. Feature generation
Transforming datasets
New features can be computed from the existing ones and added to
the training datasets to improve the model performance.
Combinations Combining existing features with operations like
substractions or ratios
Predicates Adding new information to the dataset by providing
predicates on the fields, like odd and even
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 15 / 22
16. Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 16 / 22
17. Automating model configuration
Optimizing models
Models can be tuned by adjusting their configurations to better fit our
data. Examples of automatic optimizations are
Optimized Automatic search of the best configuration per model type
OptiML Automatic search for the best type of model and
configuration according to a evaluation metric
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 17 / 22
18. Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 18 / 22
19. Local vs. remote predictions
Depending on the requirements
Single Usually for sparse or distributed requests for immediate
predictions
Batch For cumulative or periodic requests for predictions
Depending on the integration level
Remote Usually used for batch predictions, when the scalability
and parallelism of the server justifies the latency of the
call
Local For offline settings or low-latency predictions
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 19 / 22
20. Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workflows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 20 / 22
21. Automating the entire solution
An ML solution is rarely given by a single model
The solution to a Machine Learning problem is usually a sequence of
steps that involve different models and transformations: a workflow. A
workflow has to be stored in a programmable way so that it can be
Traceable To describe which were the steps that led to the solution
Repeatable To allow repetition with different or cumulative data
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 21 / 22