VSSML18. Practical Workshops

Valencian Summer School in Machine Learning
4th edition
September 13–14, 2018

Practical workshops: API and reuse
Mercè Martín

Outline
1 Preparing the environment
2 Data wrangling
3 Feature engineering
4 Model tuning
5 Predictions integration
6 Workﬂows
#VSSML18 Practical workshops: API and reuse September 13–14, 2018 3 / 22

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Register in BigML

Install the client tools
You can use virtual environments (recommended)
mkvirtualenv vssml18
and install BigMLer and the Python bindings
pip install bigmler

Set your credentials
They can be exported as environment variables
export BIGML_USERNAME=[username]
export BIGML_API_KEY=[api_key]
For windows users
setx BIGML_USERNAME [username]
setx BIGML_API_KEY [api_key]
The user name and API KEY can be found in your account information
section

Download the reference repo
https://github.com/mmerce/notebooks
and use the vssml18 folder
or link it through mybinder.org

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Data dictionary
Deﬁning the types of ﬁelds
Models process data according to its type
Numeric ordered unbounded sequence
Categorical unordered enumeration
Datetime Day, Month, Year, etc.
Text Full text or composed type: bag of words
Items Composed type: list of elements separated by a token
Data dictionary must be carefully set for the model to correctly interpret
your data

Missing tokens
Missings: meaningful or replaceable
The absence of a value can be
Meaningful either the model can treat a missing value as a new
category or you need to build a new predicate and feed it
to the model
Replaceable if the model cannot deal with missing values, maybe you
can ﬁll in a sensible value: mean, zero, min, etc.
When the percentage of training instances having missing values is
small and the amount of data is enough, we can simply discard these
instances while training.

Errors
Fixing errors
Errors can be detected automatically when the values in the field are
not compatible with its type
Datetime The contents of the field cannot be parsed with the
declared datetime format
Numeric The contents of the field are not a number
However, additional errors can pass the type coherence test.
Errors need to be addressed and, as in the missing values case, either
their value is replaced by a sensible alternative or the row should be
discarded.

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Feature selection
Non-preferred fields
Features can be excluded from model analysis because their values are
Constant If the field contains a unique value throughout all instances
Unique If the field contains a different value per instance
Highly sparse If only a very low percentage of instances have non-missing
values in the field
Redundant If the field is correlated to another one
Unrelated If the field contains reference information or is totally irrelevant
to the problem to solve
Supervised selection
In supervised problems, the relevant features can be preselected according
to some importance or evaluation metric

Feature generation
Transforming datasets
New features can be computed from the existing ones and added to
the training datasets to improve the model performance.
Combinations Combining existing features with operations like
substractions or ratios
Predicates Adding new information to the dataset by providing
predicates on the ﬁelds, like odd and even

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Automating model configuration
Optimizing models
Models can be tuned by adjusting their configurations to better fit our
data. Examples of automatic optimizations are
Optimized Automatic search of the best configuration per model type
OptiML Automatic search for the best type of model and
configuration according to a evaluation metric

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Local vs. remote predictions
Depending on the requirements
Single Usually for sparse or distributed requests for immediate
predictions
Batch For cumulative or periodic requests for predictions
Depending on the integration level
Remote Usually used for batch predictions, when the scalability
and parallelism of the server justiﬁes the latency of the
call
Local For ofﬂine settings or low-latency predictions

Outline
2 Data wrangling
4 Model tuning
6 Workﬂows

Automating the entire solution
An ML solution is rarely given by a single model
The solution to a Machine Learning problem is usually a sequence of
steps that involve different models and transformations: a workﬂow. A
workﬂow has to be stored in a programmable way so that it can be
Traceable To describe which were the steps that led to the solution
Repeatable To allow repetition with different or cumulative data

Questions?

VSSML18. Practical Workshops

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a VSSML18. Practical Workshops

Semelhante a VSSML18. Practical Workshops (20)

Mais de BigML, Inc

Mais de BigML, Inc (20)

Último

Último (20)

VSSML18. Practical Workshops