Productionalizing ML : Real Experience

PRODUCTIONALIZING ML:
REAL EXPERIENCE
Ihor Bobak
Data Scientist, EPAM Systems

SCOPE, ARCHITECTURE,
CONSIDERATIONS

3
PROJECT INFO
Customer:
A Canadian company that provides different fleet management services.
E.g. it runs a call center that handles all the maintenance and repairs of vehicles (acts as a “proxy”
between a client and service providers).
Use case:
A fleet owner contacts the agent to ask for assistance with the maintenance. The agent contacts
nearby service providers, gets offers, selects the supplier, negotiates the price for each line of the
maintenance order.
Problem:
a) Price negotiation takes agent’s time
b) Agents need to remember the details on cars/makes/models/spare parts to properly validate the
price.
Solution:
Price Prediction Web Service (based on ML) which predicts the maintenance price based on the
information about the vehicle, type of service, client location, etc.

4
DATA SCIENCE SCOPE
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Typical scope of data scientist’s work: We made two models:
• classification model answering the question
“is the price for the repair relevant or not”?
• regression model (customer’s choice)
answering the question “what is the
recommended price for this maintenance
item?”
2 times boost in agents time
1.10 times decrease in savings (due to FN)

5
SOLUTION ARCHITECTURE
Client
application
(scoring service
consumer)
Scoring Web
Service
(SOAP+REST)
Stack: Java, Spring
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Uploading Results
Upload models, training metrics, lookup tables, labels for
category variables. Destination: S3, output: 150MB zip archive
HTTP (SOAP)
requests for
price prediction
Downloading
new models
Scheduled
run of
training
process
Administrator
HTTP (REST)
management
commands
Model Storage
(HDFS/S3)
models, variables, labels,
lookup tables

6
TECHNOLOGY STACK
Training part:
• Jupyter notebook
• Spark 2.3.1
• Pandas, Numpy, Scikit-learn and other libraries
• XGBoost
Scoring part (web service):
• Java
• Spring Boot
• xgboost-predictor-java library
https://github.com/komiya-atsushi/xgboost-predictor-java
• Lots of other open source Java libraries

7
TASKS AT DIFFERENT ENVIRONMENTS
Exploration/Development Production
Training ScoringEnvironment:
• Python, Jupyter Notebook
• Spark/Scikit/XGBoost, etc.
Tasks:
• Get input data
• Rename fields
• Check values
• Modify field values
• Add new fields
• Filter rows
• Join other tables
• ML Pipeline tasks:
• Label encoding
• One hot encoding
• Train/test split
• Model training
• Metrics calculation
Environment:
• Java, Spring Boot, etc.
Tasks/Challenges:
We need to do the same things on Java:
“rename fields”, “check values”, “modify
field values”, “add fields”, “filter rows”,
“join other tables” and some of the
“ML pipepine tasks” (“label encoding”,
“one hot encoding”, “scoring by model”)
Challenges:
a) How to represent data?
b) What libraries to use for transformation?
c) What libraries to use for ML pipeline
tasks?
d) What libraries to use for scoring?
Environment:
• the same as exploration/development
Goal:
re-use the same code as much as possible.
Other tasks that we need to do here:
• Scheduled running of the whole training
cycle
• Uploading of results to some storage
(S3/HDFS)
• Alerting if metrics are below the
expectations
• Alerting if errors occurred during
training

9
Spring Boot Web Service
Price Prediction Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
(ML
models)
HTTP
Response
Payload
Logging
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
The scoring web service is a purely Java solution using the artifacts (“ML Data”) output by the Python’s training code.
*Many other things (payload shipment on S3/HDFS, updater of the ML Data files, management interface) are not shown here and will be shown later.

10
PRICE PREDICTION SERVICE LIBRARY
Price Prediction Service Library
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
Input Output
Input: maintenance order
• Order: VIN, country, supplier_id
• Line: repair code, ATA category, quantity.
Note: NO FEATURES HERE
Scoring web service needs to do the same
things as notebook, but on smaller data (5-10
lines per order):
a) It filters records (e.g. “remove Mexico data”)
b) It adds columns (=features), e.g. VIN =>
make, model, engine size, etc. Often done by
doing a lookup (=join to other table);
c) It generates features, e.g. ata_key =
aga_category + “_” + ata_subcategory.
Output: price prediction for every line.

11
NOTEBOOK DATA DUMPS
Training notebook dumps lots of things: models, lookup tables, data for the integration tests
Aggregations
Root data
and lookups
Variables
configuration
ML models
and test
dataset

12
BASE CLASSES
VIN Supplier_ID Repair code ATA cat. ATA subcat Qty
1HGBH41JXM
N109186
123456 REP 74 001003 1
[Same columns] Make Model Fuel Type … Supplier City
[Same columns] BMW X5 petrol … San Francisco
FeatureRecord - container for
features
FeatureRecordGroupedSet –
grouping by any field (in our
case – by order id)
Transformers – enrich FR with features:
• LookupTransformer – adds new columns
• ApplicableTransformer – stops processing
if some field is not in the lookup table
• OilGroupTransformer – groups similar
records
• Etc. (many others exist)
Make_BMW Make_Ford … Quantity … Prediction
1.0 0 … 1 … $55
MultiModel:
vectorizes the feature record and
does prediction

13
LOOKUP TRANSFORMER
Purpose: enrich the feature record with real features by making a lookup. Backed with:
• InMemoryIndexedDataFrame – a fast in-memory lookup
• IndexedDataFrameReader – reader of the df.csv.gz + df.schema pair of files
.schema file:
.scv.gz file example:
ata_ctgy_cd,ata_sub_ctgy_cd,ata_cd_long_desc,english_cd_long_desc,cd_stat_ind
17,001100,NEW TIRE RADIAL STEEL BELTED,NEW TIRE RADIAL STEEL BELTED,A
17,003001,USED TIRE,USED TIRE,A
10,02,010045,MIRROR SPOT,MIRROR SPOT,A

14
FULL CHAIN TRANSFORMER
Full chain transformer
combines all the atomic
transformations to one chain
Running transform() gives the
same result as if the same
records were passed through
notebook’s PySpark ETL code.

15
OTHER CLASSES
MultiModel does two things: vectorization of the feature record (into sparce vector of
doubles) and prediction. Encapsulates many XGBoost models (separate model for every
ATA code – a subject of repair).
It uses biz.k11i.xgboost (https://github.com/komiya-atsushi/xgboost-predictor-java):
MultiModelReader is a class to load the global configuration and all the models from
the config provider.
ConfigProvider is an abstraction which allows to read resources from one place.
Currently there is just one implementation – ZipConfigProvider (to read everything
from a single zip file)
PricePredictionService – a class which combines FullChainTransformer and MultiModel.

16
CONFIGURATION
Configuration structure (contents of the zip file):
• gg
• agg.json - aggregations configuration file
• <set of folders for each ata_key
• <set of pairs .csv.gz+.schema for each aggregation>,
e.g. agg_vin_model.csv.gz, agg_vin_model.schema
• config
• global.json – configuration file that describes all models, their variables, and
possible labels for cat. variables
• models
• <for every ata_key: .bin, .config and .txt file> - serialized models by the
XGBoost
• lookup
• <pairs of .csv.gz + .schema files for lookups>
ZipConfigProvider reads this zip file (size ≈ 200MB) produced by the notebook.
Uncompressed size in Java structures that we choose: 1.8 GB.

18
UNIT VS. INTEGRATION TESTING
Unit test Integration test
Results depends only on Java code Results also depends on external systems/data
Easy to write and verify Setup of integration test might be complicated
A single class/unit is tested in isolation One or more components are tested
All dependencies are mocked if needed
No mocking is used (or only unrelated components are
mocked)
Test verifies only implementation of code
Test verifies implementation of individual components and
their interconnection behavior when they are used together
A unit test uses only JUnit/TestNG and a mocking
framework
An integration test can use real containers and real DBs as
well as special integration testings frameworks (e.g. Arquillian
or DbUnit)
Mostly used by developers Integration tests are also useful to QA, DevOps, Help Desk
A failed unit test is always a regression (if the
business has not changed)
A failed integration test can also mean that the code is still
correct but the environment has changed
Unit tests in an Enterprise application should last
about 5 minutes
Integration tests in an Enterprise application can last for hours

19
INTEGRATION TESTING
Goal: to ensure that the web service is doing EXACTLY THE SAME THINGS as the training notebook does.
Training notebook outputs:
• mldata_20180807_093948.zip (200MB) - scoring configuration (ML models, lookups, variable configuration, etc.)
• mldata_test_20180807_093948.zip (600 MB) – scoring configuration + IT data:
What do we check:
• Take the input dataset (VIN, supplier_id, country, odometer reading, ATA category/subcategory, repair code, parts
quantity).
• Pass through FullChainTransformer and check if “Features by Python” = “Features by Java”
• Get predictions using MultiModel and check if “Prediction by Python” = “Prediction by Java”
Input Data
(all test dataset but JUST INPUT features)
VIN, supplier_id, ATA code, odom. reading, qnty
1M records
Test Data
with ALL features and predictions
All features (make, model, etc) + predictions
400K records

20
INTEGRATION TESTING
Maven life cycle phases:
1. validate - checks if the project is correct and all information is available
2. compile - compiles source code in binary artifacts
3. test - executes the tests
4. package - takes the compiled code and package it, for example
5. integration-test - takes the packaged result and executes additional tests, which require the packaging*
6. verify - performs checks if the package is valid
7. install - install the result of the package phase into the local Maven repository
8. deploy - deploys the package to a target, i.e. remote repository
Example of how to run:
mvn clean install –Dmldata=/path/to/mldata_test_20181010_150000.zip

21
CHECKED HASH MAP AS FEATURE CONTAINER
CheckedHashMap – an override of the HashMap, but with explicit operations.
Goal: avoid errors, increase control.
Overriden operations:
• put(): fails if the key already exists in the map
• get(): fails if the key doesn’t exist in the map
• remove(): fails if the key doesn’t exist in the map
New Operations:
• overwrite(): the key must exist, otherwise it will fail
• overwriteIfExists(): overwrites if the key exists, otherwise does nothing
• putIfNotExists(): doesn’t fail, works only if the key doesn’t exist
• putOrOverwrite(): no matter if the key exists or not, puts or overwrites it there
Left as is:
• getOrDefault(): if the key doesn’t exists, it will return a default
All operations do NOT allow null keys!

22
SERIALIZATION IN HEX
Double and float values inside CSV/LIBSVM files are written like this:
• 3.5/0000000000000c40 - for double values
• 5.1016541/c040a340 – for float values
At the python side:

23
SERIALIZATION IN HEX
At the Java side:
• HexParser class with helper static
methods to parse hex values – see the
code.
• Using SuperCSV library – to read the
CSV files.
• Made a class ParseDoubleHex extending
SuperCSV’s CellProcessor leveraging
HexParser to get values out of
3.5/0000000000000c40 .
• The same for float type.

24
PRETTY PRINTING OF ERRORS
Advice: do a “pretty print” of error data. If you don’t do it, fixing bugs will be hard.

25
PRETTY PRINTING OF ERRORS
Example of how easy is to fix errors if we have pretty print:

26
MEMORY OPTIMIZATION FOR IT
Before re-design of integration tests: After re-design:
(added partitioning)
11 GB
3.8 GB
(three times less)

28
Citing “Hidden Technical Debt in Machine Learning Systems”:
It may be surprising to the academic community to know that only a tiny fraction of the
code in many ML systems is actually devoted to learning or prediction:
TECHNICAL DEBT
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

29
Spring Boot Web Service
Price Prediction
Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
HTTP
Response
Payload
Logging
ML Data
(called a “config”)
Config
Updater
Payload
Shipping
Management
REST Controller
Client
application
(scoring service
consumer)
Model Storage
(HDFS/S3)
models, variables,
labels, lookup tables
HTTP
Request
Payload Storage
(HDFS/S3)
Daily zip files of all the
payloads
Payloads
Scheduler
(Spring)
VIN Decoder
(external REST service)

30
PRICE PREDICTION CONTROLLER/ENDPOINT
Endpoints:
• /rest – for REST (RestControler class)
• /soap – for SOAP (SoapEndpoint class)
• /manage – for management REST requests
Overrides of default behavior:
• for REST: override exception resolver, JSON message
converter (fails on unknown properties)
• for SOAP: proper handling of IvalidXmlException
and SoapMessageCreation – doing “500 Internal
Server Error” instead of “400 bad request”.
• For both: async uploads of payloads (derived
MessageDispatcherServler and DispatcherServlet)

31
PROCESSING ALGORITHM
The algorithm is common for REST controller and SOAP endpoint:
• validate the input data
• get the instance of the PricePredictionService from the manager
• perform prediction
• check the status. If it is OK – return the result, otherwise log the error and return the error.

34
MANAGEMENT CONTROLLER
The web UI is rendered
by Swagger + Springfox
Caveat #1:
it cannot generate good
examples for the
properties of type
Map<Long, SomeClass>
Caveat #2:
It doesn’t cover SOAP
/swagger-ui.html - automatically generated endpoint for help on the REST methods

35
MANAGEMENT CONTROLLER
/manage/info – renders detailed information about the configuration and 10 previous operations

36
MEMORY CLEANING
MemoryCleaner: utility class for cleaning memory.
Used on SWITCH operation.
• uses jlibs-core https://santhosh-
tekuri.github.io/jlibs/ RuntimeUtil.gc() - guarantees
garbage collection to happen, contrary to
System.gc()
• Runs a thread re-trying to free up RAM each T
seconds if
• at previous attempt we freed-up less than X MB
• we made not more than N attempts
Reason: old REST requests may still be running and hold
a reference to the old instance of the
PricePredictionService.

38
SONARQUBE
Sonarqube is a tool for
continuous inspection
of code quality.
It helps to find
potential bugs, does
code review, check unit
test, coverage, etc.
Supports 20
programming
languages.

39
Advice:
try to cover all the code
with unit tests
SONARQUBE: CODE COVERAGE
This bug
Was found after
covering these lines
with unit tests

40
SONARQUBE
SonarQube shows
places in the code
where bugs may
occur.

41
SONARQUBE: TRUE POSITIVES
Reference on explanation why we need the interrupt:
https://stackoverflow.com/questions/4906799/why-invoke-thread-currentthread-interrupt-in-a-catch-interruptexception-block

42
SONARQUBE: BAD ADVICES
The worst
suggestion of
SonarQube I’ve
ever seen.

43
STATISTICS
Part Lines
Total lines 18539
Source code lines 13139 (71%)
Comment lines 2786 (15%)
Blank lines 2614 (14%)

44
TRAINING PART
(THE “NOTEBOOK”)

45
RUNNING NOTEBOOKS OFFLINE
Goal: to re-use the code in exploration/development and in production
How it works:
nbrun.py notebook.ipynb –o out -k "pyspark 2.3.1" -e dev -i 3 -z results_{}.zip -t 180 -m 5
-o = output folder
-k = kernel name
-e = environment
-i = the cell where to insert the environment load
-z = the zip file pattern (to put the ipynb + *.py files after the run)
-t = timeout for the kernel to start
-m = maximum number of times to try to start the kernel
Algorithm:
• read the .ipynb file (just the code: omitting the output which may be there)
• insert some cells into position “-i”: a cell with profile name and override of the print function
• create “-out” folder and put there all modules, make this folder the working one.
• start the kernel, run the cells, output results to the “-out” folder
• zip the contents of the “-out” folder (notebook with executed content + py files)

46
RUNNING NOTEBOOKS OFFLINE
How nbrun works:
• Nbformat – https://github.com/jupyter/nbformat
for reading/writing ipynb, inserting/editing cells
• Nbconvert – https://github.com/jupyter/nbconvert
for running the notebook with a specified kernel and getting the output
Applied tricks:
• Override nbconvert.preprocessors.ExecutePreprocessor:
• preprocess_cell: to measure execution time and add
cell.metadata["ExecuteTime"] = {"end_time": time_end, "start_time": time_start}
• run_cell: to log the execution start/end and result just into console output of nbrun.py
• preprocess: to fix the bugs with shutting down the kernel process in the case if we couldn’t
connect to it, to change the timeout for kernel start and to do the “retry” starting if something
failed (what happens quite often with “heavy” kernels like PySpark).

47
ENVIRONMENT REPLACEMENT
This environment will be
loaded at development
time
Here nbrub.py will insert
a cell overriding the
ENV_NAME
This code
will dynamically load the
env_${ENV_NAME}.py

48
PATCHED PRINT FUNCTION
Notebook’s output
System’s output
(nbrun.py)
The print() function is overridden and inserted by nbrun.py.
Goal: to allow having the output in two places – the notebook and stdout of nbrun.py

49
ENVIRONMENT PARAMETERS
Environment parameters:
• FILESYSTEM, DB_NAME: place where we will store temporary tables.
Supports: s3, hdfs, cassandra, local, FiloDB
• HDFS/LOCAL/S3 parameters (depending on the type)
• Spark unpersisting parameters: mode which will force the Spark to unpersist the dataframes
(related to a bug with Spark 1.4 which caused cascaded unpersisting)
• Upload parameters: S3/HDFS parameters of where to upload the results of the training
• Metrics limits: upper limits for ML metrics (if met, then the new models will be uploaded)
• Datasets: location, table names, SQL statements of how to get the source data
• Thresholds for category variables (e.g. “train for top 1000 makes, ignore the others”)
• TOP_N: how many models to train
• VIN decoder parameters: how to decode the VINs in the case if they’re absent in the lookup
table

50
NOTEBOOK STRUCTURE
Notebook’s code is split on sections:
• Environment loading (if run offline – will be made by
nbrun.py)
• Loading modules, initializing shared variables
• Data extraction
• Data transformations (filtering, joining, new features)
• Model training
• Building model metrics and analysis
• Dumping of artifacts (models, lookup, etc.), zipping results
and metrics
Each section reads data from the previous section results and
saves its own results.
Each section can be switched off during development (to save
execution time).

51
MODULES
Functions were moved to the modules.
Reasons:
• Easy to debug
• Easy to see errors
• Easy to do unit tests
• PyCharm capabilities of
code navigation and type hints
Reference: https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html

52
Common variables are shared between modules.
Example of how to share variables:
1) Create a module shared.py with variables with
the same names as in the notebook
2) Call “init” at the beginning:
SHARED VARIABLES

53
MEMORY PROFILING OF THE NOTEBOOK
Memory profiling and usage of “del” at the
end of sections:
A simple way is to make some decorators.
After that for “heavy” frequently used
functions do this:
@profile
def your_func():
...
you will see how much memory the
notebook’s kernel consumed before and after
the function call.

54
MEMORY PROFILING OF THE NOTEBOOK
Each section ends with checking unreleased Pandas and Spark dataframes.
Results:
• decreased memory of the notebook from 12GB to 4GB
• removed all cached Spark dataframes from the cluster memory (=decreased the demands for cluster
resources).

56
HIGH LEVEL API
Goals:
• to simplify usage of Sparks dataframes/SQL
• Implicit caching defaults for common operations
• to minimize errors.
Commonly used functions:
• load_df(db_name, table_name, ...), save_df(df, db_name, table_name, ...)
• change_df(a_select_cols, a_drop, a_replace, a_rename, a_add, a_distinct, a_order_by, a_drop_end, a_filter_df,
a_filter_columns, a_filter_not_df, a_filter_not_columns, a_where)
• join_df, group_by
• filter_by_where, filter_by_df, filter_by_not_df, filter_by_threshold, filter_duplicates
Example: new_df = change_df(df, a_add={"new_col": "case when col > 0.0 then 1 else 0 end"})
instead of new_df = df.withColumn("new_col", F.when(df["col"] > 0.0, 1).otherwise(0))

57
CREATEDATAFRAME MONKEY PATCH
Problem:
the sqlContext.createDataFrame() function doesn’t contain a numSlices parameter (that is present in the
sc.parallelize() and defines the number of partitions). This is true up to 2.3.1
Why it is important:
to control the number of partitions for conversion from the Pandas dataframe into Spark dataframe.
Solution: patch three functions (code is in the sparkdf.py):
• SparkSession._createFromLocal = _createFromLocalMonkeyPatch
• SparkSession.createDataFrame = createDataFrameMonkeyPatch_session
• SQLContext.createDataFrame = createDataFrameMonkeyPatch_sqlcontext
In all of the overrides add the numSlices and pass it accordingly to the sc.parallelize()

58
NOTEBOOK KERNEL PARAMETERS
Marked are those parameters which we strongly advice to apply for a standalone Spark cluster:
{
"display_name": "pyspark cluster - ibobak - 3e 3c",
"language": "python",
"argv": ["/opt/conda/envs/py27/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
"env": {
"SPARK_HOME":"/opt/spark",
"PYTHONPATH":"/opt/spark/python/lib/py4j-0.10.4-src.zip:/opt/spark/python",
"PYTHONSTARTUP":"/opt/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS":" --packages com.databricks:spark-avro_2.11:3.2.0
--driver-memory 5G --executor-memory 10G --num-executors 3 --executor-cores 3 --total-executor-cores 9
--master spark://10.4.12.36:7077
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m
--conf spark.driver.extraJavaOptions="-Xss16m" --conf spark.executor.extraJavaOptions="-Xss16m"
--conf spark.cassandra.output.consistency.level=ALL --conf spark.cassandra.input.consistency.level=ALL pyspark-shell"
}
}
All three params num-executors, executor-cores and total-executor-cores must be specified (otherwise the number of cores
will be unpredictable). Serialization parameters are strongly advices to speed up the dataframe caching. –Xss16m is advices
to avoid stackoverflow errors. Cassandra parameters are needed when you use writes to Cassandra and don’t want a
situation when records are lost after saving and re-loading them.

60
JENKINS JOBS
Jenkins jobs to run different parts of the flow:
• ml_build_scoreapi: builds the price prediction web service (Java), runs the unit tests, does SonarQube analysis,
uploads the jar to the artifactory.
• ml_config_scoreapi: takes the new config files form Git and puts them on the server (dev/qa/prod), restarts the
service.
• ml_deploy_scoreapi: takes the jar file from the artifactory and puts them on the server (dev/qa/prod), restarts
the web service.
• ml_deploy_training: takes the notebooks from Git and puts them to the working folder of the training server.
• ml_run_training: runs on the training server these steps:
• Data extraction into parquet files.
• Offline running of the notebooks (using nbrun.py)
• Integration tests
• Checking ML metrics
• Uploading the training results on S3
• Issuing /manage/syncswitch HTTP GET-request on the working instance of the scoring web service.

61
CONTACTS
Ihor Bobak
E-mail: Ihor_Bobak@epam.com
Skype: ibobak
Linkedin: https://www.linkedin.com/in/ibobak

Productionalizing ML : Real Experience

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Productionalizing ML : Real Experience

Semelhante a Productionalizing ML : Real Experience (20)

Último

Último (20)

Productionalizing ML : Real Experience

Notas do Editor