SlideShare uma empresa Scribd logo
1 de 61
PRODUCTIONALIZING ML:
REAL EXPERIENCE
Ihor Bobak
Data Scientist, EPAM Systems
SCOPE, ARCHITECTURE,
CONSIDERATIONS
3
PROJECT INFO
Customer:
A Canadian company that provides different fleet management services.
E.g. it runs a call center that handles all the maintenance and repairs of vehicles (acts as a “proxy”
between a client and service providers).
Use case:
A fleet owner contacts the agent to ask for assistance with the maintenance. The agent contacts
nearby service providers, gets offers, selects the supplier, negotiates the price for each line of the
maintenance order.
Problem:
a) Price negotiation takes agent’s time
b) Agents need to remember the details on cars/makes/models/spare parts to properly validate the
price.
Solution:
Price Prediction Web Service (based on ML) which predicts the maintenance price based on the
information about the vehicle, type of service, client location, etc.
4
DATA SCIENCE SCOPE
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Typical scope of data scientist’s work: We made two models:
• classification model answering the question
“is the price for the repair relevant or not”?
• regression model (customer’s choice)
answering the question “what is the
recommended price for this maintenance
item?”
2 times boost in agents time
1.10 times decrease in savings (due to FN)
5
SOLUTION ARCHITECTURE
Client
application
(scoring service
consumer)
Scoring Web
Service
(SOAP+REST)
Stack: Java, Spring
Data Extraction
Destination: parquet files on HDFS,
scope: 2 last years, output: 28 mln. rows
Data Transformation
Filtering, joining, new fields/expressions,
destination: parquet files on HDFS, output: 5 mln. rows
ML Pipeline
label encoding, one hot encoding, vector assembling, training
XGBoost models, performance metrics.
Data Sources
Sybase, Cassandra, others
Uploading Results
Upload models, training metrics, lookup tables, labels for
category variables. Destination: S3, output: 150MB zip archive
HTTP (SOAP)
requests for
price prediction
Downloading
new models
Scheduled
run of
training
process
Administrator
HTTP (REST)
management
commands
Model Storage
(HDFS/S3)
models, variables, labels,
lookup tables
6
TECHNOLOGY STACK
Training part:
• Jupyter notebook
• Spark 2.3.1
• Pandas, Numpy, Scikit-learn and other libraries
• XGBoost
Scoring part (web service):
• Java
• Spring Boot
• xgboost-predictor-java library
https://github.com/komiya-atsushi/xgboost-predictor-java
• Lots of other open source Java libraries
7
TASKS AT DIFFERENT ENVIRONMENTS
Exploration/Development Production
Training ScoringEnvironment:
• Python, Jupyter Notebook
• Spark/Scikit/XGBoost, etc.
Tasks:
• Get input data
• Rename fields
• Check values
• Modify field values
• Add new fields
• Filter rows
• Join other tables
• ML Pipeline tasks:
• Label encoding
• One hot encoding
• Train/test split
• Model training
• Metrics calculation
Environment:
• Java, Spring Boot, etc.
Tasks/Challenges:
We need to do the same things on Java:
“rename fields”, “check values”, “modify
field values”, “add fields”, “filter rows”,
“join other tables” and some of the
“ML pipepine tasks” (“label encoding”,
“one hot encoding”, “scoring by model”)
Challenges:
a) How to represent data?
b) What libraries to use for transformation?
c) What libraries to use for ML pipeline
tasks?
d) What libraries to use for scoring?
Environment:
• the same as exploration/development
Goal:
re-use the same code as much as possible.
Other tasks that we need to do here:
• Scheduled running of the whole training
cycle
• Uploading of results to some storage
(S3/HDFS)
• Alerting if metrics are below the
expectations
• Alerting if errors occurred during
training
8
SCORING LIBRARY
9
Spring Boot Web Service
Price Prediction Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
(ML
models)
HTTP
Response
Payload
Logging
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
The scoring web service is a purely Java solution using the artifacts (“ML Data”) output by the Python’s training code.
*Many other things (payload shipment on S3/HDFS, updater of the ML Data files, management interface) are not shown here and will be shown later.
10
PRICE PREDICTION SERVICE LIBRARY
Price Prediction Service Library
Transformers
Vectorization
(Label Encoder,
One-hot
encoder)
Scoring
ML Data
Serialized Models + vectorization info (variables, labels, data types).
Lookup tables (for enriching the feature records).
Input Output
Input: maintenance order
• Order: VIN, country, supplier_id
• Line: repair code, ATA category, quantity.
Note: NO FEATURES HERE
Scoring web service needs to do the same
things as notebook, but on smaller data (5-10
lines per order):
a) It filters records (e.g. “remove Mexico data”)
b) It adds columns (=features), e.g. VIN =>
make, model, engine size, etc. Often done by
doing a lookup (=join to other table);
c) It generates features, e.g. ata_key =
aga_category + “_” + ata_subcategory.
Output: price prediction for every line.
11
NOTEBOOK DATA DUMPS
Training notebook dumps lots of things: models, lookup tables, data for the integration tests
Aggregations
Root data
and lookups
Variables
configuration
ML models
and test
dataset
12
BASE CLASSES
VIN Supplier_ID Repair code ATA cat. ATA subcat Qty
1HGBH41JXM
N109186
123456 REP 74 001003 1
[Same columns] Make Model Fuel Type … Supplier City
[Same columns] BMW X5 petrol … San Francisco
FeatureRecord - container for
features
FeatureRecordGroupedSet –
grouping by any field (in our
case – by order id)
Transformers – enrich FR with features:
• LookupTransformer – adds new columns
• ApplicableTransformer – stops processing
if some field is not in the lookup table
• OilGroupTransformer – groups similar
records
• Etc. (many others exist)
Make_BMW Make_Ford … Quantity … Prediction
1.0 0 … 1 … $55
MultiModel:
vectorizes the feature record and
does prediction
13
LOOKUP TRANSFORMER
Purpose: enrich the feature record with real features by making a lookup. Backed with:
• InMemoryIndexedDataFrame – a fast in-memory lookup
• IndexedDataFrameReader – reader of the df.csv.gz + df.schema pair of files
.schema file:
.scv.gz file example:
ata_ctgy_cd,ata_sub_ctgy_cd,ata_cd_long_desc,english_cd_long_desc,cd_stat_ind
17,001100,NEW TIRE RADIAL STEEL BELTED,NEW TIRE RADIAL STEEL BELTED,A
17,003001,USED TIRE,USED TIRE,A
10,02,010045,MIRROR SPOT,MIRROR SPOT,A
14
FULL CHAIN TRANSFORMER
Full chain transformer
combines all the atomic
transformations to one chain
Running transform() gives the
same result as if the same
records were passed through
notebook’s PySpark ETL code.
15
OTHER CLASSES
MultiModel does two things: vectorization of the feature record (into sparce vector of
doubles) and prediction. Encapsulates many XGBoost models (separate model for every
ATA code – a subject of repair).
It uses biz.k11i.xgboost (https://github.com/komiya-atsushi/xgboost-predictor-java):
MultiModelReader is a class to load the global configuration and all the models from
the config provider.
ConfigProvider is an abstraction which allows to read resources from one place.
Currently there is just one implementation – ZipConfigProvider (to read everything
from a single zip file)
PricePredictionService – a class which combines FullChainTransformer and MultiModel.
16
CONFIGURATION
Configuration structure (contents of the zip file):
• gg
• agg.json - aggregations configuration file
• <set of folders for each ata_key
• <set of pairs .csv.gz+.schema for each aggregation>,
e.g. agg_vin_model.csv.gz, agg_vin_model.schema
• config
• global.json – configuration file that describes all models, their variables, and
possible labels for cat. variables
• models
• <for every ata_key: .bin, .config and .txt file> - serialized models by the
XGBoost
• lookup
• <pairs of .csv.gz + .schema files for lookups>
ZipConfigProvider reads this zip file (size ≈ 200MB) produced by the notebook.
Uncompressed size in Java structures that we choose: 1.8 GB.
17
UNIT & INTEGRATION
TESTING
18
UNIT VS. INTEGRATION TESTING
Unit test Integration test
Results depends only on Java code Results also depends on external systems/data
Easy to write and verify Setup of integration test might be complicated
A single class/unit is tested in isolation One or more components are tested
All dependencies are mocked if needed
No mocking is used (or only unrelated components are
mocked)
Test verifies only implementation of code
Test verifies implementation of individual components and
their interconnection behavior when they are used together
A unit test uses only JUnit/TestNG and a mocking
framework
An integration test can use real containers and real DBs as
well as special integration testings frameworks (e.g. Arquillian
or DbUnit)
Mostly used by developers Integration tests are also useful to QA, DevOps, Help Desk
A failed unit test is always a regression (if the
business has not changed)
A failed integration test can also mean that the code is still
correct but the environment has changed
Unit tests in an Enterprise application should last
about 5 minutes
Integration tests in an Enterprise application can last for hours
19
INTEGRATION TESTING
Goal: to ensure that the web service is doing EXACTLY THE SAME THINGS as the training notebook does.
Training notebook outputs:
• mldata_20180807_093948.zip (200MB) - scoring configuration (ML models, lookups, variable configuration, etc.)
• mldata_test_20180807_093948.zip (600 MB) – scoring configuration + IT data:
What do we check:
• Take the input dataset (VIN, supplier_id, country, odometer reading, ATA category/subcategory, repair code, parts
quantity).
• Pass through FullChainTransformer and check if “Features by Python” = “Features by Java”
• Get predictions using MultiModel and check if “Prediction by Python” = “Prediction by Java”
Input Data
(all test dataset but JUST INPUT features)
VIN, supplier_id, ATA code, odom. reading, qnty
1M records
Test Data
with ALL features and predictions
All features (make, model, etc) + predictions
400K records
20
INTEGRATION TESTING
Maven life cycle phases:
1. validate - checks if the project is correct and all information is available
2. compile - compiles source code in binary artifacts
3. test - executes the tests
4. package - takes the compiled code and package it, for example
5. integration-test - takes the packaged result and executes additional tests, which require the packaging*
6. verify - performs checks if the package is valid
7. install - install the result of the package phase into the local Maven repository
8. deploy - deploys the package to a target, i.e. remote repository
Example of how to run:
mvn clean install –Dmldata=/path/to/mldata_test_20181010_150000.zip
21
CHECKED HASH MAP AS FEATURE CONTAINER
CheckedHashMap – an override of the HashMap, but with explicit operations.
Goal: avoid errors, increase control.
Overriden operations:
• put(): fails if the key already exists in the map
• get(): fails if the key doesn’t exist in the map
• remove(): fails if the key doesn’t exist in the map
New Operations:
• overwrite(): the key must exist, otherwise it will fail
• overwriteIfExists(): overwrites if the key exists, otherwise does nothing
• putIfNotExists(): doesn’t fail, works only if the key doesn’t exist
• putOrOverwrite(): no matter if the key exists or not, puts or overwrites it there
Left as is:
• getOrDefault(): if the key doesn’t exists, it will return a default
All operations do NOT allow null keys!
22
SERIALIZATION IN HEX
Double and float values inside CSV/LIBSVM files are written like this:
• 3.5/0000000000000c40 - for double values
• 5.1016541/c040a340 – for float values
At the python side:
23
SERIALIZATION IN HEX
At the Java side:
• HexParser class with helper static
methods to parse hex values – see the
code.
• Using SuperCSV library – to read the
CSV files.
• Made a class ParseDoubleHex extending
SuperCSV’s CellProcessor leveraging
HexParser to get values out of
3.5/0000000000000c40 .
• The same for float type.
24
PRETTY PRINTING OF ERRORS
Advice: do a “pretty print” of error data. If you don’t do it, fixing bugs will be hard.
25
PRETTY PRINTING OF ERRORS
Example of how easy is to fix errors if we have pretty print:
26
MEMORY OPTIMIZATION FOR IT
Before re-design of integration tests: After re-design:
(added partitioning)
11 GB
3.8 GB
(three times less)
27
SCORING WEB SERVICE
28
Citing “Hidden Technical Debt in Machine Learning Systems”:
It may be surprising to the academic community to know that only a tiny fraction of the
code in many ML systems is actually devoted to learning or prediction:
TECHNICAL DEBT
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
29
Spring Boot Web Service
Price Prediction
Service Library
SCORING WEB SERVICE ARCHITECTURE
REST
Controller
SOAP
Endpoint
HTTP
Request
HTTP
Response
Payload
Logging
ML Data
(called a “config”)
Config
Updater
Payload
Shipping
Management
REST Controller
Client
application
(scoring service
consumer)
Model Storage
(HDFS/S3)
models, variables,
labels, lookup tables
HTTP
Request
Payload Storage
(HDFS/S3)
Daily zip files of all the
payloads
Payloads
Scheduler
(Spring)
VIN Decoder
(external REST service)
30
PRICE PREDICTION CONTROLLER/ENDPOINT
Endpoints:
• /rest – for REST (RestControler class)
• /soap – for SOAP (SoapEndpoint class)
• /manage – for management REST requests
Overrides of default behavior:
• for REST: override exception resolver, JSON message
converter (fails on unknown properties)
• for SOAP: proper handling of IvalidXmlException
and SoapMessageCreation – doing “500 Internal
Server Error” instead of “400 bad request”.
• For both: async uploads of payloads (derived
MessageDispatcherServler and DispatcherServlet)
31
PROCESSING ALGORITHM
The algorithm is common for REST controller and SOAP endpoint:
• validate the input data
• get the instance of the PricePredictionService from the manager
• perform prediction
• check the status. If it is OK – return the result, otherwise log the error and return the error.
32
REQUEST/RESPONSE IN REST
33
REQUEST/RESPONSE IN SOAP
34
MANAGEMENT CONTROLLER
The web UI is rendered
by Swagger + Springfox
Caveat #1:
it cannot generate good
examples for the
properties of type
Map<Long, SomeClass>
Caveat #2:
It doesn’t cover SOAP
/swagger-ui.html - automatically generated endpoint for help on the REST methods
35
MANAGEMENT CONTROLLER
/manage/info – renders detailed information about the configuration and 10 previous operations
36
MEMORY CLEANING
MemoryCleaner: utility class for cleaning memory.
Used on SWITCH operation.
• uses jlibs-core https://santhosh-
tekuri.github.io/jlibs/ RuntimeUtil.gc() - guarantees
garbage collection to happen, contrary to
System.gc()
• Runs a thread re-trying to free up RAM each T
seconds if
• at previous attempt we freed-up less than X MB
• we made not more than N attempts
Reason: old REST requests may still be running and hold
a reference to the old instance of the
PricePredictionService.
37
CODE QUALITY: SONARQUBE
38
SONARQUBE
Sonarqube is a tool for
continuous inspection
of code quality.
It helps to find
potential bugs, does
code review, check unit
test, coverage, etc.
Supports 20
programming
languages.
39
Advice:
try to cover all the code
with unit tests
SONARQUBE: CODE COVERAGE
This bug
Was found after
covering these lines
with unit tests
40
SONARQUBE
SonarQube shows
places in the code
where bugs may
occur.
41
SONARQUBE: TRUE POSITIVES
Reference on explanation why we need the interrupt:
https://stackoverflow.com/questions/4906799/why-invoke-thread-currentthread-interrupt-in-a-catch-interruptexception-block
42
SONARQUBE: BAD ADVICES
The worst
suggestion of
SonarQube I’ve
ever seen.
43
STATISTICS
Part Lines
Total lines 18539
Source code lines 13139 (71%)
Comment lines 2786 (15%)
Blank lines 2614 (14%)
44
TRAINING PART
(THE “NOTEBOOK”)
45
RUNNING NOTEBOOKS OFFLINE
Goal: to re-use the code in exploration/development and in production
How it works:
nbrun.py notebook.ipynb –o out -k "pyspark 2.3.1" -e dev -i 3 -z results_{}.zip -t 180 -m 5
-o = output folder
-k = kernel name
-e = environment
-i = the cell where to insert the environment load
-z = the zip file pattern (to put the ipynb + *.py files after the run)
-t = timeout for the kernel to start
-m = maximum number of times to try to start the kernel
Algorithm:
• read the .ipynb file (just the code: omitting the output which may be there)
• insert some cells into position “-i”: a cell with profile name and override of the print function
• create “-out” folder and put there all modules, make this folder the working one.
• start the kernel, run the cells, output results to the “-out” folder
• zip the contents of the “-out” folder (notebook with executed content + py files)
46
RUNNING NOTEBOOKS OFFLINE
How nbrun works:
• Nbformat – https://github.com/jupyter/nbformat
for reading/writing ipynb, inserting/editing cells
• Nbconvert – https://github.com/jupyter/nbconvert
for running the notebook with a specified kernel and getting the output
Applied tricks:
• Override nbconvert.preprocessors.ExecutePreprocessor:
• preprocess_cell: to measure execution time and add
cell.metadata["ExecuteTime"] = {"end_time": time_end, "start_time": time_start}
• run_cell: to log the execution start/end and result just into console output of nbrun.py
• preprocess: to fix the bugs with shutting down the kernel process in the case if we couldn’t
connect to it, to change the timeout for kernel start and to do the “retry” starting if something
failed (what happens quite often with “heavy” kernels like PySpark).
47
ENVIRONMENT REPLACEMENT
This environment will be
loaded at development
time
Here nbrub.py will insert
a cell overriding the
ENV_NAME
This code
will dynamically load the
env_${ENV_NAME}.py
48
PATCHED PRINT FUNCTION
Notebook’s output
System’s output
(nbrun.py)
The print() function is overridden and inserted by nbrun.py.
Goal: to allow having the output in two places – the notebook and stdout of nbrun.py
49
ENVIRONMENT PARAMETERS
Environment parameters:
• FILESYSTEM, DB_NAME: place where we will store temporary tables.
Supports: s3, hdfs, cassandra, local, FiloDB
• HDFS/LOCAL/S3 parameters (depending on the type)
• Spark unpersisting parameters: mode which will force the Spark to unpersist the dataframes
(related to a bug with Spark 1.4 which caused cascaded unpersisting)
• Upload parameters: S3/HDFS parameters of where to upload the results of the training
• Metrics limits: upper limits for ML metrics (if met, then the new models will be uploaded)
• Datasets: location, table names, SQL statements of how to get the source data
• Thresholds for category variables (e.g. “train for top 1000 makes, ignore the others”)
• TOP_N: how many models to train
• VIN decoder parameters: how to decode the VINs in the case if they’re absent in the lookup
table
50
NOTEBOOK STRUCTURE
Notebook’s code is split on sections:
• Environment loading (if run offline – will be made by
nbrun.py)
• Loading modules, initializing shared variables
• Data extraction
• Data transformations (filtering, joining, new features)
• Model training
• Building model metrics and analysis
• Dumping of artifacts (models, lookup, etc.), zipping results
and metrics
Each section reads data from the previous section results and
saves its own results.
Each section can be switched off during development (to save
execution time).
51
MODULES
Functions were moved to the modules.
Reasons:
• Easy to debug
• Easy to see errors
• Easy to do unit tests
• PyCharm capabilities of
code navigation and type hints
Reference: https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html
52
Common variables are shared between modules.
Example of how to share variables:
1) Create a module shared.py with variables with
the same names as in the notebook
2) Call “init” at the beginning:
SHARED VARIABLES
53
MEMORY PROFILING OF THE NOTEBOOK
Memory profiling and usage of “del” at the
end of sections:
A simple way is to make some decorators.
After that for “heavy” frequently used
functions do this:
@profile
def your_func():
...
you will see how much memory the
notebook’s kernel consumed before and after
the function call.
54
MEMORY PROFILING OF THE NOTEBOOK
Each section ends with checking unreleased Pandas and Spark dataframes.
Results:
• decreased memory of the notebook from 12GB to 4GB
• removed all cached Spark dataframes from the cluster memory (=decreased the demands for cluster
resources).
55
SPARK
56
HIGH LEVEL API
Goals:
• to simplify usage of Sparks dataframes/SQL
• Implicit caching defaults for common operations
• to minimize errors.
Commonly used functions:
• load_df(db_name, table_name, ...), save_df(df, db_name, table_name, ...)
• change_df(a_select_cols, a_drop, a_replace, a_rename, a_add, a_distinct, a_order_by, a_drop_end, a_filter_df,
a_filter_columns, a_filter_not_df, a_filter_not_columns, a_where)
• join_df, group_by
• filter_by_where, filter_by_df, filter_by_not_df, filter_by_threshold, filter_duplicates
Example: new_df = change_df(df, a_add={"new_col": "case when col > 0.0 then 1 else 0 end"})
instead of new_df = df.withColumn("new_col", F.when(df["col"] > 0.0, 1).otherwise(0))
57
CREATEDATAFRAME MONKEY PATCH
Problem:
the sqlContext.createDataFrame() function doesn’t contain a numSlices parameter (that is present in the
sc.parallelize() and defines the number of partitions). This is true up to 2.3.1
Why it is important:
to control the number of partitions for conversion from the Pandas dataframe into Spark dataframe.
Solution: patch three functions (code is in the sparkdf.py):
• SparkSession._createFromLocal = _createFromLocalMonkeyPatch
• SparkSession.createDataFrame = createDataFrameMonkeyPatch_session
• SQLContext.createDataFrame = createDataFrameMonkeyPatch_sqlcontext
In all of the overrides add the numSlices and pass it accordingly to the sc.parallelize()
58
NOTEBOOK KERNEL PARAMETERS
Marked are those parameters which we strongly advice to apply for a standalone Spark cluster:
{
"display_name": "pyspark cluster - ibobak - 3e 3c",
"language": "python",
"argv": ["/opt/conda/envs/py27/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
"env": {
"SPARK_HOME":"/opt/spark",
"PYTHONPATH":"/opt/spark/python/lib/py4j-0.10.4-src.zip:/opt/spark/python",
"PYTHONSTARTUP":"/opt/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS":" --packages com.databricks:spark-avro_2.11:3.2.0
--driver-memory 5G --executor-memory 10G --num-executors 3 --executor-cores 3 --total-executor-cores 9
--master spark://10.4.12.36:7077
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m
--conf spark.driver.extraJavaOptions="-Xss16m" --conf spark.executor.extraJavaOptions="-Xss16m"
--conf spark.cassandra.output.consistency.level=ALL --conf spark.cassandra.input.consistency.level=ALL pyspark-shell"
}
}
All three params num-executors, executor-cores and total-executor-cores must be specified (otherwise the number of cores
will be unpredictable). Serialization parameters are strongly advices to speed up the dataframe caching. –Xss16m is advices
to avoid stackoverflow errors. Cassandra parameters are needed when you use writes to Cassandra and don’t want a
situation when records are lost after saving and re-loading them.
59
AUTOMATION
60
JENKINS JOBS
Jenkins jobs to run different parts of the flow:
• ml_build_scoreapi: builds the price prediction web service (Java), runs the unit tests, does SonarQube analysis,
uploads the jar to the artifactory.
• ml_config_scoreapi: takes the new config files form Git and puts them on the server (dev/qa/prod), restarts the
service.
• ml_deploy_scoreapi: takes the jar file from the artifactory and puts them on the server (dev/qa/prod), restarts
the web service.
• ml_deploy_training: takes the notebooks from Git and puts them to the working folder of the training server.
• ml_run_training: runs on the training server these steps:
• Data extraction into parquet files.
• Offline running of the notebooks (using nbrun.py)
• Integration tests
• Checking ML metrics
• Uploading the training results on S3
• Issuing /manage/syncswitch HTTP GET-request on the working instance of the scoring web service.
61
CONTACTS
Ihor Bobak
E-mail: Ihor_Bobak@epam.com
Skype: ibobak
Linkedin: https://www.linkedin.com/in/ibobak

Mais conteúdo relacionado

Mais procurados

[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft AzureKorkrid Akepanidtaworn
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Databricks
 
201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019Mark Tabladillo
 
Continuous Deployment for Deep Learning
Continuous Deployment for Deep LearningContinuous Deployment for Deep Learning
Continuous Deployment for Deep LearningDatabricks
 
ADF Mythbusters UKOUG'14
ADF Mythbusters UKOUG'14ADF Mythbusters UKOUG'14
ADF Mythbusters UKOUG'14andrejusb
 
Whats New In 2010 (Msdn & Visual Studio)
Whats New In 2010 (Msdn & Visual Studio)Whats New In 2010 (Msdn & Visual Studio)
Whats New In 2010 (Msdn & Visual Studio)Steve Lange
 
Doctor Flow- Best practices Microsoft flow - Techorama 2019
Doctor Flow- Best practices Microsoft flow - Techorama 2019Doctor Flow- Best practices Microsoft flow - Techorama 2019
Doctor Flow- Best practices Microsoft flow - Techorama 2019serge luca
 
Democratize development with Microsoft Power Apps and AI builder
Democratize development with Microsoft Power Apps and AI builderDemocratize development with Microsoft Power Apps and AI builder
Democratize development with Microsoft Power Apps and AI builderVenkatarangan Thirumalai
 
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowContinuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowDatabricks
 
Melbourne UG Presentation - UI Flow for Power Automate
Melbourne UG Presentation - UI Flow for Power AutomateMelbourne UG Presentation - UI Flow for Power Automate
Melbourne UG Presentation - UI Flow for Power AutomateAndre Margono
 
Models in Minutes using AutoML
Models in Minutes using AutoMLModels in Minutes using AutoML
Models in Minutes using AutoMLBill Liu
 
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model ServingDAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model Servingamesar0
 
Tech Mind Maps - Booklet Preview
Tech Mind Maps - Booklet PreviewTech Mind Maps - Booklet Preview
Tech Mind Maps - Booklet PreviewMichal Juhas
 
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...PAPIs.io
 
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration MondayBuilding workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration MondayBizTalk360
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)serge luca
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro sessionAvinash Patil
 

Mais procurados (20)

[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019
 
Continuous Deployment for Deep Learning
Continuous Deployment for Deep LearningContinuous Deployment for Deep Learning
Continuous Deployment for Deep Learning
 
ADF Mythbusters UKOUG'14
ADF Mythbusters UKOUG'14ADF Mythbusters UKOUG'14
ADF Mythbusters UKOUG'14
 
Whats New In 2010 (Msdn & Visual Studio)
Whats New In 2010 (Msdn & Visual Studio)Whats New In 2010 (Msdn & Visual Studio)
Whats New In 2010 (Msdn & Visual Studio)
 
Doctor Flow- Best practices Microsoft flow - Techorama 2019
Doctor Flow- Best practices Microsoft flow - Techorama 2019Doctor Flow- Best practices Microsoft flow - Techorama 2019
Doctor Flow- Best practices Microsoft flow - Techorama 2019
 
Democratize development with Microsoft Power Apps and AI builder
Democratize development with Microsoft Power Apps and AI builderDemocratize development with Microsoft Power Apps and AI builder
Democratize development with Microsoft Power Apps and AI builder
 
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowContinuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow
 
Melbourne UG Presentation - UI Flow for Power Automate
Melbourne UG Presentation - UI Flow for Power AutomateMelbourne UG Presentation - UI Flow for Power Automate
Melbourne UG Presentation - UI Flow for Power Automate
 
Models in Minutes using AutoML
Models in Minutes using AutoMLModels in Minutes using AutoML
Models in Minutes using AutoML
 
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model ServingDAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
 
Tech Mind Maps - Booklet Preview
Tech Mind Maps - Booklet PreviewTech Mind Maps - Booklet Preview
Tech Mind Maps - Booklet Preview
 
Wwf
WwfWwf
Wwf
 
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
 
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration MondayBuilding workflow solution with Microsoft Azure and Cloud | Integration Monday
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
 
Power Apps for developers
Power Apps for developersPower Apps for developers
Power Apps for developers
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)
Microsoft flow best practices SharePoint Saturday Bremen 2019 (Germany)
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 

Semelhante a Productionalizing ML : Real Experience

Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...Piyush Kumar
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning DevelopmentMatei Zaharia
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerCisco Canada
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMEconfluent
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Overhauling a database engine in 2 months
Overhauling a database engine in 2 monthsOverhauling a database engine in 2 months
Overhauling a database engine in 2 monthsMax Neunhöffer
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)Igor Talevski
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...Amazon Web Services
 

Semelhante a Productionalizing ML : Real Experience (20)

Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
 
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
 
Informatica slides
Informatica slidesInformatica slides
Informatica slides
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Overhauling a database engine in 2 months
Overhauling a database engine in 2 monthsOverhauling a database engine in 2 months
Overhauling a database engine in 2 months
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
 

Último

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Productionalizing ML : Real Experience

  • 1. PRODUCTIONALIZING ML: REAL EXPERIENCE Ihor Bobak Data Scientist, EPAM Systems
  • 3. 3 PROJECT INFO Customer: A Canadian company that provides different fleet management services. E.g. it runs a call center that handles all the maintenance and repairs of vehicles (acts as a “proxy” between a client and service providers). Use case: A fleet owner contacts the agent to ask for assistance with the maintenance. The agent contacts nearby service providers, gets offers, selects the supplier, negotiates the price for each line of the maintenance order. Problem: a) Price negotiation takes agent’s time b) Agents need to remember the details on cars/makes/models/spare parts to properly validate the price. Solution: Price Prediction Web Service (based on ML) which predicts the maintenance price based on the information about the vehicle, type of service, client location, etc.
  • 4. 4 DATA SCIENCE SCOPE Data Extraction Destination: parquet files on HDFS, scope: 2 last years, output: 28 mln. rows Data Transformation Filtering, joining, new fields/expressions, destination: parquet files on HDFS, output: 5 mln. rows ML Pipeline label encoding, one hot encoding, vector assembling, training XGBoost models, performance metrics. Data Sources Sybase, Cassandra, others Typical scope of data scientist’s work: We made two models: • classification model answering the question “is the price for the repair relevant or not”? • regression model (customer’s choice) answering the question “what is the recommended price for this maintenance item?” 2 times boost in agents time 1.10 times decrease in savings (due to FN)
  • 5. 5 SOLUTION ARCHITECTURE Client application (scoring service consumer) Scoring Web Service (SOAP+REST) Stack: Java, Spring Data Extraction Destination: parquet files on HDFS, scope: 2 last years, output: 28 mln. rows Data Transformation Filtering, joining, new fields/expressions, destination: parquet files on HDFS, output: 5 mln. rows ML Pipeline label encoding, one hot encoding, vector assembling, training XGBoost models, performance metrics. Data Sources Sybase, Cassandra, others Uploading Results Upload models, training metrics, lookup tables, labels for category variables. Destination: S3, output: 150MB zip archive HTTP (SOAP) requests for price prediction Downloading new models Scheduled run of training process Administrator HTTP (REST) management commands Model Storage (HDFS/S3) models, variables, labels, lookup tables
  • 6. 6 TECHNOLOGY STACK Training part: • Jupyter notebook • Spark 2.3.1 • Pandas, Numpy, Scikit-learn and other libraries • XGBoost Scoring part (web service): • Java • Spring Boot • xgboost-predictor-java library https://github.com/komiya-atsushi/xgboost-predictor-java • Lots of other open source Java libraries
  • 7. 7 TASKS AT DIFFERENT ENVIRONMENTS Exploration/Development Production Training ScoringEnvironment: • Python, Jupyter Notebook • Spark/Scikit/XGBoost, etc. Tasks: • Get input data • Rename fields • Check values • Modify field values • Add new fields • Filter rows • Join other tables • ML Pipeline tasks: • Label encoding • One hot encoding • Train/test split • Model training • Metrics calculation Environment: • Java, Spring Boot, etc. Tasks/Challenges: We need to do the same things on Java: “rename fields”, “check values”, “modify field values”, “add fields”, “filter rows”, “join other tables” and some of the “ML pipepine tasks” (“label encoding”, “one hot encoding”, “scoring by model”) Challenges: a) How to represent data? b) What libraries to use for transformation? c) What libraries to use for ML pipeline tasks? d) What libraries to use for scoring? Environment: • the same as exploration/development Goal: re-use the same code as much as possible. Other tasks that we need to do here: • Scheduled running of the whole training cycle • Uploading of results to some storage (S3/HDFS) • Alerting if metrics are below the expectations • Alerting if errors occurred during training
  • 9. 9 Spring Boot Web Service Price Prediction Service Library SCORING WEB SERVICE ARCHITECTURE REST Controller SOAP Endpoint HTTP Request Transformers Vectorization (Label Encoder, One-hot encoder) Scoring (ML models) HTTP Response Payload Logging ML Data Serialized Models + vectorization info (variables, labels, data types). Lookup tables (for enriching the feature records). The scoring web service is a purely Java solution using the artifacts (“ML Data”) output by the Python’s training code. *Many other things (payload shipment on S3/HDFS, updater of the ML Data files, management interface) are not shown here and will be shown later.
  • 10. 10 PRICE PREDICTION SERVICE LIBRARY Price Prediction Service Library Transformers Vectorization (Label Encoder, One-hot encoder) Scoring ML Data Serialized Models + vectorization info (variables, labels, data types). Lookup tables (for enriching the feature records). Input Output Input: maintenance order • Order: VIN, country, supplier_id • Line: repair code, ATA category, quantity. Note: NO FEATURES HERE Scoring web service needs to do the same things as notebook, but on smaller data (5-10 lines per order): a) It filters records (e.g. “remove Mexico data”) b) It adds columns (=features), e.g. VIN => make, model, engine size, etc. Often done by doing a lookup (=join to other table); c) It generates features, e.g. ata_key = aga_category + “_” + ata_subcategory. Output: price prediction for every line.
  • 11. 11 NOTEBOOK DATA DUMPS Training notebook dumps lots of things: models, lookup tables, data for the integration tests Aggregations Root data and lookups Variables configuration ML models and test dataset
  • 12. 12 BASE CLASSES VIN Supplier_ID Repair code ATA cat. ATA subcat Qty 1HGBH41JXM N109186 123456 REP 74 001003 1 [Same columns] Make Model Fuel Type … Supplier City [Same columns] BMW X5 petrol … San Francisco FeatureRecord - container for features FeatureRecordGroupedSet – grouping by any field (in our case – by order id) Transformers – enrich FR with features: • LookupTransformer – adds new columns • ApplicableTransformer – stops processing if some field is not in the lookup table • OilGroupTransformer – groups similar records • Etc. (many others exist) Make_BMW Make_Ford … Quantity … Prediction 1.0 0 … 1 … $55 MultiModel: vectorizes the feature record and does prediction
  • 13. 13 LOOKUP TRANSFORMER Purpose: enrich the feature record with real features by making a lookup. Backed with: • InMemoryIndexedDataFrame – a fast in-memory lookup • IndexedDataFrameReader – reader of the df.csv.gz + df.schema pair of files .schema file: .scv.gz file example: ata_ctgy_cd,ata_sub_ctgy_cd,ata_cd_long_desc,english_cd_long_desc,cd_stat_ind 17,001100,NEW TIRE RADIAL STEEL BELTED,NEW TIRE RADIAL STEEL BELTED,A 17,003001,USED TIRE,USED TIRE,A 10,02,010045,MIRROR SPOT,MIRROR SPOT,A
  • 14. 14 FULL CHAIN TRANSFORMER Full chain transformer combines all the atomic transformations to one chain Running transform() gives the same result as if the same records were passed through notebook’s PySpark ETL code.
  • 15. 15 OTHER CLASSES MultiModel does two things: vectorization of the feature record (into sparce vector of doubles) and prediction. Encapsulates many XGBoost models (separate model for every ATA code – a subject of repair). It uses biz.k11i.xgboost (https://github.com/komiya-atsushi/xgboost-predictor-java): MultiModelReader is a class to load the global configuration and all the models from the config provider. ConfigProvider is an abstraction which allows to read resources from one place. Currently there is just one implementation – ZipConfigProvider (to read everything from a single zip file) PricePredictionService – a class which combines FullChainTransformer and MultiModel.
  • 16. 16 CONFIGURATION Configuration structure (contents of the zip file): • gg • agg.json - aggregations configuration file • <set of folders for each ata_key • <set of pairs .csv.gz+.schema for each aggregation>, e.g. agg_vin_model.csv.gz, agg_vin_model.schema • config • global.json – configuration file that describes all models, their variables, and possible labels for cat. variables • models • <for every ata_key: .bin, .config and .txt file> - serialized models by the XGBoost • lookup • <pairs of .csv.gz + .schema files for lookups> ZipConfigProvider reads this zip file (size ≈ 200MB) produced by the notebook. Uncompressed size in Java structures that we choose: 1.8 GB.
  • 18. 18 UNIT VS. INTEGRATION TESTING Unit test Integration test Results depends only on Java code Results also depends on external systems/data Easy to write and verify Setup of integration test might be complicated A single class/unit is tested in isolation One or more components are tested All dependencies are mocked if needed No mocking is used (or only unrelated components are mocked) Test verifies only implementation of code Test verifies implementation of individual components and their interconnection behavior when they are used together A unit test uses only JUnit/TestNG and a mocking framework An integration test can use real containers and real DBs as well as special integration testings frameworks (e.g. Arquillian or DbUnit) Mostly used by developers Integration tests are also useful to QA, DevOps, Help Desk A failed unit test is always a regression (if the business has not changed) A failed integration test can also mean that the code is still correct but the environment has changed Unit tests in an Enterprise application should last about 5 minutes Integration tests in an Enterprise application can last for hours
  • 19. 19 INTEGRATION TESTING Goal: to ensure that the web service is doing EXACTLY THE SAME THINGS as the training notebook does. Training notebook outputs: • mldata_20180807_093948.zip (200MB) - scoring configuration (ML models, lookups, variable configuration, etc.) • mldata_test_20180807_093948.zip (600 MB) – scoring configuration + IT data: What do we check: • Take the input dataset (VIN, supplier_id, country, odometer reading, ATA category/subcategory, repair code, parts quantity). • Pass through FullChainTransformer and check if “Features by Python” = “Features by Java” • Get predictions using MultiModel and check if “Prediction by Python” = “Prediction by Java” Input Data (all test dataset but JUST INPUT features) VIN, supplier_id, ATA code, odom. reading, qnty 1M records Test Data with ALL features and predictions All features (make, model, etc) + predictions 400K records
  • 20. 20 INTEGRATION TESTING Maven life cycle phases: 1. validate - checks if the project is correct and all information is available 2. compile - compiles source code in binary artifacts 3. test - executes the tests 4. package - takes the compiled code and package it, for example 5. integration-test - takes the packaged result and executes additional tests, which require the packaging* 6. verify - performs checks if the package is valid 7. install - install the result of the package phase into the local Maven repository 8. deploy - deploys the package to a target, i.e. remote repository Example of how to run: mvn clean install –Dmldata=/path/to/mldata_test_20181010_150000.zip
  • 21. 21 CHECKED HASH MAP AS FEATURE CONTAINER CheckedHashMap – an override of the HashMap, but with explicit operations. Goal: avoid errors, increase control. Overriden operations: • put(): fails if the key already exists in the map • get(): fails if the key doesn’t exist in the map • remove(): fails if the key doesn’t exist in the map New Operations: • overwrite(): the key must exist, otherwise it will fail • overwriteIfExists(): overwrites if the key exists, otherwise does nothing • putIfNotExists(): doesn’t fail, works only if the key doesn’t exist • putOrOverwrite(): no matter if the key exists or not, puts or overwrites it there Left as is: • getOrDefault(): if the key doesn’t exists, it will return a default All operations do NOT allow null keys!
  • 22. 22 SERIALIZATION IN HEX Double and float values inside CSV/LIBSVM files are written like this: • 3.5/0000000000000c40 - for double values • 5.1016541/c040a340 – for float values At the python side:
  • 23. 23 SERIALIZATION IN HEX At the Java side: • HexParser class with helper static methods to parse hex values – see the code. • Using SuperCSV library – to read the CSV files. • Made a class ParseDoubleHex extending SuperCSV’s CellProcessor leveraging HexParser to get values out of 3.5/0000000000000c40 . • The same for float type.
  • 24. 24 PRETTY PRINTING OF ERRORS Advice: do a “pretty print” of error data. If you don’t do it, fixing bugs will be hard.
  • 25. 25 PRETTY PRINTING OF ERRORS Example of how easy is to fix errors if we have pretty print:
  • 26. 26 MEMORY OPTIMIZATION FOR IT Before re-design of integration tests: After re-design: (added partitioning) 11 GB 3.8 GB (three times less)
  • 28. 28 Citing “Hidden Technical Debt in Machine Learning Systems”: It may be surprising to the academic community to know that only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction: TECHNICAL DEBT https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
  • 29. 29 Spring Boot Web Service Price Prediction Service Library SCORING WEB SERVICE ARCHITECTURE REST Controller SOAP Endpoint HTTP Request HTTP Response Payload Logging ML Data (called a “config”) Config Updater Payload Shipping Management REST Controller Client application (scoring service consumer) Model Storage (HDFS/S3) models, variables, labels, lookup tables HTTP Request Payload Storage (HDFS/S3) Daily zip files of all the payloads Payloads Scheduler (Spring) VIN Decoder (external REST service)
  • 30. 30 PRICE PREDICTION CONTROLLER/ENDPOINT Endpoints: • /rest – for REST (RestControler class) • /soap – for SOAP (SoapEndpoint class) • /manage – for management REST requests Overrides of default behavior: • for REST: override exception resolver, JSON message converter (fails on unknown properties) • for SOAP: proper handling of IvalidXmlException and SoapMessageCreation – doing “500 Internal Server Error” instead of “400 bad request”. • For both: async uploads of payloads (derived MessageDispatcherServler and DispatcherServlet)
  • 31. 31 PROCESSING ALGORITHM The algorithm is common for REST controller and SOAP endpoint: • validate the input data • get the instance of the PricePredictionService from the manager • perform prediction • check the status. If it is OK – return the result, otherwise log the error and return the error.
  • 34. 34 MANAGEMENT CONTROLLER The web UI is rendered by Swagger + Springfox Caveat #1: it cannot generate good examples for the properties of type Map<Long, SomeClass> Caveat #2: It doesn’t cover SOAP /swagger-ui.html - automatically generated endpoint for help on the REST methods
  • 35. 35 MANAGEMENT CONTROLLER /manage/info – renders detailed information about the configuration and 10 previous operations
  • 36. 36 MEMORY CLEANING MemoryCleaner: utility class for cleaning memory. Used on SWITCH operation. • uses jlibs-core https://santhosh- tekuri.github.io/jlibs/ RuntimeUtil.gc() - guarantees garbage collection to happen, contrary to System.gc() • Runs a thread re-trying to free up RAM each T seconds if • at previous attempt we freed-up less than X MB • we made not more than N attempts Reason: old REST requests may still be running and hold a reference to the old instance of the PricePredictionService.
  • 38. 38 SONARQUBE Sonarqube is a tool for continuous inspection of code quality. It helps to find potential bugs, does code review, check unit test, coverage, etc. Supports 20 programming languages.
  • 39. 39 Advice: try to cover all the code with unit tests SONARQUBE: CODE COVERAGE This bug Was found after covering these lines with unit tests
  • 40. 40 SONARQUBE SonarQube shows places in the code where bugs may occur.
  • 41. 41 SONARQUBE: TRUE POSITIVES Reference on explanation why we need the interrupt: https://stackoverflow.com/questions/4906799/why-invoke-thread-currentthread-interrupt-in-a-catch-interruptexception-block
  • 42. 42 SONARQUBE: BAD ADVICES The worst suggestion of SonarQube I’ve ever seen.
  • 43. 43 STATISTICS Part Lines Total lines 18539 Source code lines 13139 (71%) Comment lines 2786 (15%) Blank lines 2614 (14%)
  • 45. 45 RUNNING NOTEBOOKS OFFLINE Goal: to re-use the code in exploration/development and in production How it works: nbrun.py notebook.ipynb –o out -k "pyspark 2.3.1" -e dev -i 3 -z results_{}.zip -t 180 -m 5 -o = output folder -k = kernel name -e = environment -i = the cell where to insert the environment load -z = the zip file pattern (to put the ipynb + *.py files after the run) -t = timeout for the kernel to start -m = maximum number of times to try to start the kernel Algorithm: • read the .ipynb file (just the code: omitting the output which may be there) • insert some cells into position “-i”: a cell with profile name and override of the print function • create “-out” folder and put there all modules, make this folder the working one. • start the kernel, run the cells, output results to the “-out” folder • zip the contents of the “-out” folder (notebook with executed content + py files)
  • 46. 46 RUNNING NOTEBOOKS OFFLINE How nbrun works: • Nbformat – https://github.com/jupyter/nbformat for reading/writing ipynb, inserting/editing cells • Nbconvert – https://github.com/jupyter/nbconvert for running the notebook with a specified kernel and getting the output Applied tricks: • Override nbconvert.preprocessors.ExecutePreprocessor: • preprocess_cell: to measure execution time and add cell.metadata["ExecuteTime"] = {"end_time": time_end, "start_time": time_start} • run_cell: to log the execution start/end and result just into console output of nbrun.py • preprocess: to fix the bugs with shutting down the kernel process in the case if we couldn’t connect to it, to change the timeout for kernel start and to do the “retry” starting if something failed (what happens quite often with “heavy” kernels like PySpark).
  • 47. 47 ENVIRONMENT REPLACEMENT This environment will be loaded at development time Here nbrub.py will insert a cell overriding the ENV_NAME This code will dynamically load the env_${ENV_NAME}.py
  • 48. 48 PATCHED PRINT FUNCTION Notebook’s output System’s output (nbrun.py) The print() function is overridden and inserted by nbrun.py. Goal: to allow having the output in two places – the notebook and stdout of nbrun.py
  • 49. 49 ENVIRONMENT PARAMETERS Environment parameters: • FILESYSTEM, DB_NAME: place where we will store temporary tables. Supports: s3, hdfs, cassandra, local, FiloDB • HDFS/LOCAL/S3 parameters (depending on the type) • Spark unpersisting parameters: mode which will force the Spark to unpersist the dataframes (related to a bug with Spark 1.4 which caused cascaded unpersisting) • Upload parameters: S3/HDFS parameters of where to upload the results of the training • Metrics limits: upper limits for ML metrics (if met, then the new models will be uploaded) • Datasets: location, table names, SQL statements of how to get the source data • Thresholds for category variables (e.g. “train for top 1000 makes, ignore the others”) • TOP_N: how many models to train • VIN decoder parameters: how to decode the VINs in the case if they’re absent in the lookup table
  • 50. 50 NOTEBOOK STRUCTURE Notebook’s code is split on sections: • Environment loading (if run offline – will be made by nbrun.py) • Loading modules, initializing shared variables • Data extraction • Data transformations (filtering, joining, new features) • Model training • Building model metrics and analysis • Dumping of artifacts (models, lookup, etc.), zipping results and metrics Each section reads data from the previous section results and saves its own results. Each section can be switched off during development (to save execution time).
  • 51. 51 MODULES Functions were moved to the modules. Reasons: • Easy to debug • Easy to see errors • Easy to do unit tests • PyCharm capabilities of code navigation and type hints Reference: https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html
  • 52. 52 Common variables are shared between modules. Example of how to share variables: 1) Create a module shared.py with variables with the same names as in the notebook 2) Call “init” at the beginning: SHARED VARIABLES
  • 53. 53 MEMORY PROFILING OF THE NOTEBOOK Memory profiling and usage of “del” at the end of sections: A simple way is to make some decorators. After that for “heavy” frequently used functions do this: @profile def your_func(): ... you will see how much memory the notebook’s kernel consumed before and after the function call.
  • 54. 54 MEMORY PROFILING OF THE NOTEBOOK Each section ends with checking unreleased Pandas and Spark dataframes. Results: • decreased memory of the notebook from 12GB to 4GB • removed all cached Spark dataframes from the cluster memory (=decreased the demands for cluster resources).
  • 56. 56 HIGH LEVEL API Goals: • to simplify usage of Sparks dataframes/SQL • Implicit caching defaults for common operations • to minimize errors. Commonly used functions: • load_df(db_name, table_name, ...), save_df(df, db_name, table_name, ...) • change_df(a_select_cols, a_drop, a_replace, a_rename, a_add, a_distinct, a_order_by, a_drop_end, a_filter_df, a_filter_columns, a_filter_not_df, a_filter_not_columns, a_where) • join_df, group_by • filter_by_where, filter_by_df, filter_by_not_df, filter_by_threshold, filter_duplicates Example: new_df = change_df(df, a_add={"new_col": "case when col > 0.0 then 1 else 0 end"}) instead of new_df = df.withColumn("new_col", F.when(df["col"] > 0.0, 1).otherwise(0))
  • 57. 57 CREATEDATAFRAME MONKEY PATCH Problem: the sqlContext.createDataFrame() function doesn’t contain a numSlices parameter (that is present in the sc.parallelize() and defines the number of partitions). This is true up to 2.3.1 Why it is important: to control the number of partitions for conversion from the Pandas dataframe into Spark dataframe. Solution: patch three functions (code is in the sparkdf.py): • SparkSession._createFromLocal = _createFromLocalMonkeyPatch • SparkSession.createDataFrame = createDataFrameMonkeyPatch_session • SQLContext.createDataFrame = createDataFrameMonkeyPatch_sqlcontext In all of the overrides add the numSlices and pass it accordingly to the sc.parallelize()
  • 58. 58 NOTEBOOK KERNEL PARAMETERS Marked are those parameters which we strongly advice to apply for a standalone Spark cluster: { "display_name": "pyspark cluster - ibobak - 3e 3c", "language": "python", "argv": ["/opt/conda/envs/py27/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"], "env": { "SPARK_HOME":"/opt/spark", "PYTHONPATH":"/opt/spark/python/lib/py4j-0.10.4-src.zip:/opt/spark/python", "PYTHONSTARTUP":"/opt/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS":" --packages com.databricks:spark-avro_2.11:3.2.0 --driver-memory 5G --executor-memory 10G --num-executors 3 --executor-cores 3 --total-executor-cores 9 --master spark://10.4.12.36:7077 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m --conf spark.driver.extraJavaOptions="-Xss16m" --conf spark.executor.extraJavaOptions="-Xss16m" --conf spark.cassandra.output.consistency.level=ALL --conf spark.cassandra.input.consistency.level=ALL pyspark-shell" } } All three params num-executors, executor-cores and total-executor-cores must be specified (otherwise the number of cores will be unpredictable). Serialization parameters are strongly advices to speed up the dataframe caching. –Xss16m is advices to avoid stackoverflow errors. Cassandra parameters are needed when you use writes to Cassandra and don’t want a situation when records are lost after saving and re-loading them.
  • 60. 60 JENKINS JOBS Jenkins jobs to run different parts of the flow: • ml_build_scoreapi: builds the price prediction web service (Java), runs the unit tests, does SonarQube analysis, uploads the jar to the artifactory. • ml_config_scoreapi: takes the new config files form Git and puts them on the server (dev/qa/prod), restarts the service. • ml_deploy_scoreapi: takes the jar file from the artifactory and puts them on the server (dev/qa/prod), restarts the web service. • ml_deploy_training: takes the notebooks from Git and puts them to the working folder of the training server. • ml_run_training: runs on the training server these steps: • Data extraction into parquet files. • Offline running of the notebooks (using nbrun.py) • Integration tests • Checking ML metrics • Uploading the training results on S3 • Issuing /manage/syncswitch HTTP GET-request on the working instance of the scoring web service.
  • 61. 61 CONTACTS Ihor Bobak E-mail: Ihor_Bobak@epam.com Skype: ibobak Linkedin: https://www.linkedin.com/in/ibobak

Notas do Editor

  1. The customer selected the regression model based on business demands and on non-risk approach: the maintenance order should be manually reviewed anyway due to checking against policies (e.g. “does the client have right to wash the car?”).
  2. XGBoost actually replaced Spark’s GBT and RF after we discovered that on 5 million rows it is much better to train locally on multiple cores instead of doing this with Spark.
  3. We looked at MLeap in December, but faced with a set of problems: no XGBoost support at that time. At scoring part we need to transform just a couple of rows: much of the transformation logic either vanishes, or changes, so no “copy paste” of the training notebook’s code is possible (actually, that wouldn’t happen anyway: notebook has a PySpark code, while we need Java).
  4. There are two modules (jars): price-prediction-service.jar (the library): does all the price prediction stuff Price-prediction-web.jar (the web service): wraps the library into a Spring Boot microservice and provides a management interface, ML data version switching/syncing, payload shipments, VIN decoder testing, etc.
  5. After all the dumps has been finished, it creates two zip files: mldata_YYYYMMDD_hhmmss.zip and mldata_test_YYYYMMDD_hhmmss.zip (extended one, for integration tests)
  6. FeatureRecord: features (CheckedHashMap<String,Object>) – for holding feature names and values stopInfo (Map<String, String>) – for marking the record as “stopped” updateInfo (Map<String, String>) – for outputting information that “prediction was made to an updated value of some feature” predictionInfo (Map<String,String>) – for outputting the price prediction (potentially we can put there several predictions) All transformers should work like this: If the record is stopped, it should return it immediately without any processing If the record is “alive”, it should create a copy of it and apply the transformation, then return the transformed record.
  7. We considered many implementations: MySQL, SQLite, Memcached, but stopped on this simple one due to one reason: it fits in RAM into 2GB in our case. If it wouldn’t – we would do something more complicated.
  8. ATA = American Trucking Association Do not use XGBoost4J: slow, buggy, platform dependent (uses JNI)
  9. It is often a temptation to work in the training notebook with a dataset which already contains some of the features, like “make”, “model”, etc. The rule is next: if the scoring web service doesn’t get this feature from the client, then your notebook MUST have a part of ETL which takes this feature from some lookup or some external source or anything else.
  10. Such explicit control on feature management allowed us to avoid errors, especially they were easily detected during integration tests. The concept is as follows: if you did not put anything as a feature, then please DO NOT TRY TO GET IT. But Java’s HashMap, TreeMap and others simply return “null” in the case if you do map.get(“Feature_X”) while Feature_X simply doesn’t exist because no one ever put it.
  11. Often in the testing code we can see just “assertEquals(target, actual)”. OK, it will assert you. But how much time you will spend on searching for the reason? In our case, we’ve done a TreeMap of features and then printed them out to the logs, so that after running all tests we’ve got detailed output and could fix that in 10 minutes.
  12. The freeware comparison tool is called WinMerge.
  13. The second picture is intentionally decreased by height to see how much less memory it started to use.
  14. There are much more work to do than you thought at the beginning.
  15. Orange blocks are those pieces of functionality that are helper stuff: we could live without them, but life with them becomes easier.
  16. The prediction method is called “safePredict()” instead of just “predict()”. That means that we do not raise any exception from there: the error information will be encapsulated in the response. This pattern allows the controller/endpoint to not care about ML stuff: this is not their business what exception may happen inside the price prediction service. The PredictionResponse has getErrorDescriptionForServerLog() which will be logged for further analysis on the server side, but this won’t go our of the service to the external world.
  17. This is a sample of prediction request and response in JSON format. As you may see, not each order line contains a prediction: some of them may contain “stop” message, the other may contain additional “update” message which notifies that “we’ve made prediction to this item forcibly setting quantity = 1”, so no matter what value was there – 1 or not 1 – we predicted as if it was “1”.
  18. Serialization/deserialization into XML and JSON is done using the same classes – PredictionRequest/PredictionResponse. Fields are annotated both with JAXB and Jackson annotations.
  19. Swagger is good enough but NOT the best tool for documenting. A huge problem is properties with type Map<anything, SomeClass>: it cannot automatically generate good examples on the UI which allows to test how the service is working. To enable this swagger stuff having @Configuration class RestMvcConf extends WebMvcConfigurationSupport, we had to @Override protected void configureDefaultServletHandling(DefaultServletHandlerConfigurer configurer) { configurer.enable(); } otherwise the default request handler for /** was not enabled.
  20. A simple approach which we use everywhere is to return Map<String, Object> for /info methods: it will convert to JSON automatically and we shouldn’t care of types.
  21. SonarQube (formerly Sonar)[1] is an open source platform developed by SonarSource for continuous inspection of code quality to perform automatic reviews with static analysis of code to detect bugs, code smells and security vulnerabilities on 20+ programming languages. SonarQube offers reports on duplicated code, coding standards, unit tests, code coverage, code complexity, comments, bugs, and security vulnerabilities.[2][3]
  22. Why do we need interrupts: when you catch the InterruptException and swallow it, you essentially prevent any higher level methods/thread groups from noticing the interrupt. Which may cause problems. By calling Thread.currentThread().interrupt(), you set the interrupt flag of the thread, so higher level interrupt handlers will notice it and can handle it appropriately.
  23. An override of the print function is needed to make so that everything that notebook’s cells are printing should go both to the notebook and to the stdout of the nbrun.py
  24. The env_dev.py, env_prod.py etc. is a standard python file which contains variable assignments like VIN_DECODER_ENABLED=True VIN_DECODER_TIMEOUT_SECONDS=60 VIN_DECODER_URL=“https://servername:port/vinData” etc.