There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
1. Crossing the Analytic Chasm and Getting
the Models You Develop Deployed
Robert L. Grossman
University of Chicago and
Analytic Strategy Partners LLC
August 20, 2018
CMI Workshop, KDD 2018, London
Why it is Important to Understand the Differences Between
Deploying Analytic Models and Developing Analytic Models.
2. 1. Overview of Developing vs Deploying Analytic Models*
*This section is adapted from: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
3. The Analytic Diamond*
Analytic strategy,
governance, security &
compliance.
Analytic modeling Analytic operations
Analytic Infrastructure
*Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
4. The Analytic Chasm
There are platforms and tools for managing and processing big data
(e.g. Hadoop and Spark), for building analytic models (e.g. R and
SAS), but fewer options for deploying analytics into operations or for
embedding analytics into products and services.
Data scientists
developing analytic
models & algorithms
Analytic infrastructure
Enterprise IT
deploying analytics
into products, services
and operations
Deploying analytics
4
5. Get the data, set up
the infrastructure, put
in place the
compliance and
security, etc.
Analyzing
&
modeling
the data
Deploying the solution
with the model in a
manner that has an
impact on the
organization
Time
Effort
Get the data Build a
model
Deploy the model
6. The Five Main Approaches (E3RW)
1. Embed analytics in databases
2. Export models and deploy them by importing into
Scoring Engines
3. Encapsulate models using containers (and virtual
machines)
4. Read a table of parameters
5. Wrap algo code or analytic system (and perhaps
create a service)
Approaches (E3RW)
7. Can you push code to
deploy models?
No Yes
Do you have a single model
with different parameters?
No Yes
Do you require
workflows or
custom models?
No Yes
No Yes
Does a database have the
analytic functionality required*?
Embed the analytics
in a database
No Yes
Are there stringent
enterprise controls on
models?
Code the models
and encapsulate
with containers or
VMs
Use a PFA
Analytic Engine
Use a PMML
Analytic Engine
Use an Analytic
Engine
Code the model &
read parameters
*Assumes that a UDF is pushed to the database, otherwise
embedding analytics into databases is on the other side of the tree.
8. 2. Scoring Engines
Typical use cases: regulated environments, healthtech,
high availability applications, applications requiring long
term reproducibility, etc.
9. Exploratory Data Analysis
Get and
clean the data
Build model in
dev/modeling environment
Initial deployment
Use champion-challenger
methodology to improve
model
Analytic modeling
Analytic operations
Deploy
model
Retire model and deploy
improved model
Select analytic
problem &
approach
Scale up
deployment
ModelDev
AnalyticOps
Perf.
data
Data Scientists
Enterprise IT
Life cycle of a model
*Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
Deployment
10. Differences Between the Modeling and
Deployment Environments
• Typically modelers use specialized languages such as SAS, SPSS or R.
• Usually, developers responsible for products and services use languages
such as Java, Python, C++, etc.
• This can result in significant delays moving the model from the modeling
environment to the deployment environment.
11. Analytic
Diamond
Analytic models & workflows Analytic operations
Deploying models & workflows
Model*
Consumer
Model*
Producer
Analytic Infrastructure
Analytics in products,
services, and internal
operations.
*Model here also includes analytic workflows.
How quick are updates of:
• Model parameters?
• New features?
• New pre- & post- processing?
Export
model*
Import
model*
aka Scoring Engine
or Analytic Engine
12. What is a Scoring Engine?
• A scoring engine is a component that is integrated into
products or enterprise IT that deploys analytic models in
operational workflows for products and services.
• A Model Interchange Format is a format that supports the
exporting of a model by one application and the
importing of a model by another application.
• Model Interchange Formats include the Predictive Model
Markup Language (PMML), the Portable Format for
Analytics (PFA), and various in-house or custom formats.
• Scoring engines are integrated once, but allow
applications to update models as quickly as reading a a
model interchange format file.
12
13. A Brief History of the
DMG
Founded
PMML v0.7
released
CCSR
Founded
PMML v0.9
released
PMML v1.0
released
PMML v1.1
released
PMML v2.0
released
PMML v3.0
released
PMML v2.1
released
PMML v3.1
released
PMML v3.2
released
PMML v4.0
released
PMML v4.1
released
PMML v4.2.1
released
Portable Format for
Analytics (PFA) Introduced
PMML v4.3
released
support
begins
Membership
Drive
14. 3. Case Study: Deploying Analytics Using a
Scoring Engine
15. Would you minding writing
all your models in Java?
Alice, Data Scientist Bob, Data Scientist
Joe, IT
I write all my models
in R, why don’t you
do the same?
I write all my
models in scikit-
learn, why don’t you
do the same?
16. Deploying analytic models
Model*
Consumer
Model*
Producer
Export
model*
Import
model*
PMML & PFA
• PMML is an XML
language for describing
analytic models
• PFA is a JSON language
for describing analytic
models and workflows
• Arbitrary models and
workflows can be
expressed in PFA.
The Not-For-Profit Data
Mining Group (DMG)
develops the PMML and
PFA standards
*Model here also includes analytic workflows.
17. • 20+ person data science group developing models in R, Python,
Scikit-learn and MATLAB.
• All the data scientists export their model in the Portable
Format for Analytics (PFA).
• The company’s product imports models in PFA and runs on
their customers data as required.
Export PFA Import PFA
Widget
records
Widget
scores
Company’s
data scientists
build models
Company’s services
embed an analytic
engine that can
interpret PFA
How a startup used PFA-compliant scoring engines:
18. 4. Case Study: Scaling Bioinformatics Pipelines
for the Genomic Data Commons by
Encapsulating Analytics in Docker Containers
19. NCI Genomic Data Commons*
• The GDC was
launched in 2016 with
over 4 PB of data.
• Used by 1500 -3000+
users per day and over
100,000 researchers
each year.
• Based upon an open
source software stack
that can be used to
build other data
commons.
*Source: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
20. TCGA dataset: 1.54 PB
consisting of 577,878
files about 14,052 cases
(patients), in 42 cancer
types, across 29 primary
sites.
2.5+ PB
of cancer
genomics data
+
Bionimbus data commons
technology running multiple
community developed variant
calling pipelines. Over 12,000
cores and 10 PB of raw storage in
18+ racks running for months.
AnalyticOps for the Genomic Data Commons
21. GDC Pipelines Are Complex
and are Mostly Written by Others
Source: Center for Data Intensive Science, University of Chicago.
This is an example
of one the
pipelines run by
the GDC.
22. Computations for a Single
Genome Can Take Over a Week
Source: Center for Data Intensive Science, University of Chicago.
23. Dev Ops
• Virtualization and the requirement for massive scale out
spawned infrastructure automation (“infrastructure as
code”).
• Requirement for reducing the time to deploying code
created tools for continuous integration and testing.
24. ModelDev AnalyticOps
• Use virtualization / containers, infrastructure
automation and scale out to support large scale
analytics.
• Requirement: reduce the time and cost to do high
quality analytics over large amounts of data.
25. GDC Pipeline Automation System (GPAS)
• Bioinformatics pipelines are written using the Common Workflow
Language (CWL)
• CWL uses DAGs to describe workflows, with each node a program
• We developed a pipeline automation system (GPAS) to execute CWL
pipelines with the GDC
• We used Docker Containers and Kubernetes for automating the
software deployment and simplifying the scale out
• Our main work was the development of the pipelines, automating the
processing of submitting data, QC, exception handling and
monitoring.
Source: Center for Data Intensive Science, University of Chicago.
26. • Model quality
(confusion matrix)
• Data quality
(six dimensions)
• Lack of ground truth
• Software errors
• Sufficient monitoring of
workflows
• Scheduling inefficiencies
• The ability to accurately
predict problem jobs
• Bottlenecks, stragglers, hot spots, etc.
• Analytic configurations problems
• System failures
• Human errors
Ten Factors Effecting AnalyticOps
27. New Effort: Portable Format for Biomedical Data (PFB)
• Based upon our experience with the GDC and data commons for
other biomedical applications, we are developing a portable format
so we can version and serialize biomedical data.
• This includes the serialization of the data dictionary, pointers to third
party ontologies, the data model, and all of the data, except for ”large
objects,” such as BAM files, image files.
• We track “large files” as digital objects with immutable GUIDs.
• In practice, the large objects are often 1000x larger than the rest of
the data.
• Talk to us if you would like to get involved.
29. E3RW Recap
1. Embed analytics in databases
2. Export models and deploy them by
importing into Scoring Engines
3. Encapsulate models using
containers (and virtual machines)
4. Read a table of parameters
5. Wrap algo code or analytic system
(and perhaps create a service)
Approaches (E3RW)
• Use languages for analytics, such as
PMML and PFA & analytic engines
• Use languages for workflows, such
as CWL & workflow engines
• Use containers and container-
orchestration systems for
automating software deployment
and scale out, such as Docker &
Kubernetes
Techniques
30. Five Best Practices When Deploying Models
1. Mature analytic organizations have an environment to
automate testing and deployment of analytic models.
2. Don’t think just about deploying analytic models, but
make sure that you have a process for deploying analytic
workflows.
3. Focus not just on reducing Type 1 and Type 2 errors, but
also data input errors, data quality errors, software errors,
systems errors and human errors. People only remember
that model didn’t work, not whose fault it was.
4. Track value obtained by the deployed analytic model, even
if it is not your explicit responsibility.
5. It is often easier to increase the value of deployed model
by improving the pre- and post- processing vs chasing
smaller improvements in the lift curve.
31. Five Common Mistakes When Deploying Models
1. Not understanding all the subtle differences between the
supplied run time data used to train the model and the actual
run time data the model sees.
2. Thinking that the features are fixed and all that you will need
to do is update the parameters.
3. Thinking the model is done and not realizing how much work
is required to keep up to date all the the pre- and post-
processing required.
4. Not checking in production to see if the inputs to the models
drift slowly over time.
5. Not checking that the model will keep running despite
missing values, garbage values, etc. (even values that should
never be missing in first place).
32. Summary
1. Deploying analytic models is a core technical competency.
2. A discipline of AnalyticOps is emerging defining best practices for
running analytics at scale.
3. Building an analytic model is just the first step in its life cycle,
which includes deployment, integration into a value chain,
improvement, and replacement.
4. The Portable Format for Analytics (PFA) is a model interchange
format for building analytic models and workflows in one
environment and deploying them in another one. Analytic
engines can be used to execute PFA.
5. Analytic containers are a good way of encapsulating everything
needed to deploy analytic models and analytic workflows into
production.