SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Crossing the Analytic Chasm and Getting
the Models You Develop Deployed
Robert L. Grossman
University of Chicago and
Analytic Strategy Partners LLC
August 20, 2018
CMI Workshop, KDD 2018, London
Why it is Important to Understand the Differences Between
Deploying Analytic Models and Developing Analytic Models.
1. Overview of Developing vs Deploying Analytic Models*
*This section is adapted from: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
The Analytic Diamond*
Analytic strategy,
governance, security &
compliance.
Analytic modeling Analytic operations
Analytic Infrastructure
*Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
The Analytic Chasm
There are platforms and tools for managing and processing big data
(e.g. Hadoop and Spark), for building analytic models (e.g. R and
SAS), but fewer options for deploying analytics into operations or for
embedding analytics into products and services.
Data scientists
developing analytic
models & algorithms
Analytic infrastructure
Enterprise IT
deploying analytics
into products, services
and operations
Deploying analytics
4
Get the data, set up
the infrastructure, put
in place the
compliance and
security, etc.
Analyzing
&
modeling
the data
Deploying the solution
with the model in a
manner that has an
impact on the
organization
Time
Effort
Get the data Build a
model
Deploy the model
The Five Main Approaches (E3RW)
1. Embed analytics in databases
2. Export models and deploy them by importing into
Scoring Engines
3. Encapsulate models using containers (and virtual
machines)
4. Read a table of parameters
5. Wrap algo code or analytic system (and perhaps
create a service)
Approaches (E3RW)
Can you push code to
deploy models?
No Yes
Do you have a single model
with different parameters?
No Yes
Do you require
workflows or
custom models?
No Yes
No Yes
Does a database have the
analytic functionality required*?
Embed the analytics
in a database
No Yes
Are there stringent
enterprise controls on
models?
Code the models
and encapsulate
with containers or
VMs
Use a PFA
Analytic Engine
Use a PMML
Analytic Engine
Use an Analytic
Engine
Code the model &
read parameters
*Assumes that a UDF is pushed to the database, otherwise
embedding analytics into databases is on the other side of the tree.
2. Scoring Engines
Typical use cases: regulated environments, healthtech,
high availability applications, applications requiring long
term reproducibility, etc.
Exploratory Data Analysis
Get and
clean the data
Build model in
dev/modeling environment
Initial deployment
Use champion-challenger
methodology to improve
model
Analytic modeling
Analytic operations
Deploy
model
Retire model and deploy
improved model
Select analytic
problem &
approach
Scale up
deployment
ModelDev
AnalyticOps
Perf.
data
Data Scientists
Enterprise IT
Life cycle of a model
*Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
Deployment
Differences Between the Modeling and
Deployment Environments
• Typically modelers use specialized languages such as SAS, SPSS or R.
• Usually, developers responsible for products and services use languages
such as Java, Python, C++, etc.
• This can result in significant delays moving the model from the modeling
environment to the deployment environment.
Analytic
Diamond
Analytic models & workflows Analytic operations
Deploying models & workflows
Model*
Consumer
Model*
Producer
Analytic Infrastructure
Analytics in products,
services, and internal
operations.
*Model here also includes analytic workflows.
How quick are updates of:
• Model parameters?
• New features?
• New pre- & post- processing?
Export
model*
Import
model*
aka Scoring Engine
or Analytic Engine
What is a Scoring Engine?
• A scoring engine is a component that is integrated into
products or enterprise IT that deploys analytic models in
operational workflows for products and services.
• A Model Interchange Format is a format that supports the
exporting of a model by one application and the
importing of a model by another application.
• Model Interchange Formats include the Predictive Model
Markup Language (PMML), the Portable Format for
Analytics (PFA), and various in-house or custom formats.
• Scoring engines are integrated once, but allow
applications to update models as quickly as reading a a
model interchange format file.
12
A Brief History of the
DMG
Founded
PMML v0.7
released
CCSR
Founded
PMML v0.9
released
PMML v1.0
released
PMML v1.1
released
PMML v2.0
released
PMML v3.0
released
PMML v2.1
released
PMML v3.1
released
PMML v3.2
released
PMML v4.0
released
PMML v4.1
released
PMML v4.2.1
released
Portable Format for
Analytics (PFA) Introduced
PMML v4.3
released
support
begins
Membership
Drive
3. Case Study: Deploying Analytics Using a
Scoring Engine
Would you minding writing
all your models in Java?
Alice, Data Scientist Bob, Data Scientist
Joe, IT
I write all my models
in R, why don’t you
do the same?
I write all my
models in scikit-
learn, why don’t you
do the same?
Deploying analytic models
Model*
Consumer
Model*
Producer
Export
model*
Import
model*
PMML & PFA
• PMML is an XML
language for describing
analytic models
• PFA is a JSON language
for describing analytic
models and workflows
• Arbitrary models and
workflows can be
expressed in PFA.
The Not-For-Profit Data
Mining Group (DMG)
develops the PMML and
PFA standards
*Model here also includes analytic workflows.
• 20+ person data science group developing models in R, Python,
Scikit-learn and MATLAB.
• All the data scientists export their model in the Portable
Format for Analytics (PFA).
• The company’s product imports models in PFA and runs on
their customers data as required.
Export PFA Import PFA
Widget
records
Widget
scores
Company’s
data scientists
build models
Company’s services
embed an analytic
engine that can
interpret PFA
How a startup used PFA-compliant scoring engines:
4. Case Study: Scaling Bioinformatics Pipelines
for the Genomic Data Commons by
Encapsulating Analytics in Docker Containers
NCI Genomic Data Commons*
• The GDC was
launched in 2016 with
over 4 PB of data.
• Used by 1500 -3000+
users per day and over
100,000 researchers
each year.
• Based upon an open
source software stack
that can be used to
build other data
commons.
*Source: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
TCGA dataset: 1.54 PB
consisting of 577,878
files about 14,052 cases
(patients), in 42 cancer
types, across 29 primary
sites.
2.5+ PB
of cancer
genomics data
+
Bionimbus data commons
technology running multiple
community developed variant
calling pipelines. Over 12,000
cores and 10 PB of raw storage in
18+ racks running for months.
AnalyticOps for the Genomic Data Commons
GDC Pipelines Are Complex
and are Mostly Written by Others
Source: Center for Data Intensive Science, University of Chicago.
This is an example
of one the
pipelines run by
the GDC.
Computations for a Single
Genome Can Take Over a Week
Source: Center for Data Intensive Science, University of Chicago.
Dev Ops
• Virtualization and the requirement for massive scale out
spawned infrastructure automation (“infrastructure as
code”).
• Requirement for reducing the time to deploying code
created tools for continuous integration and testing.
ModelDev AnalyticOps
• Use virtualization / containers, infrastructure
automation and scale out to support large scale
analytics.
• Requirement: reduce the time and cost to do high
quality analytics over large amounts of data.
GDC Pipeline Automation System (GPAS)
• Bioinformatics pipelines are written using the Common Workflow
Language (CWL)
• CWL uses DAGs to describe workflows, with each node a program
• We developed a pipeline automation system (GPAS) to execute CWL
pipelines with the GDC
• We used Docker Containers and Kubernetes for automating the
software deployment and simplifying the scale out
• Our main work was the development of the pipelines, automating the
processing of submitting data, QC, exception handling and
monitoring.
Source: Center for Data Intensive Science, University of Chicago.
• Model quality
(confusion matrix)
• Data quality
(six dimensions)
• Lack of ground truth
• Software errors
• Sufficient monitoring of
workflows
• Scheduling inefficiencies
• The ability to accurately
predict problem jobs
• Bottlenecks, stragglers, hot spots, etc.
• Analytic configurations problems
• System failures
• Human errors
Ten Factors Effecting AnalyticOps
New Effort: Portable Format for Biomedical Data (PFB)
• Based upon our experience with the GDC and data commons for
other biomedical applications, we are developing a portable format
so we can version and serialize biomedical data.
• This includes the serialization of the data dictionary, pointers to third
party ontologies, the data model, and all of the data, except for ”large
objects,” such as BAM files, image files.
• We track “large files” as digital objects with immutable GUIDs.
• In practice, the large objects are often 1000x larger than the rest of
the data.
• Talk to us if you would like to get involved.
5. Summary
E3RW Recap
1. Embed analytics in databases
2. Export models and deploy them by
importing into Scoring Engines
3. Encapsulate models using
containers (and virtual machines)
4. Read a table of parameters
5. Wrap algo code or analytic system
(and perhaps create a service)
Approaches (E3RW)
• Use languages for analytics, such as
PMML and PFA & analytic engines
• Use languages for workflows, such
as CWL & workflow engines
• Use containers and container-
orchestration systems for
automating software deployment
and scale out, such as Docker &
Kubernetes
Techniques
Five Best Practices When Deploying Models
1. Mature analytic organizations have an environment to
automate testing and deployment of analytic models.
2. Don’t think just about deploying analytic models, but
make sure that you have a process for deploying analytic
workflows.
3. Focus not just on reducing Type 1 and Type 2 errors, but
also data input errors, data quality errors, software errors,
systems errors and human errors. People only remember
that model didn’t work, not whose fault it was.
4. Track value obtained by the deployed analytic model, even
if it is not your explicit responsibility.
5. It is often easier to increase the value of deployed model
by improving the pre- and post- processing vs chasing
smaller improvements in the lift curve.
Five Common Mistakes When Deploying Models
1. Not understanding all the subtle differences between the
supplied run time data used to train the model and the actual
run time data the model sees.
2. Thinking that the features are fixed and all that you will need
to do is update the parameters.
3. Thinking the model is done and not realizing how much work
is required to keep up to date all the the pre- and post-
processing required.
4. Not checking in production to see if the inputs to the models
drift slowly over time.
5. Not checking that the model will keep running despite
missing values, garbage values, etc. (even values that should
never be missing in first place).
Summary
1. Deploying analytic models is a core technical competency.
2. A discipline of AnalyticOps is emerging defining best practices for
running analytics at scale.
3. Building an analytic model is just the first step in its life cycle,
which includes deployment, integration into a value chain,
improvement, and replacement.
4. The Portable Format for Analytics (PFA) is a model interchange
format for building analytic models and workflows in one
environment and deploying them in another one. Analytic
engines can be used to execute PFA.
5. Analytic containers are a good way of encapsulating everything
needed to deploy analytic models and analytic workflows into
production.
Questions?
33
rgrossman.com
@bobgrossman

Mais conteúdo relacionado

Mais procurados

Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data ManagementCarole Goble
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsTom Plasterer
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for BiopharmaTom Plasterer
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
BioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageBioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataTom Plasterer
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 

Mais procurados (20)

Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
 
V3 i35
V3 i35V3 i35
V3 i35
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge Graphs
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data Citation
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for Biopharma
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
BioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageBioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative Advantage
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 

Semelhante a Getting Models Deployed

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Kun Le
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleatSistemas
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 

Semelhante a Getting Models Deployed (20)

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 

Mais de Robert Grossman

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

Mais de Robert Grossman (18)

AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Último

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Último (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Getting Models Deployed

  • 1. Crossing the Analytic Chasm and Getting the Models You Develop Deployed Robert L. Grossman University of Chicago and Analytic Strategy Partners LLC August 20, 2018 CMI Workshop, KDD 2018, London Why it is Important to Understand the Differences Between Deploying Analytic Models and Developing Analytic Models.
  • 2. 1. Overview of Developing vs Deploying Analytic Models* *This section is adapted from: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
  • 3. The Analytic Diamond* Analytic strategy, governance, security & compliance. Analytic modeling Analytic operations Analytic Infrastructure *Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear.
  • 4. The Analytic Chasm There are platforms and tools for managing and processing big data (e.g. Hadoop and Spark), for building analytic models (e.g. R and SAS), but fewer options for deploying analytics into operations or for embedding analytics into products and services. Data scientists developing analytic models & algorithms Analytic infrastructure Enterprise IT deploying analytics into products, services and operations Deploying analytics 4
  • 5. Get the data, set up the infrastructure, put in place the compliance and security, etc. Analyzing & modeling the data Deploying the solution with the model in a manner that has an impact on the organization Time Effort Get the data Build a model Deploy the model
  • 6. The Five Main Approaches (E3RW) 1. Embed analytics in databases 2. Export models and deploy them by importing into Scoring Engines 3. Encapsulate models using containers (and virtual machines) 4. Read a table of parameters 5. Wrap algo code or analytic system (and perhaps create a service) Approaches (E3RW)
  • 7. Can you push code to deploy models? No Yes Do you have a single model with different parameters? No Yes Do you require workflows or custom models? No Yes No Yes Does a database have the analytic functionality required*? Embed the analytics in a database No Yes Are there stringent enterprise controls on models? Code the models and encapsulate with containers or VMs Use a PFA Analytic Engine Use a PMML Analytic Engine Use an Analytic Engine Code the model & read parameters *Assumes that a UDF is pushed to the database, otherwise embedding analytics into databases is on the other side of the tree.
  • 8. 2. Scoring Engines Typical use cases: regulated environments, healthtech, high availability applications, applications requiring long term reproducibility, etc.
  • 9. Exploratory Data Analysis Get and clean the data Build model in dev/modeling environment Initial deployment Use champion-challenger methodology to improve model Analytic modeling Analytic operations Deploy model Retire model and deploy improved model Select analytic problem & approach Scale up deployment ModelDev AnalyticOps Perf. data Data Scientists Enterprise IT Life cycle of a model *Source: Robert L. Grossman, The Strategy and Practice of Analytics, to appear. Deployment
  • 10. Differences Between the Modeling and Deployment Environments • Typically modelers use specialized languages such as SAS, SPSS or R. • Usually, developers responsible for products and services use languages such as Java, Python, C++, etc. • This can result in significant delays moving the model from the modeling environment to the deployment environment.
  • 11. Analytic Diamond Analytic models & workflows Analytic operations Deploying models & workflows Model* Consumer Model* Producer Analytic Infrastructure Analytics in products, services, and internal operations. *Model here also includes analytic workflows. How quick are updates of: • Model parameters? • New features? • New pre- & post- processing? Export model* Import model* aka Scoring Engine or Analytic Engine
  • 12. What is a Scoring Engine? • A scoring engine is a component that is integrated into products or enterprise IT that deploys analytic models in operational workflows for products and services. • A Model Interchange Format is a format that supports the exporting of a model by one application and the importing of a model by another application. • Model Interchange Formats include the Predictive Model Markup Language (PMML), the Portable Format for Analytics (PFA), and various in-house or custom formats. • Scoring engines are integrated once, but allow applications to update models as quickly as reading a a model interchange format file. 12
  • 13. A Brief History of the DMG Founded PMML v0.7 released CCSR Founded PMML v0.9 released PMML v1.0 released PMML v1.1 released PMML v2.0 released PMML v3.0 released PMML v2.1 released PMML v3.1 released PMML v3.2 released PMML v4.0 released PMML v4.1 released PMML v4.2.1 released Portable Format for Analytics (PFA) Introduced PMML v4.3 released support begins Membership Drive
  • 14. 3. Case Study: Deploying Analytics Using a Scoring Engine
  • 15. Would you minding writing all your models in Java? Alice, Data Scientist Bob, Data Scientist Joe, IT I write all my models in R, why don’t you do the same? I write all my models in scikit- learn, why don’t you do the same?
  • 16. Deploying analytic models Model* Consumer Model* Producer Export model* Import model* PMML & PFA • PMML is an XML language for describing analytic models • PFA is a JSON language for describing analytic models and workflows • Arbitrary models and workflows can be expressed in PFA. The Not-For-Profit Data Mining Group (DMG) develops the PMML and PFA standards *Model here also includes analytic workflows.
  • 17. • 20+ person data science group developing models in R, Python, Scikit-learn and MATLAB. • All the data scientists export their model in the Portable Format for Analytics (PFA). • The company’s product imports models in PFA and runs on their customers data as required. Export PFA Import PFA Widget records Widget scores Company’s data scientists build models Company’s services embed an analytic engine that can interpret PFA How a startup used PFA-compliant scoring engines:
  • 18. 4. Case Study: Scaling Bioinformatics Pipelines for the Genomic Data Commons by Encapsulating Analytics in Docker Containers
  • 19. NCI Genomic Data Commons* • The GDC was launched in 2016 with over 4 PB of data. • Used by 1500 -3000+ users per day and over 100,000 researchers each year. • Based upon an open source software stack that can be used to build other data commons. *Source: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
  • 20. TCGA dataset: 1.54 PB consisting of 577,878 files about 14,052 cases (patients), in 42 cancer types, across 29 primary sites. 2.5+ PB of cancer genomics data + Bionimbus data commons technology running multiple community developed variant calling pipelines. Over 12,000 cores and 10 PB of raw storage in 18+ racks running for months. AnalyticOps for the Genomic Data Commons
  • 21. GDC Pipelines Are Complex and are Mostly Written by Others Source: Center for Data Intensive Science, University of Chicago. This is an example of one the pipelines run by the GDC.
  • 22. Computations for a Single Genome Can Take Over a Week Source: Center for Data Intensive Science, University of Chicago.
  • 23. Dev Ops • Virtualization and the requirement for massive scale out spawned infrastructure automation (“infrastructure as code”). • Requirement for reducing the time to deploying code created tools for continuous integration and testing.
  • 24. ModelDev AnalyticOps • Use virtualization / containers, infrastructure automation and scale out to support large scale analytics. • Requirement: reduce the time and cost to do high quality analytics over large amounts of data.
  • 25. GDC Pipeline Automation System (GPAS) • Bioinformatics pipelines are written using the Common Workflow Language (CWL) • CWL uses DAGs to describe workflows, with each node a program • We developed a pipeline automation system (GPAS) to execute CWL pipelines with the GDC • We used Docker Containers and Kubernetes for automating the software deployment and simplifying the scale out • Our main work was the development of the pipelines, automating the processing of submitting data, QC, exception handling and monitoring. Source: Center for Data Intensive Science, University of Chicago.
  • 26. • Model quality (confusion matrix) • Data quality (six dimensions) • Lack of ground truth • Software errors • Sufficient monitoring of workflows • Scheduling inefficiencies • The ability to accurately predict problem jobs • Bottlenecks, stragglers, hot spots, etc. • Analytic configurations problems • System failures • Human errors Ten Factors Effecting AnalyticOps
  • 27. New Effort: Portable Format for Biomedical Data (PFB) • Based upon our experience with the GDC and data commons for other biomedical applications, we are developing a portable format so we can version and serialize biomedical data. • This includes the serialization of the data dictionary, pointers to third party ontologies, the data model, and all of the data, except for ”large objects,” such as BAM files, image files. • We track “large files” as digital objects with immutable GUIDs. • In practice, the large objects are often 1000x larger than the rest of the data. • Talk to us if you would like to get involved.
  • 29. E3RW Recap 1. Embed analytics in databases 2. Export models and deploy them by importing into Scoring Engines 3. Encapsulate models using containers (and virtual machines) 4. Read a table of parameters 5. Wrap algo code or analytic system (and perhaps create a service) Approaches (E3RW) • Use languages for analytics, such as PMML and PFA & analytic engines • Use languages for workflows, such as CWL & workflow engines • Use containers and container- orchestration systems for automating software deployment and scale out, such as Docker & Kubernetes Techniques
  • 30. Five Best Practices When Deploying Models 1. Mature analytic organizations have an environment to automate testing and deployment of analytic models. 2. Don’t think just about deploying analytic models, but make sure that you have a process for deploying analytic workflows. 3. Focus not just on reducing Type 1 and Type 2 errors, but also data input errors, data quality errors, software errors, systems errors and human errors. People only remember that model didn’t work, not whose fault it was. 4. Track value obtained by the deployed analytic model, even if it is not your explicit responsibility. 5. It is often easier to increase the value of deployed model by improving the pre- and post- processing vs chasing smaller improvements in the lift curve.
  • 31. Five Common Mistakes When Deploying Models 1. Not understanding all the subtle differences between the supplied run time data used to train the model and the actual run time data the model sees. 2. Thinking that the features are fixed and all that you will need to do is update the parameters. 3. Thinking the model is done and not realizing how much work is required to keep up to date all the the pre- and post- processing required. 4. Not checking in production to see if the inputs to the models drift slowly over time. 5. Not checking that the model will keep running despite missing values, garbage values, etc. (even values that should never be missing in first place).
  • 32. Summary 1. Deploying analytic models is a core technical competency. 2. A discipline of AnalyticOps is emerging defining best practices for running analytics at scale. 3. Building an analytic model is just the first step in its life cycle, which includes deployment, integration into a value chain, improvement, and replacement. 4. The Portable Format for Analytics (PFA) is a model interchange format for building analytic models and workflows in one environment and deploying them in another one. Analytic engines can be used to execute PFA. 5. Analytic containers are a good way of encapsulating everything needed to deploy analytic models and analytic workflows into production.