With the growing number of data-driven organizations new approaches are needed to drive innovation in scaling and implementing data science. We will discuss how data and data science platforms take advantage of what we are calling DataOps. We will share background on this approach and how it supports putting data science models into production. We will provide best practices and a roadmap on how to implement these techniques to become a leader in machine learning and data science. More: http://info.mapr.com/WB_Implementing-DataOps-BestPractices_Global_DG_17.11.07_RegistrationPage.html
Best Practices: Implementing DataOps with a Data Science Platform
1. Learn more at datascience.com | Empower Your Data Scientists
November 7, 2017
Best Practices:
Implementing DataOps with a Data Science Platform
2. Learn more at datascience.com | Empower Your Data Scientists
• Evolving data science landscape
• Data growth and impacts
• Defining DataOps
• DataOps Vs. DevOps
• Best practices in applying DataOps
• Q&A
Agenda
2
Crystal Valentine
VP Technology Strategy
MapR
cvalentine@mapr.com
William Merchan
CSO
DataScience.com
william@datascience.com
3. Learn more at datascience.com | Empower Your Data Scientists 3
EVOLVING LANDSCAPE
4. Learn more at datascience.com | Empower Your Data Scientists
DOING DATA SCIENCE HAS GROWN IN COMPLEXITY
4
Windows OSX Cloud On Prem
Laptops Remote
Environments
Security AWS Google Azure
Notebooks
Jupyter
R Studio
Zeppelin
Languages
Python
Scala
R
SAS
Tools
Libraries
Sharing & Collaboration
?
Results Models
Chat Email
.ppt
Code
Email
Shared
Drives
Deployments
Monitoring Support
Logging
Style A
Logging
Style B
Tools
PMML
Flask
Lineage and Repeatability
?
Data Lake Database
Data
Inventory
Spark PigHive
Data
ToolsETL
Cron
Users
5. Learn more at datascience.com | Empower Your Data Scientists
DATA SCIENCE TRENDS: GROWING TEAMS & OPEN SOURCE AS THE NEW
STANDARD
5
2017: 2,350,000 data science and analytics job listings*
*Source: Kaggle 2017 data science trend report, Burning Glass Quant Crunch Report, Microsoft Revolutions Blog 2017
6. Learn more at datascience.com | Empower Your Data Scientists
DATA SCIENCE PLATFORMS ARE EMERGING CATEGORY BRINGING TOGETHER ESSENTIAL
ELEMENTS FOR DATA SCIENCE SCALING
6
CLOUD PROVIDERS
ETL & DATA
ENGINEERING
VERTICAL
APPLICATIONS
BI & VISUALIZATION
TOOLS
SECURITY
INFRASTRUCTURE
LIBRARIESTOOLS
DATA PLATFORMS
DATA SCIENCE PLATFORMS
7. Learn more at datascience.com | Empower Your Data Scientists 7
DATA GROWTH
8. Learn more at datascience.com | Empower Your Data Scientists
DATA IS THE LEVERAGE POINT FOR COMPETITIVE ADVANTAGE
9. Learn more at datascience.com | Empower Your Data Scientists
DATA VOLUMES GROWING FASTER THAN MOORE’S LAW
Source: McKinsey Global Institute
20101987
1.2
Zettabytes
of Data
3
Exabytes
of Data
Data Diversity
2020
44
Zettabytes of Data
EmailsCall Detail
Records
Click
stream
CSV DocumentsData
PDFBilling
Data
Meta
Data
JSON Network
Data
Mobile
Data
XMLProduct
Catalog
Medical
Records
Text Files VideoText
Messages
Merchant
Listings
Sensor
Data
Server
Logs
Set Top
Box
Social
Media
Audio
10. Learn more at datascience.com | Empower Your Data Scientists
THE VALUE OF DATA
Size
$
Valu
e
Cost
Legacy Value Model
Net
Value
Size
$
Valu
e
Next-Gen Value Model
Cost
Net
Value
OPT OPT
11. Learn more at datascience.com | Empower Your Data Scientists
WE HAVE PASSED AN INFLECTION POINT
Legacy technology investmentNext-Gen technology investment
Source: IDC, Gartner; Analysis & Estimates: MapR
Next-gen consists of cloud, big data, software and hardware related expenses
$ (millions)
INVESTMENT IN NEXT-GEN VS. LEGACY TECHNOLOGIES FOR DATA
Total $ growth of IT market
90% of data is on
next-gen
technology by 2020
12. Learn more at datascience.com | Empower Your Data Scientists 12
DATAOPS
13. Learn more at datascience.com | Empower Your Data Scientists
DATAOPS: AN AGILE METHODOLOGY FOR DATA-DRIVEN ORGANIZATIONS
13
Axioms:
1. Data is central to disruptive enterprise applications
a. Lightweight, stateless functions do not represent the majority of workloads
2. Data science and machine learning are an important paradigm
a. Scientists become active users -- no longer just application developers
b. Iterative workflow with different data usage patterns
3. Data volumes continue to grow
4. Moving data is a performance bottleneck
DataOps Goals:
• Continuous model deployment
• Promote repeatability
• Promote productivity -- focus on core competencies
• Promote agility
• Promote self-service
14. Learn more at datascience.com | Empower Your Data Scientists
COMPARING DEVOPS AND DATAOPS: WHAT’S DIFFERENT OR THE SAME?
14
Developers &
Architects
Data Engineers
Data
Scientists
Security &
Governance
Operations
DataOps
DevOps DataOps
15. Learn more at datascience.com | Empower Your Data Scientists
CONTINUOUS MODEL DEPLOYMENT
Data
Engineering
Model
Development
Model
Management
Model
Deployment
Model
Monitoring &
Rescoring
Key Building Blocks for Agility:
1) Unified data platform
2) Data governance
3) Self-service data and compute access
4) Multitenancy and resource management
16. Learn more at datascience.com | Empower Your Data Scientists 16
BEST PRACTICES
17. Learn more at datascience.com | Empower Your Data Scientists
INDUSTRY LEADING DATA SCIENCE ORGANIZATIONS ADOPTING DATAOPS
Versioning Platform approach Team makeup and
organization
Self service
18. Learn more at datascience.com | Empower Your Data Scientists 18
DataOps Platform Checklist
Unified platform for all data --
historical and real-time production
Multitenancy and resource utilization
Single security and access model for
governance and self-service access
Enterprise-grade for mission-critical
applications and open source tools
Run compute on the data platform --
leverage data locality
19. Learn more at datascience.com | Empower Your Data Scientists 19
Thank you!
20. Learn more at datascience.com | Empower Your Data Scientists 20
NEW DATAOPS APPROACH FOR DATA SCIENCE TEAMS
DataOps