3. More Data, More Insights
Data is abundant,
diverse & shared freely
As is how we store,
process and analyze it
Streaming Machine Learning BI
ETL Modeling
4. More Results
Top Cancer Research
Institutions
Working to Cure Cancer Rocket Science
Thorn
Destroying Human Trafficking
Networks
5. “Only 27% of the big data projects are regarded as successful”
“Only 8% of the big data projects are regarded as VERY successful”
Only 13% of organizations have achieved full-scale production for their
Big Data implementations
Source: CapGemini 2014
“Only 17% of survey respondents said they had a
well-developed Predictive/Prescriptive Analytics program
in place, while 80% said they planned on implementing
such a program within five years” Dataversity 2015 Survey
Organization & Culture: Sobering Statistics
6. The Data Scientist is not one person
Curiosity
Math and
Statistical
Knowledge
Hacking
Skills
Substantive
Expertise
Traditional
Research
Data
Science
Danger
Zone
Machine
Learning
Source: Drew Conway
7. The Data Scientist does not stand alone
Data Engineer/ETL Engineer
Executive Sponsor
Data Steward/SME
Subject Matter Expert
Data Scientist
+ Product Owner, app developer,
program manager, devOps etc
8. The Data Scientist does not sit in a centralized org
Other - 37%
CIO or IT Function - 18%
CMO - 11%
CFO - 9%
Chief Analytics Officer - 7%
CRO / Risk - 7%
VP Strategic Planning - 5%
VP Sales - 3%
Chief Data Officer - 3%
VP Customer Service - 3%
Source: Gartner 2016
11. Importance of Process
Data Science != Software Engineering
But, we can learn a lot, especially on processes
after all…Failing to plan is planning to fail
2. Feature
Extraction
3. Data Flow
Implementation
Data
Acquisition
1. Data Flow
Architecture
4. Data Flow
Validation
2. Data Schema
Architecture
2. Acquire Data
Sources
3. Data exploration
4. Create analytics
dataset
5. Modeling
& Descriptive
Analysis
6. Model evaluation
and tuning
7 . Model
Deployment
Data Science
1. Data Problem
Formulation
12. Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities
1
2
3
4
Four Pillars of the Team Data Science Process
13. • Data science virtual machines
(DSVMs) as the fundamental
development platform on cloud
• Use Visual Studio Team Services
(VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities in Git
repository
• Use cloud-based Azure resources as
needed
Team Data Science Process at Microsoft
14. Question
is sharp.
Data
measures
what they
care
about.
Data is
connected.
Data is
accurate.
A lot of
data.
The better the raw materials, the better the product.
E.g. Predict
whether
component X will
fail in the next Y
days; clear path
of action with
answer
E.g. Identifiers at
the level they are
predicting
E.g. Will be difficult
to predict failure
accurately with few
examples
E.g. Failures are
really failures,
human labels on
root causes; domain
knowledge
translated into
process
E.g. Machine
information linkable
to usage
information
Data Engineering – ready for ML?
15. A Bit more on Data Engineering
How do
Data Scientists
spend their
time?
Gartner estimates that poor quality of data costs an average organization
$13.5 million per year, and yet data governance problems
— which all organizations suffer from — are worsening.
Cleaning & organizing data - 60%
Collecting data sets - 19%
Mining data for patterns -- 9%
Refining algorithms - 4%
Building training sets - 3%
Other - 5% Source: CrowdFlower
16. A Bit more on Data Engineering
Data Ingestion
(Kafka, Navigator, Search)
Cloudera enables users to build real-time, end-to-
end data pipelines in order to power their
business. Leadership in Apache Spark and Kafka
have made Cloudera a trusted resource for users
who want to capture real-time, streaming, and time
series data without being presented with gaps in
security.
Data Processing
(Spark, Hive)
Cloudera is helping users accelerate their data pipelines
with leadership in technologies like Apache Spark. Data
processing in Cloudera Enterprise can help take
processing windows from hours to minutes and enables
faster access to data for a variety of users and skillsets.
17. Data Engineering/Science/Analyst Tools
Cloudera Certified Partners
0
10
20
30
40
50
60
70
2015 2016
Data Engineering
0
10
20
30
40
50
2015 2016
Data Science/Analytics
0
20
40
60
80
100
120
2015 2016
Data Analyst / BI
18. Flexible deployments: Cloud enabled
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
• Consumption based billing and metering
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at
scale
Flexible Deployments
• No cloud vendor lock-in: open plugin
framework for IaaS platforms
• Scaling of provisioned clusters
• Spot instance provisioning
Cloudera Director
19. Cortana Intelligence Suite on Azure cloud platform
Intelligence
Dashboards &
Visualizations
Information
Management
Big Data Stores Machine Learning
and Analytics
Cortana
Event Hubs
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Data Intelligence Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Bot
Framework
SQL Data
Warehouse
Data Catalog
Data Lake
Analytics
Data Factory
Machine
Learning
Data Lake Store
Cognitive
Services
Power BI
Data
Sources
Apps
Sensors
and
devices
Data
20. Careful checking
and cleaning of
data
Leverage the
power of
the cloud
More Data =
More results!
Create a data
driven culture
& DS processes
Use the right
tool for the
job
21. • Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp
• Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities
• Sign up for a free VSTS account: http://www.visualstudio.com
• Complete Cloudera resource library: https://www.cloudera.com/resources.html
• Coursera Data Science: http://www.coursera.org
Resources