SlideShare a Scribd company logo
1 of 21
Breeding Data Scientists
• Danielle Dean, PhD Senior Data Scientist Lead, Microsoft
• Amy O’Connor Business Value Enablement, Cloudera
Data
Engineering
Cloud
Enabled
Five changes in the world of the Data Scientist
More Data,
Insights, Results
Organization
& Culture
Productivity
Tools
More Data, More Insights
Data is abundant,
diverse & shared freely
As is how we store,
process and analyze it
Streaming Machine Learning BI
ETL Modeling
More Results
Top Cancer Research
Institutions
Working to Cure Cancer Rocket Science
Thorn
Destroying Human Trafficking
Networks
“Only 27% of the big data projects are regarded as successful”
“Only 8% of the big data projects are regarded as VERY successful”
Only 13% of organizations have achieved full-scale production for their
Big Data implementations
Source: CapGemini 2014
“Only 17% of survey respondents said they had a
well-developed Predictive/Prescriptive Analytics program
in place, while 80% said they planned on implementing
such a program within five years” Dataversity 2015 Survey
Organization & Culture: Sobering Statistics
The Data Scientist is not one person
Curiosity
Math and
Statistical
Knowledge
Hacking
Skills
Substantive
Expertise
Traditional
Research
Data
Science
Danger
Zone
Machine
Learning
Source: Drew Conway
The Data Scientist does not stand alone
Data Engineer/ETL Engineer
Executive Sponsor
Data Steward/SME
Subject Matter Expert
Data Scientist
+ Product Owner, app developer,
program manager, devOps etc
The Data Scientist does not sit in a centralized org
Other - 37%
CIO or IT Function - 18%
CMO - 11%
CFO - 9%
Chief Analytics Officer - 7%
CRO / Risk - 7%
VP Strategic Planning - 5%
VP Sales - 3%
Chief Data Officer - 3%
VP Customer Service - 3%
Source: Gartner 2016
“How do I become a Data Scientist?”
“How do I become a Data Scientist?”
Importance of Process
Data Science != Software Engineering
But, we can learn a lot, especially on processes
after all…Failing to plan is planning to fail
2. Feature
Extraction
3. Data Flow
Implementation
Data
Acquisition
1. Data Flow
Architecture
4. Data Flow
Validation
2. Data Schema
Architecture
2. Acquire Data
Sources
3. Data exploration
4. Create analytics
dataset
5. Modeling
& Descriptive
Analysis
6. Model evaluation
and tuning
7 . Model
Deployment
Data Science
1. Data Problem
Formulation
Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities
1
2
3
4
Four Pillars of the Team Data Science Process
• Data science virtual machines
(DSVMs) as the fundamental
development platform on cloud
• Use Visual Studio Team Services
(VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities in Git
repository
• Use cloud-based Azure resources as
needed
Team Data Science Process at Microsoft
Question
is sharp.
Data
measures
what they
care
about.
Data is
connected.
Data is
accurate.
A lot of
data.
The better the raw materials, the better the product.
E.g. Predict
whether
component X will
fail in the next Y
days; clear path
of action with
answer
E.g. Identifiers at
the level they are
predicting
E.g. Will be difficult
to predict failure
accurately with few
examples
E.g. Failures are
really failures,
human labels on
root causes; domain
knowledge
translated into
process
E.g. Machine
information linkable
to usage
information
Data Engineering – ready for ML?
A Bit more on Data Engineering
How do
Data Scientists
spend their
time?
Gartner estimates that poor quality of data costs an average organization
$13.5 million per year, and yet data governance problems
— which all organizations suffer from — are worsening.
Cleaning & organizing data - 60%
Collecting data sets - 19%
Mining data for patterns -- 9%
Refining algorithms - 4%
Building training sets - 3%
Other - 5% Source: CrowdFlower
A Bit more on Data Engineering
Data Ingestion
(Kafka, Navigator, Search)
Cloudera enables users to build real-time, end-to-
end data pipelines in order to power their
business. Leadership in Apache Spark and Kafka
have made Cloudera a trusted resource for users
who want to capture real-time, streaming, and time
series data without being presented with gaps in
security.
Data Processing
(Spark, Hive)
Cloudera is helping users accelerate their data pipelines
with leadership in technologies like Apache Spark. Data
processing in Cloudera Enterprise can help take
processing windows from hours to minutes and enables
faster access to data for a variety of users and skillsets.
Data Engineering/Science/Analyst Tools
Cloudera Certified Partners
0
10
20
30
40
50
60
70
2015 2016
Data Engineering
0
10
20
30
40
50
2015 2016
Data Science/Analytics
0
20
40
60
80
100
120
2015 2016
Data Analyst / BI
Flexible deployments: Cloud enabled
Easy Administration
• Dynamic cluster lifecycle management
• Single pane of glass: multi-cluster view
• Consumption based billing and metering
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at
scale
Flexible Deployments
• No cloud vendor lock-in: open plugin
framework for IaaS platforms
• Scaling of provisioned clusters
• Spot instance provisioning
Cloudera Director
Cortana Intelligence Suite on Azure cloud platform
Intelligence
Dashboards &
Visualizations
Information
Management
Big Data Stores Machine Learning
and Analytics
Cortana
Event Hubs
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Data Intelligence Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Bot
Framework
SQL Data
Warehouse
Data Catalog
Data Lake
Analytics
Data Factory
Machine
Learning
Data Lake Store
Cognitive
Services
Power BI
Data
Sources
Apps
Sensors
and
devices
Data
Careful checking
and cleaning of
data
Leverage the
power of
the cloud
More Data =
More results!
Create a data
driven culture
& DS processes
Use the right
tool for the
job
• Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp
• Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities
• Sign up for a free VSTS account: http://www.visualstudio.com
• Complete Cloudera resource library: https://www.cloudera.com/resources.html
• Coursera Data Science: http://www.coursera.org
Resources

More Related Content

Similar to Breed data scientists_ A Presentation.pptx

DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleatSistemas
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Cortana Intelligence Solutions
Cortana Intelligence SolutionsCortana Intelligence Solutions
Cortana Intelligence SolutionsDarwin Schweitzer
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Abhimanyu Singhal
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data and AI in education
Data and AI in educationData and AI in education
Data and AI in educationJisc
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
How IBM is Creating a Foundation for Cloud Innovation
How IBM is Creating a Foundation for Cloud InnovationHow IBM is Creating a Foundation for Cloud Innovation
How IBM is Creating a Foundation for Cloud InnovationCCG
 
CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Dataconomy Media
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeBig Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeDenodo
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsAlignedProject
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 

Similar to Breed data scientists_ A Presentation.pptx (20)

DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Cortana Intelligence Solutions
Cortana Intelligence SolutionsCortana Intelligence Solutions
Cortana Intelligence Solutions
 
Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure Opportunity: Data, Analytic & Azure
Opportunity: Data, Analytic & Azure
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data and AI in education
Data and AI in educationData and AI in education
Data and AI in education
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
How IBM is Creating a Foundation for Cloud Innovation
How IBM is Creating a Foundation for Cloud InnovationHow IBM is Creating a Foundation for Cloud Innovation
How IBM is Creating a Foundation for Cloud Innovation
 
CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeBig Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and Tools
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 

Recently uploaded

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 

Recently uploaded (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 

Breed data scientists_ A Presentation.pptx

  • 1. Breeding Data Scientists • Danielle Dean, PhD Senior Data Scientist Lead, Microsoft • Amy O’Connor Business Value Enablement, Cloudera
  • 2. Data Engineering Cloud Enabled Five changes in the world of the Data Scientist More Data, Insights, Results Organization & Culture Productivity Tools
  • 3. More Data, More Insights Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling
  • 4. More Results Top Cancer Research Institutions Working to Cure Cancer Rocket Science Thorn Destroying Human Trafficking Networks
  • 5. “Only 27% of the big data projects are regarded as successful” “Only 8% of the big data projects are regarded as VERY successful” Only 13% of organizations have achieved full-scale production for their Big Data implementations Source: CapGemini 2014 “Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years” Dataversity 2015 Survey Organization & Culture: Sobering Statistics
  • 6. The Data Scientist is not one person Curiosity Math and Statistical Knowledge Hacking Skills Substantive Expertise Traditional Research Data Science Danger Zone Machine Learning Source: Drew Conway
  • 7. The Data Scientist does not stand alone Data Engineer/ETL Engineer Executive Sponsor Data Steward/SME Subject Matter Expert Data Scientist + Product Owner, app developer, program manager, devOps etc
  • 8. The Data Scientist does not sit in a centralized org Other - 37% CIO or IT Function - 18% CMO - 11% CFO - 9% Chief Analytics Officer - 7% CRO / Risk - 7% VP Strategic Planning - 5% VP Sales - 3% Chief Data Officer - 3% VP Customer Service - 3% Source: Gartner 2016
  • 9. “How do I become a Data Scientist?”
  • 10. “How do I become a Data Scientist?”
  • 11. Importance of Process Data Science != Software Engineering But, we can learn a lot, especially on processes after all…Failing to plan is planning to fail 2. Feature Extraction 3. Data Flow Implementation Data Acquisition 1. Data Flow Architecture 4. Data Flow Validation 2. Data Schema Architecture 2. Acquire Data Sources 3. Data exploration 4. Create analytics dataset 5. Modeling & Descriptive Analysis 6. Model evaluation and tuning 7 . Model Deployment Data Science 1. Data Problem Formulation
  • 12. Standard Project Lifecycle Standardized Document Templates, Project Structure Shared, Distributed Resources Productivity Tools, Shared Utilities 1 2 3 4 Four Pillars of the Team Data Science Process
  • 13. • Data science virtual machines (DSVMs) as the fundamental development platform on cloud • Use Visual Studio Team Services (VSTS) • Work item tracking and scrum planning • Git repositories • Shared data science utilities in Git repository • Use cloud-based Azure resources as needed Team Data Science Process at Microsoft
  • 14. Question is sharp. Data measures what they care about. Data is connected. Data is accurate. A lot of data. The better the raw materials, the better the product. E.g. Predict whether component X will fail in the next Y days; clear path of action with answer E.g. Identifiers at the level they are predicting E.g. Will be difficult to predict failure accurately with few examples E.g. Failures are really failures, human labels on root causes; domain knowledge translated into process E.g. Machine information linkable to usage information Data Engineering – ready for ML?
  • 15. A Bit more on Data Engineering How do Data Scientists spend their time? Gartner estimates that poor quality of data costs an average organization $13.5 million per year, and yet data governance problems — which all organizations suffer from — are worsening. Cleaning & organizing data - 60% Collecting data sets - 19% Mining data for patterns -- 9% Refining algorithms - 4% Building training sets - 3% Other - 5% Source: CrowdFlower
  • 16. A Bit more on Data Engineering Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end-to- end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.
  • 17. Data Engineering/Science/Analyst Tools Cloudera Certified Partners 0 10 20 30 40 50 60 70 2015 2016 Data Engineering 0 10 20 30 40 50 2015 2016 Data Science/Analytics 0 20 40 60 80 100 120 2015 2016 Data Analyst / BI
  • 18. Flexible deployments: Cloud enabled Easy Administration • Dynamic cluster lifecycle management • Single pane of glass: multi-cluster view • Consumption based billing and metering Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Flexible Deployments • No cloud vendor lock-in: open plugin framework for IaaS platforms • Scaling of provisioned clusters • Spot instance provisioning Cloudera Director
  • 19. Cortana Intelligence Suite on Azure cloud platform Intelligence Dashboards & Visualizations Information Management Big Data Stores Machine Learning and Analytics Cortana Event Hubs HDInsight (Hadoop and Spark) Stream Analytics Data Intelligence Action People Automated Systems Apps Web Mobile Bots Bot Framework SQL Data Warehouse Data Catalog Data Lake Analytics Data Factory Machine Learning Data Lake Store Cognitive Services Power BI Data Sources Apps Sensors and devices Data
  • 20. Careful checking and cleaning of data Leverage the power of the cloud More Data = More results! Create a data driven culture & DS processes Use the right tool for the job
  • 21. • Microsoft’s “Team Data Science Process” Github: http://aka.ms/tdsp • Productive utilities repository: https://github.com/Azure/Azure-TDSP-Utilities • Sign up for a free VSTS account: http://www.visualstudio.com • Complete Cloudera resource library: https://www.cloudera.com/resources.html • Coursera Data Science: http://www.coursera.org Resources