SlideShare a Scribd company logo
1 of 113
1
https://www.datasciencetech.institute/
Data Science:
Past, Present, and Future
Gregory Piatetsky-Shapiro
KDnuggets
2© KDnuggets 2016
La Science des données:
passé, présent et futur
Predicting Behavior –
Key to Survival
© KDnuggets 2016 3
Better prediction – better intelligence
“Predictions”: Astrology
© KDnuggets 2016 4
My May 26 Horoscope:
So what if things aren't
completely wonderful in your
life right now? Just keep your
hopes high, and your fingers
crossed. … Being with the
people who make you feel good
about yourself will help keep
your thoughts bright, so get
together with your closest
friend as soon as you can..
www.astrology.com/horoscope/daily/aries.html
“Predictions” : Turkish Coffee Grinds
© KDnuggets 2016 5
If a big chunk of the coffee
grounds falls down on the saucer
then it is taken as the first positive
sign of your reading. “Trouble and
worries are leaving you”.
Pundits “Predictions”
• Nate Silver FiveThirtyEight.com prediction for
Trump winning Republican nomination:
• Aug 2015: 2%
• Sep 2015: 5%
• Nov 2015: 6%
• Jan 2016: 12%
• May 2016: 99%
© KDnuggets 2016 6
Desire to Predict – Deep Human Trait
© KDnuggets 2016 7
• People are hard-wired to see patterns
• People want predictions
• Human intuition does not work on large scale
data, for understanding probability
• Good story is essential to a convincing
prediction (whether true or false)
Lessons
Data Science
Data-Driven, Scientific
approach to prediction
and data analysis
8
Outline
• Intro, Data Science History and Terms
• 10 Real-World Data Science Lessons
• Data Science Now: Polls & Trends
• Data Science Roles
• Data Science Job Trends
• Data Science Future
© KDnuggets 2016 9
What do we call it?
• Statistics
• Data Mining
• Knowledge Discovery in Data
(KDD)
• Predictive Analytics
• Data Analytics
• Data Science
• …?
© KDnuggets 2016 10
Core Idea:
Finding
Useful
Patterns
in Data
Pre-history (1800-2008): Statistics
© KDnuggets 2016 11
From Google Ngram viewer – English language books
Search case insensitive.
Other languages need to be considered for full picture
statistics is the biggest term in 20th century,
Analytics is used increasingly thru 20th century
data mining appears in late 1990s
French Books, 1800-2008
Statistiques vs Mathematiques
© KDnuggets 2016 12
“Data Mining” Surges in 1996
© KDnuggets 2016 13
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
Analytics
Data Mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google N-grams search case insensitive, smoothing 1
Earliest use of “data mining”: 1962
(c) KDnuggets 2016 15
Source: Google Books
After eliminating many “following data. Mining cost is ” examples
which refer to Mining of minerals,
and books from “1958” that have a CD attached (errors in book year)
The earliest “data mining” reference I found is
Very Recent History
Using Google Trends
(c) KDnuggets 2016 16
Google Trends, 2005-2016:
After 2006, Analytics > Data Mining
17(c) KDnuggets 2016
Global – all regions
>50% of “Analytics” searches are for
“Google Analytics”
18(c) KDnuggets 2016
Google Analytics introduced,
Dec 2005
Google Trends, 2005-2016
(c) KDnuggets 2016
data
science
analytics - Google
big data
data mining
2010 2012 2014
Google Trends, 2005-2016
(c) KDnuggets 2016
2012: Analytics down, Big Data up
2015
2005
Google Trends, 2005-2016
(c) KDnuggets 2016
2013: Data Science grows
20132005
Google Trends:
Machine Learning, Data Science,
Deep Learning
© KDnuggets 2016 22
2009 2011 2013 2015
Google Trends: Machine Learning
© KDnuggets 2016 23
Machine Learning ~ “Machine Learning”
Google Trends: Data Science
© KDnuggets 2016 24
[Data Science] != “Data Science”
Lesson: Examine assumptions carefully
2009 2011 2013 2015
Regional Interest in
“Data Science” in 2015
25(c) KDnuggets 2016
Google Trends
Note: search for “Data Science” is
different from [Data Science]
KDnuggets Audience by Region, Q1
2016
© KDnuggets 2016 26
Data Science History
• < 1900 - Statistics
• 1960s Data Mining = bad activity, data “dredging”
• 1990 - “Data Mining” is good, surges in 1996
• 2003 - “Data Mining” peaks (bad in press, invasion of
privacy?), slowly declines, but still popular
• 2006 - Google Analytics
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data
• 2014 - Data Science
• 2015 - Deep Learning
• 2018 - ??
27© KDnuggets 2016
10 Real-World Lessons
from the Art & Practice
of Data Science &
Data Mining
28© KDnuggets 2016
Lesson 1: It is a Iterative, Circular Process
© KDnuggets 2016 29
Waterfall
model
does NOT
work
for
Data
Science
CRISP-DM: Iterative, Circular Process
© KDnuggets 2016 30
See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html
Data Mining Process – CRISP-DM, 1998
CRISP-DM, 1998
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment
Academic Data Science
Process
© KDnuggets 2016 31
See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html
Harvard, 2013
Machine Learning Workflow, MS Azure
© KDnuggets 2016 32
See
www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html
blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in-
azureml-using-linear-regression/
Lesson 2: Data Engineering
Takes The Bulk of Time
• Building Machine Learning/Predicting Models
is the key (and most fun) part, but only a small
part of the whole process
• 60-80% (?) spent on Data
Preparation/Engineering
© KDnuggets 2016 33
Competitions are different
© KDnuggets 2016 34
March Machine Learning Mania 2016,
Winner's Interview: 1st Place, Miguel Alomar
https://twitter.com/kdnuggets/status/730417186167263232
http://blog.kaggle.com/2016/05/10/march-machine-learning-
mania-2016-winners-interview-1st-place-miguel-alomar/
How #MachineLearning @Kaggle
winner spent time:
35% read forums,
25% build models,
25% evaluate results
15% data preparation,
Lesson 3: Question Assumptions
© KDnuggets 2016 35
Problem:
Deciles not uniform
Decile 1 is too rare,
Decile 0 – too frequent?
Why ?
* Not actual data
Measurement
Mass Spectrometry
© KDnuggets 2016 36
Mass spectrometry (MS) is an
analytical technique that ionizes
chemical species and sorts the
ions based on their mass to
charge ratio.
Can produce a large number
(~ 20,000) of
m/z values for a sample
Goal: find biomarkers for
disease, test, condition
Question Assumptions
© KDnuggets 2016 37
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement
Question Assumptions
© KDnuggets 2016 38
Instead of Measurement Deciles
Examine actual ranges,
including 0
Nothing between 1 and 14
Value 0 is too frequent
Why ?
* Not actual data
Measurement
Someone added a rule to round
raw measurement values
below 15 to zero
The best data scientists have one
thing in common –
unbelievable curiosity
DJ Patil, US First Chief Data Scientist
http://www.sciencefriday.com/articles/10-questions-for-the-
nations-first-chief-data-scientist
April 2016
39
Lesson 4: Focus on the Right Metric -
Actionable
• Consumer: Churn may depend on age, region,
usage, and rate plan. Rate plan easiest to
change.
• Uplift Modeling in Marketing and Politics:
focus on persuadables
© KDnuggets 2016 40
Right Metric: Uplift Modeling
© KDnuggets 2016 41
Don’t model if consumer will buy –
Model if consumer will buy in response
to an offer
Right Metric: Uplift Modeling
© KDnuggets 2016 42
From Eric Siegel presentation at PAW, 2011
In Obama 2012 Campaign
www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory
Lesson 5: Be a Fox, not a Hedgehog
© KDnuggets 2016 43
Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox
A fox knows many things, but
a hedgehog - one important thing.
Lesson 5: Modeling
No Free Lunch Theorem – no method is universally the best (Wolpert)
In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016):
• Handcrafted feature engineering
• Or Deep Learning Neural Networks
www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html
• XGBoost – winning method in many recent Kaggle competitions
• Ensemble methods
For Structured Data (Sebastian Rashka )
• SVM (Support Vector Machines) for smaller data
• Random Forests – more data, more automated
www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html
Unstructured:
• Deep Learning
© KDnuggets 2016 44
Lesson 6: Avoid Overfitting
© KDnuggets 2016 45
http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
Many examples at http://tylervigen.com/spurious-correlations
Avoid Overfitting
© KDnuggets 2016 46
“Irreproducible” results - BIG problem is social
sciences, medicine:
John P. A. Ioannidis famous paper Why Most Published
Research Findings Are False (PLoS Medicine, 2005).
Due to
• Small samples
• Testing too many hypotheses
• Confirmation bias (explicit or implicit)
• Poor training
How to Avoid Overfitting
• If it is too good to be true, it probably is
• Find the simplest possible hypothesis
• Adjusting the False Discovery Rate
• Randomization Testing
• Nested cross-validation (train, test, holdout)
• Regularization (adding a penalty for
complexity)
© KDnuggets 2016 47
www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
Lesson 7: Tell a story
• Combine facts into a story
• Combine visual and text presentation
• Explanation gives credibility
• Dynamic / Interactive
• Examples: Kefir, Google Analytics, Quill
© KDnuggets 2016 48
KEFIR (KEy FInding Reporter), 1994
• Overview report
www.kdnuggets.com/data_mining_course/kefir/overview.htm
• Inpatient admissions
www.kdnuggets.com/data_mining_course/kefir/s2.htm
© KDnuggets 2016 49
Quill report for KDnuggets
• Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average
• Sessions remained flat compared to the prior week. The 121,040
sessions, however, were above your 85,105-session weekly average
for the year. Your site's total pageviews stayed flat last week at
206,124, while pages per session grew less than a percent to 1.7.
That's equal to your weekly average for the year.
• Among all your pages, Analytics, Data Mining, and Data Science had
both the highest bounce rate (43%) and the most pageviews (8,734)
last week.
© KDnuggets 2016 50
La Diseuse de bonne aventure,
Caravaggio, 1595 (Louvre)
© KDnuggets 2016 51
Beware of
Fortune
tellers!
Lesson 8: Limits to Predicting Human
Behavior?
• Inherent randomness, complexity in human
behavior
• Individual predictions have limited accuracy
(but can still be better than random and very
useful for consumer analytics)
• Aggregate predictions (eg who will win the
election) more accurate, because individual
randomness cancels out
(c) KDnuggets 2016 52
Example: Netflix Prize, 2006
• Example: Netflix Prize: the most advanced
algorithms were only a few percentages better
than basic algorithms
© KDnuggets 2016 53
See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business
Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/
Direct Marketing Lift:
Random and Model-sorted Lists
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
Model
5% of random list have 5% of hits
5% of model-score ranked list have 21% of hits.
Lift(5%) = 21%/5% = 4.2
Pct list
CPH:CumulativePctHits
Most lift curves are surprising similar-
limit to human predictability?
Study of lift curves in banking,
telecom
Best lift curves are similar
Special point T=Target
percentage
Lift(T) ~ sqrt (1/T)
G. Piatetsky-Shapiro, B. Masand,
Estimating Campaign Benefits and
Modeling Lift, in Proceedings of
KDD-99 Conference, ACM Press,
1999.
(c) KDnuggets 2016 55
0
2
4
6
8
10
12
14
0 5 10 15 20 25
100*T%
Lift
Actual lift(T) Est. lift(T)
More recent data is more predictive!
• Real-time behavior data more predictive than
historical, demographic data
• Ad retargeting
© KDnuggets 2016 56
Lesson 9: Deployment & Maintenance
• Netflix Prize winning algorithm not deployed
• Technical debt of Machine Learning
– (Google research.google.com/pubs/pub43146.html )
© KDnuggets 2016 57
… the additional accuracy gains that we
measured did not seem to justify the
engineering effort needed to bring them
into a production environment. Also, our
focus on improving Netflix personalization
had shifted to the next level by then.
http://techblog.netflix.com/2012/04/netflix
-recommendations-beyond-5-stars.html
Modeling in Real World vs Kaggle
• ROI of extra accuracy vs cost of maintenance
• Is model explainable ? (legal, acceptance reasons)
• Does model discriminate on basis of race,
gender,…?
• Netflix Prize algorithm which won $1M - not
implemented
• In real-world, simpler is usually better
© KDnuggets 2016 58
Deployment Test and Monitor
• Monitor assumptions
– Do fields have the same value distributions
• Detect when model is no longer valid, needs
rebuilding
• Automatic model re-build
© KDnuggets 2016 59
Lesson 10: Don’t just predict, optimize
• Prediction is usually just one part of making a
decision
• Consider cost, frequency, latency, human
behavior, etc
• Goal: Optimization
• From Data Science to Decision Science
© KDnuggets 2016 60
Privacy in the age of Big Data
• Privacy laws much stricter in Europe
• Individual Privacy vs Benefits for all (eg
aggregated health-care data)
• Image and Face recognition (eg Facebook)
• Very hard to keep privacy with so many digital
breadcrumbs
• Privacy vs Security (eg FBI vs Apple)
• Politicians are behind technology curve –
researchers should help society, politicians make
an informed decision
© KDnuggets 2016 61
When It Is Ethical To Analyze
A Particular Dataset?
62© KDnuggets 2016
Data Ethics Golden Rule
Don’t do with someone else data
what you don’t want done
with your data
© KDnuggets 2016 63
Data Science Now
What, Where, How
KDnuggets Polls Findings
www.KDnuggets.com/polls/
64(c) KDnuggets 2016
65© KDnuggets 2016
www.kdnuggets.com/2016/01/poll-analytics-data-mining-data-science-applied-2015.html
Where did you apply Analytics,
Data Mining, Data Science ?
Avg. Number of Industries 2.7
Most Popular:
- CRM
- Finance
- Banking
- Health Care
- Science
- e-commerce
Highest growth in:
Games, 121%
Entertainment / Music 74%
Social Good/Non-profit, 68%
Finance, 42%
Education, 30%
Data Types
Analyzed/Mined
66© KDnuggets 2016
www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html
Most popular:
- Table data
- Time series
- Text
- itemsets/transactions
Most growing:
- music/audio
- JSON
Largest Dataset Analyzed?
© KDnuggets 2016 67
www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
Largest Dataset Analyzed?
© KDnuggets 2016 68
Python swallowed an Elephant?
Antoine de Saint-Exupery
Largest Dataset Analyzed?
© KDnuggets 2016 69
Big Data Miners –
elite group .
www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
Median in 11-100 GB
range, slight increase.
Largest Dataset Analyzed by Region
© KDnuggets 2016 70
Big Data Miners:
TeraBytes and
Petabytes
10-25%
4 Main Languages of Data Science
© KDnuggets 2016 71
www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
4 Main Languages of Data Science, 2
© KDnuggets 2016 72
R vs Python
© KDnuggets 2016 74
http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html
Surprising Stability:
88% of R users stayed with R
and 91% stayed with Python.
% of primary R , Python users up,
while % Other or None down.
Data Science Roles
77(c) KDnuggets 2016
Data Science Roles
• Data Analyst
• (Big) Data Engineer
• Data Scientist
• Machine Learning Researcher
• Data Science Manager/Director
• Chief Data Officer
• Company Founder
© KDnuggets 2016 78
Data Science Venn Diagram, 2010
© KDnuggets 2016 79
Drew Conway, 2010
LinkedIn Data Skills
LinkedIn has 334,000 Titles with “Data”
• Data Analyst 60,273
• Data Scientist 12,680
• Database Analyst 4,357
• Business Data Analyst 1,709
• Senior Data Scientist 1,691
• Sr. Data Analyst 1,131
Thanks to Lutz Finger, Director of Analytics at LinkedIn for
this custom study
© KDnuggets 2016 80
LinkedIn: 4 Groups of Skills
Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills.
Database Management and Software
• Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database
Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD
MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle
Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning
PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server
Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP
Machine Learning
• Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing
Research Design Sentiment Analysis Structural Bioinformatics Text Mining
Mathematics
• Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear
Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical
Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing
Simulations Trigonometry
Statistical Analysis and Data Mining
• A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design
of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression
Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling
Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary
Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics
Survey Research Survival Analysis Time Series Analysis Web Analytics
© KDnuggets 2016 81
LinkedIn Skills
N. Skills
relating to
Data
Number of LinkedIn
Members
1 9,708,214
2 3,870,376
3 2,065,318
4 1,097,849
5 576,310
6 305,266
7 169,351
8 98,284
9 60,419
10 37,689
© KDnuggets 2016 82
Data Science Skills, Updated
© KDnuggets 2016
84
Database,
Coding
Skills
Domain/Business
Expertise
Database,
Coding
Skills
Domain/Business
Expertise
Data Analyst/BI Analyst
© KDnuggets 2016
85
Data Analyst
Glassdoor, Apr 2016
US Avg Salary:
$60-70,000
Positions: 13,000
Database,
Coding
Skills
Data Engineer
© KDnuggets 2016
86
Domain/Business
Expertise
Data Engineer
Glassdoor, Apr 2016
US Salary: $95,500
Jobs: 40,296
Ingénieur … Data
France: 5K Jobs
Machine Learning Researcher
© KDnuggets 2016
87
Database,
Coding
Skills
Domain/Business
Expertise
ML Researcher
“Unicorn” Data Scientist
© KDnuggets 2016
88
Database,
Coding
Skills
Domain/Business
Expertise
Glassdoor, Apr 2016
US Salary: $113,400
Jobs: 2572
France: €43,500
Jobs: 180
“Unicorn”
Data Scientist
Data Science Manager/Director
© KDnuggets 2016
89
Database,
Coding
Skills
Domain/
Business
Expertise
People
Management
Skills
Data Science
Leader
Company Founder
© KDnuggets 2016
90
Database,
Coding
Skills
Domain/
Business
Expertise
People
Management
Skills + Vision
Founder
Data Career Progression
© KDnuggets 2016 91
BI/Data Analyst Data Engineer
Data Scientist
Machine Learning
Researcher
Data Science
Manager/Director
Company Founder/CEO
Chief Data Officer
Chief
Scientist
DATA SCIENCE
JOB TRENDS
(c) KDnuggets 2016 92
Shortage of Data Scientists?
• McKinsey (2011): shortage by 2018 in US
– 140-190,000 people with deep analytical skills
– 1.5 M managers/analysts with the know-how to
use the analysis of big data to make effective
decisions.
Source:
www.mckinsey.com/mgi/publications/big_data/
93(c) KDnuggets 2016
Data Scientist –
Sexiest Job of the 21st Century?
• Thomas H. Davenport and D.J. Patil, (Harvard
Business Review, 2012)
94(c) KDnuggets 2016
“Data Scientist” - leading job trend
© KDnuggets 2016 95
“Data Scientist” Job has grown 1,700% from 2012 to 2016
Top 5 Tech Job Trends in 2016:
Data Scientist, Devops, Puppet, PaaS, Hadoop
?
Indeed.com/jobtrends
Attention to Detail:
[Data Scientist] != “Data Scientist”
© KDnuggets 2016 96
Indeed.com/jobtrends
Data Scientist
“Data Scientist” = “data scientist”
“Data Scientist” vs Statistician
© KDnuggets 2016 97
Indeed.com job trends
“Data Scientist”
Statistician
Data Scientist jobs on KDnuggets
© KDnuggets 2016 98
0%
5%
10%
15%
20%
25%
30%
35%
40%
2010 2011 2012 2013 2014 2015
% Data Scientist jobs on KDnuggets
Including Senior, Junior, Principal, Chief DS, …
LinkedIn 25 Hot Skills
© KDnuggets 2016 99
2015
2014
Data Science Future
100
Big Data
• Next Industrial Revolution
• Data Science is the Engine of Big Data
101(c) KDnuggets 2016
Doing Old Things Better
Application areas
– Direct marketing/Customer modeling
– Recommendations
– Fraud detection
– Security/Intelligence
– Healthcare
– …
• Competition will level companies
102(c) KDnuggets 2016
Big Data Enables New Things !
• Google – first big success of big data
• Social networks (Facebook, Twitter, LinkedIn,
…) success depends on network size, i.e. big
data
• Big Data in Health-care
– image analysis, diagnosis,
– Personalized medicine
• Recommendations - Netflix streaming
103(c) KDnuggets 2016
New services, products, platforms
• Image recognition – FB uses to decide what to
show users
• Face recognition - security
• Location-based services – Tinder
• Big Data to Power AI and Machine Learning
– Imagine Google DeepMind, IBM Watson, Siri in
2020 ?
© KDnuggets 2016 104
Gartner Hype Cycle, 2012
© 2016 KDnuggets
105
Gartner Hype Cycle
Big Data
Gartner Hype Cycle, 2013
© 2016 KDnuggets
106
Gartner Hype Cycle
Big Data
Gartner Hype Cycle, 2014
© 2016 KDnuggets
107
Big DataData
Science
See http://diggdata.in/ which has 4 years of Gartner Hype Cycle
Gartner Hype Cycle, 2015
© 2016 KDnuggets
108
Gartner Hype Cycle
Big Data
www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html
Citizen
Data
Science
Machine
Learning
“Citizen” Data Science
© KDnuggets 2016 110
This is Bob, our new Citizen Data Scientist.
He previously worked as a citizen dentist
and a citizen pilot.
Golden Age of Data Science,
Machine Learning
• Amazing New Tools
• Very Complex Algorithms are very easy to use
• scikit-learn, iPython notebooks, etc
• One-Click deployment of TensorFlow on AWS
with GPU
© KDnuggets 2016 111
Data Science Automated ?
© KDnuggets 2016 112
Expert Human Ability
Current
Computer
Ability
Data Science Automated ?
© KDnuggets 2016 113
Expert Human Ability
Data Science Automated By 2025?
© KDnuggets 2016 114
KDnuggets Poll in 2015:
51% of voters expect Data Science Automation to happen in 10 years or less -
www.kdnuggets.com/2015/05/data-scientists-automated-2025.html
Data Science Automation
© KDnuggets 2016 115
I remember when only a Deep Learning
supercomputer could beat
me in a Data Science competition
Data Science Automation
KDnuggets: Software: Automated Data Science:
• AutoDiscovery from ButlerScientifics
• Automatic Business Modeler from Algolytics
• Automatic Statistician project
• DataRobot
• DMWay
• ForecastThis DSX
• FeatureLab
• Loom Systems,
• machineJS: Automated machine learning
• Quill from Narrative Science
• SAP Predictive Analytics
• Savvy from Yseop.
• Skytree Machine Learning Software
• Tree-based Pipeline Optimization Tool (TPOT)
© KDnuggets 2016 116
Data Science Automation
• New tools make Data Scientists more
productive
• Make data results more widely available
• Automate lower-level Data Science tasks
© KDnuggets 2016 117
“Soft” Data Science Skills
Harder to Automate
• Curiosity
• Intuition
• Business Knowledge
• Selecting a good metric
• Posing the right question
• Presentation Skills
Data Science – still a great profession
© KDnuggets 2016 118
Questions?
KDnuggets: Analytics, Big Data, Data Science
• Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
• Email to editor1@kdnuggets.com
• Twitter: @kdnuggets
• facebook.com/kdnuggets
• LinkedIn group: KDnuggets
119© KDnuggets 2016

More Related Content

What's hot

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyPeter Kua
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-surveyAdam Rabinovitch
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist Experian_US
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyLyn Fenex
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...Galvanize
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Data Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopData Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopIan Hopkinson
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science processMathieu d'Aquin
 
Data_Scientist_Position_Description
Data_Scientist_Position_DescriptionData_Scientist_Position_Description
Data_Scientist_Position_DescriptionSuman Banerjee
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Gregory Piatetsky-Shapiro
 

What's hot (20)

Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Data Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st CenturyData Scientist: The Sexiest Job in the 21st Century
Data Scientist: The Sexiest Job in the 21st Century
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data Science For Social Scientists Workshop
Data Science For Social Scientists WorkshopData Science For Social Scientists Workshop
Data Science For Social Scientists Workshop
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
Data_Scientist_Position_Description
Data_Scientist_Position_DescriptionData_Scientist_Position_Description
Data_Scientist_Position_Description
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 

Viewers also liked

Keys to understanding when you are looking for a Data Scientist vs. Engineer,...
Keys to understanding when you are looking for a Data Scientist vs. Engineer,...Keys to understanding when you are looking for a Data Scientist vs. Engineer,...
Keys to understanding when you are looking for a Data Scientist vs. Engineer,...Domino Data Lab
 
Licenças de obras e projetos de arquitetura no rio de janeiro
Licenças de obras e projetos de arquitetura no rio de janeiroLicenças de obras e projetos de arquitetura no rio de janeiro
Licenças de obras e projetos de arquitetura no rio de janeiroRobson Quintiliano
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Automatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learnAutomatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learnAbhishek Thakur
 
Google Display Network - Google Görüntülü Reklam Ağı - GDN
Google Display Network - Google Görüntülü Reklam Ağı - GDNGoogle Display Network - Google Görüntülü Reklam Ağı - GDN
Google Display Network - Google Görüntülü Reklam Ağı - GDNMustafa Kemal TEMEL
 
Branding & Marketing Firm Brand Book
Branding & Marketing Firm Brand Book Branding & Marketing Firm Brand Book
Branding & Marketing Firm Brand Book Rachael Alexander
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Outlook on Artificial Intelligence in the Enterprise 2016
Outlook on Artificial Intelligence in the Enterprise 2016Outlook on Artificial Intelligence in the Enterprise 2016
Outlook on Artificial Intelligence in the Enterprise 2016Narrative Science
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
 
Você pode ir muito além do que imagina.
Você pode ir muito além do que imagina.Você pode ir muito além do que imagina.
Você pode ir muito além do que imagina.Valeria Dantas Machado
 
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from UdaipurMartina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipurhttps://www.martinamotwani.com
 
RubyでRoombaをハックする
RubyでRoombaをハックするRubyでRoombaをハックする
RubyでRoombaをハックするYusuke Kon
 
Top 28 Quotes on Simplicity
Top 28 Quotes on Simplicity Top 28 Quotes on Simplicity
Top 28 Quotes on Simplicity Margaret Molloy
 
OpenACC Highlights - March
OpenACC Highlights - MarchOpenACC Highlights - March
OpenACC Highlights - MarchNVIDIA
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformHenning Jacobs
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsGood Funnel
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHubSpot
 

Viewers also liked (20)

Keys to understanding when you are looking for a Data Scientist vs. Engineer,...
Keys to understanding when you are looking for a Data Scientist vs. Engineer,...Keys to understanding when you are looking for a Data Scientist vs. Engineer,...
Keys to understanding when you are looking for a Data Scientist vs. Engineer,...
 
CANCHA MURAL
CANCHA MURALCANCHA MURAL
CANCHA MURAL
 
Licenças de obras e projetos de arquitetura no rio de janeiro
Licenças de obras e projetos de arquitetura no rio de janeiroLicenças de obras e projetos de arquitetura no rio de janeiro
Licenças de obras e projetos de arquitetura no rio de janeiro
 
DIARIO NTR
DIARIO NTRDIARIO NTR
DIARIO NTR
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Automatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learnAutomatic Machine Learning using Python & scikit-learn
Automatic Machine Learning using Python & scikit-learn
 
Google Display Network - Google Görüntülü Reklam Ağı - GDN
Google Display Network - Google Görüntülü Reklam Ağı - GDNGoogle Display Network - Google Görüntülü Reklam Ağı - GDN
Google Display Network - Google Görüntülü Reklam Ağı - GDN
 
Branding & Marketing Firm Brand Book
Branding & Marketing Firm Brand Book Branding & Marketing Firm Brand Book
Branding & Marketing Firm Brand Book
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Outlook on Artificial Intelligence in the Enterprise 2016
Outlook on Artificial Intelligence in the Enterprise 2016Outlook on Artificial Intelligence in the Enterprise 2016
Outlook on Artificial Intelligence in the Enterprise 2016
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Você pode ir muito além do que imagina.
Você pode ir muito além do que imagina.Você pode ir muito além do que imagina.
Você pode ir muito além do que imagina.
 
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from UdaipurMartina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur
Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur
 
RubyでRoombaをハックする
RubyでRoombaをハックするRubyでRoombaをハックする
RubyでRoombaをハックする
 
Top 28 Quotes on Simplicity
Top 28 Quotes on Simplicity Top 28 Quotes on Simplicity
Top 28 Quotes on Simplicity
 
L'entraînement cardiovasculaire
L'entraînement cardiovasculaireL'entraînement cardiovasculaire
L'entraînement cardiovasculaire
 
OpenACC Highlights - March
OpenACC Highlights - MarchOpenACC Highlights - March
OpenACC Highlights - March
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion Platform
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 

Similar to Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
 
Leland Lockhart - SXSW Intro to Data Science
Leland Lockhart -  SXSW Intro to Data ScienceLeland Lockhart -  SXSW Intro to Data Science
Leland Lockhart - SXSW Intro to Data ScienceLeland Lockhart, PhD
 
What every product manager needs to know about data science (ProductCamp Bost...
What every product manager needs to know about data science (ProductCamp Bost...What every product manager needs to know about data science (ProductCamp Bost...
What every product manager needs to know about data science (ProductCamp Bost...ProductCamp Boston
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
What's new with analytics in academia?
What's new with analytics in academia?What's new with analytics in academia?
What's new with analytics in academia?InfoTrust LLC
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 

Similar to Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro (20)

Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
Leland Lockhart - SXSW Intro to Data Science
Leland Lockhart -  SXSW Intro to Data ScienceLeland Lockhart -  SXSW Intro to Data Science
Leland Lockhart - SXSW Intro to Data Science
 
What every product manager needs to know about data science (ProductCamp Bost...
What every product manager needs to know about data science (ProductCamp Bost...What every product manager needs to know about data science (ProductCamp Bost...
What every product manager needs to know about data science (ProductCamp Bost...
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Big data
Big dataBig data
Big data
 
Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 
Math in data
Math in dataMath in data
Math in data
 
Data Science Webinar
Data Science WebinarData Science Webinar
Data Science Webinar
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
What's new with analytics in academia?
What's new with analytics in academia?What's new with analytics in academia?
What's new with analytics in academia?
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro

  • 2. Data Science: Past, Present, and Future Gregory Piatetsky-Shapiro KDnuggets 2© KDnuggets 2016 La Science des données: passé, présent et futur
  • 3. Predicting Behavior – Key to Survival © KDnuggets 2016 3 Better prediction – better intelligence
  • 4. “Predictions”: Astrology © KDnuggets 2016 4 My May 26 Horoscope: So what if things aren't completely wonderful in your life right now? Just keep your hopes high, and your fingers crossed. … Being with the people who make you feel good about yourself will help keep your thoughts bright, so get together with your closest friend as soon as you can.. www.astrology.com/horoscope/daily/aries.html
  • 5. “Predictions” : Turkish Coffee Grinds © KDnuggets 2016 5 If a big chunk of the coffee grounds falls down on the saucer then it is taken as the first positive sign of your reading. “Trouble and worries are leaving you”.
  • 6. Pundits “Predictions” • Nate Silver FiveThirtyEight.com prediction for Trump winning Republican nomination: • Aug 2015: 2% • Sep 2015: 5% • Nov 2015: 6% • Jan 2016: 12% • May 2016: 99% © KDnuggets 2016 6
  • 7. Desire to Predict – Deep Human Trait © KDnuggets 2016 7 • People are hard-wired to see patterns • People want predictions • Human intuition does not work on large scale data, for understanding probability • Good story is essential to a convincing prediction (whether true or false) Lessons
  • 8. Data Science Data-Driven, Scientific approach to prediction and data analysis 8
  • 9. Outline • Intro, Data Science History and Terms • 10 Real-World Data Science Lessons • Data Science Now: Polls & Trends • Data Science Roles • Data Science Job Trends • Data Science Future © KDnuggets 2016 9
  • 10. What do we call it? • Statistics • Data Mining • Knowledge Discovery in Data (KDD) • Predictive Analytics • Data Analytics • Data Science • …? © KDnuggets 2016 10 Core Idea: Finding Useful Patterns in Data
  • 11. Pre-history (1800-2008): Statistics © KDnuggets 2016 11 From Google Ngram viewer – English language books Search case insensitive. Other languages need to be considered for full picture statistics is the biggest term in 20th century, Analytics is used increasingly thru 20th century data mining appears in late 1990s
  • 12. French Books, 1800-2008 Statistiques vs Mathematiques © KDnuggets 2016 12
  • 13. “Data Mining” Surges in 1996 © KDnuggets 2016 13 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy Analytics Data Mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google N-grams search case insensitive, smoothing 1
  • 14. Earliest use of “data mining”: 1962 (c) KDnuggets 2016 15 Source: Google Books After eliminating many “following data. Mining cost is ” examples which refer to Mining of minerals, and books from “1958” that have a CD attached (errors in book year) The earliest “data mining” reference I found is
  • 15. Very Recent History Using Google Trends (c) KDnuggets 2016 16
  • 16. Google Trends, 2005-2016: After 2006, Analytics > Data Mining 17(c) KDnuggets 2016 Global – all regions
  • 17. >50% of “Analytics” searches are for “Google Analytics” 18(c) KDnuggets 2016 Google Analytics introduced, Dec 2005
  • 18. Google Trends, 2005-2016 (c) KDnuggets 2016 data science analytics - Google big data data mining 2010 2012 2014
  • 19. Google Trends, 2005-2016 (c) KDnuggets 2016 2012: Analytics down, Big Data up 2015 2005
  • 20. Google Trends, 2005-2016 (c) KDnuggets 2016 2013: Data Science grows 20132005
  • 21. Google Trends: Machine Learning, Data Science, Deep Learning © KDnuggets 2016 22 2009 2011 2013 2015
  • 22. Google Trends: Machine Learning © KDnuggets 2016 23 Machine Learning ~ “Machine Learning”
  • 23. Google Trends: Data Science © KDnuggets 2016 24 [Data Science] != “Data Science” Lesson: Examine assumptions carefully 2009 2011 2013 2015
  • 24. Regional Interest in “Data Science” in 2015 25(c) KDnuggets 2016 Google Trends Note: search for “Data Science” is different from [Data Science]
  • 25. KDnuggets Audience by Region, Q1 2016 © KDnuggets 2016 26
  • 26. Data Science History • < 1900 - Statistics • 1960s Data Mining = bad activity, data “dredging” • 1990 - “Data Mining” is good, surges in 1996 • 2003 - “Data Mining” peaks (bad in press, invasion of privacy?), slowly declines, but still popular • 2006 - Google Analytics • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data • 2014 - Data Science • 2015 - Deep Learning • 2018 - ?? 27© KDnuggets 2016
  • 27. 10 Real-World Lessons from the Art & Practice of Data Science & Data Mining 28© KDnuggets 2016
  • 28. Lesson 1: It is a Iterative, Circular Process © KDnuggets 2016 29 Waterfall model does NOT work for Data Science
  • 29. CRISP-DM: Iterative, Circular Process © KDnuggets 2016 30 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Data Mining Process – CRISP-DM, 1998 CRISP-DM, 1998 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment
  • 30. Academic Data Science Process © KDnuggets 2016 31 See www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Harvard, 2013
  • 31. Machine Learning Workflow, MS Azure © KDnuggets 2016 32 See www.kdnuggets.com/2016/04/developers-need-know-about-machine-learning.html blogs.msdn.microsoft.com/continuous_learning/2014/11/15/end-to-end-predictive-model-in- azureml-using-linear-regression/
  • 32. Lesson 2: Data Engineering Takes The Bulk of Time • Building Machine Learning/Predicting Models is the key (and most fun) part, but only a small part of the whole process • 60-80% (?) spent on Data Preparation/Engineering © KDnuggets 2016 33
  • 33. Competitions are different © KDnuggets 2016 34 March Machine Learning Mania 2016, Winner's Interview: 1st Place, Miguel Alomar https://twitter.com/kdnuggets/status/730417186167263232 http://blog.kaggle.com/2016/05/10/march-machine-learning- mania-2016-winners-interview-1st-place-miguel-alomar/ How #MachineLearning @Kaggle winner spent time: 35% read forums, 25% build models, 25% evaluate results 15% data preparation,
  • 34. Lesson 3: Question Assumptions © KDnuggets 2016 35 Problem: Deciles not uniform Decile 1 is too rare, Decile 0 – too frequent? Why ? * Not actual data Measurement
  • 35. Mass Spectrometry © KDnuggets 2016 36 Mass spectrometry (MS) is an analytical technique that ionizes chemical species and sorts the ions based on their mass to charge ratio. Can produce a large number (~ 20,000) of m/z values for a sample Goal: find biomarkers for disease, test, condition
  • 36. Question Assumptions © KDnuggets 2016 37 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement
  • 37. Question Assumptions © KDnuggets 2016 38 Instead of Measurement Deciles Examine actual ranges, including 0 Nothing between 1 and 14 Value 0 is too frequent Why ? * Not actual data Measurement Someone added a rule to round raw measurement values below 15 to zero
  • 38. The best data scientists have one thing in common – unbelievable curiosity DJ Patil, US First Chief Data Scientist http://www.sciencefriday.com/articles/10-questions-for-the- nations-first-chief-data-scientist April 2016 39
  • 39. Lesson 4: Focus on the Right Metric - Actionable • Consumer: Churn may depend on age, region, usage, and rate plan. Rate plan easiest to change. • Uplift Modeling in Marketing and Politics: focus on persuadables © KDnuggets 2016 40
  • 40. Right Metric: Uplift Modeling © KDnuggets 2016 41 Don’t model if consumer will buy – Model if consumer will buy in response to an offer
  • 41. Right Metric: Uplift Modeling © KDnuggets 2016 42 From Eric Siegel presentation at PAW, 2011 In Obama 2012 Campaign www.thefiscaltimes.com/Articles/2013/01/21/The-Real-Story-Behind-Obamas-Election-Victory
  • 42. Lesson 5: Be a Fox, not a Hedgehog © KDnuggets 2016 43 Read Isaiah Berlin 1953 essay, The Hedgehog and the Fox A fox knows many things, but a hedgehog - one important thing.
  • 43. Lesson 5: Modeling No Free Lunch Theorem – no method is universally the best (Wolpert) In Kaggle competitions, there are 2 ways to win (Anthony Goldbloom, 2016): • Handcrafted feature engineering • Or Deep Learning Neural Networks www.kdnuggets.com/2016/01/anthony-goldbloom-secret-winning-kaggle-competitions.html • XGBoost – winning method in many recent Kaggle competitions • Ensemble methods For Structured Data (Sebastian Rashka ) • SVM (Support Vector Machines) for smaller data • Random Forests – more data, more automated www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html Unstructured: • Deep Learning © KDnuggets 2016 44
  • 44. Lesson 6: Avoid Overfitting © KDnuggets 2016 45 http://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html Many examples at http://tylervigen.com/spurious-correlations
  • 45. Avoid Overfitting © KDnuggets 2016 46 “Irreproducible” results - BIG problem is social sciences, medicine: John P. A. Ioannidis famous paper Why Most Published Research Findings Are False (PLoS Medicine, 2005). Due to • Small samples • Testing too many hypotheses • Confirmation bias (explicit or implicit) • Poor training
  • 46. How to Avoid Overfitting • If it is too good to be true, it probably is • Find the simplest possible hypothesis • Adjusting the False Discovery Rate • Randomization Testing • Nested cross-validation (train, test, holdout) • Regularization (adding a penalty for complexity) © KDnuggets 2016 47 www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
  • 47. Lesson 7: Tell a story • Combine facts into a story • Combine visual and text presentation • Explanation gives credibility • Dynamic / Interactive • Examples: Kefir, Google Analytics, Quill © KDnuggets 2016 48
  • 48. KEFIR (KEy FInding Reporter), 1994 • Overview report www.kdnuggets.com/data_mining_course/kefir/overview.htm • Inpatient admissions www.kdnuggets.com/data_mining_course/kefir/s2.htm © KDnuggets 2016 49
  • 49. Quill report for KDnuggets • Sessions Stay Flat, But Way Higher Than 12-Month Weekly Average • Sessions remained flat compared to the prior week. The 121,040 sessions, however, were above your 85,105-session weekly average for the year. Your site's total pageviews stayed flat last week at 206,124, while pages per session grew less than a percent to 1.7. That's equal to your weekly average for the year. • Among all your pages, Analytics, Data Mining, and Data Science had both the highest bounce rate (43%) and the most pageviews (8,734) last week. © KDnuggets 2016 50
  • 50. La Diseuse de bonne aventure, Caravaggio, 1595 (Louvre) © KDnuggets 2016 51 Beware of Fortune tellers!
  • 51. Lesson 8: Limits to Predicting Human Behavior? • Inherent randomness, complexity in human behavior • Individual predictions have limited accuracy (but can still be better than random and very useful for consumer analytics) • Aggregate predictions (eg who will win the election) more accurate, because individual randomness cancels out (c) KDnuggets 2016 52
  • 52. Example: Netflix Prize, 2006 • Example: Netflix Prize: the most advanced algorithms were only a few percentages better than basic algorithms © KDnuggets 2016 53 See Gregory Piatetsky, “Big Data: Hype & Reality”, Harvard Business Review 2012, https://hbr.org/2012/10/big-data-hype-and-reality/
  • 53. Direct Marketing Lift: Random and Model-sorted Lists 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random Model 5% of random list have 5% of hits 5% of model-score ranked list have 21% of hits. Lift(5%) = 21%/5% = 4.2 Pct list CPH:CumulativePctHits
  • 54. Most lift curves are surprising similar- limit to human predictability? Study of lift curves in banking, telecom Best lift curves are similar Special point T=Target percentage Lift(T) ~ sqrt (1/T) G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, in Proceedings of KDD-99 Conference, ACM Press, 1999. (c) KDnuggets 2016 55 0 2 4 6 8 10 12 14 0 5 10 15 20 25 100*T% Lift Actual lift(T) Est. lift(T)
  • 55. More recent data is more predictive! • Real-time behavior data more predictive than historical, demographic data • Ad retargeting © KDnuggets 2016 56
  • 56. Lesson 9: Deployment & Maintenance • Netflix Prize winning algorithm not deployed • Technical debt of Machine Learning – (Google research.google.com/pubs/pub43146.html ) © KDnuggets 2016 57 … the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. http://techblog.netflix.com/2012/04/netflix -recommendations-beyond-5-stars.html
  • 57. Modeling in Real World vs Kaggle • ROI of extra accuracy vs cost of maintenance • Is model explainable ? (legal, acceptance reasons) • Does model discriminate on basis of race, gender,…? • Netflix Prize algorithm which won $1M - not implemented • In real-world, simpler is usually better © KDnuggets 2016 58
  • 58. Deployment Test and Monitor • Monitor assumptions – Do fields have the same value distributions • Detect when model is no longer valid, needs rebuilding • Automatic model re-build © KDnuggets 2016 59
  • 59. Lesson 10: Don’t just predict, optimize • Prediction is usually just one part of making a decision • Consider cost, frequency, latency, human behavior, etc • Goal: Optimization • From Data Science to Decision Science © KDnuggets 2016 60
  • 60. Privacy in the age of Big Data • Privacy laws much stricter in Europe • Individual Privacy vs Benefits for all (eg aggregated health-care data) • Image and Face recognition (eg Facebook) • Very hard to keep privacy with so many digital breadcrumbs • Privacy vs Security (eg FBI vs Apple) • Politicians are behind technology curve – researchers should help society, politicians make an informed decision © KDnuggets 2016 61
  • 61. When It Is Ethical To Analyze A Particular Dataset? 62© KDnuggets 2016
  • 62. Data Ethics Golden Rule Don’t do with someone else data what you don’t want done with your data © KDnuggets 2016 63
  • 63. Data Science Now What, Where, How KDnuggets Polls Findings www.KDnuggets.com/polls/ 64(c) KDnuggets 2016
  • 64. 65© KDnuggets 2016 www.kdnuggets.com/2016/01/poll-analytics-data-mining-data-science-applied-2015.html Where did you apply Analytics, Data Mining, Data Science ? Avg. Number of Industries 2.7 Most Popular: - CRM - Finance - Banking - Health Care - Science - e-commerce Highest growth in: Games, 121% Entertainment / Music 74% Social Good/Non-profit, 68% Finance, 42% Education, 30%
  • 65. Data Types Analyzed/Mined 66© KDnuggets 2016 www.kdnuggets.com/polls/2014/data-types-sources-analyzed.html Most popular: - Table data - Time series - Text - itemsets/transactions Most growing: - music/audio - JSON
  • 66. Largest Dataset Analyzed? © KDnuggets 2016 67 www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html
  • 67. Largest Dataset Analyzed? © KDnuggets 2016 68 Python swallowed an Elephant? Antoine de Saint-Exupery
  • 68. Largest Dataset Analyzed? © KDnuggets 2016 69 Big Data Miners – elite group . www.kdnuggets.com/2015/08/largest-dataset-analyzed-more-gigabytes-petabytes.html Median in 11-100 GB range, slight increase.
  • 69. Largest Dataset Analyzed by Region © KDnuggets 2016 70 Big Data Miners: TeraBytes and Petabytes 10-25%
  • 70. 4 Main Languages of Data Science © KDnuggets 2016 71 www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
  • 71. 4 Main Languages of Data Science, 2 © KDnuggets 2016 72
  • 72. R vs Python © KDnuggets 2016 74 http://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html Surprising Stability: 88% of R users stayed with R and 91% stayed with Python. % of primary R , Python users up, while % Other or None down.
  • 73. Data Science Roles 77(c) KDnuggets 2016
  • 74. Data Science Roles • Data Analyst • (Big) Data Engineer • Data Scientist • Machine Learning Researcher • Data Science Manager/Director • Chief Data Officer • Company Founder © KDnuggets 2016 78
  • 75. Data Science Venn Diagram, 2010 © KDnuggets 2016 79 Drew Conway, 2010
  • 76. LinkedIn Data Skills LinkedIn has 334,000 Titles with “Data” • Data Analyst 60,273 • Data Scientist 12,680 • Database Analyst 4,357 • Business Data Analyst 1,709 • Senior Data Scientist 1,691 • Sr. Data Analyst 1,131 Thanks to Lutz Finger, Director of Analytics at LinkedIn for this custom study © KDnuggets 2016 80
  • 77. LinkedIn: 4 Groups of Skills Skills of people with “Data” in the title grouped into dedicated clusters - using similarity of members with similar skills. Database Management and Software • Access Database BTEQ Cubes Data Center Data Modeling Database Admin Database Administration Database Design Databases DB2 Embedded SQL FastExport FastLoad MDX Memcached Microsoft SQL Server MLOAD MongoDB Multiload MySQL NoSQL OA Framework Oracle Oracle Developer Suite Oracle Discoverer Oracle Enterprise Manager Oracle PL/SQL Development Oracle RAC Oracle SQL Developer Performance Tuning PhpMyAdmin PL/SQL PostgreSQL RDBMS Redis Relational Databases Replication RMAN SQL SQL Server Management Studio SQL*Plus SQL400 SQLite Stored Procedures Sybase T-SQL Teradata Toad TPT TPUMP Machine Learning • Computational Linguistics Data Visualization Information Retrieval Machine Learning Natural Language Processing Research Design Sentiment Analysis Structural Bioinformatics Text Mining Mathematics • Algebra Applied Mathematics Calculus Differential Equations Fortran Geometry Image Analysis LabVIEW Linear Algebra Maple Mathematica Mathematical Modeling Mathematics Matlab Monte Carlo Simulation Numerical Analysis Numerical Simulation Operations Research Partial Differential Equations Pre-Calculus Scientific Computing Simulations Trigonometry Statistical Analysis and Data Mining • A/B Testing Analytics ANOVA Business Analytics Cluster Analysis Data Analysis Data Mining Decision Trees Design of Experiments Economic Modeling Experimental Design Factor Analysis Google Analytics JMP Linear Regression Logistic Regression Marketing Analytics Minitab Pattern Recognition Predictive Analytics Predictive Modeling Primary Research Questionnaire Design Questionnaires R Sampling SAS SAS Programming SDTM Secondary Research SPSS Statistical Consulting Statistical Data Analysis Statistical Modeling Statistical Programming Statistics Survey Research Survival Analysis Time Series Analysis Web Analytics © KDnuggets 2016 81
  • 78. LinkedIn Skills N. Skills relating to Data Number of LinkedIn Members 1 9,708,214 2 3,870,376 3 2,065,318 4 1,097,849 5 576,310 6 305,266 7 169,351 8 98,284 9 60,419 10 37,689 © KDnuggets 2016 82
  • 79. Data Science Skills, Updated © KDnuggets 2016 84 Database, Coding Skills Domain/Business Expertise
  • 80. Database, Coding Skills Domain/Business Expertise Data Analyst/BI Analyst © KDnuggets 2016 85 Data Analyst Glassdoor, Apr 2016 US Avg Salary: $60-70,000 Positions: 13,000
  • 81. Database, Coding Skills Data Engineer © KDnuggets 2016 86 Domain/Business Expertise Data Engineer Glassdoor, Apr 2016 US Salary: $95,500 Jobs: 40,296 Ingénieur … Data France: 5K Jobs
  • 82. Machine Learning Researcher © KDnuggets 2016 87 Database, Coding Skills Domain/Business Expertise ML Researcher
  • 83. “Unicorn” Data Scientist © KDnuggets 2016 88 Database, Coding Skills Domain/Business Expertise Glassdoor, Apr 2016 US Salary: $113,400 Jobs: 2572 France: €43,500 Jobs: 180 “Unicorn” Data Scientist
  • 84. Data Science Manager/Director © KDnuggets 2016 89 Database, Coding Skills Domain/ Business Expertise People Management Skills Data Science Leader
  • 85. Company Founder © KDnuggets 2016 90 Database, Coding Skills Domain/ Business Expertise People Management Skills + Vision Founder
  • 86. Data Career Progression © KDnuggets 2016 91 BI/Data Analyst Data Engineer Data Scientist Machine Learning Researcher Data Science Manager/Director Company Founder/CEO Chief Data Officer Chief Scientist
  • 87. DATA SCIENCE JOB TRENDS (c) KDnuggets 2016 92
  • 88. Shortage of Data Scientists? • McKinsey (2011): shortage by 2018 in US – 140-190,000 people with deep analytical skills – 1.5 M managers/analysts with the know-how to use the analysis of big data to make effective decisions. Source: www.mckinsey.com/mgi/publications/big_data/ 93(c) KDnuggets 2016
  • 89. Data Scientist – Sexiest Job of the 21st Century? • Thomas H. Davenport and D.J. Patil, (Harvard Business Review, 2012) 94(c) KDnuggets 2016
  • 90. “Data Scientist” - leading job trend © KDnuggets 2016 95 “Data Scientist” Job has grown 1,700% from 2012 to 2016 Top 5 Tech Job Trends in 2016: Data Scientist, Devops, Puppet, PaaS, Hadoop ? Indeed.com/jobtrends
  • 91. Attention to Detail: [Data Scientist] != “Data Scientist” © KDnuggets 2016 96 Indeed.com/jobtrends Data Scientist “Data Scientist” = “data scientist”
  • 92. “Data Scientist” vs Statistician © KDnuggets 2016 97 Indeed.com job trends “Data Scientist” Statistician
  • 93. Data Scientist jobs on KDnuggets © KDnuggets 2016 98 0% 5% 10% 15% 20% 25% 30% 35% 40% 2010 2011 2012 2013 2014 2015 % Data Scientist jobs on KDnuggets Including Senior, Junior, Principal, Chief DS, …
  • 94. LinkedIn 25 Hot Skills © KDnuggets 2016 99 2015 2014
  • 96. Big Data • Next Industrial Revolution • Data Science is the Engine of Big Data 101(c) KDnuggets 2016
  • 97. Doing Old Things Better Application areas – Direct marketing/Customer modeling – Recommendations – Fraud detection – Security/Intelligence – Healthcare – … • Competition will level companies 102(c) KDnuggets 2016
  • 98. Big Data Enables New Things ! • Google – first big success of big data • Social networks (Facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data • Big Data in Health-care – image analysis, diagnosis, – Personalized medicine • Recommendations - Netflix streaming 103(c) KDnuggets 2016
  • 99. New services, products, platforms • Image recognition – FB uses to decide what to show users • Face recognition - security • Location-based services – Tinder • Big Data to Power AI and Machine Learning – Imagine Google DeepMind, IBM Watson, Siri in 2020 ? © KDnuggets 2016 104
  • 100. Gartner Hype Cycle, 2012 © 2016 KDnuggets 105 Gartner Hype Cycle Big Data
  • 101. Gartner Hype Cycle, 2013 © 2016 KDnuggets 106 Gartner Hype Cycle Big Data
  • 102. Gartner Hype Cycle, 2014 © 2016 KDnuggets 107 Big DataData Science See http://diggdata.in/ which has 4 years of Gartner Hype Cycle
  • 103. Gartner Hype Cycle, 2015 © 2016 KDnuggets 108 Gartner Hype Cycle Big Data www.kdnuggets.com/2015/08/gartner-2015-hype-cycle-big-data-is-out-machine-learning-is-in.html Citizen Data Science Machine Learning
  • 104. “Citizen” Data Science © KDnuggets 2016 110 This is Bob, our new Citizen Data Scientist. He previously worked as a citizen dentist and a citizen pilot.
  • 105. Golden Age of Data Science, Machine Learning • Amazing New Tools • Very Complex Algorithms are very easy to use • scikit-learn, iPython notebooks, etc • One-Click deployment of TensorFlow on AWS with GPU © KDnuggets 2016 111
  • 106. Data Science Automated ? © KDnuggets 2016 112 Expert Human Ability Current Computer Ability
  • 107. Data Science Automated ? © KDnuggets 2016 113 Expert Human Ability
  • 108. Data Science Automated By 2025? © KDnuggets 2016 114 KDnuggets Poll in 2015: 51% of voters expect Data Science Automation to happen in 10 years or less - www.kdnuggets.com/2015/05/data-scientists-automated-2025.html
  • 109. Data Science Automation © KDnuggets 2016 115 I remember when only a Deep Learning supercomputer could beat me in a Data Science competition
  • 110. Data Science Automation KDnuggets: Software: Automated Data Science: • AutoDiscovery from ButlerScientifics • Automatic Business Modeler from Algolytics • Automatic Statistician project • DataRobot • DMWay • ForecastThis DSX • FeatureLab • Loom Systems, • machineJS: Automated machine learning • Quill from Narrative Science • SAP Predictive Analytics • Savvy from Yseop. • Skytree Machine Learning Software • Tree-based Pipeline Optimization Tool (TPOT) © KDnuggets 2016 116
  • 111. Data Science Automation • New tools make Data Scientists more productive • Make data results more widely available • Automate lower-level Data Science tasks © KDnuggets 2016 117
  • 112. “Soft” Data Science Skills Harder to Automate • Curiosity • Intuition • Business Knowledge • Selecting a good metric • Posing the right question • Presentation Skills Data Science – still a great profession © KDnuggets 2016 118
  • 113. Questions? KDnuggets: Analytics, Big Data, Data Science • Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html • Email to editor1@kdnuggets.com • Twitter: @kdnuggets • facebook.com/kdnuggets • LinkedIn group: KDnuggets 119© KDnuggets 2016

Editor's Notes

  1. Churn: best algorithms for predicting churn have lift of 5-7 – 5-7 times better than random. Behavioral advertising: 2-3% CTR – 10 times better than random
  2. Future is Bright for Big Data, but need use caution when evaluating claims