3. Out of all Data Science Projects…
Only these many succeed.
Gartner says 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects
will “remain alchemy, run by wizards” through 2020
6. Why Data Science Projects Fail?
Initiation People and Process Solution Design Data Access Data Fallacies
Inadequate research
Starting with the
wrong questions
Not addressing the
root cause
Initiating data science
project due to a blog
Try to take on to large
of a first project
Lack of diverse
Subject Matter Experts
Lacking an experienced
data science leader
Limited business
understanding
Lack of a standardized
data science process
Failing to
communicate value
of data science
The solutions are
too complex
Forming conclusions
before data scientists
start
Poorly designed
models
Fail to provide
actionable insights
Using technologies
because they are
cool
Lack of access to
data
Using Faulty / Bad
Data
Having a data scientist
build their own ETLs
Relying on Excel as
the main data storage.
Big data silos or
vendor owned
data!
Simpson’s Paradox
Setting wrong
performance measures
McNamara Fallacy
Overfitting
Data Dredging
7. Availability of right data
More Data Beats A Cleverer Algorithm!
You need 10 times as many examples as degrees of freedom in
the model.
What
model
should I
use?
How much
training
data
should I
gather?
How much data is required for algorithms?
IT DEPENDS!
More features may help overcome issue of lesser data.
Correctly tagged data may be more useful than large untagged
dataset.
Small Data problems
Over-fitting becomes much harder to avoid
Outliers become much more dangerous.
Noise becomes a real issue!
8. Setting The Right Performance Measure
Regression Classification
OthersUnsupervised
Models
o MSPE
o MSAE
o R Square
o Adjusted R Square
o Precision- Recall
o ROC-AUC
o Accuracy
o Log-Loss
o Rand Index
o Mutual Information
o CV Error
o Heuristic methods
tc find K
o BLEU Score (NLP)
9. Classification of legal documents using Data Science
A u t o m a t e d
C l a s s i f i c a t i o n o f l e g a l
d o c u m e n t s f r o m 3 0 0 0 +
g o v e r n m e n t w e b p a g e s
• U n s t r u c t u r e d D a t a
• Tr a i n i n g d a t a – M o r e
t h a n 9 0 0 0 0 m a n u a l l y
c l a s s i f i e d
• 9 5 % o ve r a l l
a c c u r a c y
• M o r e t h a n 8 5 %
r e c a l l
Data UsedObjective Result
Availability of right data Right Performance Measure
• N a ï v e B a y e s
• S V M
• R o c c h i o
• Te c h n o l o g i e s – P y t h o n
a n d R e d i s
Algorithms And Technologies
10. Milking the bull – When data science is going to fail!
Images and text with very less
features
Beware of data issues
that can lead to failure
Inaccurately tagged data
Low quantity data
Absence of actual predictors
Time series data without features
11. The model does not categorize the data correctly because of too much of details
and noise.
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of data.
When a model gets trained with so much of data, it starts learning from the
noise and inaccurate data entries.
12. Production forecasting for manufacturers
P r e d i c t i n g d a i l y
p r o d u c t i o n f o r a
m a n u f a c t u r i n g
c o m p a n y
* M o n t h l y t a r g e t s n o t a v a i l a b l e
* L a b o r d e t a i l s u n r e l i a b l e
* M a c h i n e d e t a i l s n o t a v a i l a b l e
• P a t t e r n s f o u n d i n
s u b s e t s o f d a t a
n o t g e n e r a l i z i n g
• N o u s e f u l
p r e d i c t i o n
Data UsedObjective Result
Milking The Bull Overfitting
• R
• A R I M A , L i n e a r
r e g r e s s i o n , C a t b o o s t ,
& R a n d o m F o r e s t s w i t h
f e a t u r e e n g i n e e r i n g
Algorithms And Technologies
6 m o n t h s h o u r l y
p r o d u c t i o n n u m b e r s
13. How companies use hammers to kill mosquitos!
appropriate technologies do!
Sexy Technologies don’t guarantee success,
14. Data must be clean or there should be a way to clean data.
Better Data > Efficient Algorithms
Domain understanding
Data Owner interactions
Acceptance of
missing values
Data Profiling
automation
• Remove Unwanted observations
• Fix Structural Errors
• Filter Unwanted Outliers
• Handle Missing Data
15. Customer Churn Prediction For Specialty Insurance
• P r e d i c t l i k e l i h o o d o f
c h u r n f o r e a c h
p o l i c y
• G e t m a x i m u m c h u r n
d e t e c t i o n r a t e w h i l e
k e e p i n g f a l s e
a l a r m s l o w e r t h a n
2 0 %
• D a t a u n d e r s t a n d i n g w i t h
d o m a i n e x p e r t s
• D a t a c l e a n i n g a n d
e x p l o r a t i o n f o r i n s i g h t s
a n d p r e d i c t o r
i d e n t i f i c a t i o n
• A b l e t o a c h i e v e 7 5 %
c h u r n d e t e c t i o n r a t e
w h i l e k e e p i n g t h e
f a l s e a l a r m r a t e l e s s
t h a t 2 0 %
• T i m e l y p r e d i c t i o n o f
c h u r n h e l p i n g i n
t a k i n g r e t e n t i o n
a c t i o n
Data UsedObjective Result
Appropriate Technologies Clean Data
Algorithms And Technologies
• R
• R a n d o m F o r e s t , L o g i s t i c
R e g r e s s i o n , G r a d i e n t
B o o s t i n g a n d C a t B o o s t
16. A phenomenon in which a trend appears in different groups of data
but disappears or reverses when the groups are combined.
Simpson’s Paradox
The admission process seems significantly
biased against women.
But in reality most of the departments
are significantly biased against men.
Admission to UC Berkeley
17. Making a decision based solely on quantitative observations
and ignoring all others.
McNamara Fallacy
Presume that which cannot be measured easily is not important.
• Let us assume a company has developed a new E-Commerce website.
• After new website, site visits are up 50% and number of newsletter subscriptions are up 25%.
Measure whatever can be easily measured.
Disregard that which cannot be measured easily.
But
What if percentage of people who never open their emails OR
who unsubscribe immediately has increased?
Web Traffic Measurement
18. Biotech Innovation Efficiency Analytics
• A n a l y z e p o t e n t i a l o f
e a r l y s t a g e b i o t e c h
f i r m s b y a n a l y z i n g
t h e i r I n n o v a t i o n
e f f i c i e n c y.
• F i n d o u t s t a t i s t i c a l
c o r r e l a t i o n o f
p e r f o r m a n c e o f a
c o m p a n y w i t h
i n n o v a t i o n e f f i c i e n c y.
• C l i n i c a l Tr a i l s
• P r e s s r e l e a s e
• S t o c k d e t a i l s
• P a t e n t s
• P u b l i c a t i o n s
• C o m p a n y F i n a n c i a l s
Data UsedObjective Result
Simpson’s Paradox McNamara Fallacy
Algorithms And Technologies
• R a n d o m F o r e s t
• D e c i s i o n T r e e
• N e u r a l N e t w o r k s
• R e g r e s s i o n
Right Performance Measure Domain Understanding
S t a t i s t i c a l
c o r r e l a t i o n b e t w e e n
f i n a n c i a l
p e r f o r m a n c e a n d
i n n o va t i o n
e f f i c i e n c y f o r
c e r t a i n c a t e g o r y o f
c o m p a n i e s
19. Why Data Science Projects Fail?
Initiation People and Process Solution Design Data Access Data Fallacies
Inadequate research
Starting with the
wrong questions
Not addressing the
root cause
Initiating data science
project due to a blog
Try to take on to large
of a first project
Lack of diverse
Subject Matter Experts
Lacking an experienced
data science leader
Limited business
understanding
Lack of a standardized
data science process
Failing to
communicate value
of data science
The solutions are
too complex
Forming conclusions
before data scientists
start
Poorly designed
models
Fail to provide
actionable insights
Using technologies
because they are
cool
Lack of access to
data
Using Faulty / Bad
Data
Having a data scientist
build their own ETLs
Relying on Excel as
the main data storage.
Big data silos or
vendor owned
data!
Simpson’s Paradox
Setting wrong
performance measures
McNamara Fallacy
Overfitting
Data Dredging