Ellicium Solutions - Making Data Science Work

Making Data Science Work
By
Dr. Kuldeep Deshpande
Saumitra Modak

10 x increase in
data science jobs!

Out of all Data Science Projects…
Only these many succeed.
Gartner says 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects
will “remain alchemy, run by wizards” through 2020

It is!
Data Science = Business Sense

Why Data Science Projects Fail?
Initiation People and Process Solution Design Data Access Data Fallacies
Inadequate research
Starting with the
wrong questions
Not addressing the
root cause
Initiating data science
project due to a blog
Try to take on to large
of a first project
Lack of diverse
Subject Matter Experts
Lacking an experienced
data science leader
Limited business
understanding
Lack of a standardized
data science process
Failing to
communicate value
of data science
The solutions are
too complex
Forming conclusions
before data scientists
start
Poorly designed
models
Fail to provide
actionable insights
Using technologies
because they are
cool
Lack of access to
data
Using Faulty / Bad
Data
Having a data scientist
build their own ETLs
Relying on Excel as
the main data storage.
Big data silos or
vendor owned
data!
Simpson’s Paradox
Setting wrong
performance measures
McNamara Fallacy
Overfitting
Data Dredging

Availability of right data
More Data Beats A Cleverer Algorithm!
You need 10 times as many examples as degrees of freedom in
the model.
What
model
should I
use?
How much
training
data
should I
gather?
How much data is required for algorithms?
IT DEPENDS!
More features may help overcome issue of lesser data.
Correctly tagged data may be more useful than large untagged
dataset.
Small Data problems
Over-fitting becomes much harder to avoid
Outliers become much more dangerous.
Noise becomes a real issue!

Setting The Right Performance Measure
Regression Classification
OthersUnsupervised
Models
o MSPE
o MSAE
o R Square
o Adjusted R Square
o Precision- Recall
o ROC-AUC
o Accuracy
o Log-Loss
o Rand Index
o Mutual Information
o CV Error
o Heuristic methods
tc find K
o BLEU Score (NLP)

Classification of legal documents using Data Science
A u t o m a t e d
C l a s s i f i c a t i o n o f l e g a l
d o c u m e n t s f r o m 3 0 0 0 +
g o v e r n m e n t w e b p a g e s
• U n s t r u c t u r e d D a t a
• Tr a i n i n g d a t a – M o r e
t h a n 9 0 0 0 0 m a n u a l l y
c l a s s i f i e d
• 9 5 % o ve r a l l
a c c u r a c y
• M o r e t h a n 8 5 %
r e c a l l
Data UsedObjective Result
Availability of right data Right Performance Measure
• N a ï v e B a y e s
• S V M
• R o c c h i o
• Te c h n o l o g i e s – P y t h o n
a n d R e d i s
Algorithms And Technologies

Milking the bull – When data science is going to fail!
Images and text with very less
features
Beware of data issues
that can lead to failure
Inaccurately tagged data
Low quantity data
Absence of actual predictors
Time series data without features

The model does not categorize the data correctly because of too much of details
and noise.
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of data.
When a model gets trained with so much of data, it starts learning from the
noise and inaccurate data entries.

Production forecasting for manufacturers
P r e d i c t i n g d a i l y
p r o d u c t i o n f o r a
m a n u f a c t u r i n g
c o m p a n y
* M o n t h l y t a r g e t s n o t a v a i l a b l e
* L a b o r d e t a i l s u n r e l i a b l e
* M a c h i n e d e t a i l s n o t a v a i l a b l e
• P a t t e r n s f o u n d i n
s u b s e t s o f d a t a
n o t g e n e r a l i z i n g
• N o u s e f u l
p r e d i c t i o n
Milking The Bull Overfitting
• R
• A R I M A , L i n e a r
r e g r e s s i o n , C a t b o o s t ,
& R a n d o m F o r e s t s w i t h
f e a t u r e e n g i n e e r i n g
6 m o n t h s h o u r l y
p r o d u c t i o n n u m b e r s

How companies use hammers to kill mosquitos!
appropriate technologies do!
Sexy Technologies don’t guarantee success,

Data must be clean or there should be a way to clean data.
Better Data > Efficient Algorithms
Domain understanding
Data Owner interactions
Acceptance of
missing values
Data Profiling
automation
• Remove Unwanted observations
• Fix Structural Errors
• Filter Unwanted Outliers
• Handle Missing Data

Customer Churn Prediction For Specialty Insurance
• P r e d i c t l i k e l i h o o d o f
c h u r n f o r e a c h
p o l i c y
• G e t m a x i m u m c h u r n
d e t e c t i o n r a t e w h i l e
k e e p i n g f a l s e
a l a r m s l o w e r t h a n
2 0 %
• D a t a u n d e r s t a n d i n g w i t h
d o m a i n e x p e r t s
• D a t a c l e a n i n g a n d
e x p l o r a t i o n f o r i n s i g h t s
a n d p r e d i c t o r
i d e n t i f i c a t i o n
• A b l e t o a c h i e v e 7 5 %
c h u r n d e t e c t i o n r a t e
w h i l e k e e p i n g t h e
f a l s e a l a r m r a t e l e s s
t h a t 2 0 %
• T i m e l y p r e d i c t i o n o f
c h u r n h e l p i n g i n
t a k i n g r e t e n t i o n
a c t i o n
Appropriate Technologies Clean Data
• R
• R a n d o m F o r e s t , L o g i s t i c
R e g r e s s i o n , G r a d i e n t
B o o s t i n g a n d C a t B o o s t

A phenomenon in which a trend appears in different groups of data
but disappears or reverses when the groups are combined.
Simpson’s Paradox
The admission process seems significantly
biased against women.
But in reality most of the departments
are significantly biased against men.
Admission to UC Berkeley

Making a decision based solely on quantitative observations
and ignoring all others.
McNamara Fallacy
Presume that which cannot be measured easily is not important.
• Let us assume a company has developed a new E-Commerce website.
• After new website, site visits are up 50% and number of newsletter subscriptions are up 25%.
Measure whatever can be easily measured.
Disregard that which cannot be measured easily.
But
What if percentage of people who never open their emails OR
who unsubscribe immediately has increased?
Web Traffic Measurement

Biotech Innovation Efficiency Analytics
• A n a l y z e p o t e n t i a l o f
e a r l y s t a g e b i o t e c h
f i r m s b y a n a l y z i n g
t h e i r I n n o v a t i o n
e f f i c i e n c y.
• F i n d o u t s t a t i s t i c a l
c o r r e l a t i o n o f
p e r f o r m a n c e o f a
c o m p a n y w i t h
i n n o v a t i o n e f f i c i e n c y.
• C l i n i c a l Tr a i l s
• P r e s s r e l e a s e
• S t o c k d e t a i l s
• P a t e n t s
• P u b l i c a t i o n s
• C o m p a n y F i n a n c i a l s
Simpson’s Paradox McNamara Fallacy
• R a n d o m F o r e s t
• D e c i s i o n T r e e
• N e u r a l N e t w o r k s
• R e g r e s s i o n
Right Performance Measure Domain Understanding
S t a t i s t i c a l
c o r r e l a t i o n b e t w e e n
f i n a n c i a l
p e r f o r m a n c e a n d
i n n o va t i o n
e f f i c i e n c y f o r
c e r t a i n c a t e g o r y o f
c o m p a n i e s

Ellicium Solutions - Making Data Science Work

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ellicium Solutions - Making Data Science Work

Semelhante a Ellicium Solutions - Making Data Science Work (20)

Último

Último (20)

Ellicium Solutions - Making Data Science Work