The Incredible Disappearing Data Scientist

The Incredible Disappearing
Data Scientist
Critical Integration Points for the Next 10 Years
Big Mountain Data and Dev
2018

Dr. Rebecca Bilbro
Head of Data Science, ICX Media
Co-creator, Scikit-Yellowbrick
Faculty, Georgetown Univ.
@rebeccabilbro

Data Science: A Story in Six Acts
● The Beginning
● The Hype
● The Work
● The Tools
● The Team
● The Future

A search for integration points
math & language
statistics & programming
visualization & machine learning
research & agile
systems & applications
???

“Pay for data scientists has rocketed from $125,000 to $150,000 two years ago to upwards
of $225,000 these days – even for those straight out of school. ” Ali Behnam (2013)
“Stephen Purpura, the co-founder and CEO of Seattle-based software company Context
Relevant, recently lost out on one job candidate who had with a PhD and seven years of
work experience. He was dying to land the guy. As he puts it, “These people are almost like
unicorns.” But out of the blue, Microsoft came knocking with an offer of $650,000 in annual
salary and guaranteed bonuses. “We can’t compete with that kind of offer,” says Purpura.”
(2013)
According to Glassdoor data, the median salary for data scientists in the United States is
$120,931. By contrast, a business analyst can expect to make around $70,170 and a data
analyst around $65,470. (2018)
Pay scales for data scientists

But ...
“Data scientists’ most basic, universal
skill is the ability to write code. This may
be less true in five years’ time, when
many more people will have the title
‘data scientist’ on their business cards.”
Davenport and Patil
Data Scientist: The Sexiest Job of the 21st Century

But ...
“After rising rapidly in 2015 and 2016,
median base salaries at all job levels
changed by a single-digit percentage
point or not at all from March 2017 to
March 2018.”

1936: Turing first theorizes the Turing Machine
1957: Frank Rosenblatt invents the Perceptron
1973: James Lighthill publishes the Lighthill Report, discrediting AI
1974-1993 (approx): AI Winter
1995: Siegelmann & Vapnik invent the SVM
1997: Deep Blue beats Garry Kasparov
1998: CNNs can recognize handwritten digits
2000’s: GPUs become commercially available
2015: TensorFlow becomes open source
????: Robots take over run for your life
AI Timeline

A recent KDnuggets Poll Data
Scientists Automated and
Unemployed by 2025? found
the majority of respondents
thought that expert-level Data
Science will be automated by
2025.
Does anyone understand
what we actually do?
Now ...

● Personalized news classification
● Detecting political lean in documents
● Entity extraction, recognition and resolution
● Supply chain inference from unstructured text
● Natural language interfaces for relational data
● Automated metadata tagging
● Plagiarism detection
Adventures in Applied NLP

The Natural Language Toolkit
import nltk
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
print(moby.similar("ahab"))
print(moby.common_contexts(["ahab", "starbuck"]))
print(moby.concordance("monstrous", 55, lines=10))

Gensim + Wikipedia
import bz2
import gensim
# Load id to word dictionary
id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
# Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
# Do latent Semantic Analysis and find 10 prominent topics
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsa.print_topics(10)

From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags

Streaming
Corpus Preprocessing
Tokenized
Corpus
CorpusReader for streaming access, preprocessing, and saving the tokenized version
HTML
Paras
Sents
Tokens
Tags
Raw
Corpus

Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer

Ingestion
Data Munging
and Wrangling
Computation
and Analysis
Modeling and
Application
Visual Analysis
The Pipeline
(in theory)

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,
RandomForestClassifier
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!

- Search is difficult particularly in high
dimensional space.
- Even with techniques like genetic
algorithms or particle swarm
optimization, there is no guarantee of
a solution.
- As the search space gets larger, the
amount of time increases
exponentially.
Except

Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning

Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form

- Extend the Scikit-Learn API.
- Enhances the model selection
process.
- Tools for feature visualization,
visual diagnostics, and visual
steering.
- Not a replacement for other
visualization libraries.
Yellowbrick

● Based on spring tension
minimization algorithm.
● Features equally spaced on a unit
circle, instances dropped into circle.
● Features pull instances towards their
position on the circle in proportion
to their normalized numerical value
for that instance.
● Classification coloring based on
labels in data.
Radial Visualization

● Visualize clusters in data.
● Points represented as connected
line segments.
● Each vertical line represents one
attribute (x-axis units not
meaningful).
● One set of connected line segments
represents one instance.
● Points that tend to cluster will
appear closer together.
Parallel Coordinates

● Feature engineering requires
understanding of the relationships
between features
● Visualize pairwise relationships as a
heatmap
● Pearson shows us strong
correlations, potential collinearity
● Covariance helps us understand the
sequence of relationships
Rank2D

● Embed instances described
by many dimensions into 2.
● Look for latent structures in
the data, noise, separability.
● Is it possible to create a
decision space in the data?
● Unlike PCA or SVD, manifolds
use nearest neighbors, can
capture non-linear structures.
Manifolds

● Uses PCA to decompose high
dimensional data into two or three
dimensions
● Each instance plotted in a scatter
plot.
● Projected dataset can be analyzed
along axes of principle variation
● Can be interpreted to determine if
spherical distance metrics can be
utilized.
PCA Projection

● Need to select the minimum
required features to produce a valid
model.
● The more features a model
contains, the more complex it is
(sparse data, errors due to
variance).
● This visualizer ranks and plots
underlying impact of features
relative to each other.
Feature Importances

● Recursive feature elimination fits a
model and removes the weakest
feature(s) until the specified
number is reached.
● Features are ranked by internal
model’s coef_ or
feature_importances_
● Attempts to eliminate
dependencies and collinearity that
may exist in the model.
Recursive Feature Elimination

Precision: of
those labelled
edible, how many
actually were?
Is it better
to have
false
positives
here or
here?
Recall: how many
of the
poisonous ones
did our model
find?
Classification Heatmap

● Visualize tradeoff between
classifier's sensitivity (how
well it finds true positives)
and specificity (how well it
avoids false positives)
● Usually for binary
classification.
● Can also visualize
multiclass classification
with per-class or
one-vs-rest strategies.
ROC-AUC

Do I care
about certain
classes more
than others?
I have a lot
of classes;
how does my
model
perform on
each?
Confusion Matrix

Similar to confusion
matrix, but often more
interpretable!
Class Prediction Error

Visualize the
distribution of error to
diagnose
heteroscedasticity
Prediction Error and Residuals

higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or suggests
a different
algorithm
Hyperparameter Tuning

Data science in production is hard
ComputationStorageDataInteraction
Computational
Data Store
Feature Analysis
Model Builds
Model
Selection &
Monitoring
NormalizationIngestion
Feedback
Wrangling
API
Cross Validation
Build Phase
Deploy Phase

Building data science teams
● No two data scientists are alike.
● Some blend of: programming, math, machine learning, and
data engineering/wrangling.
● Evaluate creative problem solving, not rote memorization.
● Think beyond just data scientists:
○ Business development specialists
○ Product owners and project managers
○ Communications

1. Data Science is Disappearing

Remember when
people used to
write “word
processing skills”
on their
resumes?

“Data products are self-adapting,
broadly applicable economic
engines that derive their value from
data and generate more data by
influencing human behavior or by
making inferences or predictions
upon new data.”
- Benjamin Bengfort
The Age of the Data Product

Decision Science
Inform strategy and key
decisions by analysis of
business metrics.
Data Products
Improve product
performance via automatic
decision making.

“We will begin to see more smaller-scale
analytics developed to subtly improve the
user experience of everyday applications.
[These] will rely not (or not exclusively)
on massive datasets [or algorithmic
innovations], but on custom,
domain-specific datasets geared to
specific use cases.”

The Incredible Disappearing Data Scientist

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The Incredible Disappearing Data Scientist

Semelhante a The Incredible Disappearing Data Scientist (20)

Mais de Rebecca Bilbro

Mais de Rebecca Bilbro (19)

Último

Último (20)

The Incredible Disappearing Data Scientist