The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
6. A search for integration points
math & language
statistics & programming
visualization & machine learning
research & agile
systems & applications
???
9. “Pay for data scientists has rocketed from $125,000 to $150,000 two years ago to upwards
of $225,000 these days – even for those straight out of school. ” Ali Behnam (2013)
“Stephen Purpura, the co-founder and CEO of Seattle-based software company Context
Relevant, recently lost out on one job candidate who had with a PhD and seven years of
work experience. He was dying to land the guy. As he puts it, “These people are almost like
unicorns.” But out of the blue, Microsoft came knocking with an offer of $650,000 in annual
salary and guaranteed bonuses. “We can’t compete with that kind of offer,” says Purpura.”
(2013)
According to Glassdoor data, the median salary for data scientists in the United States is
$120,931. By contrast, a business analyst can expect to make around $70,170 and a data
analyst around $65,470. (2018)
Pay scales for data scientists
11. But ...
“Data scientists’ most basic, universal
skill is the ability to write code. This may
be less true in five years’ time, when
many more people will have the title
‘data scientist’ on their business cards.”
Davenport and Patil
Data Scientist: The Sexiest Job of the 21st Century
12. But ...
“After rising rapidly in 2015 and 2016,
median base salaries at all job levels
changed by a single-digit percentage
point or not at all from March 2017 to
March 2018.”
15. 1936: Turing first theorizes the Turing Machine
1957: Frank Rosenblatt invents the Perceptron
1973: James Lighthill publishes the Lighthill Report, discrediting AI
1974-1993 (approx): AI Winter
1995: Siegelmann & Vapnik invent the SVM
1997: Deep Blue beats Garry Kasparov
1998: CNNs can recognize handwritten digits
2000’s: GPUs become commercially available
2015: TensorFlow becomes open source
????: Robots take over run for your life
AI Timeline
16. A recent KDnuggets Poll Data
Scientists Automated and
Unemployed by 2025? found
the majority of respondents
thought that expert-level Data
Science will be automated by
2025.
Does anyone understand
what we actually do?
Now ...
18. ● Personalized news classification
● Detecting political lean in documents
● Entity extraction, recognition and resolution
● Supply chain inference from unstructured text
● Natural language interfaces for relational data
● Automated metadata tagging
● Plagiarism detection
Adventures in Applied NLP
19. The Natural Language Toolkit
import nltk
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
print(moby.similar("ahab"))
print(moby.common_contexts(["ahab", "starbuck"]))
print(moby.concordance("monstrous", 55, lines=10))
20. Gensim + Wikipedia
import bz2
import gensim
# Load id to word dictionary
id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
# Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
# Do latent Semantic Analysis and find 10 prominent topics
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsa.print_topics(10)
21. From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags
28. from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,
RandomForestClassifier
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
29. - Search is difficult particularly in high
dimensional space.
- Even with techniques like genetic
algorithms or particle swarm
optimization, there is no guarantee of
a solution.
- As the search space gets larger, the
amount of time increases
exponentially.
Except
32. Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
33. - Extend the Scikit-Learn API.
- Enhances the model selection
process.
- Tools for feature visualization,
visual diagnostics, and visual
steering.
- Not a replacement for other
visualization libraries.
Yellowbrick
34. ● Based on spring tension
minimization algorithm.
● Features equally spaced on a unit
circle, instances dropped into circle.
● Features pull instances towards their
position on the circle in proportion
to their normalized numerical value
for that instance.
● Classification coloring based on
labels in data.
Radial Visualization
35. ● Visualize clusters in data.
● Points represented as connected
line segments.
● Each vertical line represents one
attribute (x-axis units not
meaningful).
● One set of connected line segments
represents one instance.
● Points that tend to cluster will
appear closer together.
Parallel Coordinates
36. ● Feature engineering requires
understanding of the relationships
between features
● Visualize pairwise relationships as a
heatmap
● Pearson shows us strong
correlations, potential collinearity
● Covariance helps us understand the
sequence of relationships
Rank2D
37. ● Embed instances described
by many dimensions into 2.
● Look for latent structures in
the data, noise, separability.
● Is it possible to create a
decision space in the data?
● Unlike PCA or SVD, manifolds
use nearest neighbors, can
capture non-linear structures.
Manifolds
38. ● Uses PCA to decompose high
dimensional data into two or three
dimensions
● Each instance plotted in a scatter
plot.
● Projected dataset can be analyzed
along axes of principle variation
● Can be interpreted to determine if
spherical distance metrics can be
utilized.
PCA Projection
39. ● Need to select the minimum
required features to produce a valid
model.
● The more features a model
contains, the more complex it is
(sparse data, errors due to
variance).
● This visualizer ranks and plots
underlying impact of features
relative to each other.
Feature Importances
40. ● Recursive feature elimination fits a
model and removes the weakest
feature(s) until the specified
number is reached.
● Features are ranked by internal
model’s coef_ or
feature_importances_
● Attempts to eliminate
dependencies and collinearity that
may exist in the model.
Recursive Feature Elimination
41. Precision: of
those labelled
edible, how many
actually were?
Is it better
to have
false
positives
here or
here?
Recall: how many
of the
poisonous ones
did our model
find?
Classification Heatmap
42. ● Visualize tradeoff between
classifier's sensitivity (how
well it finds true positives)
and specificity (how well it
avoids false positives)
● Usually for binary
classification.
● Can also visualize
multiclass classification
with per-class or
one-vs-rest strategies.
ROC-AUC
43. Do I care
about certain
classes more
than others?
I have a lot
of classes;
how does my
model
perform on
each?
Confusion Matrix
46. higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or suggests
a different
algorithm
Hyperparameter Tuning
49. Data science in production is hard
ComputationStorageDataInteraction
Computational
Data Store
Feature Analysis
Model Builds
Model
Selection &
Monitoring
NormalizationIngestion
Feedback
Wrangling
API
Cross Validation
Build Phase
Deploy Phase
50. Building data science teams
● No two data scientists are alike.
● Some blend of: programming, math, machine learning, and
data engineering/wrangling.
● Evaluate creative problem solving, not rote memorization.
● Think beyond just data scientists:
○ Business development specialists
○ Product owners and project managers
○ Communications
55. “Data products are self-adapting,
broadly applicable economic
engines that derive their value from
data and generate more data by
influencing human behavior or by
making inferences or predictions
upon new data.”
- Benjamin Bengfort
The Age of the Data Product
56. Decision Science
Inform strategy and key
decisions by analysis of
business metrics.
Data Products
Improve product
performance via automatic
decision making.
58. “We will begin to see more smaller-scale
analytics developed to subtly improve the
user experience of everyday applications.
[These] will rely not (or not exclusively)
on massive datasets [or algorithmic
innovations], but on custom,
domain-specific datasets geared to
specific use cases.”