SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
The Incredible Disappearing
Data Scientist
Critical Integration Points for the Next 10 Years
Big Mountain Data and Dev
2018
Dr. Rebecca Bilbro
Head of Data Science, ICX Media
Co-creator, Scikit-Yellowbrick
Faculty, Georgetown Univ.
@rebeccabilbro
Data Science: A Story in Six Acts
● The Beginning
● The Hype
● The Work
● The Tools
● The Team
● The Future
The Beginning
A search for integration points
math & language
statistics & programming
visualization & machine learning
research & agile
systems & applications
???
The Hype
October 2012
“Pay for data scientists has rocketed from $125,000 to $150,000 two years ago to upwards
of $225,000 these days – even for those straight out of school. ” Ali Behnam (2013)
“Stephen Purpura, the co-founder and CEO of Seattle-based software company Context
Relevant, recently lost out on one job candidate who had with a PhD and seven years of
work experience. He was dying to land the guy. As he puts it, “These people are almost like
unicorns.” But out of the blue, Microsoft came knocking with an offer of $650,000 in annual
salary and guaranteed bonuses. “We can’t compete with that kind of offer,” says Purpura.”
(2013)
According to Glassdoor data, the median salary for data scientists in the United States is
$120,931. By contrast, a business analyst can expect to make around $70,170 and a data
analyst around $65,470. (2018)
Pay scales for data scientists
The high water mark
But ...
“Data scientists’ most basic, universal
skill is the ability to write code. This may
be less true in five years’ time, when
many more people will have the title
‘data scientist’ on their business cards.”
Davenport and Patil
Data Scientist: The Sexiest Job of the 21st Century
But ...
“After rising rapidly in 2015 and 2016,
median base salaries at all job levels
changed by a single-digit percentage
point or not at all from March 2017 to
March 2018.”
And ...
1936: Turing first theorizes the Turing Machine
1957: Frank Rosenblatt invents the Perceptron
1973: James Lighthill publishes the Lighthill Report, discrediting AI
1974-1993 (approx): AI Winter
1995: Siegelmann & Vapnik invent the SVM
1997: Deep Blue beats Garry Kasparov
1998: CNNs can recognize handwritten digits
2000’s: GPUs become commercially available
2015: TensorFlow becomes open source
????: Robots take over run for your life
AI Timeline
A recent KDnuggets Poll Data
Scientists Automated and
Unemployed by 2025? found
the majority of respondents
thought that expert-level Data
Science will be automated by
2025.
Does anyone understand
what we actually do?
Now ...
The Work
● Personalized news classification
● Detecting political lean in documents
● Entity extraction, recognition and resolution
● Supply chain inference from unstructured text
● Natural language interfaces for relational data
● Automated metadata tagging
● Plagiarism detection
Adventures in Applied NLP
The Natural Language Toolkit
import nltk
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
print(moby.similar("ahab"))
print(moby.common_contexts(["ahab", "starbuck"]))
print(moby.concordance("monstrous", 55, lines=10))
Gensim + Wikipedia
import bz2
import gensim
# Load id to word dictionary
id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
# Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
# Do latent Semantic Analysis and find 10 prominent topics
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsa.print_topics(10)
From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags
Streaming
Corpus Preprocessing
Tokenized
Corpus
CorpusReader for streaming access, preprocessing, and saving the tokenized version
HTML
Paras
Sents
Tokens
Tags
Raw
Corpus
Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer
The Tools
Ingestion
Data Munging
and Wrangling
Computation
and Analysis
Modeling and
Application
Visual Analysis
The Pipeline
(in theory)
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,
RandomForestClassifier
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
- Search is difficult particularly in high
dimensional space.
- Even with techniques like genetic
algorithms or particle swarm
optimization, there is no guarantee of
a solution.
- As the search space gets larger, the
amount of time increases
exponentially.
Except
Instead, let’s steer
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
- Extend the Scikit-Learn API.
- Enhances the model selection
process.
- Tools for feature visualization,
visual diagnostics, and visual
steering.
- Not a replacement for other
visualization libraries.
Yellowbrick
● Based on spring tension
minimization algorithm.
● Features equally spaced on a unit
circle, instances dropped into circle.
● Features pull instances towards their
position on the circle in proportion
to their normalized numerical value
for that instance.
● Classification coloring based on
labels in data.
Radial Visualization
● Visualize clusters in data.
● Points represented as connected
line segments.
● Each vertical line represents one
attribute (x-axis units not
meaningful).
● One set of connected line segments
represents one instance.
● Points that tend to cluster will
appear closer together.
Parallel Coordinates
● Feature engineering requires
understanding of the relationships
between features
● Visualize pairwise relationships as a
heatmap
● Pearson shows us strong
correlations, potential collinearity
● Covariance helps us understand the
sequence of relationships
Rank2D
● Embed instances described
by many dimensions into 2.
● Look for latent structures in
the data, noise, separability.
● Is it possible to create a
decision space in the data?
● Unlike PCA or SVD, manifolds
use nearest neighbors, can
capture non-linear structures.
Manifolds
● Uses PCA to decompose high
dimensional data into two or three
dimensions
● Each instance plotted in a scatter
plot.
● Projected dataset can be analyzed
along axes of principle variation
● Can be interpreted to determine if
spherical distance metrics can be
utilized.
PCA Projection
● Need to select the minimum
required features to produce a valid
model.
● The more features a model
contains, the more complex it is
(sparse data, errors due to
variance).
● This visualizer ranks and plots
underlying impact of features
relative to each other.
Feature Importances
● Recursive feature elimination fits a
model and removes the weakest
feature(s) until the specified
number is reached.
● Features are ranked by internal
model’s coef_ or
feature_importances_
● Attempts to eliminate
dependencies and collinearity that
may exist in the model.
Recursive Feature Elimination
Precision: of
those labelled
edible, how many
actually were?
Is it better
to have
false
positives
here or
here?
Recall: how many
of the
poisonous ones
did our model
find?
Classification Heatmap
● Visualize tradeoff between
classifier's sensitivity (how
well it finds true positives)
and specificity (how well it
avoids false positives)
● Usually for binary
classification.
● Can also visualize
multiclass classification
with per-class or
one-vs-rest strategies.
ROC-AUC
Do I care
about certain
classes more
than others?
I have a lot
of classes;
how does my
model
perform on
each?
Confusion Matrix
Similar to confusion
matrix, but often more
interpretable!
Class Prediction Error
Visualize the
distribution of error to
diagnose
heteroscedasticity
Prediction Error and Residuals
higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or suggests
a different
algorithm
Hyperparameter Tuning
The Team
Hot Take
Data science in production is hard
ComputationStorageDataInteraction
Computational
Data Store
Feature Analysis
Model Builds
Model
Selection &
Monitoring
NormalizationIngestion
Feedback
Wrangling
API
Cross Validation
Build Phase
Deploy Phase
Building data science teams
● No two data scientists are alike.
● Some blend of: programming, math, machine learning, and
data engineering/wrangling.
● Evaluate creative problem solving, not rote memorization.
● Think beyond just data scientists:
○ Business development specialists
○ Product owners and project managers
○ Communications
The Future
1. Data Science is Disappearing
Remember when
people used to
write “word
processing skills”
on their
resumes?
2. Move Upstream
“Data products are self-adapting,
broadly applicable economic
engines that derive their value from
data and generate more data by
influencing human behavior or by
making inferences or predictions
upon new data.”
- Benjamin Bengfort
The Age of the Data Product
Decision Science
Inform strategy and key
decisions by analysis of
business metrics.
Data Products
Improve product
performance via automatic
decision making.
3. Solve Specific Problems
“We will begin to see more smaller-scale
analytics developed to subtly improve the
user experience of everyday applications.
[These] will rely not (or not exclusively)
on massive datasets [or algorithmic
innovations], but on custom,
domain-specific datasets geared to
specific use cases.”
Thank you

Mais conteúdo relacionado

Mais procurados

Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZCharles Vestur
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsManojit Nandi
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiPractical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiSri Ambati
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningJames Ward
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBigML, Inc
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning OverviewMykhailo Koval
 

Mais procurados (20)

Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex models
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.aiPractical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
BSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic ModelingBSSML16 L4. Association Discovery and Topic Modeling
BSSML16 L4. Association Discovery and Topic Modeling
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
 

Semelhante a The Incredible Disappearing Data Scientist

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Francesca Lazzeri, PhD
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 
Keepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech
 
AI In Actuarial Science
AI In Actuarial ScienceAI In Actuarial Science
AI In Actuarial ScienceAudrey Britton
 
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...polochau
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AIBill Liu
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
 
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan LippsMyth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan LippsApplitools
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
 

Semelhante a The Incredible Disappearing Data Scientist (20)

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
C3 w5
C3 w5C3 w5
C3 w5
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Keepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivos
 
AI In Actuarial Science
AI In Actuarial ScienceAI In Actuarial Science
AI In Actuarial Science
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
Human-Centered AI: Scalable, Interactive Tools for Interpretation and Attribu...
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning3. Relationships Matter: Using Connected Data for Better Machine Learning
3. Relationships Matter: Using Connected Data for Better Machine Learning
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
 
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan LippsMyth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
Myth vs Reality: Understanding AI/ML for QA Automation - w/ Jonathan Lipps
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
recent.pptx
recent.pptxrecent.pptx
recent.pptx
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world Challenges
 

Mais de Rebecca Bilbro

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine LearningRebecca Bilbro
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black BoxRebecca Bilbro
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection processRebecca Bilbro
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability ProjectRebecca Bilbro
 

Mais de Rebecca Bilbro (19)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Último

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 

The Incredible Disappearing Data Scientist

  • 1. The Incredible Disappearing Data Scientist Critical Integration Points for the Next 10 Years Big Mountain Data and Dev 2018
  • 2. Dr. Rebecca Bilbro Head of Data Science, ICX Media Co-creator, Scikit-Yellowbrick Faculty, Georgetown Univ. @rebeccabilbro
  • 3. Data Science: A Story in Six Acts ● The Beginning ● The Hype ● The Work ● The Tools ● The Team ● The Future
  • 5.
  • 6. A search for integration points math & language statistics & programming visualization & machine learning research & agile systems & applications ???
  • 9. “Pay for data scientists has rocketed from $125,000 to $150,000 two years ago to upwards of $225,000 these days – even for those straight out of school. ” Ali Behnam (2013) “Stephen Purpura, the co-founder and CEO of Seattle-based software company Context Relevant, recently lost out on one job candidate who had with a PhD and seven years of work experience. He was dying to land the guy. As he puts it, “These people are almost like unicorns.” But out of the blue, Microsoft came knocking with an offer of $650,000 in annual salary and guaranteed bonuses. “We can’t compete with that kind of offer,” says Purpura.” (2013) According to Glassdoor data, the median salary for data scientists in the United States is $120,931. By contrast, a business analyst can expect to make around $70,170 and a data analyst around $65,470. (2018) Pay scales for data scientists
  • 11. But ... “Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time, when many more people will have the title ‘data scientist’ on their business cards.” Davenport and Patil Data Scientist: The Sexiest Job of the 21st Century
  • 12. But ... “After rising rapidly in 2015 and 2016, median base salaries at all job levels changed by a single-digit percentage point or not at all from March 2017 to March 2018.”
  • 14.
  • 15. 1936: Turing first theorizes the Turing Machine 1957: Frank Rosenblatt invents the Perceptron 1973: James Lighthill publishes the Lighthill Report, discrediting AI 1974-1993 (approx): AI Winter 1995: Siegelmann & Vapnik invent the SVM 1997: Deep Blue beats Garry Kasparov 1998: CNNs can recognize handwritten digits 2000’s: GPUs become commercially available 2015: TensorFlow becomes open source ????: Robots take over run for your life AI Timeline
  • 16. A recent KDnuggets Poll Data Scientists Automated and Unemployed by 2025? found the majority of respondents thought that expert-level Data Science will be automated by 2025. Does anyone understand what we actually do? Now ...
  • 18. ● Personalized news classification ● Detecting political lean in documents ● Entity extraction, recognition and resolution ● Supply chain inference from unstructured text ● Natural language interfaces for relational data ● Automated metadata tagging ● Plagiarism detection Adventures in Applied NLP
  • 19. The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab")) print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))
  • 20. Gensim + Wikipedia import bz2 import gensim # Load id to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)
  • 21. From each doc, extract html, identify paras/sents/words, tag with part-of-speech Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags
  • 22. Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing, and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus
  • 23. Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer
  • 25.
  • 26.
  • 27. Ingestion Data Munging and Wrangling Computation and Analysis Modeling and Application Visual Analysis The Pipeline (in theory)
  • 28. from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier from sklearn import cross_validation as cv classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = cv.KFold(len(X), n_folds=12) max([ cv.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ]) Try them all!
  • 29. - Search is difficult particularly in high dimensional space. - Even with techniques like genetic algorithms or particle swarm optimization, there is no guarantee of a solution. - As the search space gets larger, the amount of time increases exponentially. Except
  • 32. Data Management Layer Raw Data Feature Engineering Hyperparameter Tuning Algorithm Selection Model Selection Triples Instance Database Model Storage Model Family Model Form
  • 33. - Extend the Scikit-Learn API. - Enhances the model selection process. - Tools for feature visualization, visual diagnostics, and visual steering. - Not a replacement for other visualization libraries. Yellowbrick
  • 34. ● Based on spring tension minimization algorithm. ● Features equally spaced on a unit circle, instances dropped into circle. ● Features pull instances towards their position on the circle in proportion to their normalized numerical value for that instance. ● Classification coloring based on labels in data. Radial Visualization
  • 35. ● Visualize clusters in data. ● Points represented as connected line segments. ● Each vertical line represents one attribute (x-axis units not meaningful). ● One set of connected line segments represents one instance. ● Points that tend to cluster will appear closer together. Parallel Coordinates
  • 36. ● Feature engineering requires understanding of the relationships between features ● Visualize pairwise relationships as a heatmap ● Pearson shows us strong correlations, potential collinearity ● Covariance helps us understand the sequence of relationships Rank2D
  • 37. ● Embed instances described by many dimensions into 2. ● Look for latent structures in the data, noise, separability. ● Is it possible to create a decision space in the data? ● Unlike PCA or SVD, manifolds use nearest neighbors, can capture non-linear structures. Manifolds
  • 38. ● Uses PCA to decompose high dimensional data into two or three dimensions ● Each instance plotted in a scatter plot. ● Projected dataset can be analyzed along axes of principle variation ● Can be interpreted to determine if spherical distance metrics can be utilized. PCA Projection
  • 39. ● Need to select the minimum required features to produce a valid model. ● The more features a model contains, the more complex it is (sparse data, errors due to variance). ● This visualizer ranks and plots underlying impact of features relative to each other. Feature Importances
  • 40. ● Recursive feature elimination fits a model and removes the weakest feature(s) until the specified number is reached. ● Features are ranked by internal model’s coef_ or feature_importances_ ● Attempts to eliminate dependencies and collinearity that may exist in the model. Recursive Feature Elimination
  • 41. Precision: of those labelled edible, how many actually were? Is it better to have false positives here or here? Recall: how many of the poisonous ones did our model find? Classification Heatmap
  • 42. ● Visualize tradeoff between classifier's sensitivity (how well it finds true positives) and specificity (how well it avoids false positives) ● Usually for binary classification. ● Can also visualize multiclass classification with per-class or one-vs-rest strategies. ROC-AUC
  • 43. Do I care about certain classes more than others? I have a lot of classes; how does my model perform on each? Confusion Matrix
  • 44. Similar to confusion matrix, but often more interpretable! Class Prediction Error
  • 45. Visualize the distribution of error to diagnose heteroscedasticity Prediction Error and Residuals
  • 46. higher silhouette scores mean denser, more separate clusters The elbow shows the best value of k… Or suggests a different algorithm Hyperparameter Tuning
  • 49. Data science in production is hard ComputationStorageDataInteraction Computational Data Store Feature Analysis Model Builds Model Selection & Monitoring NormalizationIngestion Feedback Wrangling API Cross Validation Build Phase Deploy Phase
  • 50. Building data science teams ● No two data scientists are alike. ● Some blend of: programming, math, machine learning, and data engineering/wrangling. ● Evaluate creative problem solving, not rote memorization. ● Think beyond just data scientists: ○ Business development specialists ○ Product owners and project managers ○ Communications
  • 52. 1. Data Science is Disappearing
  • 53. Remember when people used to write “word processing skills” on their resumes?
  • 55. “Data products are self-adapting, broadly applicable economic engines that derive their value from data and generate more data by influencing human behavior or by making inferences or predictions upon new data.” - Benjamin Bengfort The Age of the Data Product
  • 56. Decision Science Inform strategy and key decisions by analysis of business metrics. Data Products Improve product performance via automatic decision making.
  • 57. 3. Solve Specific Problems
  • 58. “We will begin to see more smaller-scale analytics developed to subtly improve the user experience of everyday applications. [These] will rely not (or not exclusively) on massive datasets [or algorithmic innovations], but on custom, domain-specific datasets geared to specific use cases.”