SlideShare a Scribd company logo
1 of 25
Download to read offline
Machine learning @ NYT
Dae Il Kim - daeil.kim@nytimes.com
Overview
● Assisting Great Journalism: The Story of Faulty Takata Airbags
○ Using Logistic Regression to help uncover suspicious comments
● Extracting insights from big data - A Bayesian perspective
○ BNPy: A fully pythonic framework for Bayesian Nonparametric Models
○ Refinery: A Locally Deployable Web App for Scalable Topic Modeling
● Using ML to help news-related non journalistic problems
○ Single Copy - Using ML to effectively predict the number of papers to print
○ Subscribers - Retention and Audience Acquisition
○ Recommendations - Using collaborative topic models for recommendations
Part 1: The Story of Faulty Takata Airbags
Complaints data from NHTSA complaints
The Data
Data contains 33,204 comments with 2219 of
these painstakingly labeled as being suspicious (by
Hiroko Tabuchi).
A Machine Learning Approach
Develop a prediction algorithm that can predict
whether a comment was either suspicious or not.
The algorithm will then learn from the dataset
which features are representative of a suspicious
comment.
The Machine Learning Approach
A sample comment. We will preprocess this data for the algorithm
- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) -
LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE
DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL
SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK,
FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)
TOKENIZE
FILTER
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments
DATA IS READY FOR TRAINING!
The data now consists of 33,204 examples with 56,191 features
Cross-Validation
CommentID
Features (i.e word frequency)
0 0 0 3 1 0 2 0...
1 0 0 0 2 0 1 1...
...
1 1 5 1 2 0 0 1...
This is our training set. Take a subset of the
data for training
S
NS
S
S
NS
NS
NS
NS
NS
Labels (S = Suspicious, NS = Not Suspicious)
This is our test set. After training, test on
this dataset to obtain accuracy measures.
How did we do?
Experiment Setup
We hold out 25% of both the
suspicious and not suspicious
comments for testing and train on
the rest. We do this 5 times, creating
random splits and retraining the
model with these splits.
Performance!
We obtain a very high AUC (~.97) on
our test sets.
Check what we missed
These comments are potentially
worth checking twice.
The most predictive words / features
Predictive of a
suspicious comment
Predictive of a
normal comment.
After training the model,
we then applied this on
the full dataset.
We looked for
comments that Hiroko
didn’t label as being
suspicious, but the
algorithm did to follow
up on (374 / 33K total).
Result: 7 new cases
where a passenger
was injured were
discovered from
those comments she
missed.
Part 2: Extracting Interpretable Insights from Big Data
Understanding Documents using Topic Models
There are reasons to believe that the
genetics of an organism are likely to
shift due to the extreme changes in our
climate. To protect them, our politicians
must pass environmental legislation
that can protect our future species from
becoming extinct…
Decompose
documents as a
probability
distribution over
“topic” indices
1
0
“Politics”
“Climate Change”
“Genetics”
“Climate Change” “Genetics”“Politics”
Topics in turn represent probability distributions over the unique words in your vocabulary.
Topic Models: A Graphical Model Perspective
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
1
0
“Politics”
“Climate Change”
“Genetics”
dna: 2, obama: 1, state: 1, gene: 2,
climate: 3, government: 1, drug: 2,
pollution: 3
Bayes Theorem
Prior belief about the world. In terms of
LDA, our modeling assumptions / priors.
Normalization constant makes this
problem a lot harder. We need this
for valid probabilities.
Likelihood. Given our model,
how likely is this data?
Posterior distribution. Probability of our
new model given the data.
Posterior Inference in LDA
GOAL: Obtain this posterior
which means that we need to
calculate this intractable term:
For LDA, this represents the posterior
over latent variables representing how
much a document contains of topic k (θ)
and topic word assignments z.
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
Scalable Learning & Inference in Topic Models
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
Analyze a subset of your total documents before updating.
Update θ, z, and β after analyzing
each mini-batch of documents.
Please check out BNPy (Bayesian Nonparametric Python)
Open source and supports a large set of powerful Bayesian nonparametric models. Actively
maintained and highly scalable code.
git clone https://bitbucket.org/michaelchughes/bnpy-dev/
Refinery: An open source web-app for large document analyses
Daeil Kim @ New York Times
Founder of Refinery
daeil.kim@nytimes.com
Ben Swanson @ MIT Media Lab
Co-Founder of Refinery
dujiaozhu@gmail.com
Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
Installing Refinery
1) Command → git clone https://github.com/daeilkim/refinery.git
2) Go to the root folder. Command → vagrant up
3) Open brower and go to --> 11.11.11.11:8080
3 Simple Steps to get Refinery running
Install these first!
A Typical Refinery Pipeline
Step 1: Upload documents
Step 2: Extract Topics from a Topic
Model
Step 3: Find a subset of documents with
topics of interest.
Step 4: Discover Interesting Phrases
A Quick Refinery Demo
Extracting NYT
articles from
keyword “obama” in
2013.
What themes / topics defined the Obama administration during 2013?
Future Directions: Better tools for Investigative Reporting
Collecting
& Scraping
Data
Refinery focuses on extracting
insights from relatively clean data
Great tools like DocumentCloud take
care of steps 1 & 2
Enterprise stories might
be completed in a
fraction of the time.
Filtering
& Cleaning
Data
Extracting
Insights
Part 3: Using ML to help in the bottom line
Part 3: Using ML to help in non-news related endeavors
Training predictive models
for each part of this funnel
We’re interested in developing a meaningful loyal relationship with our readers. Can we discover
covariates that indicate better ways to obtain and maintain that relationship with our audience?
Starbucks Single Copy
Using machine learning to predict the
number of actual copies we should sell
to Starbucks outlets across the nation.
Understanding international audiences
Part of our ability to expand the New York Times internationally
will be to leverage algorithms based off of topic models to help
understand reading patterns and behaviors.
Making better recommendations
Given how people read the news and some of their
demographic info, can we make better
recommendations for articles?
Even better, if they haven’t read anything what kind
of recommendations can we make given just their
metadata?
Age: 32
State: NY
Job: Student
read
recommend
Attract first time users
with relevant content

More Related Content

Viewers also liked

Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?mortardata
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

Viewers also liked (8)

Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
The Knowledge Management Advantage
The Knowledge Management AdvantageThe Knowledge Management Advantage
The Knowledge Management Advantage
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models

Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Semantic Web Company
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)Laura Chiticariu
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyBoPeng76
 
Fantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFuturice
 
Kim Hammar - Paper presentation WI 2018 Santiago
Kim Hammar - Paper presentation WI 2018 SantiagoKim Hammar - Paper presentation WI 2018 Santiago
Kim Hammar - Paper presentation WI 2018 SantiagoKim Hammar
 
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pubStephen Buxton
 
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Stijn (Stan) Christiaens
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...Daniel Katz
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
How To Start A Reflective Essay 8 Tips For Writing A
How To Start A Reflective Essay 8 Tips For Writing AHow To Start A Reflective Essay 8 Tips For Writing A
How To Start A Reflective Essay 8 Tips For Writing ALisa Martinez
 
The What, Why and How of Big Data
The What, Why and How of Big DataThe What, Why and How of Big Data
The What, Why and How of Big DataLuca Naso
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 

Similar to ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models (20)

Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
Julien Gonçalves: Named entity recognition and disambiguation using an iterat...
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparency
 
Fantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl WeirFantastic Problems and Where to Find Them: Daryl Weir
Fantastic Problems and Where to Find Them: Daryl Weir
 
Kim Hammar - Paper presentation WI 2018 Santiago
Kim Hammar - Paper presentation WI 2018 SantiagoKim Hammar - Paper presentation WI 2018 Santiago
Kim Hammar - Paper presentation WI 2018 Santiago
 
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub2013 10-03-semantics-meetup-s buxton-mark_logic_pub
2013 10-03-semantics-meetup-s buxton-mark_logic_pub
 
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
Successfully Kickstarting Data Governance's Social Dynamics: Define, Collabor...
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
How To Start A Reflective Essay 8 Tips For Writing A
How To Start A Reflective Essay 8 Tips For Writing AHow To Start A Reflective Essay 8 Tips For Writing A
How To Start A Reflective Essay 8 Tips For Writing A
 
The What, Why and How of Big Data
The What, Why and How of Big DataThe What, Why and How of Big Data
The What, Why and How of Big Data
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 

More from mortardata

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porwaymortardata
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Sciencemortardata
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblrmortardata
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …mortardata
 

More from mortardata (6)

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porway
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblr
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models

  • 1. Machine learning @ NYT Dae Il Kim - daeil.kim@nytimes.com
  • 2. Overview ● Assisting Great Journalism: The Story of Faulty Takata Airbags ○ Using Logistic Regression to help uncover suspicious comments ● Extracting insights from big data - A Bayesian perspective ○ BNPy: A fully pythonic framework for Bayesian Nonparametric Models ○ Refinery: A Locally Deployable Web App for Scalable Topic Modeling ● Using ML to help news-related non journalistic problems ○ Single Copy - Using ML to effectively predict the number of papers to print ○ Subscribers - Retention and Audience Acquisition ○ Recommendations - Using collaborative topic models for recommendations
  • 3. Part 1: The Story of Faulty Takata Airbags
  • 4. Complaints data from NHTSA complaints The Data Data contains 33,204 comments with 2219 of these painstakingly labeled as being suspicious (by Hiroko Tabuchi). A Machine Learning Approach Develop a prediction algorithm that can predict whether a comment was either suspicious or not. The algorithm will then learn from the dataset which features are representative of a suspicious comment.
  • 5. The Machine Learning Approach A sample comment. We will preprocess this data for the algorithm - NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) - LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK, FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations) TOKENIZE FILTER (NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments (NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments DATA IS READY FOR TRAINING! The data now consists of 33,204 examples with 56,191 features
  • 6. Cross-Validation CommentID Features (i.e word frequency) 0 0 0 3 1 0 2 0... 1 0 0 0 2 0 1 1... ... 1 1 5 1 2 0 0 1... This is our training set. Take a subset of the data for training S NS S S NS NS NS NS NS Labels (S = Suspicious, NS = Not Suspicious) This is our test set. After training, test on this dataset to obtain accuracy measures.
  • 7. How did we do? Experiment Setup We hold out 25% of both the suspicious and not suspicious comments for testing and train on the rest. We do this 5 times, creating random splits and retraining the model with these splits. Performance! We obtain a very high AUC (~.97) on our test sets. Check what we missed These comments are potentially worth checking twice.
  • 8. The most predictive words / features Predictive of a suspicious comment Predictive of a normal comment. After training the model, we then applied this on the full dataset. We looked for comments that Hiroko didn’t label as being suspicious, but the algorithm did to follow up on (374 / 33K total). Result: 7 new cases where a passenger was injured were discovered from those comments she missed.
  • 9. Part 2: Extracting Interpretable Insights from Big Data
  • 10. Understanding Documents using Topic Models There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our climate. To protect them, our politicians must pass environmental legislation that can protect our future species from becoming extinct… Decompose documents as a probability distribution over “topic” indices 1 0 “Politics” “Climate Change” “Genetics” “Climate Change” “Genetics”“Politics” Topics in turn represent probability distributions over the unique words in your vocabulary.
  • 11. Topic Models: A Graphical Model Perspective LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 1 0 “Politics” “Climate Change” “Genetics” dna: 2, obama: 1, state: 1, gene: 2, climate: 3, government: 1, drug: 2, pollution: 3
  • 12. Bayes Theorem Prior belief about the world. In terms of LDA, our modeling assumptions / priors. Normalization constant makes this problem a lot harder. We need this for valid probabilities. Likelihood. Given our model, how likely is this data? Posterior distribution. Probability of our new model given the data.
  • 13. Posterior Inference in LDA GOAL: Obtain this posterior which means that we need to calculate this intractable term: For LDA, this represents the posterior over latent variables representing how much a document contains of topic k (θ) and topic word assignments z. LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001
  • 14. Scalable Learning & Inference in Topic Models LDA: Latent Dirichlet Allocation (Bayesian Topic Model) Blei et. al, 2001 Analyze a subset of your total documents before updating. Update θ, z, and β after analyzing each mini-batch of documents.
  • 15. Please check out BNPy (Bayesian Nonparametric Python) Open source and supports a large set of powerful Bayesian nonparametric models. Actively maintained and highly scalable code. git clone https://bitbucket.org/michaelchughes/bnpy-dev/
  • 16. Refinery: An open source web-app for large document analyses Daeil Kim @ New York Times Founder of Refinery daeil.kim@nytimes.com Ben Swanson @ MIT Media Lab Co-Founder of Refinery dujiaozhu@gmail.com Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org
  • 17. Installing Refinery 1) Command → git clone https://github.com/daeilkim/refinery.git 2) Go to the root folder. Command → vagrant up 3) Open brower and go to --> 11.11.11.11:8080 3 Simple Steps to get Refinery running Install these first!
  • 18. A Typical Refinery Pipeline Step 1: Upload documents Step 2: Extract Topics from a Topic Model Step 3: Find a subset of documents with topics of interest. Step 4: Discover Interesting Phrases
  • 19. A Quick Refinery Demo Extracting NYT articles from keyword “obama” in 2013. What themes / topics defined the Obama administration during 2013?
  • 20. Future Directions: Better tools for Investigative Reporting Collecting & Scraping Data Refinery focuses on extracting insights from relatively clean data Great tools like DocumentCloud take care of steps 1 & 2 Enterprise stories might be completed in a fraction of the time. Filtering & Cleaning Data Extracting Insights
  • 21. Part 3: Using ML to help in the bottom line
  • 22. Part 3: Using ML to help in non-news related endeavors Training predictive models for each part of this funnel We’re interested in developing a meaningful loyal relationship with our readers. Can we discover covariates that indicate better ways to obtain and maintain that relationship with our audience?
  • 23. Starbucks Single Copy Using machine learning to predict the number of actual copies we should sell to Starbucks outlets across the nation.
  • 24. Understanding international audiences Part of our ability to expand the New York Times internationally will be to leverage algorithms based off of topic models to help understand reading patterns and behaviors.
  • 25. Making better recommendations Given how people read the news and some of their demographic info, can we make better recommendations for articles? Even better, if they haven’t read anything what kind of recommendations can we make given just their metadata? Age: 32 State: NY Job: Student read recommend Attract first time users with relevant content