ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models

Machine learning @ NYT
Dae Il Kim - daeil.kim@nytimes.com

Overview
● Assisting Great Journalism: The Story of Faulty Takata Airbags
○ Using Logistic Regression to help uncover suspicious comments
● Extracting insights from big data - A Bayesian perspective
○ BNPy: A fully pythonic framework for Bayesian Nonparametric Models
○ Refinery: A Locally Deployable Web App for Scalable Topic Modeling
● Using ML to help news-related non journalistic problems
○ Single Copy - Using ML to effectively predict the number of papers to print
○ Subscribers - Retention and Audience Acquisition
○ Recommendations - Using collaborative topic models for recommendations

Part 1: The Story of Faulty Takata Airbags

Complaints data from NHTSA complaints
The Data
Data contains 33,204 comments with 2219 of
these painstakingly labeled as being suspicious (by
Hiroko Tabuchi).
A Machine Learning Approach
Develop a prediction algorithm that can predict
whether a comment was either suspicious or not.
The algorithm will then learn from the dataset
which features are representative of a suspicious
comment.

The Machine Learning Approach
A sample comment. We will preprocess this data for the algorithm
- NEW TOYOTA CAMRY LE PURCHASED JANUARY 2004 - ON FEBRUARY 25TH KEY WOULD NOT TURN (TOOK 10 - 15 MINUTES TO START IT) -
LATER WHILE PARKING, THE CAR THE STEERING LOCKED TURNING THE CAR TO THE RIGHT - THE CAR ACCELERATED AND SURGED DESPITE
DEPRESSING THE BRAKE (SAME AS ODI PEO4021) - THOUGH THE CAR BROKE A METAL FLAG POLE, DAMAGED A RETAINING WALL, AND FELL
SEVEN FEET INTO A MAJOR STREET, THE AIR BAGS DID NOT DEPLOY - CAR IS SEVERELY DAMAGED: WHEELS, TIRES, FRONT END, GAS TANK,
FRONT AXLE - DRIVER HAS A SWOLLEN AND SORE KNEE ALONG WITH SIGNIFICANT SOFT TISSUE INJURIES INCLUDING BACK PAIN *SC *JB
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Break this into individual words
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Break this into bigrams (every two word combinations)
TOKENIZE
FILTER
(NEW), (TOYOTA), (CAMRY), (LE), (PURCHASED), (JANUARY), (2004), ... → Remove tokens that appear in less than 5 comments
(NEW TOYOTA), (TOYOTA CAMRY), (CAMRY LE), (LE PURCHASED), ... → Remove bigrams that appear in less than 5 comments
DATA IS READY FOR TRAINING!
The data now consists of 33,204 examples with 56,191 features

Cross-Validation
CommentID
Features (i.e word frequency)
0 0 0 3 1 0 2 0...
1 0 0 0 2 0 1 1...
...
1 1 5 1 2 0 0 1...
This is our training set. Take a subset of the
data for training
S
NS
S
S
NS
NS
NS
NS
NS
Labels (S = Suspicious, NS = Not Suspicious)
This is our test set. After training, test on
this dataset to obtain accuracy measures.

How did we do?
Experiment Setup
We hold out 25% of both the
suspicious and not suspicious
comments for testing and train on
the rest. We do this 5 times, creating
random splits and retraining the
model with these splits.
Performance!
We obtain a very high AUC (~.97) on
our test sets.
Check what we missed
These comments are potentially
worth checking twice.

The most predictive words / features
Predictive of a
suspicious comment
Predictive of a
normal comment.
After training the model,
we then applied this on
the full dataset.
We looked for
comments that Hiroko
didn’t label as being
suspicious, but the
algorithm did to follow
up on (374 / 33K total).
Result: 7 new cases
where a passenger
was injured were
discovered from
those comments she
missed.

Part 2: Extracting Interpretable Insights from Big Data

Understanding Documents using Topic Models
There are reasons to believe that the
genetics of an organism are likely to
shift due to the extreme changes in our
climate. To protect them, our politicians
must pass environmental legislation
that can protect our future species from
becoming extinct…
Decompose
documents as a
probability
distribution over
“topic” indices
1
0
“Politics”
“Climate Change”
“Genetics”
“Climate Change” “Genetics”“Politics”
Topics in turn represent probability distributions over the unique words in your vocabulary.

Topic Models: A Graphical Model Perspective
LDA: Latent Dirichlet Allocation (Bayesian Topic Model)
Blei et. al, 2001
1
0
“Politics”
“Climate Change”
“Genetics”
dna: 2, obama: 1, state: 1, gene: 2,
climate: 3, government: 1, drug: 2,
pollution: 3

Bayes Theorem
Prior belief about the world. In terms of
LDA, our modeling assumptions / priors.
Normalization constant makes this
problem a lot harder. We need this
for valid probabilities.
Likelihood. Given our model,
how likely is this data?
Posterior distribution. Probability of our
new model given the data.

Posterior Inference in LDA
GOAL: Obtain this posterior
which means that we need to
calculate this intractable term:
For LDA, this represents the posterior
over latent variables representing how
much a document contains of topic k (θ)
and topic word assignments z.
Blei et. al, 2001

Scalable Learning & Inference in Topic Models
Blei et. al, 2001
Analyze a subset of your total documents before updating.
Update θ, z, and β after analyzing
each mini-batch of documents.

Please check out BNPy (Bayesian Nonparametric Python)
Open source and supports a large set of powerful Bayesian nonparametric models. Actively
maintained and highly scalable code.
git clone https://bitbucket.org/michaelchughes/bnpy-dev/

Refinery: An open source web-app for large document analyses
Daeil Kim @ New York Times
Founder of Refinery
daeil.kim@nytimes.com
Ben Swanson @ MIT Media Lab
Co-Founder of Refinery
dujiaozhu@gmail.com
Refinery is a 2014 Knight Prototype Fund winner. Check it out at: http://docrefinery.org

Installing Refinery
1) Command → git clone https://github.com/daeilkim/refinery.git
2) Go to the root folder. Command → vagrant up
3) Open brower and go to --> 11.11.11.11:8080
3 Simple Steps to get Refinery running
Install these first!

A Typical Refinery Pipeline
Step 1: Upload documents
Step 2: Extract Topics from a Topic
Model
Step 3: Find a subset of documents with
topics of interest.
Step 4: Discover Interesting Phrases

A Quick Refinery Demo
Extracting NYT
articles from
keyword “obama” in
2013.
What themes / topics defined the Obama administration during 2013?

Future Directions: Better tools for Investigative Reporting
Collecting
& Scraping
Data
Refinery focuses on extracting
insights from relatively clean data
Great tools like DocumentCloud take
care of steps 1 & 2
Enterprise stories might
be completed in a
fraction of the time.
Filtering
& Cleaning
Data
Extracting
Insights

Part 3: Using ML to help in the bottom line

Part 3: Using ML to help in non-news related endeavors
Training predictive models
for each part of this funnel
We’re interested in developing a meaningful loyal relationship with our readers. Can we discover
covariates that indicate better ways to obtain and maintain that relationship with our audience?

Starbucks Single Copy
Using machine learning to predict the
number of actual copies we should sell
to Starbucks outlets across the nation.

Understanding international audiences
Part of our ability to expand the New York Times internationally
will be to leverage algorithms based off of topic models to help
understand reading patterns and behaviors.

Making better recommendations
Given how people read the news and some of their
demographic info, can we make better
recommendations for articles?
Even better, if they haven’t read anything what kind
of recommendations can we make given just their
metadata?
Age: 32
State: NY
Job: Student
read
recommend
Attract first time users
with relevant content

ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models

Similar to ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models (20)

More from mortardata

More from mortardata (6)

Recently uploaded

Recently uploaded (20)

ML at NYT: Assisting Journalism & Business with Logistic Regression, Topic Models