2. “Stories will emerge from stacks of financial disclosure forms,
court records, legislative hearings, officials' calendars or meeting
notes, and regulators' email messages that no one today has
time or money to mine. With a suite of reporting tools, a
journalist will be able to scan, transcribe, analyze, and visualize
the patterns in these documents.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism: Definitions
5. Computational Journalism: Definitions
“Broadly defined, it can involve changing how stories are
discovered, presented, aggregated, monetized, and archived.
Computation can advance journalism by drawing on innovations
in topic detection, video analysis, personalization, aggregation,
visualization, and sensemaking.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011
6. Journalism & Technology: Big Data, Personalization & Automation
Shailesh Prakash, The Washington Post
8. We are now living in a world where algorithms, and the data that feed
them, adjudicate a large array of decisions in our lives: not just search
engines and personalized online news systems, but educational
evaluations, the operation of markets and political campaigns, the design
of urban public spaces, and even how social services like welfare and
public safety are managed.
…
Journalists are beginning to adapt their traditional watchdogging and
accountability functions to this new wellspring of power in society. They are
investigating algorithms in order to characterize their power and delineate
their mistakes and biases.
- Nick Diakopoulos, Algorithmic Accountability, 2015
Computational Journalism: Definitions
9. Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
13. Administration
Assignments
Some assignments require programming, but your writing counts for more than your code!
Final project
Code, story, or research
Course blog
http://compjournalism.com
Grading
40% assignements
40% final project
20% class participation
14. This class
• Introduction
• High dimensional data
• Text analysis in journalism
• The Document Vector Space model
• The Overview document mining platform
16. Vector representation of objects
x1
x2
x3
xN
é
ë
ê
ê
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
ú
ú
Fundamental representation for (almost) all data mining, clustering,
machine learning, visualization, NLP, etc. algorithms.
17. Interpreting High Dimensional Data
UK House of Lords voting record, 2000-2012.
N = 1043 lords by M = 1630 votes
2 = aye, 4 = nay, -9 = didn't vote
18. Dimensionality reduction
Problem: vector space is high-dimensional. Up to thousands of
dimensions. The screen is two-dimensional.
We have to go from
x ∈ RN
to much lower dimensional points
y ∈ RK<<N
Probably K=2 or K=3.
20. Which direction should we look from?
Principal components analysis: find a linear projection that
preserves greatest variance
Take first K eigenvectors of covariance matrix corresponding to
largest eigenvalues. This gives a K-dimensional sub-space for
projection.
21. PCA on House of Lords data
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the foundation
not only for conceptualization, language, and speech, but
also for mathematics, statistics, and data analysis in
general.
Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
Classification Techniques
23. Classification and Clustering
Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the foundation
not only for conceptualization, language, and speech, but
also for mathematics, statistics, and data analysis in
general.
Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
Classification Techniques
24. Distance metric
d(x, y) ≥ 0
- distance is never negative
d(x, x) = 0
- “reflexivity”: zero distance to self
d(x, y) = d(y, x)
- “symmetry”: x to y same as y to x
d(x, z) ≤ d(x, y) + d(y, z)
- “triangle inequality”: going direct is shorter
25. Distance matrix
Data matrix for M objects of N dimensions
Distance matrix
X =
x1
x2
xM
é
ë
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
=
x1,1 x1,2 x1,N
x2,1 x2,2
x1,M xM,N
é
ë
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
Dij = Dji = d(xi , xj ) =
d1,1 d1,2 dM,M
d2,1 d2,2
d1,M dM,M
é
ë
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
26. Different clustering algorithms
• Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
o Also LDA and many Bayesian models, from a certain perspective
• Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
• Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
29. Voting clusters with parties
LDem XB Lab LDem XB Lab XB Lab Con XB
1 2 2 1 3 2 2 2 1 4
Con Con LDem Con Con Con LDem Lab Con LDem
1 1 1 1 1 1 5 2 1 1
Lab Lab Con Lab XB XB Lab XB Lab Con
2 2 1 2 3 2 2 4 2 1
Lab XB Lab Con XB XB LDem Lab XB Lab
2 3 2 1 3 1 1 2 1 2
Con Con Lab Con XB Lab Lab Con XB XB
1 5 2 1 4 2 2 1 2 1
Con XB Con Con XB Con Lab XB LDem Con
1 4 1 1 4 1 2 2 1 5
Con Con Con Lab Bp XB Lab Lab Lab LDem
1 1 1 2 3 3 2 2 2 5
Lab XB Con Lab Con XB Con Con XB XB
2 3 1 2 1 4 1 1 4 4
Con Con Lab Con Con XB Lab Lab Lab Con
1 1 2 1 1 2 2 2 2 1
Lab LDem Lab Con Lab Lab Con XB Lab Con
2 1 2 1 2 2 1 3 2 1
Con Lab XB Con XB XB XB Lab Lab Lab
1 2 2 1 2 3 4 2 2 2
30. No unique “right” clustering
Different distance metrics and clustering algorithms give different
results.
Should we sort incident reports by location, time, actor, event type,
author, cost, casualties…?
There is only context-specific categorization.
And the computer doesn’t understand your context.
32. Clustering Algorithm
Input: data points (feature vectors).
Output: a set of clusters, each of which is a set of points.
Visualization
Input: data points (feature vectors).
Output: a picture of the points.
33. Linear projections (like PCA)
Projects in a straight line to
closest point on "screen.”
y = Px
where P is a K by N matrix.
Projection from 2 to 1 dimensions
34. Nonlinear projections
Still going from high-
dimensional x to low-
dimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not preserve
relative distances, angles,
etc.
Fish-eye projection from 3 to 2 dimensions
35. Multidimensional scaling
Idea: try to preserve distances between points "as much as possible."
If we have the distances between all points in a distance matrix,
D = |xi – xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know how far away
each city is from every other.
But notice that the original dimension is not encoded in the matrix… we
can re-project to any number of dimensions.
37. MDS Stress minimization
The formula actually minimizes “stress”
Think of “springs” between every pair of points. Spring between xi, xj
has rest length dij
Stress is zero if all high-dimensional distances matched exactly in low
dimension.
stress(x) = xi - xj - dij( )
2
i, j
å
40. Robustness of results
Regarding these analyses of legislative voting, we could still ask:
• Are we modeling the right thing? (What about other legislative
work, e.g. in committee?)
• Are our underlying assumptions correct? (do representatives
really have “ideal points” in a preference space?)
• What are we trying to argue? What will be the effect of pointing
out this result?
46. The Post obtained draft versions of 12 audits by the inspector general’s office,
covering projects from the Caribbean to Pakistan to the Republic of Georgia
between 2011 and 2013. The drafts are confidential and rarely become public.
The Post compared the drafts with the final reports published by the
inspector general’s office and interviewed former and current employees. E-
mails and other internal records also were reviewed.
The Post tracked changes in the language that auditors used to describe
USAID and its mission offices. The analysis found that more than 400
negative references were removed from the audits between the draft and
final versions.
Sentiment analysis used by Washington Post, 2014
48. The Times analyzed Los Angeles Police Department violent crime
data from 2005 to 2012. Our analysis found that the Los Angeles
Police Department misclassified an estimated 14,000 serious assaults
as minor offenses, artificially lowering the city’s crime levels. To
conduct the analysis, The Times used an algorithm that combined
two machine learning classifiers. Each classifier read in a brief
description of the crime, which it used to determine if it was a minor
or serious assault.
An example of a minor assault reads: "VICTS AND SUSPS BECAME
INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS
IN THE FACE.”
49. We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters
52. Document vectors in journalism
- Text clustering for stories, e.g. Message Machine
- Find “key words” or “most important words”
- Topic analysis, e.g. ProPublica’s legislative tracker
- Key component of filtering algorithms, e.g. Google News
- Standard representation for document classification.
- Basis of all text search engines.
A text analysis building block.
53. What is this document "about"?
Most commonly occurring words a pretty good indicator.
30 the
23 to
19 and
19 a
18 animal
17 cruelty
15 of
15 crimes
14 in
14 for
11 that
8 crime
7 we
55. Features = words works fine
Encode each document as the list of words it contains.
Dimensions = vocabulary of document set.
Value on each dimension = # of times word appears in document
56. Example
D1 = “I like databases”
D2 = “I hate hate databases”
Each row = document vector
All rows = term-document matrix
Individual entry = tf(t,d) = “term frequency”
57. Aka “Bag of words” model
Throws out word order.
e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded
identically.
58. Tokenization
The documents come to us as long strings, not individual words.
Tokenization is the process of converting the string into individual
words, or "tokens."
For this course, we will assume a very simple strategy:
o convert all letters to lowercase
o remove all punctuation characters
o separate words based on spaces
Note that this won't work at all for Chinese. It will fail in ,many
ways even for English. How?
59. Distance metric for text
Useful for:
• clustering documents
• finding docs similar to example
• matching a search query
Basic idea: look for overlapping terms
60. Cosine similarity
Given document vectors a,b define
If each word occurs exactly once in each document, equivalent
to counting overlapping words.
Note: not a distance function, as similarity increases when
documents are… similar. (What part of the definition of a
distance function is violated here?)
similarity(a,b) º a·b
61. Problem: long documents always win
Let a = “This car runs fast.”
Let b = “My car is old. I want a new car, a shiny car”
Let query = “fast car”
this car runs fast my is old I want a new shiny
a 1 1 1 1 0 0 0 0 0 0 0 0
b 0 3 0 0 1 1 1 1 1 1 1 1
q 0 1 0 1 0 0 0 0 0 0 0 0
62. similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3
Longer document more “similar”, by virtue of repeating words.
Problem: long documents always win
67. Problem: common words
We want to look at words that “discriminate” among documents.
Stopwords: if all documents contain “the,” are all documents
similar?
Common words: if most documents contain “car” then car
doesn’t tell us much about (contextual) similarity.
69. Document Frequency
Idea: de-weight common words
Common = appears in many documents
“document frequency” = fraction of docs containing
term
df (t,D)= d Î D:t Î d D
71. TF-IDF
Multiply term frequency by inverse document frequency
n(t,d) = number of times term t in doc d
n(t,D) = number docs in D containing t
tfidf (t,d,D)= tf (t,d)×idf (d,D)
= n(t,d)×log D n(t,D)( )
72. TF-IDF depends on entire corpus
The TF-IDF vector for a document changes if we add
another document to the corpus.
TF-IDF is sensitive to context. The context is all other documents
tfidf (t,d,D)= tf (t,d)×idf (d,D)
if we add a document, D changes!
73. What is this document "about"?
Each document is now a vector of TF-IDF scores for every word in the
document. We can look at which words have the top scores.
crimes 0.0675591652263963
cruelty 0.0585772393867342
crime 0.0257614113616027
reporting 0.0208838148975406
animals 0.0179258756717422
michael 0.0156575858658684
category 0.0154564813388897
commit 0.0137447439653709
criminal 0.0134312894429112
societal 0.0124164973052386
trends 0.0119505837811614
conviction 0.0115699047136248
patterns 0.011248045148093
76. Salton’s description of tf-idf
- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
78. Cluster Hypothesis
“documents in the same cluster behave similarly with respect to
relevance to information needs”
- Manning, Raghavan, Schütze, Introduction to Information Retrieval
Not really a precise statement – but the crucial link between human
semantics and mathematical properties.
Articulated as early as 1971, has been shown to hold at web scale,
widely assumed.
79. Bag of words + TF-IDF widely used
Practical win: good precision-recall metrics in tests with human-tagged
document sets.
Still the dominant text indexing scheme used today. (Lucene, FAST,
Google…) Many variants and extensions.
Some, but not much, theory to explain why this works. (E.g. why that
particular IDF formula? why doesn’t indexing bigrams improve
performance?)
Collectively: the vector space document model
Notas do Editor
To open:
House of lords notebook and blog post
http://www.compjournalism.com/?p=13
https://github.com/jstray/compjournalism2018/blob/master/uk-lords-votes.ipynb
Rotating projection cube
http://1.bp.blogspot.com/-pgMAHiIWvuw/Tql5HIXNdRI/AAAAAAAABLI/I2zPF5cLRwQ/s1600/clust.gif
K-means demo
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Overview prototype
https://blog.overviewdocs.com/2012/03/16/video-document-mining-with-the-overview-prototype/