5. Barriers to Structured data analysis in
the newsroom
• Expensive
• Too hard to collect.
• It takes practice
• It takes patience.
• Once collected, data has a short shelf life – its
value inside the newsroom effectively ends
once a story is published.
7. Solutions
• User-friendly tool for scraping websites for
structured data
• Packages of algorithms from fraud and other
forensic fields for use with public records
datasets online.
• Packages of queries and statistical tests for
money, dates, geographical identifiers, names
and codes, presented in standard English
• Tools for fuzzy matching of datasets: include
scoring, best match likelihood, interactive
machine learning for different datasets.
9. Too many sources with too little news
• Twitter, Facebook, LinkedIn and other social media
• RSS feeds from other news organizations and blogs
• Press releases from government agencies or beat
subjects
Lack of archiving is just as troubling as the lack of
structure. Reporters can’t hold the powerful
accountable without information from the past.
10. Solutions
• Archiving users’ feeds locally or in the cloud
• Mash-up social media, rss feeds into an app
that reveals more insight into the sources
• Formalize each reporter’s definition of “news”
through machine learning.
• Alerts for important source material. Example:
changing time of a press conference.
13. Solutions
• Visual extractor of data from scanned forms.
• Separate scanned boxes of documents into
their pieces for further analysis
• Use speech recognition tools on government
audio and video
• OCR video to find the speaker at a hearing
16. Our way A newer way
• Hand-enter individual items • Leverage web scraping and
into spreadsheets paid crowdsourcing for data
• Transcribe entry (MT)
interviews, hearings and • Use speech recognition for
other audio and video the first pass on searchable
content for searching audio and video
• Read each document • Use clustering, information
extraction and other
methods for overview of
documents