SlideShare uma empresa Scribd logo
1 de 22
Reporterslab.org

Presentation for computational
     journalism students
        February 2012
STRUCTURED DATA
.. And most reporters’ inability to deal with it
New York Times reporters used Word searches and
annotations to analyze Wikileaks documents in 2010
and 2011.
PANDA project trying to help gather data inside newsrooms
Barriers to Structured data analysis in
                the newsroom
•   Expensive
•   Too hard to collect.
•   It takes practice
•   It takes patience.
•   Once collected, data has a short shelf life – its
    value inside the newsroom effectively ends
    once a story is published.
Web-scraping software:
ephemeral or too
expensive for a task not
viewed as mission-
critical.
Solutions
• User-friendly tool for scraping websites for
  structured data
• Packages of algorithms from fraud and other
  forensic fields for use with public records
  datasets online.
• Packages of queries and statistical tests for
  money, dates, geographical identifiers, names
  and codes, presented in standard English
• Tools for fuzzy matching of datasets: include
  scoring, best match likelihood, interactive
  machine learning for different datasets.
TOO MUCH MATERIAL
With too little information
Too many sources with too little news

• Twitter, Facebook, LinkedIn and other social media
• RSS feeds from other news organizations and blogs
• Press releases from government agencies or beat
  subjects

      Lack of archiving is just as troubling as the lack of
      structure. Reporters can’t hold the powerful
      accountable without information from the past.
Solutions
• Archiving users’ feeds locally or in the cloud
• Mash-up social media, rss feeds into an app
  that reveals more insight into the sources
• Formalize each reporter’s definition of “news”
  through machine learning.
• Alerts for important source material. Example:
  changing time of a press conference.
The buried treasure

UNUSABLE RECORDS
Solutions
• Visual extractor of data from scanned forms.
• Separate scanned boxes of documents into
  their pieces for further analysis
• Use speech recognition tools on government
  audio and video
• OCR video to find the speaker at a hearing
For unstructured data

ANTIQUATED METHODS
Our way                         A newer way



• Hand-enter individual items   • Leverage web scraping and
  into spreadsheets               paid crowdsourcing for data
• Transcribe                      entry (MT)
  interviews, hearings and      • Use speech recognition for
  other audio and video           the first pass on searchable
  content for searching           audio and video
• Read each document            • Use clustering, information
                                  extraction and other
                                  methods for overview of
                                  documents
Reporterslab.org working to tame
audio and video
Associated Press
project to bring order
to unstructured data
Wordseer for
historical text
Jigsaw
REPORTERSLAB.ORG
Creating sample data and documents for researchers based on real
stories

Mais conteúdo relacionado

Semelhante a Computational journalism projects

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-finalDavid Newman
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practiceHelen Nneka Okpala
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provisionLucas anastasiou
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa WahediUNICORNS IN TECH
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisShawn Day
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for LibrariesThomas King
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarFAIRDOM
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ donellemckinley
 
E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)Isak Van der Walt
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
 

Semelhante a Computational journalism projects (20)

Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-final
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practice
 
Open minted content_provision
Open minted content_provisionOpen minted content_provision
Open minted content_provision
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa Wahedi
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for Analysis
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
 
Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ Crowdsourcing or bust: The Indexer, Archives NZ
Crowdsourcing or bust: The Indexer, Archives NZ
 
E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)E research africa presentation (19 nov 2014)
E research africa presentation (19 nov 2014)
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of Pretoria
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 

Último

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Computational journalism projects

  • 1. Reporterslab.org Presentation for computational journalism students February 2012
  • 2. STRUCTURED DATA .. And most reporters’ inability to deal with it
  • 3. New York Times reporters used Word searches and annotations to analyze Wikileaks documents in 2010 and 2011.
  • 4. PANDA project trying to help gather data inside newsrooms
  • 5. Barriers to Structured data analysis in the newsroom • Expensive • Too hard to collect. • It takes practice • It takes patience. • Once collected, data has a short shelf life – its value inside the newsroom effectively ends once a story is published.
  • 6. Web-scraping software: ephemeral or too expensive for a task not viewed as mission- critical.
  • 7. Solutions • User-friendly tool for scraping websites for structured data • Packages of algorithms from fraud and other forensic fields for use with public records datasets online. • Packages of queries and statistical tests for money, dates, geographical identifiers, names and codes, presented in standard English • Tools for fuzzy matching of datasets: include scoring, best match likelihood, interactive machine learning for different datasets.
  • 8. TOO MUCH MATERIAL With too little information
  • 9. Too many sources with too little news • Twitter, Facebook, LinkedIn and other social media • RSS feeds from other news organizations and blogs • Press releases from government agencies or beat subjects Lack of archiving is just as troubling as the lack of structure. Reporters can’t hold the powerful accountable without information from the past.
  • 10. Solutions • Archiving users’ feeds locally or in the cloud • Mash-up social media, rss feeds into an app that reveals more insight into the sources • Formalize each reporter’s definition of “news” through machine learning. • Alerts for important source material. Example: changing time of a press conference.
  • 12.
  • 13. Solutions • Visual extractor of data from scanned forms. • Separate scanned boxes of documents into their pieces for further analysis • Use speech recognition tools on government audio and video • OCR video to find the speaker at a hearing
  • 14.
  • 16. Our way A newer way • Hand-enter individual items • Leverage web scraping and into spreadsheets paid crowdsourcing for data • Transcribe entry (MT) interviews, hearings and • Use speech recognition for other audio and video the first pass on searchable content for searching audio and video • Read each document • Use clustering, information extraction and other methods for overview of documents
  • 17. Reporterslab.org working to tame audio and video
  • 18. Associated Press project to bring order to unstructured data
  • 21.
  • 22. REPORTERSLAB.ORG Creating sample data and documents for researchers based on real stories