SlideShare uma empresa Scribd logo
1 de 33
Karen Cariani
AAPB Project Director, WGBH
Senior Director, WGBH Media Library & Archives
Let the Computer,
and the Public,
do the Metadata Work!
The Library of Congress
Packard Campus for Audio Visual Conservation
American Archive of Public Broadcasting
WGBH Educational Foundation
American Archive of Public Broadcasting
the situation
72,000 digitized television and radio programs
incomplete, inaccurate metadata records
limited staff resources
we need to know what we have in the collection
we have a responsibility to users to provide access to the collection
continued growth of the collection (content and sparse metadata)
the potential:
transforming content into data
• Computational Tools
• Speech-to-text
• Audio analysis
• Image Analysis
• Visualization of Data
How can we use them?
a crowdsourcing game
fixit.americanarchive.org
Casey Davis Kaufman
Associate Director, WGBH Media Library and Archives
Project Manager, AAPB
AV crowdsourcing precedents
TiltFactor @
Dartmouth:
“Metadata Games”
New York Public
Library’s Together
We Listen project &
Transcript Editing
Tool
Netherlands
Institute for Sound
and Vision
user population
General public
Public media
fans
K-12 students
Senior Citizens
People seeking
to develop
editing skills
People seeking
volunteer
opportunities
game pipeline
Identify errors
1
Suggest
corrections
2
Validate
corrections
3
game improvement targets
Change algorithm and game pipeline to get transcripts through the game
quicker
Update Rules page to allow more leniency in corrections. Communicate that
we’re looking for acceptable corrections, not perfection.
Add ability for AAPB staff to prioritize transcripts in the game
Remove the preferences feature
Update API to help AAPB staff determine more easily which transcripts are
ready to come out of the game.
lessons learned
• Ensure that all team members understand the overall
goals of the project from the beginning
• Ensure that all relevant team members are involved in
developing the game flow concepts and API
• Stay involved in all decision-making – don’t trust that
the developers/contractors will make all the right
decisions
• Test, test, test!!
once corrected…
JSON transcripts will be
stored on AAPB’s
Amazon S3 account
Transcripts will be
indexed for keyword
searching on the AAPB
website
Transcripts will be made
available alongside the
media on the record
page
Transcripts can play as
captions within the
player
Transcripts can be
harvested via an API
and used as a dataset
for research such as a
digital humanities
project
usability & ux research questions
Do users understand the
workflow of the game?
Do users understand the
iconography?
How do users feel about
interacting with random
transcripts rather than
choosing a specific
transcript to work on?
How do users feel about
interacting with small bits
of transcripts rather than
a full transcript at once?
What is the overall user
experience when playing
the game?
What is the overall
satisfaction level in
playing the game?
future plans
facebook.com/amarchivepub
@amarchivepub
americanarchive.org
http://fixit.americanarchive.org
#FixItAAPB
Come to our
editathon!
Friday, 5:45 – 6:45
pm
Room: Arcadian I
Treats and prizes!

Mais conteúdo relacionado

Semelhante a Let the Public and the Computer do the Metadata Work!

Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench Workshops
Beyond The Bench
 
Agile software development
Agile software developmentAgile software development
Agile software development
Hemangi Talele
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Heidi Nance
 

Semelhante a Let the Public and the Computer do the Metadata Work! (20)

Hypermediated TV
Hypermediated TVHypermediated TV
Hypermediated TV
 
RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation RA21 Charleston Library Conference Presentation
RA21 Charleston Library Conference Presentation
 
The mobile game application of educational relate to animal's in Holy Quran
The mobile game application of educational relate to animal's in Holy QuranThe mobile game application of educational relate to animal's in Holy Quran
The mobile game application of educational relate to animal's in Holy Quran
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19
 
Going Far by Going Together: Collaboration with Scholars and Other Allies
Going Far by Going Together: Collaboration with Scholars and Other AlliesGoing Far by Going Together: Collaboration with Scholars and Other Allies
Going Far by Going Together: Collaboration with Scholars and Other Allies
 
Localisation of AT - PDF Version
Localisation of AT - PDF Version Localisation of AT - PDF Version
Localisation of AT - PDF Version
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench Workshops
 
xAPI: The Landscape
xAPI: The LandscapexAPI: The Landscape
xAPI: The Landscape
 
Francia Sandoval UX Portfolio
Francia Sandoval UX PortfolioFrancia Sandoval UX Portfolio
Francia Sandoval UX Portfolio
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
E resources selection criteria
E resources selection criteriaE resources selection criteria
E resources selection criteria
 
Agile software development
Agile software developmentAgile software development
Agile software development
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
 
From Research to Innovation: Linked Open Data and Gamification to Design Inte...
From Research to Innovation: Linked Open Data and Gamification to Design Inte...From Research to Innovation: Linked Open Data and Gamification to Design Inte...
From Research to Innovation: Linked Open Data and Gamification to Design Inte...
 
Connecting Librarians to Researchers
Connecting Librarians to ResearchersConnecting Librarians to Researchers
Connecting Librarians to Researchers
 

Mais de WGBH Media Library and Archives

Mais de WGBH Media Library and Archives (20)

Engage Your Community to Celebrate Your History
Engage Your Community to Celebrate Your HistoryEngage Your Community to Celebrate Your History
Engage Your Community to Celebrate Your History
 
Wikipedia Editathon: How to Guide
Wikipedia Editathon: How to GuideWikipedia Editathon: How to Guide
Wikipedia Editathon: How to Guide
 
FIX IT+ Transcript Editing
FIX IT+ Transcript EditingFIX IT+ Transcript Editing
FIX IT+ Transcript Editing
 
Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...
Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...
Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...
 
AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...
AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...
AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...
 
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
 
Use of American Archive of Public Broadcasting in Humanities Research
Use of American Archive of Public Broadcasting in Humanities ResearchUse of American Archive of Public Broadcasting in Humanities Research
Use of American Archive of Public Broadcasting in Humanities Research
 
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...American Archive of Public Broadcasting: a Digital Library for Teaching Media...
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
 
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...Accessibility of the American Archive of Public Broadcasting in Academic Libr...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
 
How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...How to Use the American Archive of Public Broadcasting as a Resource in the C...
How to Use the American Archive of Public Broadcasting as a Resource in the C...
 
Putting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogPutting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television Catalog
 
DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...
DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...
DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...
 
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
 
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
 
Let the Computer Do the Work
Let the Computer Do the WorkLet the Computer Do the Work
Let the Computer Do the Work
 
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
 
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
 
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access Workflows
 
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collect...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Let the Public and the Computer do the Metadata Work!

  • 1.
  • 2. Karen Cariani AAPB Project Director, WGBH Senior Director, WGBH Media Library & Archives Let the Computer, and the Public, do the Metadata Work!
  • 3.
  • 4. The Library of Congress Packard Campus for Audio Visual Conservation American Archive of Public Broadcasting
  • 5. WGBH Educational Foundation American Archive of Public Broadcasting
  • 6.
  • 7.
  • 8. the situation 72,000 digitized television and radio programs incomplete, inaccurate metadata records limited staff resources we need to know what we have in the collection we have a responsibility to users to provide access to the collection continued growth of the collection (content and sparse metadata)
  • 9. the potential: transforming content into data • Computational Tools • Speech-to-text • Audio analysis • Image Analysis • Visualization of Data How can we use them?
  • 10. a crowdsourcing game fixit.americanarchive.org Casey Davis Kaufman Associate Director, WGBH Media Library and Archives Project Manager, AAPB
  • 11. AV crowdsourcing precedents TiltFactor @ Dartmouth: “Metadata Games” New York Public Library’s Together We Listen project & Transcript Editing Tool Netherlands Institute for Sound and Vision
  • 12.
  • 13. user population General public Public media fans K-12 students Senior Citizens People seeking to develop editing skills People seeking volunteer opportunities
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. game improvement targets Change algorithm and game pipeline to get transcripts through the game quicker Update Rules page to allow more leniency in corrections. Communicate that we’re looking for acceptable corrections, not perfection. Add ability for AAPB staff to prioritize transcripts in the game Remove the preferences feature Update API to help AAPB staff determine more easily which transcripts are ready to come out of the game.
  • 27. lessons learned • Ensure that all team members understand the overall goals of the project from the beginning • Ensure that all relevant team members are involved in developing the game flow concepts and API • Stay involved in all decision-making – don’t trust that the developers/contractors will make all the right decisions • Test, test, test!!
  • 28. once corrected… JSON transcripts will be stored on AAPB’s Amazon S3 account Transcripts will be indexed for keyword searching on the AAPB website Transcripts will be made available alongside the media on the record page Transcripts can play as captions within the player Transcripts can be harvested via an API and used as a dataset for research such as a digital humanities project
  • 29. usability & ux research questions Do users understand the workflow of the game? Do users understand the iconography? How do users feel about interacting with random transcripts rather than choosing a specific transcript to work on? How do users feel about interacting with small bits of transcripts rather than a full transcript at once? What is the overall user experience when playing the game? What is the overall satisfaction level in playing the game?
  • 31.
  • 32.

Notas do Editor

  1. And we are talking about ….. Using computational tools and crowdsourcing games to increase metadata and discoverability of digital collections
  2. We are WGBH, Pop-Up Archive and University of Texas at Austin School of Information. I am discussing a project generously funded by IMLS. I am Karen Cariani, Senior Director of WGBH media library and archives, and project director for the american archive of public broadcasting.
  3. ... a collaboration between the Library of Congress…
  4. Home to many hours of prime time PBS programming
  5. The American Archive goal is to preserve and make accessible significant public radio and television programs before they are lost to posterity. The American Archive is a digital archive with a website, americanarchive.org, the homepage of which you see here. Users anywhere in the U.S. can access a wide range of historical public television and radio programs from the late 1940s to the present. Our primary objective is to preserve public media and assure discoverability and access through a coordinated national effort. In doing this, we support content creators and current stewards of the materials, and facilitate the use of historical public broadcasting by researchers, educators, students, and others.
  6. As an aggregator of content, AAPB hopes to provide a centralized web portal of discovery for public media materials. The collection is growing with new additions. Access for research, educational, and informational purposes only. Due to rights restrictions, a portion (about 20,000 items) are available through our On-line Reading Room anywhere in the US. However, the entire collection of over 72,000 items is available for viewing on location at the Library of Congress and WGBH.
  7. As part of the initial project funded by CPB, the AAPB has 72,000 digitized tv and radio programs from about 100 stations across the country. Along with these digital files we received incomplete metadata records with very little descriptive data about the content or the program. We have limited staff resources to fully catalog the 72,000 items. We figured it would take a full time person about 32 years to watch everything, spending only 15 minutes per item cataloguing to complete the collection, all while we adding up to 25,000 items in annually. So you can do the math and figure out that even if we could afford a team of 10 people to just catalogue full time, it would still take a long time and we would barely catch up cataloguing the new acquisitions. However, we need to know what we have, (it helps us determine rights and what we can make accessible) and we need to be able to make it findable for users, and do that, currently, we need to be able to expose text for search engines and indexers. So how to do you transform large amounts of audio and video into something searchable for search engines and indexers? How can we transform it into a dataset?
  8. We thought, this is a great opportunity for collaboration with computational tools and computer science field, but we need to understand the capabilities of what exist. Here are some of the tools available that can help us with our dilemma. With this IMLS funded project we are working with Pop-up archive to create speech to text transcripts of the entire collection, and with UT Texas to analyze the audio to help further identify speakers and sounds. And we will use a crowdsourcing game to help correct or fix the computer generated transcripts which will hopefully help further train the tools to improve..
  9. Experience has shown that most speech to text tools don’t output clean transcripts. Accurate transcripts are dependent on audio quality, speaker accents, background noise, etc, Given that our collection is from 100 different local tv and radio stations across the country, the variety of audio and audio quality varies widely. Some programs are in Spanish, some are musical performances, and nearly all begin with standard bars and tone for video recordings. The speech to text tool tries to interpret these sounds as text, and it makes a number of other mistakes too. WGBH has created a web based game to allow the public to help us fix and correct these transcripts.
  10. Before our project there were several crowdsourcing projects creating additional metadata through crowdsourcing. A group at Dartmouth called Tilt Factor had created a number of metadata games. NYPL built a transcription tool for on-line crowdsourcing of oral histories to create and edit transcripts. And the Netherlands Institute of Sound and Vision developed a social tagging game called Waisda? which asked the public to compete by tagging the same video simultaneously, awarding points for both speed and accuracy of tags. We decided to join the fray.
  11. The game has a terms of use that we need players to check off to make sure they understand that they can not use the content for anything but helping us correct the transcripts. We’ve kept the clips very short in order to be able to take advantage of fair use.
  12. Our target audience for the game is just about everyone. Maybe not college students or early graduates.
  13. There are 3 games you can play – identify errors, suggest fixes, and validate fixes. You gain points for each action taken.
  14. As a player you can choose preferences along topic, or station of choice.
  15. The more you play, the more points you get.
  16. There is a progress board to tell you how far along we are overall.
  17. And there are instructions for each game to let you know how to play.
  18. Each iteration of a game lasts 5 minutes. But you can play multiple times for any length of time. Three lines of the transcript are active at once. You listen to the audio, see the line highlighted and click on it if there is a mistake. There are instructions and guides on what is considered an error and how to mark it. It take a little bit to figure it out, but after a few times you can pick it up pretty quickly.
  19. For game 1, this is highlighting a mistake that needs correction in this line.
  20. In game 2 you can chose to fix the error or claim it is not an error. We require at least 2 people agreeing on whether it is an error or not.
  21. In game 3 you validate a correction that someone else has made. We are requiring that at least 3 people agree to a correction.
  22. The game board keeps track of points and players. And highlights top scorers. Studies have shown that people play these games for personal satisfaction and a competition doesn’t necessarily increase the desire to play. We hope people will be driven just by the personal satisfaction of getting points and helping us out as oppose to competing against anyone in particular.
  23. There are 260 transcripts in the pipeline and more as new players add new preferences. 49 corrections have been made. But Zero transcripts have been fixed.
  24. As you can see, with over 700 players, and 68,000 transcripts we have barely made a dent. There are over 15,000 errors identified across 260 transcripts. We needed to rethink our approach.
  25. We worked with our game developers and decided we needed to change the algorithm that allows transcripts to move through the pipeline. We can limit what transcripts are in the pipeline, overriding the preferences – so not all 68,000 plus are in play at the same time, but a more concentrated number, like 10. We changed the level of perfection we required to say it was fixed – a little lenient. We assured that once a phrase has been validated as not having an error it moves forward never to go backwards. And we updated the ability for our staff to go in and manually decide which transcripts were ready to be finished.
  26. So we learned some lessons, most of which are true for any project. As with so many other technical projects where we as archivists rely on a technical team outside our own department, it is important to have the archivist voice heard. We actually did know best, we know the content, and we know what the project are goal is. With the first iteration of the game it seemed that the developers lost sight that the basic goal was to correct transcripts and output them, not just play a game. We seem to be on the right track now and have relaunched the game – so please play it and give us feedback.
  27. Once the transcripts have been verified, the JSON transcripts will be stored in the AAPB’s Amazon S3 account and indexed for keyword searching on the AAPB website. The transcripts will be made available alongside the media on the record page. They can also be played like captions within the video player. And they will be able to be harvested via an API to be used as a data set for research. We are hoping that researchers will begin to look at the collection as a data set and start trying to see trends from programming over the last 60 years. Particularly across news programs.
  28. In the meantime we are continuing to improve the interface and feedback from users. We will be having an editathon session this afternoon – so please join us. There will be treats!
  29. But wait there is more!!!
  30. We plan to utilize the NYPL transcript editor tool to see if it is a more efficient way to correct transcripts and get the public engaged.
  31. And we are launching a zoouniverse project called “Roll the Credits” to help us gather data form the credit rolls – like authenticated titles, broadcast date, producer, writer, etc.