SlideShare uma empresa Scribd logo
1 de 22
SCAPE


Audio Quality Assurance
An application of cross correlation
Jesper Sindahl Nielsen
The State and University Library & MADALGO

iPRES
Toronto, 2012


                                    This work was partially supported by the SCAPE Project.
       The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
SCAPE
         The State and University Library

• Large national collections
   • Radio & television:
      • More than 1.000.000 hours – app. 1Pbytes of data
   • Web archive
      • More than 8 billion pages – app. 300Tbytes of data
   • Up-coming newspaper digitization project
      • 32 million pages – app. 800Tbytes of data
   • Many other collections of almost any kind and size
• Digital preservation challanges in large scale
   • Fits perfectly with the overall objectives of SCAPE

                                                                2/22
SCAPE
                 The Two Problems

• Overlap
   • Input: 2 Audio (wav) files
   • Guarantee: They overlap within the last 6 minutes
   • Output: The exact timestamp where the overlap starts
• Quality Assurance of Migration
   • Input: 2 Audio (wav) files
   • Guarantee: Playback of the same file with different players
   • Output: Whether the two files are ‘enough’ alike



                                                             3/22
SCAPE
                  Cross Correlation

• Main component of our solutions
• Well known technique (folklore by now)
• An audio (wav) file is just a function (signal)
   • At sample 1 we have an amplitude, f(1)
   • At sample 2 we have an amplitude, f(2)
   • ... At sample n, we have an amplitude, f(n)
• What it does: Given two functions it computes how
  much to shift one function along the x-axis such that
  they have highest correlation.

                                                       4/22
SCAPE
Cross Correlation: Example




           Samples              5/22
SCAPE
Cross Correlation: Example




           Samples              6/22
SCAPE
Cross Correlation: Example




           Samples              7/22
SCAPE
Cross Correlation: Example




           Samples              8/22
SCAPE
Cross Correlation: Example




           Samples              9/22
SCAPE
          Cross Correlation: Example

This had the highest
correlation, thus
the output is ’2’.




                                Samples

                                            10/22
SCAPE
                  Cross Correlation

• Naively running this algorithm is slow
   • Running time O(n2)
• Using Fourier transforms it is much faster
   • Running time O(n log n)
• Any text book on signal processing will describe this
  procedure.
   • a short summary can be found in the paper as well




                                                         11/22
SCAPE
                The Overlap problem

• We have 15+ years of radio broadcast from DR
  (Danish Radio) on tape in 2-hour chunks.
• Recorded using 2 tape recorders.
   • Overlap occurs



• Then digitized
• Situation: Someone wants to listen to a program that
  spans two tapes (files)
   • Don’t want to listen to the same clip twice
                                                     12/22
SCAPE
                  The Overlap problem

• Solutions?
  • Find the longest suffix of the first file, that is a prefix of the
    second file (excluding meta data etc) – bitwise comparison.
      • Does not work. Audio files sounding the same do not necessarily
        have the same bit pattern.
  • Fingerprinting techniques
      • Seems excessive.
      • Some of them even calculate correlation as a subroutine.
  • Cross Correlation
      • It finds exactly what we want.
      • Cut out the last 6 minutes of the first file and the first 6 minutes of
        the second file, use cross correlation on the two clips.
                                                                          13/22
SCAPE
                The Overlap problem

• We implemented the procedure
  • It becomes quite simple when relying on FFT libraries
  • So it relies on FFTW (“Fastest Fourier Transform in the
    West”)
• Results
  • Has been run on approximately 3 months of broadcasts
     • Around 1000 files, took around 85 hours
     • Found errors in the collection (missing files, wrong channel .. ~3%)
     • The rest has been nicely cut (91% of the 3 months)
  • 5 minutes pr overlap, including cutting the files and using a
    Quality Assurance check
                                                                       14/22
SCAPE
              The Migration Problem

• Over time file types become endangered
   • When did you last listen to a ‘real audio’ clip?
   • How many ‘gif’ images do you encounter today versus 5 or
     10 years ago?
• We still want to be able to hear/view them in fifty
  years
• Solution: Migrate to a different, more preservation
  friendly format.
   • Real Audio  WAV files

                                                          15/22
SCAPE
         The Quality Assurance Problem

• How can we be sure the content is the same?
  • We need methods for doing QA for audio
     • Simple methods: Is the length the same before and after?
     • Better: extract more sophisticated features from the content.
         • Do they have the same average sound level?
     • Our suggestion: Use two different migration programs and cross
       correlate the output




                                                                       16/22
SCAPE
The Quality Assurance Algorithm




                                    17/22
SCAPE
        The Quality Assurance Algorithm

• Split the two output files into blocks of ~5 seconds.




• If all blocks have the same offset, within a fixed
  parameter, we conclude the two files are equal, and
  the migration went as it should
• Otherwise report an error
                                                      18/22
SCAPE
                QA algorithm Results

• We needed a data set
  • We do not actually have any migration errors (yet!)
  • Solution: make an artificial data set
     • Some files turned in to complete garbage (except header)
     • Some files had some parts replaced by garbage
     • Rest were OK
  • The data set was 70 file pairs where
     • 3 files where complete garbage
     • 5 files where partly garbage




                                                                  19/22
SCAPE
                QA algorithm Results

• How did we do?
   • The tool reported all the introduced errors
   • It reported false positives
      • These can be removed by tweaking parameters
• How long did it take?
   • 70 files – 4 hours and 45 minutes (each file is 2 hours).
      • ~4-5 minutes pr case
• How much memory does it use?
   • Almost nothing (less than 10mb, no matter the input size)


                                                                 20/22
SCAPE
                       Summary

• Overlap
   • On average about 5 minutes pr file.
   • Seems to be very accurate
   • Can we do it faster? – integrate with SCAPE platform?
• Migration
   • On average about 5 minutes pr case
   • Caught the errors we thought of
   • Which errors do actually occur?
• Can we apply these ideas in other contexts?
   • Video? Images? Is it too slow for this?
                                                             21/22
SCAPE
                         Questions?




                          Thank you
Code can be found at:
https://github.com/openplanets/scape-xcorrsound


A blog post about xcorrSound can be found on
http://www.openplanetsfoundation.org/blogs/

                                                    22/22

Mais conteúdo relacionado

Semelhante a Audio QA Using Cross Correlation

Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threadsmperham
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Nikolay Savvinov
 
Gfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introductionGfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introductionChawanat Nakasan
 
[Lucas Films] Using a Perforce Proxy with Alternate Transports
[Lucas Films] Using a Perforce Proxy with Alternate Transports[Lucas Films] Using a Perforce Proxy with Alternate Transports
[Lucas Films] Using a Perforce Proxy with Alternate TransportsPerforce
 
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectJpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectSCAPE Project
 
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...jkSlidevault
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Derek Buitenhuis
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_monTomas Doran
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsIsrael Herraiz
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networksbalmanme
 
Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Paolo Negri
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social gamesWooga
 
Cue to You Doc Camera Presentation
Cue to You Doc Camera PresentationCue to You Doc Camera Presentation
Cue to You Doc Camera Presentationmathman314
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoKota Tsuyuzaki
 

Semelhante a Audio QA Using Cross Correlation (20)

Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
Lesson 9 compression - Audio
Lesson 9 compression - AudioLesson 9 compression - Audio
Lesson 9 compression - Audio
 
A sip of Elixir
A sip of ElixirA sip of Elixir
A sip of Elixir
 
Gfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introductionGfarm presentation and thesis topic introduction
Gfarm presentation and thesis topic introduction
 
[Lucas Films] Using a Perforce Proxy with Alternate Transports
[Lucas Films] Using a Perforce Proxy with Alternate Transports[Lucas Films] Using a Perforce Proxy with Alternate Transports
[Lucas Films] Using a Perforce Proxy with Alternate Transports
 
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectJpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
 
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...
Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyze...
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
 
Real time system_performance_mon
Real time system_performance_monReal time system_performance_mon
Real time system_performance_mon
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasets
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
Zero mq logs
Zero mq logsZero mq logs
Zero mq logs
 
Cue to You Doc Camera Presentation
Cue to You Doc Camera PresentationCue to You Doc Camera Presentation
Cue to You Doc Camera Presentation
 
More Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit JunoMore Efficient Object Replication in OpenStack Summit Juno
More Efficient Object Replication in OpenStack Summit Juno
 

Mais de SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 

Mais de SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Audio QA Using Cross Correlation

  • 1. SCAPE Audio Quality Assurance An application of cross correlation Jesper Sindahl Nielsen The State and University Library & MADALGO iPRES Toronto, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
  • 2. SCAPE The State and University Library • Large national collections • Radio & television: • More than 1.000.000 hours – app. 1Pbytes of data • Web archive • More than 8 billion pages – app. 300Tbytes of data • Up-coming newspaper digitization project • 32 million pages – app. 800Tbytes of data • Many other collections of almost any kind and size • Digital preservation challanges in large scale • Fits perfectly with the overall objectives of SCAPE 2/22
  • 3. SCAPE The Two Problems • Overlap • Input: 2 Audio (wav) files • Guarantee: They overlap within the last 6 minutes • Output: The exact timestamp where the overlap starts • Quality Assurance of Migration • Input: 2 Audio (wav) files • Guarantee: Playback of the same file with different players • Output: Whether the two files are ‘enough’ alike 3/22
  • 4. SCAPE Cross Correlation • Main component of our solutions • Well known technique (folklore by now) • An audio (wav) file is just a function (signal) • At sample 1 we have an amplitude, f(1) • At sample 2 we have an amplitude, f(2) • ... At sample n, we have an amplitude, f(n) • What it does: Given two functions it computes how much to shift one function along the x-axis such that they have highest correlation. 4/22
  • 10. SCAPE Cross Correlation: Example This had the highest correlation, thus the output is ’2’. Samples 10/22
  • 11. SCAPE Cross Correlation • Naively running this algorithm is slow • Running time O(n2) • Using Fourier transforms it is much faster • Running time O(n log n) • Any text book on signal processing will describe this procedure. • a short summary can be found in the paper as well 11/22
  • 12. SCAPE The Overlap problem • We have 15+ years of radio broadcast from DR (Danish Radio) on tape in 2-hour chunks. • Recorded using 2 tape recorders. • Overlap occurs • Then digitized • Situation: Someone wants to listen to a program that spans two tapes (files) • Don’t want to listen to the same clip twice 12/22
  • 13. SCAPE The Overlap problem • Solutions? • Find the longest suffix of the first file, that is a prefix of the second file (excluding meta data etc) – bitwise comparison. • Does not work. Audio files sounding the same do not necessarily have the same bit pattern. • Fingerprinting techniques • Seems excessive. • Some of them even calculate correlation as a subroutine. • Cross Correlation • It finds exactly what we want. • Cut out the last 6 minutes of the first file and the first 6 minutes of the second file, use cross correlation on the two clips. 13/22
  • 14. SCAPE The Overlap problem • We implemented the procedure • It becomes quite simple when relying on FFT libraries • So it relies on FFTW (“Fastest Fourier Transform in the West”) • Results • Has been run on approximately 3 months of broadcasts • Around 1000 files, took around 85 hours • Found errors in the collection (missing files, wrong channel .. ~3%) • The rest has been nicely cut (91% of the 3 months) • 5 minutes pr overlap, including cutting the files and using a Quality Assurance check 14/22
  • 15. SCAPE The Migration Problem • Over time file types become endangered • When did you last listen to a ‘real audio’ clip? • How many ‘gif’ images do you encounter today versus 5 or 10 years ago? • We still want to be able to hear/view them in fifty years • Solution: Migrate to a different, more preservation friendly format. • Real Audio  WAV files 15/22
  • 16. SCAPE The Quality Assurance Problem • How can we be sure the content is the same? • We need methods for doing QA for audio • Simple methods: Is the length the same before and after? • Better: extract more sophisticated features from the content. • Do they have the same average sound level? • Our suggestion: Use two different migration programs and cross correlate the output 16/22
  • 17. SCAPE The Quality Assurance Algorithm 17/22
  • 18. SCAPE The Quality Assurance Algorithm • Split the two output files into blocks of ~5 seconds. • If all blocks have the same offset, within a fixed parameter, we conclude the two files are equal, and the migration went as it should • Otherwise report an error 18/22
  • 19. SCAPE QA algorithm Results • We needed a data set • We do not actually have any migration errors (yet!) • Solution: make an artificial data set • Some files turned in to complete garbage (except header) • Some files had some parts replaced by garbage • Rest were OK • The data set was 70 file pairs where • 3 files where complete garbage • 5 files where partly garbage 19/22
  • 20. SCAPE QA algorithm Results • How did we do? • The tool reported all the introduced errors • It reported false positives • These can be removed by tweaking parameters • How long did it take? • 70 files – 4 hours and 45 minutes (each file is 2 hours). • ~4-5 minutes pr case • How much memory does it use? • Almost nothing (less than 10mb, no matter the input size) 20/22
  • 21. SCAPE Summary • Overlap • On average about 5 minutes pr file. • Seems to be very accurate • Can we do it faster? – integrate with SCAPE platform? • Migration • On average about 5 minutes pr case • Caught the errors we thought of • Which errors do actually occur? • Can we apply these ideas in other contexts? • Video? Images? Is it too slow for this? 21/22
  • 22. SCAPE Questions? Thank you Code can be found at: https://github.com/openplanets/scape-xcorrsound A blog post about xcorrSound can be found on http://www.openplanetsfoundation.org/blogs/ 22/22