SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
SCAPE

Matchbox tool
Quality control for digital collections
Roman Graf                                                                                   Reinhold Huber-Mörk
Research Area Future Networks and Services                     Research Area Intelligent Vision Systems
                  Department Safety & Security, AIT Austrian Institute of Technology

Alexander Schindler
Department of Software Technology and Interactive Systems
Vienna University of Technology

SCAPE training event
Guimaraes, Portugal, 6-7 December 2012
                                      This work was partially supported by the SCAPE Project.
         The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
SCAPE
                      Overview

•   Introduction
•   Matchbox Tool Description
•   Image Processing
•   Collection Samples
•   Matchbox Tool Features
•   Training Description
•   Installation Guidelines
•   Practical Exercises and Tool Analysis Results
•   Conclusion
                                                       2
SCAPE
                          Introduction
•   High storage costs
•   Update of digitized collection through an automatic scanning process
•   Use case: Find Duplicates
•   No automatic method to detect duplicates in not structured collections
•   Lack expertise and efficient methods for finding images in a huge
    collection
•   Need for automated solutions
•   QA is required to select between the old and new
•   Decision support - overwrite or human inspection
•   Image: d = 40.000 SIFT descriptors, book: n = 700 images
•   SIFT: d2 = 1.6×109 vector comparisons for a single pair of images
•   BoW typical book: clustering, n×(n - 1) = 350.000 vector comparisons

                                                                             3
SCAPE
              Matchbox Tool Description

• Tool
   • C++ (DLLs on Windows or shared objects on Linux)
• Dataset
   • Austrian National Library - Digital Book Collection (about 600.000
     books that will be digitized over the coming years)
• Main tasks
   • Overwriting existing collection items with new items
   • Image pairs can be compared within a book
• Output
   • Visual dictionary for further analysis
   • Duplicates

                                                                          4
SCAPE
                                Image Processing
1.    Document feature extraction
     •     Interest keypoints - Scale Invariant Feature Transform (SIFT)
     •     Local feature descriptors (invariant to geometrical distortions)
2.    Learning visual dictionary
     •     Clustering method applied to all SIFT descriptors of all images
           using k-means algorithm
     •     Collect local descriptors in a visual dictionary using Bag-Of-
           Words (BoW) algorithm
3.    Create visual histogram for each image document
4.    Detect similar images based on visual histogram and local descriptors.
      Structural SIMilarity (SSIM) approach
     •     Rotate
     •     Scale
     •     Mask
     •     Overlaying




                                                                                  5
SCAPE
Matching of keypoints




                           6
SCAPE
Pixel wise comparison - SSIM




                                  7
SCAPE
Images 10 to 17 are duplicates of images 2 to 9




                                                  8
SCAPE
High similarity but no duplicates




                                       9
SCAPE
              Matchbox Tool Features

• Reduce costs
• Improves quality
• Saves time
• Automatically
• Increase efficiency of human work with particular focus
• Invariant to format, rotation, scale, translation, illumination,
  resolution, cropping, warping, distortions
• Application: assembling collections, missing files, duplicates,
  compare two images independent from format (profile, pixel)


                                                                10
SCAPE
                     Training Description
• Goal: to be able to detect duplicates in digital image collections
• Outcomes of training: learn how to install the matchbox and how to set up
  associated workflows.
• Teacher activity:
    • Tool presentation
    • Carry out a number of duplicate detection experiments
• Attendee activity: complete some workflows for
    •   Image duplicate search
    •   Content-based image comparison
    •   Customize duplicate search workflow
    •   Understand and describe outputs of different commands




                                                                        11
SCAPE
                 Installation Guidelines
• Linux OS with more than 10GB disk and 8GB RAM
• Git
• Python2.7
• Cmake
• C++ compiler
• The newest OpenCV version
• Matchbox HTTP URL: https://github.com/openplanets/scape.git or
  download ZIP from the same page (“pc-qa-matchbox”)
• Digital collection should have at least 15 files in order to build BoW




                                                                           12
SCAPE
                         Practical Exercises
1.   Identifying duplicate images in digital collections
     a.   Move digital collection to the server where matchbox is installed. For
          Windows use pscp, WinScp or Web Interface.
     b.   cd scape/pc-qa-matchbox/Python directory in matchbox source code
     c.   sudo python2.7 ./FindDuplicates.py /home/matchbox/matchbox-data/ all
          --help
     d.   Define which step of the workflow should be executed: all, extract,
          compare, train, bowhist, clean
     e.   Optional parameters are not supported yet
     f.   Correct command sequence if not "all“:
          1.   clean
          2.   extract
          3.   train
          4.   bowhist
          5.   Compare
                                                                               13
SCAPE
Scenario: professional duplicate search




                                             14
SCAPE
Scenario: find duplicates using nested commands




                                                  15
SCAPE
                  Analysis of the Tool Results

 •   [1 of 20] 1                                     [11 of 20] 11
 •   [2 of 20] 2 => [10]                             [12 of 20] 12
 •   [3 of 20] 3                                     [13 of 20] 13
 •   [4 of 20] 4                                     [14 of 20] 14
 •   [5 of 20] 5                                     [15 of 20] 15 => [7]
 •   [6 of 20] 6                                     [16 of 20] 16 => [8]
 •   [7 of 20] 7 => [15]                             [17 of 20] 17 => [9]
 •   [8 of 20] 8 => [16]                             [18 of 20] 18
 •   [9 of 20] 9 => [17]                             [19 of 20] 19
 •   [10 of 20] 10 => [2]                            [20 of 20] 20

3,4,5,6 with associated duplicates 11,12,13,14 are nearly empty pages

compare.exe -l 4 /root/samples/matchboxCollection/00000012.jp2.SIFTComparison.feat.xml.gz
/root/samples/matchboxCollection/00000003.jp2.SIFTComparison.feat.xml.gz
OpenCV Error: Assertion failed (CV_IS_MAT(points1) && CV_IS_MAT(points2) &&
CV_ARE_SIZES_EQ(points1, points2)) in cvFindFundamentalMat, file /root/down/OpenCV-
2.4.3/modules/calib3d/src/fundam.cpp, line 599
                                                                                            16
SCAPE
                             Practical Exercises
Output for collection with multiple duplicates:
=== compare images from directory /root/samples/col_multiple_dup/ ===
...loading features
...calculating distance matrix
[1 of 16] 92
[2 of 16] 85 => [77, 79, 81, 83]
[3 of 16] 82 => [78, 80, 84]
[4 of 16] 78 => [80, 82, 84]
[5 of 16] 87
[6 of 16] 89
[7 of 16] 86
[8 of 16] 88
[9 of 16] 79 => [77, 81, 83, 85]
[10 of 16] 91
[11 of 16] 90
[12 of 16] 83 => [77, 79, 81, 85]
[13 of 16] 84 => [78, 80, 82]
[14 of 16] 81 => [77, 79, 83, 85]
[15 of 16] 77 => [79, 81, 83, 85]
[16 of 16] 80 => [78, 82, 84]

                                                                           17
SCAPE
                           Practical Exercises
2.   Compare two images by profile information
     •   extractfeatures /home/matchbox/matchbox-data/00000001.jp2
     •   extractfeatures /home/matchbox/matchbox-data/00000002.jp2
     •   compare /home/matchbox/matchbox-data/00000001.jp2.
         ImageProfile.feat.xml.gz /home/matchbox/matchbox-
         data/00000002.jp2.ImageProfile.feat.xml.gz
     •   Output:
         <?xml version="1.0"?>
         <comparison>
          <task level="2" name="ImageProfile">
             <result>0.000353421</result> => high similarity
          </task>
         </comparison>

         <?xml version="1.0"?>
         <comparison>
          <task level="2" name="ImageProfile">
             <result>14.1486</result>      => low similarity
          </task>
         </comparison>
                                                                        18
SCAPE
Scenario: compare image pair based on profiles




                                                 19
SCAPE
                              Practical Exercises
3.    Compare two images based on SSIM method
      • python2.7 FindDuplicates.py /root/samples/matchboxCollection/ --
        img1=00000001.jp2 --img2=00000002.jp2 compareimagepair
      • Output:
=== compare image pair 00000001.jp2 00000002.jp2 from directory /samples/matchboxCollection/ ===

dir: /root/samples/matchboxCollection/
img1: /root/samples/matchboxCollection/00000001.jp2.BOWHistogram.feat.xml.gz
img2: /root/samples/matchboxCollection/00000002.jp2.BOWHistogram.feat.xml.gz

...calculating distance matrix
[1 of 2] 71         => if images are not duplicates
[1 of 2] 1 => [2] => if images are duplicates




                                                                                                   20
SCAPE
Scenario: check duplicate pair using SSIM




                                               21
SCAPE
                        Practical Exercises
1.   Exercise: Identifying duplicate images in digital collections
     a.   You have a collection of 20 digital documents. Write a command to search
          duplicates in one turn
     b.   Write commands to search duplicates using customized workflow
     c.   Describe outputs
2.   Exercise: Identifying multiple duplicates in digital collection
     a.   You have a collection that contains multiple duplicates of one document. Write a
          command to detect all these duplicates
     b.   Describe outputs
3.   Exercise: Compare two images
     a.   You have analyzed a collection of 20 digital documents. Write a command to
          perform a content-based comparison of two particular documents
     b.   Describe outputs



                                                                                             22
SCAPE
                Conclusion

• Decision making support for duplicate
  detection in document image collections
• An automatic approach delivers a significant
  improvement when compared to manual
  analysis
• The tool is available as Taverna components
  for easy invocation and testing
• System ensures quality of the digitized
  content and supports managers of libraries
  and archives with regard to long term digital
  preservation
                                                  23
SCAPE



Thank you for your attention!



                                24

Mais conteúdo relacionado

Semelhante a Matchbox tool. Quality control for digital collections – SCAPE Training event, Guimarães 2012

Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applicationsAnastasiοs Antoniadis
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA Taiwan
 
Real-Time Image Recognition with Apache Spark with Nikita Shamgunov
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovReal-Time Image Recognition with Apache Spark with Nikita Shamgunov
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovDatabricks
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopJason Trost
 
Reverse engineering
Reverse engineeringReverse engineering
Reverse engineeringSaswat Padhi
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSpark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSingleStore
 
ICCV 2019 - A view
ICCV 2019 - A viewICCV 2019 - A view
ICCV 2019 - A viewLiberiFatali
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory CourseRuby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Coursepeter_marklund
 

Semelhante a Matchbox tool. Quality control for digital collections – SCAPE Training event, Guimarães 2012 (20)

Static analysis of java enterprise applications
Static analysis of java enterprise applicationsStatic analysis of java enterprise applications
Static analysis of java enterprise applications
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
 
Real-Time Image Recognition with Apache Spark with Nikita Shamgunov
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovReal-Time Image Recognition with Apache Spark with Nikita Shamgunov
Real-Time Image Recognition with Apache Spark with Nikita Shamgunov
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in Hadoop
 
Reverse engineering
Reverse engineeringReverse engineering
Reverse engineering
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and SparkSpark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
Spark Summit West 2017: Real-Time Image Recognition with MemSQL and Spark
 
ICCV 2019 - A view
ICCV 2019 - A viewICCV 2019 - A view
ICCV 2019 - A view
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory CourseRuby on Rails 101 - Presentation Slides for a Five Day Introductory Course
Ruby on Rails 101 - Presentation Slides for a Five Day Introductory Course
 

Mais de SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 

Mais de SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Matchbox tool. Quality control for digital collections – SCAPE Training event, Guimarães 2012

  • 1. SCAPE Matchbox tool Quality control for digital collections Roman Graf Reinhold Huber-Mörk Research Area Future Networks and Services Research Area Intelligent Vision Systems Department Safety & Security, AIT Austrian Institute of Technology Alexander Schindler Department of Software Technology and Interactive Systems Vienna University of Technology SCAPE training event Guimaraes, Portugal, 6-7 December 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
  • 2. SCAPE Overview • Introduction • Matchbox Tool Description • Image Processing • Collection Samples • Matchbox Tool Features • Training Description • Installation Guidelines • Practical Exercises and Tool Analysis Results • Conclusion 2
  • 3. SCAPE Introduction • High storage costs • Update of digitized collection through an automatic scanning process • Use case: Find Duplicates • No automatic method to detect duplicates in not structured collections • Lack expertise and efficient methods for finding images in a huge collection • Need for automated solutions • QA is required to select between the old and new • Decision support - overwrite or human inspection • Image: d = 40.000 SIFT descriptors, book: n = 700 images • SIFT: d2 = 1.6×109 vector comparisons for a single pair of images • BoW typical book: clustering, n×(n - 1) = 350.000 vector comparisons 3
  • 4. SCAPE Matchbox Tool Description • Tool • C++ (DLLs on Windows or shared objects on Linux) • Dataset • Austrian National Library - Digital Book Collection (about 600.000 books that will be digitized over the coming years) • Main tasks • Overwriting existing collection items with new items • Image pairs can be compared within a book • Output • Visual dictionary for further analysis • Duplicates 4
  • 5. SCAPE Image Processing 1. Document feature extraction • Interest keypoints - Scale Invariant Feature Transform (SIFT) • Local feature descriptors (invariant to geometrical distortions) 2. Learning visual dictionary • Clustering method applied to all SIFT descriptors of all images using k-means algorithm • Collect local descriptors in a visual dictionary using Bag-Of- Words (BoW) algorithm 3. Create visual histogram for each image document 4. Detect similar images based on visual histogram and local descriptors. Structural SIMilarity (SSIM) approach • Rotate • Scale • Mask • Overlaying 5
  • 8. SCAPE Images 10 to 17 are duplicates of images 2 to 9 8
  • 9. SCAPE High similarity but no duplicates 9
  • 10. SCAPE Matchbox Tool Features • Reduce costs • Improves quality • Saves time • Automatically • Increase efficiency of human work with particular focus • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping, distortions • Application: assembling collections, missing files, duplicates, compare two images independent from format (profile, pixel) 10
  • 11. SCAPE Training Description • Goal: to be able to detect duplicates in digital image collections • Outcomes of training: learn how to install the matchbox and how to set up associated workflows. • Teacher activity: • Tool presentation • Carry out a number of duplicate detection experiments • Attendee activity: complete some workflows for • Image duplicate search • Content-based image comparison • Customize duplicate search workflow • Understand and describe outputs of different commands 11
  • 12. SCAPE Installation Guidelines • Linux OS with more than 10GB disk and 8GB RAM • Git • Python2.7 • Cmake • C++ compiler • The newest OpenCV version • Matchbox HTTP URL: https://github.com/openplanets/scape.git or download ZIP from the same page (“pc-qa-matchbox”) • Digital collection should have at least 15 files in order to build BoW 12
  • 13. SCAPE Practical Exercises 1. Identifying duplicate images in digital collections a. Move digital collection to the server where matchbox is installed. For Windows use pscp, WinScp or Web Interface. b. cd scape/pc-qa-matchbox/Python directory in matchbox source code c. sudo python2.7 ./FindDuplicates.py /home/matchbox/matchbox-data/ all --help d. Define which step of the workflow should be executed: all, extract, compare, train, bowhist, clean e. Optional parameters are not supported yet f. Correct command sequence if not "all“: 1. clean 2. extract 3. train 4. bowhist 5. Compare 13
  • 15. SCAPE Scenario: find duplicates using nested commands 15
  • 16. SCAPE Analysis of the Tool Results • [1 of 20] 1 [11 of 20] 11 • [2 of 20] 2 => [10] [12 of 20] 12 • [3 of 20] 3 [13 of 20] 13 • [4 of 20] 4 [14 of 20] 14 • [5 of 20] 5 [15 of 20] 15 => [7] • [6 of 20] 6 [16 of 20] 16 => [8] • [7 of 20] 7 => [15] [17 of 20] 17 => [9] • [8 of 20] 8 => [16] [18 of 20] 18 • [9 of 20] 9 => [17] [19 of 20] 19 • [10 of 20] 10 => [2] [20 of 20] 20 3,4,5,6 with associated duplicates 11,12,13,14 are nearly empty pages compare.exe -l 4 /root/samples/matchboxCollection/00000012.jp2.SIFTComparison.feat.xml.gz /root/samples/matchboxCollection/00000003.jp2.SIFTComparison.feat.xml.gz OpenCV Error: Assertion failed (CV_IS_MAT(points1) && CV_IS_MAT(points2) && CV_ARE_SIZES_EQ(points1, points2)) in cvFindFundamentalMat, file /root/down/OpenCV- 2.4.3/modules/calib3d/src/fundam.cpp, line 599 16
  • 17. SCAPE Practical Exercises Output for collection with multiple duplicates: === compare images from directory /root/samples/col_multiple_dup/ === ...loading features ...calculating distance matrix [1 of 16] 92 [2 of 16] 85 => [77, 79, 81, 83] [3 of 16] 82 => [78, 80, 84] [4 of 16] 78 => [80, 82, 84] [5 of 16] 87 [6 of 16] 89 [7 of 16] 86 [8 of 16] 88 [9 of 16] 79 => [77, 81, 83, 85] [10 of 16] 91 [11 of 16] 90 [12 of 16] 83 => [77, 79, 81, 85] [13 of 16] 84 => [78, 80, 82] [14 of 16] 81 => [77, 79, 83, 85] [15 of 16] 77 => [79, 81, 83, 85] [16 of 16] 80 => [78, 82, 84] 17
  • 18. SCAPE Practical Exercises 2. Compare two images by profile information • extractfeatures /home/matchbox/matchbox-data/00000001.jp2 • extractfeatures /home/matchbox/matchbox-data/00000002.jp2 • compare /home/matchbox/matchbox-data/00000001.jp2. ImageProfile.feat.xml.gz /home/matchbox/matchbox- data/00000002.jp2.ImageProfile.feat.xml.gz • Output: <?xml version="1.0"?> <comparison> <task level="2" name="ImageProfile"> <result>0.000353421</result> => high similarity </task> </comparison> <?xml version="1.0"?> <comparison> <task level="2" name="ImageProfile"> <result>14.1486</result> => low similarity </task> </comparison> 18
  • 19. SCAPE Scenario: compare image pair based on profiles 19
  • 20. SCAPE Practical Exercises 3. Compare two images based on SSIM method • python2.7 FindDuplicates.py /root/samples/matchboxCollection/ -- img1=00000001.jp2 --img2=00000002.jp2 compareimagepair • Output: === compare image pair 00000001.jp2 00000002.jp2 from directory /samples/matchboxCollection/ === dir: /root/samples/matchboxCollection/ img1: /root/samples/matchboxCollection/00000001.jp2.BOWHistogram.feat.xml.gz img2: /root/samples/matchboxCollection/00000002.jp2.BOWHistogram.feat.xml.gz ...calculating distance matrix [1 of 2] 71 => if images are not duplicates [1 of 2] 1 => [2] => if images are duplicates 20
  • 21. SCAPE Scenario: check duplicate pair using SSIM 21
  • 22. SCAPE Practical Exercises 1. Exercise: Identifying duplicate images in digital collections a. You have a collection of 20 digital documents. Write a command to search duplicates in one turn b. Write commands to search duplicates using customized workflow c. Describe outputs 2. Exercise: Identifying multiple duplicates in digital collection a. You have a collection that contains multiple duplicates of one document. Write a command to detect all these duplicates b. Describe outputs 3. Exercise: Compare two images a. You have analyzed a collection of 20 digital documents. Write a command to perform a content-based comparison of two particular documents b. Describe outputs 22
  • 23. SCAPE Conclusion • Decision making support for duplicate detection in document image collections • An automatic approach delivers a significant improvement when compared to manual analysis • The tool is available as Taverna components for easy invocation and testing • System ensures quality of the digitized content and supports managers of libraries and archives with regard to long term digital preservation 23
  • 24. SCAPE Thank you for your attention! 24