This document discusses quality assurance for digital image collections in preservation workflows. It presents a keypoint-based approach for comparing document images that is robust to scaling, rotation and other transformations. Keypoints are detected across images and matched, and structural similarity is evaluated. The method was tested on collections from the Dunhuang manuscripts and Google books, achieving high accuracy in identifying identical and similar image pairs. The goal is to integrate this approach into digital preservation platforms to automate quality control of large image collections.
Quality assurance for document image collections in digital preservation
1. Quality assurance for document
image collections in digital
preservation
Reinhold Huber-Mörk1 & Alexander Schindler1,2
1 Research Area Intelligent Vision Systems
Department Safety & Security
AIT Austrian Institute of Technology
2 Department of Software Technology and Interactive Systems
Vienna University of Technology
2. Overview
Digital preservation
Quality assurance in digital image preservation workflows
Keypoint based approach for image content comparison
Spatially distinctive keypoints
Document image preprocessing
Structural similarity assessment
Results on real-world data sets
212.09.2013
3. Digital preservation
„Set of processes, activities and management of digital information over time
to ensure its long term accessibility“ (Source: Wikipedia)
Physical damage of digital/digitized content, e.g. „bit rot“ related to some
storage media
Digital obsolescence of hardware/software, e.g. vanishing file formats
Content modification in preservation, e.g. error injection during file format
conversion, digital manipulation, reacquisition,…
312.09.2013
Images provided by historical
newspaper collection / The
British Library
4. Quality assurance in digital image preservation workflows
Automated preservation workflows are common in large digitization projects
(e.g. museum collections, Google books PPPs,…).
Automated quality assurance to ensure file format consistency, detection of
duplicates and quality and content preservation.
SCAPE FP7
412.09.2013
5. Keypoint based approach for document comparison
Local features are detected & described by standard LoG/SIFT approach
Scaling, rotation, cropping and additional/missing content is handled
Affine transformation is sufficient (usually no perspective, bending etc.)
512.09.2013
200 400 600 800 1000 1200
100
200
300
400
500
600
700
800
Images
provided by
historical
newspaper
collection /
The British
Library
6. Spatially distinctive keypoints (SDKs) (1)
High-resolution document scans contain large number of keypoints (e.g.
~50.000 keypoints on ~5000x3000 pixel images)
Matching of descriptors results in high computational complexity
Changing of detector edge/peak thresholds often results in spatially uneven
distribution of keypoints
One solution is dense/regular spatial sampling of keypoints
Another solution is adaptive non-maximal suppression (Brown et. al, 2005)
Our solution is to enforce keypoint selection at positions locally adjacent to
spatially uniformly distributed interest regions
612.09.2013
7. Spatially distinctive keypoints (SDKs) (2)
Interest regions are distributed over the image using a regular grid
Keypoints with highest saliency are selected from each interest region
(Harris & Stevens corner strength is used as saliency measure)
712.09.2013
Images provided by International Dunhuang Project / The British Library
8. Evaluation SDK (1)
812.09.2013
Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs)
#SDKs=64 #SDKs=256
#SDKs=512
#SDKs=1024
#SDKs=2048 all keypoints
10. Robust symmetric matching
RANSAC constrained by affine transformation
Only accept significant matches - distance ratio of best and second best match
Enforcing one-to-one matching of descriptors - ignoring ambiguous matches
1012.09.2013
Images provided by
International
Dunhuang Project /
The British Library
11. Image preprocessing (1)
Content in (historical) book collections is characterized by a mixture of text,
graphical art, empty pages & other artefacts
E.g. onsider a sample from the Dunhuang manuscripts
1112.09.2013
Images
provided by
International
Dunhuang
Project / The
British
Library
12. Image preprocessing (2)
Locally adaptive histogram equalization to enhance paper structure while
preserving text structure
Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987),
where grid/tile spacing ~ character size (e.g. 40x50 pixels)
1212.09.2013
Images provided by International Dunhuang Project / The British Library
13. Image preprocessing (3)
Tile centers Original Global hist. eq. CLAHE
1312.09.2013
Images provided by International Dunhuang Project / The British Library
14. Structural similarity (1)
MSE, PSNR, etc. not well suited for content comparison –> perceptual
image quality assessment
Non-blind/full-reference image quality assessment
The mean structural similarity index (SSIM, Wang et. al 2004) compares two
images based on luminance, contrast and structure terms.
Mean SSIM is evaluated for overlapping region of image pairs -> registration
To lower the influence of misregistration the local minimum of the mean
SSIM between the images in the pair is evaluated
1412.09.2013
15. Structural similarity (2)
Registered and overlaid images (SSIM low … black, SSIM high …white)
1512.09.2013
Images provided by International Dunhuang Project / The British Library
16. 16
Pairs not
matching
Pairs with
low
structural
similarity
Pairs with
high structural
similarity
Mean SSIM = 0
8 pairs
Mean SSIM <0.67
78 pairs
Mean SSIM >0.67
(p=5 quantile)
1482 pairs
1560 pairsTotal number
Results - International Dunhuang Project data (1)
17. Results - International Dunhuang Project data (2)
1712.09.2013
Pairs of high mean SSIM are not subject to a human verification
Images provided by
International Dunhuang
Project / The British Library
18. Results - International Dunhuang Project data (3)
1812.09.2013
Pairs of medium mean SSIM are possibly subject to human verification
Images provided by
International
Dunhuang Project /
The British Library
19. Results - International Dunhuang Project data (4)
1912.09.2013
Pairs of low mean SSIM are subject to human verification
Images provided by International Dunhuang Project /
The British Library
21. 21
Pairs not
matching
Pairs with
low similarity
(or low
overlap)
Pairs with high
similarity
Results - Google books redownload workflow (2)
Images provided by Google books collection / Austrian National Library
22. Conclusion and outlook
Keypoint based approach for quality assurance in digital book preservation
Combination of keypoints approach with perceptual similarity evaluation
Recently: combination with bag of keypoints approach for duplicate
detection and collection comparison
Currently: Evaluation at Austrian National Library (Google books collection)
and British Library (historical newspaper collection)
Future: Integration on SCAPE platform for scalable distributed computing
2212.09.2013
23. AIT Austrian Institute of Technology
your ingenious partner
reinhold.huber-moerk@ait.ac.at