Call Girl Benson Town - Phone No 7001305949 For Ultimate Sexual Urges
Europeana Newspapers Project Digitizes Millions of Newspaper Pages
1. Copyright: Olmsted County Historical Society
Europeana Newspapers
…in a nutshell
Newspapers in Europe and the Digital
Agenda for Europe - Final Workshop
29 September 2014, London, British Library
Clemens Neudecker, State Library Berlin
@cneudecker
2. Facts & Figures
• Europeana Newspapers – EU ICT-PSP Best Practice Network
• Started in February 2012 and will run until January 2015
• 18 partners, 11 associated partners, 22 networking partners
(28 countries involved)
• Total budget: €5.16M – EC contribution: €4.12M
• Project coordination: State Library Berlin / Preußischer Kulturbesitz
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 2
3. Europeana Newspapers is all over Europe…and beyond
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
3
Red = Project
Partners
Blue = Associated
Partners
Green = Networking
Partners
4. Refinement - we‘re scaling it up!
• 8 million pages refined with Optical Character Recognition (OCR)
• 2 million pages refined with Optical Layout Recognition (OLR)
• Technical resources for Named Entity Recognition (NER) in
three languages (Dutch, German, French)
• Metadata for >18 million pages ingested to Europeana
In comparison: currently provides access to
8,056,532 pages
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
4
5. Quality & Performance
Bag of Words OCR Evaluation
Per Language
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
Layout Analysis Performance
Per evaluation profile
Bag of Words OCR Evaluation
Per Font
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Bag of Words OCR Evaluation
5
82.4%
85.3%
80.9%
75.9%
67.5%
83.4% 84.1%
68.1%
93.1%
57.6%
87.0%
68.3%
76.1%
82.6%
54.1%
32.7%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Success Rate
Language Setting
71.9%
74.3%
80%
75%
70%
65%
60%
55%
50%
Index based Count based
Success Rate
Bag of Words OCR Evaluation
Index based rate vs. count based rate
79.1%
62.2%
55.9%
58.8%
94.7%
0%
Keyword
search
Phrase search Access via
content
structure
Print/ebook
on demand
Content
based image
retrieval
Success Rate (harmonic, area based)
Evaluation Profile
67.3%
81.4%
64.0%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Gothic Normal Mixed
Success Rate
Font
FineReader vs. Tesseract
75.3%
53.78%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Success Rate (count based)
OCR Engine
FineReader Tesseract
6. Access via TEL & Europeana
• Full text search in TEL Historic Newspapers Browser:
http://www.theeuropeanlibrary.org/tel4/newspapers
(recently updated following usability testing)
• Metadata search in Europeana:
http://www.europeana.eu/portal
(now with embedded object presentation via TEL)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
6
7. Full-text search
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
7
8. Browse by date
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
8
9. Explore on a map
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
9
10. Title list
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
10
11. Embedded TEL Viewer in Europeana!
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp 11
12. Metadata Best Practices
• Europeana Newspapers METS/ALTO Profile (ENMAP)
• Contributions to ALTO standard v2.x, v3.0
• Structural metadata with tool support - Structify
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
12
13. Media, News, Events
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
13
14. Lots of opportunities for research & reuse
• Metadata for >18M pages licensed CC0
• Images & full-text for 10M pages licensed public domain
• See also:
http://www.europeana-newspapers.eu/
category/
interviews-with-researchers/
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
yet another way to reuse
newspapers…
14
15. Thank you for your attention!
@eurnews
http://www.europeana-newspapers.eu
http://www.theeuropeanlibrary.org/tel4/newspapers
http://www.europeana.eu/