SlideShare uma empresa Scribd logo
1 de 16
Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands IMPACT: Challenges and solutions
Overview of this presentation ,[object Object],[object Object],[object Object],[object Object],[object Object]
The Content ,[object Object],[object Object],[object Object],[object Object]
The full text ,[object Object],[object Object],[object Object],[object Object]
Challenges to OCR:
Language Challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
Answering  the challenges – IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IMPACT - Approach ,[object Object],[object Object],[object Object],Image enhancement: Binarisation noise  removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface  LMU,INL Experimental engines USAL,NCSR,UIBK ,[object Object],[object Object],Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Preparation and scanning: guidelines and case studies All partners -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-
IMPACT – Approach continued ,[object Object],[object Object],[object Object],[object Object]
IMPACT Achievements: summary   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Results: Better and Faster ,[object Object],[object Object],[object Object],Image enhancement: Binarisation noise  removal geometrical defects correction NSCR,USAL, ABBYY OCR ABBY FR IBM Adaptive Dictonaries/interface  LMU,INL Experimental engines USAL,NCSR,UIBK ,[object Object],[object Object],Post correction and Enrichment CONCERT IBM Error Profiler LMU Language resources 9 partners Document Understanding Platform UIBK Better:hybrid line segmentation on 2700 text lines SOA 90,9 ->98,8% IMPACT  Better:recognition old fonts FR9->FR10 improved 25% Better, faster:Adaptive OCR on small testset halves FOM  (post processing level required) Faster: CONCERT increases correction speed up to 40% Faster: postcorrection with Error Profiler up to 2,7 times faster than without Better: page split detected on 3.000 images from dataset: SOA 73%->94% IMPACT Better: language resources show improvement for all 9 languages
Results: Cheaper ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Benefits for the Digital Library ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Benefits for the End User ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Enjoy!

Mais conteúdo relacionado

Semelhante a IMPACT Final Conference - Hildelies Balk-Pennington de Jongh

The Simple Assembly Line Balancing Problem
The Simple Assembly Line Balancing ProblemThe Simple Assembly Line Balancing Problem
The Simple Assembly Line Balancing Problem
Nicole Wells
 
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
AEGIS-ACCESSIBLE Projects
 
European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)
GreenLabCenter
 
Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
Per Runeson
 
Liberate Your Library Building A Scottish Consortium November 16th 2009
Liberate Your Library   Building A Scottish Consortium November 16th 2009Liberate Your Library   Building A Scottish Consortium November 16th 2009
Liberate Your Library Building A Scottish Consortium November 16th 2009
Jonathan Field
 

Semelhante a IMPACT Final Conference - Hildelies Balk-Pennington de Jongh (20)

Thinking the archives of 2020: Opportunitiws, priorities, Issues
Thinking the archives of 2020: Opportunitiws, priorities, IssuesThinking the archives of 2020: Opportunitiws, priorities, Issues
Thinking the archives of 2020: Opportunitiws, priorities, Issues
 
The Simple Assembly Line Balancing Problem
The Simple Assembly Line Balancing ProblemThe Simple Assembly Line Balancing Problem
The Simple Assembly Line Balancing Problem
 
IMPACT Final Conference - NCSR - Wordspotting
IMPACT Final Conference - NCSR - WordspottingIMPACT Final Conference - NCSR - Wordspotting
IMPACT Final Conference - NCSR - Wordspotting
 
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
8 Open-Source Concept Coded Graphic Symbol support in OpenOffice.org
 
Patent scope ompi
Patent scope ompiPatent scope ompi
Patent scope ompi
 
CORE final workshop introduction
CORE final workshop introductionCORE final workshop introduction
CORE final workshop introduction
 
Excitement introduction
Excitement introductionExcitement introduction
Excitement introduction
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)
 
56 o oo ccf_final
56 o oo ccf_final56 o oo ccf_final
56 o oo ccf_final
 
Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
 
sample PPT.pptx
sample PPT.pptxsample PPT.pptx
sample PPT.pptx
 
04 --spatial-data
04 --spatial-data04 --spatial-data
04 --spatial-data
 
SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers
 
Liberate Your Library Building A Scottish Consortium November 16th 2009
Liberate Your Library   Building A Scottish Consortium November 16th 2009Liberate Your Library   Building A Scottish Consortium November 16th 2009
Liberate Your Library Building A Scottish Consortium November 16th 2009
 
Introduction to (web) APIs - definitions, examples, concepts and trends
Introduction to (web) APIs - definitions, examples, concepts and trendsIntroduction to (web) APIs - definitions, examples, concepts and trends
Introduction to (web) APIs - definitions, examples, concepts and trends
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
 
Quality and capacity expansion of thematic services in EOSC-SYNERGY
Quality and capacity expansion of thematic services in EOSC-SYNERGYQuality and capacity expansion of thematic services in EOSC-SYNERGY
Quality and capacity expansion of thematic services in EOSC-SYNERGY
 
Portugol EDUCON2010
Portugol EDUCON2010Portugol EDUCON2010
Portugol EDUCON2010
 

Mais de IMPACT Centre of Competence

Mais de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

IMPACT Final Conference - Hildelies Balk-Pennington de Jongh

  • 1. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands IMPACT: Challenges and solutions
  • 2.
  • 3.
  • 4.
  • 6. Language Challenges Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.

Notas do Editor

  1. Quality of full text for historical documents mostly poor Period 1600-1900 (much) less then half of words found in search
  2. Damaged pages, bleed through, difficult layout, historic fonts …
  3. Spelling variants, orthographical variants, inflected forms …and more