1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow Development for OCR
(and beyond)
Clemens Neudecker, KB National Library of the Netherlands
Creating and Communicating Digital Content Conference
Umea, 26 May 2011
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT – Improving access to text
Funded by the EC as part of the 7th Framework Programme
Coordinated by KB – National Library of the Netherlands
EU funding: € 12 100 000
26 partners: Libraries, Research Institutes, Industry Partners
Start date: 1 January 2008
Duration: 48 Months 2012: Centre of Competence
2
Project website: www.impact-project.eu
IMPACT blog: http://impactocr.wordpress.com/
Twitter: @impactocr, #impactproject
Join us on LinkedIn!
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
A familiar scene?
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S' aö'Jifeert mo?
üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te /
sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met
beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…
I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR: A multitude of challenges…
II. Language challenges (spelling variants, inflection, and many more!)
Example: historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
And a multitude of solutions!
22 different ‘tools’ from diverse WP’s,
developers:
OCR (C++, C#),
Image Processing & Lexica (DLL),
Command Line Tools (Win/Linux),
Java, Ruby, PHP, Perl, etc.
+ 3rd party software!
“One ring to rule them all...”
IMPACT Interoperability Framework (IIF)
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Requirement: Interoperability Framework
Interoperability vs. integration
Web based vs. local installation/platform
Most important: flexible, scalable, user friendly
7
Java 6
Apache Axis2
Apache Tomcat
Apache Synapse (optional)
Taverna Workflow Engine
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
Generic Web Service Wrapper
Only requirement: Command Line Application HTML form
Available on OPFlabs:
https://github.com/openplanets/scape/tree/master/xa-toolwrapper
Minimise integration effort: developers can focus on their
application and have to worry less about integration =
higher quality software
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
Service Oriented Architecture
Java as programming
language = platform
independence
Standard Apache
components = easy to
maintain, well supported
Synapse as enterprise
service bus = load
balancing & fail over
HTTPS encryption &
authentication = secure
Minimise deployment effort: scalability, hot deployment/update
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
Workflow development
OCR workflow =
data pipeline
Building blocks =
processing steps (nodes)
Integration =
interaction between nodes
(mashup)
Maximise usability
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
Workflow management
Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: project website
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow registry
Share resources and
experience
Rate/tag/comment
workflows
Organised in groups
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow modules
“Basic” workflows = wraps exactly one software tool/web service
Documented inputs/outputs
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
14
Complex workflows
Tool/data pipeline
Easily derived from
workflow modules
Task/goal oriented
Reusable
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local client: Taverna Workbench
http://www.taverna.org.uk/
Background:
BioSciences
Developed and
maintained by
myGrid, UK
Available for
Windows/Linux/OSX
and as open source
Funding secured
until 2014
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Web client: Taverna Server/
Workflow Parser
SOAP/REST API
Remote execution of workflows (webapp)
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Use case: Workflows for Evaluation
Tool A vs Tool B (Tool A(v1) vs Tool A(v2))
Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C)
Workflow X vs previously digitised material
Users identify optimal workflow for source material/project
17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Other examples
Workflows for Digitisation IMPACT
Workflows for Linguistic Analysis CLARIN
Workflows for Preservation SCAPE
Interface for automatic storage of results, based on DAV,
realised as a workflow module (native beanshell support)
And there are many more…
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Benefits & Outlook
Modular
Transparent
Expandable
Scalable
Platform independent
User friendly
Growing interest in workflow management in CH sector
Easy to set up, deploy, free (open source)
Domain independent
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you! Questions?