The IMPACT Interoperability Framework - Workflows for OCR and beyond
Better, faster, cheaper. Solutions of the IMPACT Centre of Competence and future challenges, The British Library, 24-25 October 2011, London, United Kingdom.
How to Troubleshoot Apps for the Modern Connected Worker
The IMPACT Interoperability Framework - Workflows for OCR and beyond
1. The IMPACT Interoperability Framework:
Workflows for OCR and beyond
Clemens Neudecker, KB National Library of the Netherlands
2nd IMPACT Conference, British Library, London 24/25 October 2011
2. Background
> 20 individual software components for specific challenges
Prototyping new algorithms, improving commercial solutions
Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)
Extensible with 3rd party applications
IMPACT Interoperability Framework (IIF)
3. Architecture
Java
Web Services
Apache
Taverna
Open Source available on https://github.com/impactcentre
Free Hackathon 14/15 November, University of Manchester
http://impact-mygrid-taverna-hackathon.wikispaces.com/
4. Integration
Only requirement:
command line executable
Generic command line wrapper
produces web service
Web service exposed as
workflow module with
documentation
Quick & easy integration:
developers can focus on their application and have to worry
less about integration = higher quality software
5. Workflows
OCR workflow =
data pipeline
Building blocks =
processing modules
(nodes)
Integration =
interaction between
nodes (mashups)
Collaboration with
6.
7. Evaluation features
Text comparison of result with ground truth,
using Levenshtein distance method
Word evaluation (with reading order)
Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework
8. Community
Web2.0 style
workflow registry
Ready-to-use and
documented resources
Community of experts
Sharing of experiments
and know how
9. Local client: Taverna Workbench
Background:
BioSciences
Developed and
maintained by
myGrid, UK
Open source
GUI for design and execution of web services & workflows
10. Remote client: Portal
SOAP/REST API
Remote execution of web services & workflows
11. Results Repository
Custom service for IMPACT:
automatic storage of
workflow outputs and
provenance via WebDAV
Fully interoperable,
since HTTP-based
Configurable storage of
result sets
Create reports using POI
12. Scalability
Central ESB proxy
manages multiple
service copies
Process parallelization,
Load distribution,
Fail over, Security
Served >2M requests
Throughput improvements of 94% with every additional instance
Tested on Dutch Supercomputing Cloud (“Enlighten Your Research”)
13. Outlook
Online service for testing/evaluation
Specification & Guidelines
Extending the scope:
Workflows for linguistic analysis: CLARIN
Workflows for preservation: SCAPE
Even better scalability: Map/Reduce
Supported by a community of developers & practitioners
14.
15. xkcd.com/688
“Anyway, the thing about progress is
that is always seems greater than it really is.”
Ludwig Wittgenstein, Philosophical Investigations
(quoting Johann Nestroy)