By Joaquim Rocha.
Currently there are still a lot of documents still stored in paper format and this presents some problems related to preservation, flexibility and even ecology.
With the current Free Software OCR engines it is possible to get a good accuracy rate when converting printed text to digital format but these engines only perform that basic conversion and know nothing about a document's structure and elements.
OCRFeeder presents itself as an easy to use solution implemented for GNOME that performs automatic content detection in pages, allows manual correction and uses the system-wide OCR engines to convert the text. It allows to export the documents in various formats such as ODT, HTML or PDF.
This project stands as the most complete Free Software solution for converting printed documents to digital formats and competes with the proprietary alternatives.
9. No fair conversion apps for
GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
10. OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
11. What's needed is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
15. So many layouts...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
CC Photo by: http://www.flickr.com/photos/uber-tuber/
16. Layouts vary with the type of
document
What works on detecting one, won't
work on others
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
17. OCRFeeder focuses on contents,
not on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
18. Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
24. User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ways
CLI only works in an unattended
mode
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012