Katrien Depuydt gives presentation on the 'Structural analysis of documents and the Functional Extension Parser (FEP)' at the IMPACT Demo Day at the British Library on the 12th of July 2011.
Down the Islands. A voyage to the Caribbees ... With illustrations. (1888) BL Demonstrator Set, [prima ids 465024-465278] William Agnew Paton The images TOC1 input.png and TOC2 input.png show the embedded fulltext (OCR output) within the pdf output of ABBYY Finereader. It is interessting to see that in "TOC1 input.png" there are 3 errors from the ocr analysis which have a strong impact on quality of the fep analysis results. a) The link to pagenumbers from the first two TOC entries, Introduction and Chapter I, are not detected by the OCR. b) The third Toc entry (Chapter II) links according to the OCR to the page labelled with the pagenumber 2 (instead of 22) These errors have the following impact on the analysis (which can be seen on Image TOC1 output.png): a) The entry Introduction is missed completely b) The second toc entry ends after the two centered lines and has no link to the book content c) the second part of the second toc entry is grouped together with the third toc entry and has a wrong link to pagenumber 2 instead of 22. d) The fourth toc entry contains no ocr errors and is therefore grouped and also linked correctly. The seccond toc page (TOC2 input.png) does not contain any ocr errors and also the analysis results of the fep are correct. Concerning the TOC reconstruction the fep performs as follows: 25 TOC entries in total: 1 TOC entry was missed, 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link 22 TOC entries are completely correct. The Images Example1.png and Example2.png show the results of the logical structure analysis of the fep. Correct labels are marked with a green, wrong labels with a red border.