4. OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
13. Material for lexicon building: - historical dictionaries with quotations (OED, WNT) - corpus material, ground truth quality - list of dictionary entries - modern or historical language computional lexica
22. Resources for lexicon building Dictionaries Corpora Lexica Bulgarian Ground Truth, Early OCR Czech Jungmann, Kott Ground Truth, Czech National Corpus Based on modern dictionary English OED Ground Truth French Ground Truth, Frantext morphalou Polish The dictionary of 17 th and early 18 th century Polish Ground truth Grammatical dictionary of polish Slovene AHLib, wikisource, Ground Truth Multext-east lexicon Spanish Diccionario de Autoridades, Real Academia Española Cervantes Virtual Library, Ground Truth Apertium lexicon
23. Issues and challenges Language Issues Countermeasures Bulgarian Some characters in late 19 th century bulgarian not recognized by FineReader; Old Church Slavonic printing not at all implemented Lack of sufficient corpus material Special font training; lexicon development ground truth Czech Lack of sufficient corpus material lexicon development ground truth Polish Special Glyphs; Lack of sufficient corpus material lexicon development ground truth
A snippet from a Dutch magazine (De Denker. No. 4. Den 24. January 1763) ------------------------------------------- OCR, improving Access to text: improving the quality of the text. RETRIEVAL: Improving Access to text: dealing with historical spelling variants Used: HISTORICAL LEXICON OF DUTCH Can we handle ‘the world’? Yes we can, ought to be our answer, especially when investing hugely in mass digitisation. Mass digitisation is the very reason for investing in lexicon building. Efforts in digitising huge quantities of historical text demand efforts in quality of OCR as well as retrieval. Historical lexicon building for OCR and Retrieval, as shown above in this little example, can contribute to that. An example: in a ground truth text corpus of Dutch texts from 1550 until 1950, containing approximately 150 million words, search for the very common word ‘wereld’ yielded 23396 hits. Using a historical lexicon, containing spelling and morphological variants of this word, resulted in 34339 hits. I
<ed> again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
Two types of variation, examples for Dutch from the lexicon
Here you search for wereld and automatically for variations of wereld. WERELD is the modern Dutch spelling. How: by using the lexicon built in impact that has a modern word + variations stored in a format so that it can be integrated into a lucene search engine