SlideShare a Scribd company logo
1 of 11
Download to read offline
University Library of KU Leuven 
Sam Alloing and Demmy Verbeke
University Library of KU Leuven 
Divisions involved: 
Arts Faculty Library 
•Collections and services focused on ongoing research and teaching in the Faculty of Arts 
•Some special collections (e.g. Gulden Librije) 
LIBIS 
•Provides services for libraries, museums and archives (inside and outside the university) 
Digitisation Unit 
•A.o. Digital Lab: High-tech digital photography centre
Why did we get involved? 
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research 
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie 
http://www.illuminare.be/rich_project 
http://www.europeana-photography.eu
Corpus 
13 books from the pretiosa collection of the Gulden Librije: 
-translations from Latin 
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
Assumptions 
•As automated as possible 
•Try as soon as possible, to fail early 
•Use ALTO format throughout the workflow
Workflow OCR 
Attestation 
Improving 
•User pattern training 
•Use dictionary 
•Improve images 
Executing OCR 
Digitisation 
Evaluation set 
ocrevalUAtion 
Lesson learnt: 
high error rate is not necessarily bad 
Aletheia 
•Create ground truth 
•User friendly 
Lessons learnt: 
•B&W images 
•Remove border 
•Biggest problem: letters from other pages coming through 
ABBYY FineReader engine 
•Useful sample applications 
•Windows
Workflow NER 
Attestation 
Training set 
Test set 
Execute NER 
Model 
Input 
Europeana Newspaper NER 
•ALTO input from OCR 
•Lesson learnt: lot of resources (RAM) needed 
INL Attestation tool 
Lesson learnt: 
lot more ground truth needed than OCR 
NERT of INL 
80/20 split training/test 
NERT of INL 
•Different split training and test set 
•Create variants from old spelling 
Improving
Results NER 
Precision 
Recall 
F1 
Overall 
0.6257 
0.5130 
0.5638 
Location 
0.675 
0.2903 
0.40601 
Organization 
1.0 
0.1666 
0.2857 
Person 
0.6207 
0.5571 
0.5871 
Segmentation 
0.6634 
0.5438 
0.5977 
Classification accuracy 
0.9433 
> 60% recognised correctly 
≈ 50% of the entities found
Results NER, an experiment 
Input 
Corrected file 
Training file 
Test file 
Split 
Combine 
Precision 
Recall 
F1 
Overall 
0.8398 
0.7954 
0.8170 
Location 
0.8741 
0.6720 
0.7599 
Organization 
1.0 
0.5 
0.6666 
Person 
0.8320 
0.8320 
0.8320 
Segmentation 
0.8920 
0.8448 
0.8677 
Classification accuracy 
0.9415 
80% recognised correctly 
≈ 80% entities found
Next steps 
•Create a OCR and NER platform for the university and as part of the LIBIS services 
•New project about OCR and (early modern) Latin texts 
•Looking into other tools : 
•Lexicon building 
•Border detection 
•Automatically remove ‘noise’ from a page 
•NER: 
•Learning to use Latin (and Greek)
Thanks! 
Questions? 
•Sam Alloing (Sam.Alloing@libis.kuleuven.be) 
•Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) 
•http://bib.kuleuven.be/english/ub

More Related Content

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnetJo Rademakers
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlinelab_SNG
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABRonald Snijder
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder Ulab_SNG
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital CollectionsErin Tripp
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceElena Yaroshenko
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanitieslabsbl
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional RepertoireBohyun Kim
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...Jason Casden
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcherLIBER Europe
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghRepository Fringe
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...MCN (Museum Computer Network)
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyJane Alexander
 

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke (20)

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet
 
Introducing SUL
Introducing SULIntroducing SUL
Introducing SUL
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOAB
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder U
 
KU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoCKU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoC
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpace
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanities
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional Repertoire
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcher
 
Sistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLCSistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLC
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of Edinburgh
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
 
Emea, March 2011
Emea, March 2011 Emea, March 2011
Emea, March 2011
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: Technology
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

University library of KU Leuven - Sam Alloing et Demmy Verbecke

  • 1. University Library of KU Leuven Sam Alloing and Demmy Verbeke
  • 2. University Library of KU Leuven Divisions involved: Arts Faculty Library •Collections and services focused on ongoing research and teaching in the Faculty of Arts •Some special collections (e.g. Gulden Librije) LIBIS •Provides services for libraries, museums and archives (inside and outside the university) Digitisation Unit •A.o. Digital Lab: High-tech digital photography centre
  • 3. Why did we get involved? Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie http://www.illuminare.be/rich_project http://www.europeana-photography.eu
  • 4. Corpus 13 books from the pretiosa collection of the Gulden Librije: -translations from Latin -books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
  • 5. Assumptions •As automated as possible •Try as soon as possible, to fail early •Use ALTO format throughout the workflow
  • 6. Workflow OCR Attestation Improving •User pattern training •Use dictionary •Improve images Executing OCR Digitisation Evaluation set ocrevalUAtion Lesson learnt: high error rate is not necessarily bad Aletheia •Create ground truth •User friendly Lessons learnt: •B&W images •Remove border •Biggest problem: letters from other pages coming through ABBYY FineReader engine •Useful sample applications •Windows
  • 7. Workflow NER Attestation Training set Test set Execute NER Model Input Europeana Newspaper NER •ALTO input from OCR •Lesson learnt: lot of resources (RAM) needed INL Attestation tool Lesson learnt: lot more ground truth needed than OCR NERT of INL 80/20 split training/test NERT of INL •Different split training and test set •Create variants from old spelling Improving
  • 8. Results NER Precision Recall F1 Overall 0.6257 0.5130 0.5638 Location 0.675 0.2903 0.40601 Organization 1.0 0.1666 0.2857 Person 0.6207 0.5571 0.5871 Segmentation 0.6634 0.5438 0.5977 Classification accuracy 0.9433 > 60% recognised correctly ≈ 50% of the entities found
  • 9. Results NER, an experiment Input Corrected file Training file Test file Split Combine Precision Recall F1 Overall 0.8398 0.7954 0.8170 Location 0.8741 0.6720 0.7599 Organization 1.0 0.5 0.6666 Person 0.8320 0.8320 0.8320 Segmentation 0.8920 0.8448 0.8677 Classification accuracy 0.9415 80% recognised correctly ≈ 80% entities found
  • 10. Next steps •Create a OCR and NER platform for the university and as part of the LIBIS services •New project about OCR and (early modern) Latin texts •Looking into other tools : •Lexicon building •Border detection •Automatically remove ‘noise’ from a page •NER: •Learning to use Latin (and Greek)
  • 11. Thanks! Questions? •Sam Alloing (Sam.Alloing@libis.kuleuven.be) •Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) •http://bib.kuleuven.be/english/ub