SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Collatinus :
Lemmatizer and
morphological
analyzer for
Latin texts.
Yves Ouvrard
Philippe Verkerk
Digital Classics III: Re-thinking Text Analysis May 12th 2017
Who are we ?
Yves Ouvrard
Professor of latin
(retired)
Philippe Verkerk
Physicist
Lab. PhLAM
●
Reading support of
latin texts
●
Prepare the wordlist
associated to a text
What was initially Collatinus ?
Collatinus 11
●
Free and Open program (GNU GPL)
- written in C++ (Qt 5)
- resources as plain text files
●
Stand-alone software (Mac OS X, Windows,
Linux)
http://outils.biblissima.fr/en/collatinus/
●
On-line version (Collatinus 10.2)
http://outils.biblissima.fr/en/collatinus-web/
Main functions
●
Lemmatization
Analysis
Reading support
●
Dictionaries
- Gaffiot 2016
- Lewis & Short
- Georges...
●
Scansion●
Inflection
Ārmă (Ārmā) vĭrūmquĕ cānō (cănō̆ ), Trōjāe quī prīmŭs ăb ōrīs (ōrĭs)
--̆ u-u -̆ -̆ -- - -u u --̆
Ītălĭām, fātō prŏfŭgūs, Lāvīnĭăquĕ (Lāvīnĭāquĕ) vēnĭt (vĕnĭt)
-uu- -- uu- --u-̆ u -̆ u
lītŏră, mūlt[um] īll[e] ēt tērrīs jāctātŭs (jāctātūs) ĕt āltō
-uu -` -` - -- ---̆ u --
vī sŭpĕrūm sāevāe mĕmŏrēm Jūnōnĭs ŏb īrăm;
- uu- -- uu- --u u -u
mūltă (mūltā) quōqu[e] (quŏqu[e]) ēt bēllō pāssūs, dūm cōndĕrĕt ūrbĕm,
--̆ -̆ ` - -- -- - -uu -u
īnfērrētquĕ dĕōs Lătĭō, gĕnūs (gĕnŭs) ūndĕ Lătīnŭm,
---u u- uu- u-̆ -u u-u
Ālbānīquĕ pātrēs (pā̆ trēs), ātqu[e] āltāe mōenĭă Rōmāe.
---u -̆ - -` -- -uu --
What makes the difference ?
●
Dictionaries :
a single click opens
the dictionary
●
Allows to compare
two dictionaries
●
Scansion :
knows a priori if a
vowel is short or long
→ not limited to
metrical verses
●
Integrated server
→ can answer a question from another program
●
Expandable(one can add lemmas and/or dictionaries)
Principle of the analysis
●
Not a list of inflected forms
→ closer to the human mechanism
●
Split the word in 2 parts (all possibilities)
●
Look for the 2 parts in the proper list
●
Check for the compatibility
●
24 100 lemmas in the “main lexicon”
●
58 000 lemmas in the “extended lexicon”
The figures
●
24 100 lemmas
(classical Latin)
●
58 000 lemmas in the
extended lexicon
(unusual words)
●
140 “paradigms”
- Lewis & Short ≈ 60 000 entries
- Gaffiot 2016 ≈ 60 000 entries
- K.E. Georges ≈ 50 000 entries
- Gérard Jeanneau ≈ 50 000 entries
Generate automatically the
lemmas in the format of Collatinus
7 500 entries still waiting
●
Contraction in the declension
ĭĭ → ī
amasse↔ amavisse
●
Assimilation of the prefix
(without an extra entry)
obfero↔ offero
Ordering the analysis
●
“Ordering” the analysis
→ a step to desambiguization
≠ from a usual POS tagger (no list of forms)
●
Statistics on the texts lemmatized by the
LASLA (almost 2 000 000 forms)
●
Counting the tags, the sequences of 3 tags
and each lemma (mapped in the lexicon)
●
91 tags : more than the POS
TextiColor
●
Students are
supposed to know
a list of words
●
The text is
prepared with
different colors :
- known
- not known
- medieval form
]
From quantities to rhythm
●
Collatinus knows the quantity of the vowels
→ easy to determine the tonical accent.
●
Can split the syllabes→ hyphenation
abs·cí·di (abs-cido) ≠ áb·sci·di (ab-scindo)
●
Opens the way to the study of rhythmic prose
●
Statistics on paroxytons and proparoxytons.
●
Rhythm everywhere, not only inclausulae
New feature : probabilistic tagger
●
No one knows if a probabilistic tagger works
with Latin (“free” order of the words)
●
No evaluation of the error rate (not yet)
●
2nd order hidden Markov model
●
Training corpus : LASLA.
●
For each sentence, chooses the best
sequence of tags and thus the best analysis
Internal server
●
Collatinus can answer questions asked from
another program. For instance, LibreOffice.
Internal server (2)
●
A “standard” tagger with the data from LASLA
- List of forms, with lemma and analysis
- Exact number of occurrences
- Tagset and statistics on sequences of tags
●
Problem of the forms that have not been seen
before : Ask Collatinus !
→ Answers a LASLA-like analysis
LASLA-Tagger
●
Supervised lemmatization.
●
Edition/correction of the result of the tagger.
→ Hope to reach zero error !
●
Exact number of occurrences. For “patres” :
253 voc. 110 nom. and 75 acc.
Collatinus expects 6 voc. 200 nom. & 320 acc
LASLA-Tagger : screen-shot
What is going on ?
Praelector:
●
“Grammatical assistant” to help pupils in
reading Latin / understanding the sentence.
→“Translation” in French of the sentence
●
Medieval spelling :
e forae, o forau, f forph, etc...
Reduce the medieval forms and the classical
words to the same “phonetic” form.
Future plans
●
Coupling of the probabilistic tagger with the
gramatical analysis of the sentence
●
“Universal” output format to gather all the
information about the text (lemmas,
analysis, scanned and accented forms...)
●
Statistics on rhythmic prose (cursus)
●
Looking for verses in prose
Suggestions
●
Yves.Ouvrard@collatinus.org
●
Philippe.Verkerk@univ-lille1.fr
Thank you for your attention
● http://outils.biblissima.fr/en/collatinus/
● http://outils.biblissima.fr/en/collatinus-web/
● For ancient Greek : http://outils.biblissima.fr/en/eulexis/

Mais conteúdo relacionado

Semelhante a Collatinus: Lemmatizer and morphological analyzer for Latin texts

Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
A practical parser with combined parsing
A practical parser with combined parsingA practical parser with combined parsing
A practical parser with combined parsingijseajournal
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty FrameworkAapo Talvensaari
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLEric Evans
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testingKnoldus Inc.
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansTomoki Aburatani
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYdmbaturin
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 

Semelhante a Collatinus: Lemmatizer and morphological analyzer for Latin texts (20)

Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
A practical parser with combined parsing
A practical parser with combined parsingA practical parser with combined parsing
A practical parser with combined parsing
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
parsing.pptx
parsing.pptxparsing.pptx
parsing.pptx
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-Koans
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Static analysis for beginners
Static analysis for beginnersStatic analysis for beginners
Static analysis for beginners
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
LVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLYLVEE 2014: Text parsing with Python and PLY
LVEE 2014: Text parsing with Python and PLY
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 

Mais de Equipex Biblissima

Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteEquipex Biblissima
 
eScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysiseScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysisEquipex Biblissima
 
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Equipex Biblissima
 
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Equipex Biblissima
 
Représentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFReprésentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFEquipex Biblissima
 
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Equipex Biblissima
 
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsMise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsEquipex Biblissima
 
Actualités et perspectives de IIIF
Actualités et perspectives de IIIFActualités et perspectives de IIIF
Actualités et perspectives de IIIFEquipex Biblissima
 
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFMieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFEquipex Biblissima
 
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Equipex Biblissima
 
IIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceIIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceEquipex Biblissima
 
The Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesThe Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesEquipex Biblissima
 
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Equipex Biblissima
 
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Equipex Biblissima
 
Biblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsBiblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsEquipex Biblissima
 
A la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaA la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaEquipex Biblissima
 
Browse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFBrowse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFEquipex Biblissima
 
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Equipex Biblissima
 

Mais de Equipex Biblissima (20)

Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
 
eScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysiseScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document Analysis
 
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
Annotate (E-ReColNat) : annotation rapide d’images et de vidéos en sciences n...
 
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
Appliquer les techniques d'apprentissage profond pour détecter les enluminure...
 
Représentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIFReprésentations du chant du Moyen Âge dans les images IIIF
Représentations du chant du Moyen Âge dans les images IIIF
 
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
Réflexions et explorations croisées autour de IIIF, Omeka-s et NumaHOP à la B...
 
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documentsMise en œuvre de IIIF pour la reconnaissance automatique de documents
Mise en œuvre de IIIF pour la reconnaissance automatique de documents
 
Nakala et IIIF
Nakala et IIIFNakala et IIIF
Nakala et IIIF
 
Actualités et perspectives de IIIF
Actualités et perspectives de IIIFActualités et perspectives de IIIF
Actualités et perspectives de IIIF
 
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIFMieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
Mieux diffuser et valoriser ses images sur le Web grâce aux standards IIIF
 
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
 
IIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in FranceIIIF360: A Service to Support and Promote IIIF in France
IIIF360: A Service to Support and Promote IIIF in France
 
The Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical NamesThe Biblissima Authority File of Geographical Names
The Biblissima Authority File of Geographical Names
 
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
Les référentiels Biblissima : épine dorsale du portail Biblissima et de IIIF-...
 
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
Introduction aux protocoles IIIF. Formation Enssib 23.01.2019 (Régis Robineau)
 
Biblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts CollectionsBiblissima: Connecting Manuscripts Collections
Biblissima: Connecting Manuscripts Collections
 
IIIF et Biblissima
IIIF et BiblissimaIIIF et Biblissima
IIIF et Biblissima
 
A la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail BiblissimaA la recherche du patrimoine écrit avec le portail Biblissima
A la recherche du patrimoine écrit avec le portail Biblissima
 
Browse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIFBrowse and Visualize Manuscripts Illuminations with IIIF
Browse and Visualize Manuscripts Illuminations with IIIF
 
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT...
 

Último

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Collatinus: Lemmatizer and morphological analyzer for Latin texts

  • 1. Collatinus : Lemmatizer and morphological analyzer for Latin texts. Yves Ouvrard Philippe Verkerk Digital Classics III: Re-thinking Text Analysis May 12th 2017
  • 2. Who are we ? Yves Ouvrard Professor of latin (retired) Philippe Verkerk Physicist Lab. PhLAM ● Reading support of latin texts ● Prepare the wordlist associated to a text What was initially Collatinus ?
  • 3. Collatinus 11 ● Free and Open program (GNU GPL) - written in C++ (Qt 5) - resources as plain text files ● Stand-alone software (Mac OS X, Windows, Linux) http://outils.biblissima.fr/en/collatinus/ ● On-line version (Collatinus 10.2) http://outils.biblissima.fr/en/collatinus-web/
  • 4. Main functions ● Lemmatization Analysis Reading support ● Dictionaries - Gaffiot 2016 - Lewis & Short - Georges... ● Scansion● Inflection Ārmă (Ārmā) vĭrūmquĕ cānō (cănō̆ ), Trōjāe quī prīmŭs ăb ōrīs (ōrĭs) --̆ u-u -̆ -̆ -- - -u u --̆ Ītălĭām, fātō prŏfŭgūs, Lāvīnĭăquĕ (Lāvīnĭāquĕ) vēnĭt (vĕnĭt) -uu- -- uu- --u-̆ u -̆ u lītŏră, mūlt[um] īll[e] ēt tērrīs jāctātŭs (jāctātūs) ĕt āltō -uu -` -` - -- ---̆ u -- vī sŭpĕrūm sāevāe mĕmŏrēm Jūnōnĭs ŏb īrăm; - uu- -- uu- --u u -u mūltă (mūltā) quōqu[e] (quŏqu[e]) ēt bēllō pāssūs, dūm cōndĕrĕt ūrbĕm, --̆ -̆ ` - -- -- - -uu -u īnfērrētquĕ dĕōs Lătĭō, gĕnūs (gĕnŭs) ūndĕ Lătīnŭm, ---u u- uu- u-̆ -u u-u Ālbānīquĕ pātrēs (pā̆ trēs), ātqu[e] āltāe mōenĭă Rōmāe. ---u -̆ - -` -- -uu --
  • 5. What makes the difference ? ● Dictionaries : a single click opens the dictionary ● Allows to compare two dictionaries ● Scansion : knows a priori if a vowel is short or long → not limited to metrical verses ● Integrated server → can answer a question from another program ● Expandable(one can add lemmas and/or dictionaries)
  • 6. Principle of the analysis ● Not a list of inflected forms → closer to the human mechanism ● Split the word in 2 parts (all possibilities) ● Look for the 2 parts in the proper list ● Check for the compatibility ● 24 100 lemmas in the “main lexicon” ● 58 000 lemmas in the “extended lexicon”
  • 7. The figures ● 24 100 lemmas (classical Latin) ● 58 000 lemmas in the extended lexicon (unusual words) ● 140 “paradigms” - Lewis & Short ≈ 60 000 entries - Gaffiot 2016 ≈ 60 000 entries - K.E. Georges ≈ 50 000 entries - Gérard Jeanneau ≈ 50 000 entries Generate automatically the lemmas in the format of Collatinus 7 500 entries still waiting ● Contraction in the declension ĭĭ → ī amasse↔ amavisse ● Assimilation of the prefix (without an extra entry) obfero↔ offero
  • 8. Ordering the analysis ● “Ordering” the analysis → a step to desambiguization ≠ from a usual POS tagger (no list of forms) ● Statistics on the texts lemmatized by the LASLA (almost 2 000 000 forms) ● Counting the tags, the sequences of 3 tags and each lemma (mapped in the lexicon) ● 91 tags : more than the POS
  • 9. TextiColor ● Students are supposed to know a list of words ● The text is prepared with different colors : - known - not known - medieval form ]
  • 10. From quantities to rhythm ● Collatinus knows the quantity of the vowels → easy to determine the tonical accent. ● Can split the syllabes→ hyphenation abs·cí·di (abs-cido) ≠ áb·sci·di (ab-scindo) ● Opens the way to the study of rhythmic prose ● Statistics on paroxytons and proparoxytons. ● Rhythm everywhere, not only inclausulae
  • 11. New feature : probabilistic tagger ● No one knows if a probabilistic tagger works with Latin (“free” order of the words) ● No evaluation of the error rate (not yet) ● 2nd order hidden Markov model ● Training corpus : LASLA. ● For each sentence, chooses the best sequence of tags and thus the best analysis
  • 12. Internal server ● Collatinus can answer questions asked from another program. For instance, LibreOffice.
  • 13. Internal server (2) ● A “standard” tagger with the data from LASLA - List of forms, with lemma and analysis - Exact number of occurrences - Tagset and statistics on sequences of tags ● Problem of the forms that have not been seen before : Ask Collatinus ! → Answers a LASLA-like analysis
  • 14. LASLA-Tagger ● Supervised lemmatization. ● Edition/correction of the result of the tagger. → Hope to reach zero error ! ● Exact number of occurrences. For “patres” : 253 voc. 110 nom. and 75 acc. Collatinus expects 6 voc. 200 nom. & 320 acc
  • 16. What is going on ? Praelector: ● “Grammatical assistant” to help pupils in reading Latin / understanding the sentence. →“Translation” in French of the sentence ● Medieval spelling : e forae, o forau, f forph, etc... Reduce the medieval forms and the classical words to the same “phonetic” form.
  • 17. Future plans ● Coupling of the probabilistic tagger with the gramatical analysis of the sentence ● “Universal” output format to gather all the information about the text (lemmas, analysis, scanned and accented forms...) ● Statistics on rhythmic prose (cursus) ● Looking for verses in prose
  • 18. Suggestions ● Yves.Ouvrard@collatinus.org ● Philippe.Verkerk@univ-lille1.fr Thank you for your attention ● http://outils.biblissima.fr/en/collatinus/ ● http://outils.biblissima.fr/en/collatinus-web/ ● For ancient Greek : http://outils.biblissima.fr/en/eulexis/