SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
Indexing the Albanian Language
by Andri Xhitoni
may 2015
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Intricacies of search
Intricacies of search
Many think of search as
a straight-forward process
Intricacies of search
Many think of search as
a straight-forward process
“in go search terms, out come results”
it’s not that simple...
“in go search terms, out come results”
it’s not that simple...
it’s not that simple...
Words take on many forms.
it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
Some words have no real semantic value
and must be ignored (stop words)
How do the big guys do it?
How do the big guys do it?
No searching through raw content
How do the big guys do it?
No searching through raw content
Search through optimized versions
of the raw content (indexing)
Basic indexing process
Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'
Basic indexing process
Normalize the characters (transliteration)
and remove punctuation
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
Basic indexing process
Remove stop words
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
Basic indexing process
Transform each remaining word to its "basic version"
(stemming)
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
Basic indexing process
Store the indexed content alongside the original
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
Performing the search
Performing the search
the book Alice’s sister was reading
Performing the search
the book alice’s sister was reading
Perform the same indexing on the search terms
Performing the search
Search for the indexed search terms
in the indexed content
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
Performing the search
Rank results according to number of occurrences,
closeness of terms, position in the indexed text
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
2 21 1
Add the Albanian language
on top of the problem
Add the Albanian language
on top of the problem
No known "stop words" list
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Vast number of forms for each single word
Just a taste of the complexity
Nouns 6 cases
x 2 numbers (singular, plural)
x 2 definitenes (definite, indefinite)
~24 word forms
Verbs 3 unique word-forming modes (of 6)
x 4 unique word-forming tenses (of 8)
x 2 voices (active, passive)
x 6 conjugative forms
~70 word forms
Looking for solutions
Looking for solutions
Ideally:
Looking for solutions
Ideally:
A list of stop words
Looking for solutions
Ideally:
A list of stop words
A (huge) list of all possible word forms
for all words in Albanian,
linked to their stem form.
Looking for solutions
Sources:
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Hybrid source
a probability-based model
picking (hopefully) the best
from both sources
Data mining: Stop words
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Manually white-list obvious false positives
Data mining: Stemming
Data mining: Stemming
Invert each word from the collected list
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Manually look for false positives
and put them in a white list
The (basic) indexing algorithm
The (basic) indexing algorithm
Transliterate the input text
The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
The (basic) indexing algorithm
https://github.com/andrixh/index-albanian
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
Indexing the Albanian Language
by Andri Xhitoni
Thank you!
https://github.com/andrixh/index-albanian

Mais conteúdo relacionado

Semelhante a Andri Xhitoni - Indexing Albanian Language

Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Dawn Anderson MSc DigM
 
Fpt Academic Writing Grammars
Fpt Academic Writing GrammarsFpt Academic Writing Grammars
Fpt Academic Writing GrammarsHung Nguyen
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPlotly
 
копия How to teach vocabulary
копия How  to teach vocabularyкопия How  to teach vocabulary
копия How to teach vocabularyIryna Grusha
 
Definite and Indefinite Articles
Definite and Indefinite ArticlesDefinite and Indefinite Articles
Definite and Indefinite ArticlesShelli Seehusen
 
Subject verb agreement exercise answers
Subject verb agreement exercise answersSubject verb agreement exercise answers
Subject verb agreement exercise answersPatrick John Ibanez
 
RelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptxRelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptxEnglish Online Inc.
 
E10 Feb17 2010
E10 Feb17 2010E10 Feb17 2010
E10 Feb17 2010mlsteacher
 
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdfactivetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdfMuhammadSajed1
 
Summary quote paraphrase workshop
Summary quote paraphrase workshopSummary quote paraphrase workshop
Summary quote paraphrase workshopkb615
 
Active to passive voice basic rules
Active to passive voice  basic rulesActive to passive voice  basic rules
Active to passive voice basic rulesTika Subedi
 
Active passive
Active passiveActive passive
Active passiveRabia Khan
 

Semelhante a Andri Xhitoni - Indexing Albanian Language (20)

Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 
Fpt Academic Writing Grammars
Fpt Academic Writing GrammarsFpt Academic Writing Grammars
Fpt Academic Writing Grammars
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization Methods
 
Chapter10
Chapter10Chapter10
Chapter10
 
копия How to teach vocabulary
копия How  to teach vocabularyкопия How  to teach vocabulary
копия How to teach vocabulary
 
Luis canas
Luis canasLuis canas
Luis canas
 
present perfecto
present perfectopresent perfecto
present perfecto
 
Definite and Indefinite Articles
Definite and Indefinite ArticlesDefinite and Indefinite Articles
Definite and Indefinite Articles
 
Subject verb agreement exercise answers
Subject verb agreement exercise answersSubject verb agreement exercise answers
Subject verb agreement exercise answers
 
RelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptxRelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptx
 
6th lecture
6th lecture6th lecture
6th lecture
 
Grammar
GrammarGrammar
Grammar
 
E10 Feb17 2010
E10 Feb17 2010E10 Feb17 2010
E10 Feb17 2010
 
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdfactivetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
 
Summary quote paraphrase workshop
Summary quote paraphrase workshopSummary quote paraphrase workshop
Summary quote paraphrase workshop
 
Active to passive voice basic rules
Active to passive voice  basic rulesActive to passive voice  basic rules
Active to passive voice basic rules
 
Ova Gleidy
Ova GleidyOva Gleidy
Ova Gleidy
 
Adso update
Adso updateAdso update
Adso update
 
Active passive
Active passiveActive passive
Active passive
 
Course intro
Course introCourse intro
Course intro
 

Mais de Open Labs Albania

Clair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the cloudsClair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the cloudsOpen Labs Albania
 
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...Open Labs Albania
 
Georges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governanceGeorges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governanceOpen Labs Albania
 
Chris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond SoftwareChris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond SoftwareOpen Labs Albania
 
Bruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on qualityBruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on qualityOpen Labs Albania
 
Alex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platformAlex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platformOpen Labs Albania
 
Kiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledgeKiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledgeOpen Labs Albania
 
Gjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open TechnologyGjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open TechnologyOpen Labs Albania
 
Giannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora communityGiannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora communityOpen Labs Albania
 
Enkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source securityEnkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source securityOpen Labs Albania
 
Chris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of openChris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of openOpen Labs Albania
 
Bruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open sourceBruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open sourceOpen Labs Albania
 
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding Open Labs Albania
 
Bledar Gjocaj - Java open source
Bledar Gjocaj - Java open sourceBledar Gjocaj - Java open source
Bledar Gjocaj - Java open sourceOpen Labs Albania
 
Besfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for GisBesfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for GisOpen Labs Albania
 
Alex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_dbAlex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_dbOpen Labs Albania
 
Inva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open AtriumInva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open AtriumOpen Labs Albania
 
Greta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy AlbaniaGreta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy AlbaniaOpen Labs Albania
 
Altin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy KosovoAltin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy KosovoOpen Labs Albania
 

Mais de Open Labs Albania (20)

Clair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the cloudsClair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the clouds
 
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
 
Georges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governanceGeorges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governance
 
Chris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond SoftwareChris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond Software
 
Bruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on qualityBruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on quality
 
Alex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platformAlex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platform
 
Kiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledgeKiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledge
 
Gjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open TechnologyGjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open Technology
 
Giannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora communityGiannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora community
 
Enkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source securityEnkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source security
 
Chris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of openChris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of open
 
Bruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open sourceBruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open source
 
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
 
Bledar Gjocaj - Java open source
Bledar Gjocaj - Java open sourceBledar Gjocaj - Java open source
Bledar Gjocaj - Java open source
 
Besfort Guri - OS Geo Live
Besfort Guri - OS Geo LiveBesfort Guri - OS Geo Live
Besfort Guri - OS Geo Live
 
Besfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for GisBesfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for Gis
 
Alex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_dbAlex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_db
 
Inva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open AtriumInva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open Atrium
 
Greta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy AlbaniaGreta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy Albania
 
Altin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy KosovoAltin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy Kosovo
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Andri Xhitoni - Indexing Albanian Language

  • 1. Indexing the Albanian Language by Andri Xhitoni may 2015
  • 2. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 3. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 4. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 6. Intricacies of search Many think of search as a straight-forward process
  • 7. Intricacies of search Many think of search as a straight-forward process “in go search terms, out come results”
  • 8. it’s not that simple... “in go search terms, out come results”
  • 9. it’s not that simple...
  • 10. it’s not that simple... Words take on many forms.
  • 11. it’s not that simple... Words take on many forms. Words may have different meanings based on context
  • 12. it’s not that simple... Words take on many forms. Words may have different meanings based on context Some words have no real semantic value and must be ignored (stop words)
  • 13. How do the big guys do it?
  • 14. How do the big guys do it? No searching through raw content
  • 15. How do the big guys do it? No searching through raw content Search through optimized versions of the raw content (indexing)
  • 16. Basic indexing process Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
  • 17. Basic indexing process Normalize the characters (transliteration) and remove punctuation alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
  • 18. Basic indexing process Remove stop words alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
  • 19. Basic indexing process Transform each remaining word to its "basic version" (stemming) alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
  • 20. Basic indexing process Store the indexed content alongside the original alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
  • 22. Performing the search the book Alice’s sister was reading
  • 23. Performing the search the book alice’s sister was reading Perform the same indexing on the search terms
  • 24. Performing the search Search for the indexed search terms in the indexed content alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?' the book alice’s sister was reading
  • 25. Performing the search Rank results according to number of occurrences, closeness of terms, position in the indexed text alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?' the book alice’s sister was reading 2 21 1
  • 26. Add the Albanian language on top of the problem
  • 27. Add the Albanian language on top of the problem No known "stop words" list
  • 28. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process
  • 29. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process High irregularity in word formation
  • 30. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process High irregularity in word formation Vast number of forms for each single word
  • 31. Just a taste of the complexity Nouns 6 cases x 2 numbers (singular, plural) x 2 definitenes (definite, indefinite) ~24 word forms Verbs 3 unique word-forming modes (of 6) x 4 unique word-forming tenses (of 8) x 2 voices (active, passive) x 6 conjugative forms ~70 word forms
  • 34. Looking for solutions Ideally: A list of stop words
  • 35. Looking for solutions Ideally: A list of stop words A (huge) list of all possible word forms for all words in Albanian, linked to their stem form.
  • 37. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms
  • 38. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms The Internet not too comprehensive many word forms potential errors
  • 39. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms The Internet not too comprehensive many word forms potential errors Hybrid source a probability-based model picking (hopefully) the best from both sources
  • 41. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better)
  • 42. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts
  • 43. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word
  • 44. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first).
  • 45. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first). Stop words will float to the top.
  • 46. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first). Stop words will float to the top. Manually white-list obvious false positives
  • 48. Data mining: Stemming Invert each word from the collected list
  • 49. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes)
  • 50. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes) Find highest occurring suffixes of 2, 3 and 4 letters
  • 51. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes) Find highest occurring suffixes of 2, 3 and 4 letters Manually look for false positives and put them in a white list
  • 52. The (basic) indexing algorithm
  • 53. The (basic) indexing algorithm Transliterate the input text
  • 54. The (basic) indexing algorithm Transliterate the input text Find and remove all stop words
  • 55. The (basic) indexing algorithm Transliterate the input text Find and remove all stop words Go through each word and remove the found suffixes (largest to smallest)
  • 56. The (basic) indexing algorithm https://github.com/andrixh/index-albanian Transliterate the input text Find and remove all stop words Go through each word and remove the found suffixes (largest to smallest)
  • 57. Indexing the Albanian Language by Andri Xhitoni Thank you! https://github.com/andrixh/index-albanian