11. it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
12. it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
Some words have no real semantic value
and must be ignored (stop words)
14. How do the big guys do it?
No searching through raw content
15. How do the big guys do it?
No searching through raw content
Search through optimized versions
of the raw content (indexing)
16. Basic indexing process
Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'
17. Basic indexing process
Normalize the characters (transliteration)
and remove punctuation
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
18. Basic indexing process
Remove stop words
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
19. Basic indexing process
Transform each remaining word to its "basic version"
(stemming)
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
20. Basic indexing process
Store the indexed content alongside the original
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
23. Performing the search
the book alice’s sister was reading
Perform the same indexing on the search terms
24. Performing the search
Search for the indexed search terms
in the indexed content
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
25. Performing the search
Rank results according to number of occurrences,
closeness of terms, position in the indexed text
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
2 21 1
27. Add the Albanian language
on top of the problem
No known "stop words" list
28. Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
29. Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
30. Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Vast number of forms for each single word
31. Just a taste of the complexity
Nouns 6 cases
x 2 numbers (singular, plural)
x 2 definitenes (definite, indefinite)
~24 word forms
Verbs 3 unique word-forming modes (of 6)
x 4 unique word-forming tenses (of 8)
x 2 voices (active, passive)
x 6 conjugative forms
~70 word forms
38. Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
39. Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Hybrid source
a probability-based model
picking (hopefully) the best
from both sources
41. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
42. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
43. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
44. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
45. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
46. Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Manually white-list obvious false positives
49. Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
50. Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
51. Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Manually look for false positives
and put them in a white list
54. The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
55. The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
56. The (basic) indexing algorithm
https://github.com/andrixh/index-albanian
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
57. Indexing the Albanian Language
by Andri Xhitoni
Thank you!
https://github.com/andrixh/index-albanian