Topic Modelling APIs in Farsi

Topic Modelling
in Farsi
and APIs
@aliostad
Ali Kheyrollahi
Barcelona 2015

Topic Modelling*
Document to document similarity

Farsi
23rd most popular Spoken Language,
above Italian and Polish

Farsi
14th most popular Internet Language,
above Korean and Swedish

Acquisition and pre-
My Book
‫‌من‬‫ب‬‫کتا‬
ketaab -e- man

What did you just
say?
‫ن‬ ‫‌م‬ ‫اب‬ ‫ت‬ ‫ک‬
‫ک‬ ‫ﮐﻢ‬‫ﻳﯾﮑﯽ‬‫ﺗﮏ‬‫ﺑﺎﮎک‬

How weird can it get?
Unicode codez

And there’s more
‫من‬ ‫کتاب‬
Zero-width non-joiner (0x200C)
&zwnj;HTML =>
‫کتابمن‬

Latent Dirichlet
Allocation (LDA)
✤ Mainly a “clustering” algorithm
✤ Defines topics as latent variables within the documents
✤ Its implementations available in most programming languages
✤ Python => Gensim and Java => Mallet

Topic Modelling
concepts in LDA
✤ Document: “Bag of words” vs. “Markov chain”
✤ Word: mere an id (“library”=>123, “librarian”=>789)
✤ Dictionary: set of all words
✤ Corpus: set of all documents
✤ Topic: Distribution over words (LDA)

Using Latent Dirichlet
Allocation
✤ Document as a vector of topic weights {0: 0.01, 12: 0.19,
42: 0.23}
✤ Cosine similarity for document similarity
✤ Document similarity works really well
✤ Not great in some domains [to fix => Hierarchical]
✤ Boosting

Resources: Dictionary
Dictionary
POST languages/farsi/dictionaries HTTP/1.1
Host: example.com
200 OK
Location: languages/farsi/dictionaries/123
languages/{lang}/dictionaries/{id}
Create

Resources: Dictionary
Dictionary
PUT languages/farsi/dictionaries/123 HTTP/1.1
Content-Type: application/json
{
“documents”: [
{“fullText”: “‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,}”یک‬
…
]
}
languages/{lang}/dictionaries/{id}
Add document words

Resources: Corpus
Corpus
POST languages/farsi/corpi HTTP/1.1
Host: example.com
200 OK
location: languages/farsi/corpi/123
languages/{lang}/corpora/{id}
Create

Resources: Corpus
Corpus
PUT languages/farsi/corpi/123 HTTP/1.1
{
“documents”: [
{“fullText”: “‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,}”یک‬
…
]
}
languages/{lang}/corpora/{id}
Add documents

Resources: TopicModel
TopicModel
POST languages/farsi/topicmodels?passes=6&alpha=auto HTTP/1.1
Host: example.com
{
“dictionaryId”:123,
“corpusId”:456
}
languages/{lang}/topicmodels
Create (request)

Resources*: TopicModel
TopicModel
202 Accepted
Location: languages/farsi/topicmodels/789
languages/{lang}/topicmodels
Create (response)

State of current ML APIs
HATEOAS
Hypermedia
REST
APIs C A C H E
Markov Chain
Graph Theory
Deep Learning
Bayesian

Is this really a resource?
Converts

Mills
✤ A single piece of work/specialty (& verb)
✤ Encapsulating an “algorithm”
✤ Do not own data (own config tho):  
Raw data in, processed result out
✤ All calls are safe and idempotent

Topic Model Mills:
classifier
TopicModellanguages/{lang}/topicmodels/{id}/classifier
classifier (request)
POST languages/farsi/topicmodels/789/classifier HTTP/1.1
Host: example.com
{
“fullText”:“‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,”یک‬
“refinement”:”hierarchical”
}

Topic Model Mills:
classify
TopicModellanguages/{lang}/topicmodels/{id}/classify
classifier (response)
OK 200
{
“15”: 0.03,
“123”: 0.2,
“390”: 0.09,
…
}

Thank you!
@aliostad
aliostad [at] gmail [dot] com

Acknowledgements
✤ Windmill picture: https://www.flickr.com/photos/capnkroaker/2473951927/in/photolist-4LBDHz-4Mba53-pnrARE-4Ktk7H
✤ Algorithm picture: https://www.flickr.com/photos/peterrosbjerg/4257452000/in/photolist-7udy9Q-834w2L-dcPgeA-dcPg7s-dcPdHZ-jiRgZs-
jiQstc-8qnJKb-8qdWAF-8qh6D1-8qdWEz-b8ADMZ-b8Ausi-ansdvD-dcPgc9-8kNsd1-pNCgk1-b7G8ZT-8pQRPF-8pTvUy-eXse1A-99XXLF-
eKT1Y-831n9H-jj7vuo-jiRZij-b7G84Z-b7G78P-fvrqUB-b7GajB-jiPF46-8ERYQY-jiNzQv-jiRkkW-jiNUNm-jiPgPZ-jiPbWU-jiNQQm-jiRhDF-jiPF7m-
jiSdJs-jiPqDT-jiSa8u-jiPwr5-jiM5Ac-jiMKze-jiPZCZ-jiNwf6-jiNtMF-jiPHut
✤ Water Tanks picture: https://www.flickr.com/photos/psilver/2280385292/in/photolist-4tvz6N-9h1PHy-cuie7-
RJ83k-696owN-85tcrs-74MqFc-pkuu5-o3BsGV-bR11F2-8jNAnA-ep2fVX-8YyHWv-ABECA-av9mMk-7LMozD-dMySvh-7Pipgo-5rXApy-Q8zgi-
eFxGYc-7sDbjx-87LdLE-aELtQV-7AnXb7-dJqjNR-XYpHK-nAFVCS-95G4EU-9jxNiT-7F1RPj-68hFop-7VFYSs-nzr9W2-pb3zpe-9j5sua-9962cu-
bJ1UED-dp6yqD-8UCQTj-NywAX-kBG3xr-9aTXxq-pVmJui-k8BDsX-7XXtce-7pKUVr-5Hn3CL-rvWcUu-kW6dat
✤ IT Web Jobs: http://www.itjobswatch.co.uk/jobs/uk/machine%20learning.do
✤ Question mark picture: https://www.flickr.com/photos/129627585@N07/15684220620/in/photolist-pTXJmU-4W4Xed-dKRgQ2-
LLBYA-8uuSCh-8qk5Q-4y7wzQ-6feu6Z-6EsuSe-f7eVmb-9WAPNR-f75fPR-8Lzt7R-9L2t4y-apWJQR-fhdGoH-4v7kg9-65wR72-7FyjMW-epmYfa-
abQKN1-6m1HuV-86Uor8-a64uYL-a61DMr-9oCDYc-dW2Xad-a64vnN-a61A6c-2Zvn7-5pjxSz-9Gd43a-oQRf6d-oQReWf-p8kRm6-p8iYSE-
xtEEP-7oxXJg-a64sGS-7U52mA-2Z97S-a61D7e-a64uFd-aiWYZk-2Z9mV-4cmUWW-2Zoxb-2Zg4r-a61Fmi-a61B1T
✤ Timber picture: https://www.flickr.com/photos/simonbleasdale/2797031694/in/photolist-5gaw5o-hLB1BA-9Zafk9-bnjrwy-cSf2fU-
cSf4tN-69qu5j-69qzMC-dkby9F-7wjTFp-kRxidH-53cRdq-nDr93N-kRFANh-25iCW3-cjLSKs-9R81XA-4xHk5Z-9R7MXG-5gay3S-baDJJF-
bnjrdE-9R7XMf-9R8Knj-9R8Ceq-bAej8Z-bnjrhU-8WcymA-
bnjrA5-9R8RgQ-9R5CtZ-9R8e6h-9R8Peh-9R5eca-9R8htw-9R5bbD-9R5yov-9R51kT-9R5hqK-dvvEFb-dvq6fP-dvq7GT-dvvFG7-dvvEyL-dvvGtA-
dvq8bT-dvvGAN-dvq6U2-dvvGWQ-dvq5sD
✤ Timber: https://commons.wikimedia.org/wiki/Category:Timber#/media/File:Oregon_BLM_Forestry_10_(6871708937).jpg

References
✤ Gensim: https://radimrehurek.com/gensim/
✤ LDA Paper: http://machinelearning.wustl.edu/mlpapers/
paper_files/BleiNJ03.pdf
✤ Client-Server Domain Separation: http://byterot.blogspot.com.es/
2012/11/client-server-domain-separation-csds-rest.html
✤ Mill proposal: Is coming!

Topic Modelling APIs in Farsi

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Topic Modelling APIs in Farsi

Semelhante a Topic Modelling APIs in Farsi (20)

Mais de Ali Kheyrollahi

Mais de Ali Kheyrollahi (19)

Último

Último (20)

Topic Modelling APIs in Farsi