SlideShare a Scribd company logo
1 of 25
Download to read offline
NLP in the WILD
-or-
Building a System for
Text Language Identification
Vsevolod Dyomkin
12/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Roles
Langid Problem
* 150+ langs in Wikipedia
* >10 writing systems
(script/alphabet) in active use
* script-lang: 1:1, 1:2, 1:n, n:1 :)
* Latin >50 langs, Cyrillyc >20
* Long texts easy, short hmm– –
* Internet texts (mixed langs)
* Small task => resource-constrained
Twitter Case Study
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Prior Art
* C++: https://github.com/CLD2Owners/cld2
* Python: https://github.com/saffsd/langid.py
* Java: 
https://github.com/shuyo/language-detection/
http://blog.mikemccandless.com/2011/10/accuracy
-and-performance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages, always evolving
* Wanted to do in Lisp
Linguistics
(domain knowledge)
* Polyglots?
* ISO 639
* Internet lang bias
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
* Rule-based ideas:
- 1:1/1:2 scripts
- unique letters
* Per-script/per-lang segmentation
insight
data
Data
* evaluation data:
- smoke test
- in-/out-of-domain data
- precision-/recall-oriented
* training data
- where to get? Wikidata
- how to get? SAX parsing
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wikipedia
* >150 langs
* size? Wikipedia abstracts
* automation?
* filtering?
Alternatives
* API
(defun get-examples (word)
(remove-if-not
^(upper-case-p (char % 0))
(mapcar ^(substitute #Space #_ (? % "text"))
(? (yason:parse
(drakma:http-request
(fmt "http://api.wordnik.com/v4/word.json/~A/examples"
(drakma:url-encode word :utf-8))
:additional-headers *wordnik-auth-headers*))
"examples"))))
* Web scraping
(defmethod scrape ((site (eql :linguaholic)) source)
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent")
($ text) (span) |...|))
!!!))))
Research
(quality)
* Simple task => simple models (NB)
* Challenges
- short texts
- mixed langs
- 90% of data - cryptic
ideas
experiments
Naive Bayes
* Features: 3-/4-char ngrams
* Improvement ideas:
- add words (word unigrams)
- factor in word lengths
- use Internet lang bias
Formula:
(argmax (* (? priors lang)
(or (? word-probs word)
(norm (reduce '* ^(? 3g-probs %)
(word-3gs word)))))
langs)
http://www.paulgraham.com/spam.html
Experiments
* Usual ML setup (70:30) doesn't
work here
* “If you torment the data too
much...” (~c) Yaser Abu-Mosafa
* Comparison with existing systems
helps
Confusion MatrixAB: 0.90 | FR:0.10
AF: 0.80 | EN:0.20
AK: 0.80 | NN:0.10 IT:0.10
AN: 0.90 | ES:0.10
AY: 0.90 | ES:0.10
BG: 0.60 | RU:0.40
BM: 0.80 | FR:0.10 LA:0.10
BS: 0.90 | EN:0.10
CO: 0.90 | IT:0.10
CR: 0.40 | FR:0.30 UND:0.20 MS:0.10
CS: 0.90 | IT:0.10
CU: 0.90 | VI:0.10
CV: 0.80 | RU:0.20
DA: 0.70 | FO:0.10 NO:0.10 NN:0.10
DV: 0.80 | UZ:0.10 EN:0.10
DZ: NIL | BO:0.80 IK:0.10 NE:0.10
EN: 0.90 | NL:0.10
ET: 0.80 | EN:0.20
FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10
FI: 0.80 | FR:0.10 DA:0.10
FJ: 0.90 | OC:0.10
GL: 0.90 | ES:0.10
HA: 0.80 | YO:0.10 EN:0.10
HR: 0.70 | BS:0.10 DE:0.10 GL:0.10
ID: 0.80 | MS:0.20
IE: 0.90 | EN:0.10
IG: 0.60 | EN:0.40
IO: 0.86 | DA:0.14
KG: 0.90 | SW:0.10
KL: 0.90 | EN:0.10
KS: 0.30 | UR:0.60 UND:0.10
KU: 0.90 | EN:0.10
KW: 0.89 | UND:0.11
LA: 0.90 | FR:0.10
LB: 0.90 | EN:0.10
LG: 0.90 | IT:0.10
LI: 0.80 | NL:0.20
MI: 0.90 | ES:0.10
MK: 0.80 | IT:0.10 RU:0.10
MS: 0.80 | ID:0.10 EN:0.10
MT: 0.90 | DE:0.10
NO: 0.90 | DA:0.10
NY: 0.80 | AR:0.10 SW:0.10
OM: 0.90 | EN:0.10
OS: 0.90 | RU:0.10
QU: 0.70 | ES:0.20 EN:0.10
RM: 0.90 | EN:0.10
RN: 0.50 | RW:0.40 YO:0.10
SC: 0.90 | FR:0.10
SG: 0.90 | FR:0.10
SR: 0.80 | HR:0.10 BS:0.10
SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10
ST: 0.90 | PT:0.10
SV: 0.90 | DA:0.10
TI: 0.40 | AM:0.40 LA:0.10 EN:0.10
TK: 0.80 | TR:0.20
TO: 0.50 | EN:0.50
TS: 0.80 | EN:0.10 UZ:0.10
TW: 0.40 | EN:0.40 AK:0.10 YO:0.10
TY: 0.90 | ES:0.10
UG: 0.60 | UZ:0.40
UK: 0.80 | UND:0.10 VI:0.10
VE: 0.90 | EN:0.10
WO: 0.80 | NL:0.10 FR:0.10
XH: 0.80 | UZ:0.10 EN:0.10
YO: 0.80 | EN:0.20
ZU: 0.60 | XH:0.30 PT:0.10
Total quality: 0.90
The Ladder of NLP
Rule-based
Linear ML
Decision Trees & co.
Sequence models
Artificial Neural networks
Better Models
What can be improved?
* Account for word order
* Discriminative models per script
* DeepLearning™ model
Marginal gain is not huge…
Engineer
(efficiency)
* Just a small piece
of the pipeline:
- good-enough speed
- minimize space usage
- minimize external dependencies
* Proper floating-point calculations
* Proper processing of big texts?
* Pre-/post-processing
* Clean API
implementation
optimization
Model Optimization
Initial model size: ~1G
Target: ~10M :)
How to do it?
- Lossy compression: pruning
- Lossless compression:
Huffman coding, efficient DS
API
* Levels of detalization:
- text-langs
- word-langs
- window?
* UI: library, REPL & Web APIs
Recap
* Triple view of any
knowledge-related problem
* Ladder of approaches to solving
NLP problems
* Importance of productive env:
general- & special-purpose
REPL lang API access to data– –
efficient testing–
* Main stages of problem solving:
data experiment→ →
implementation optimization→

More Related Content

Viewers also liked

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesDominic Graefen
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
 

Viewers also liked (11)

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love Parantheses
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 

Similar to NLP in the WILD or Building a System for Text Language Identification

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaAlexander Gyoshev
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakesRichardWarburton
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)jaxLondonConference
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsMiloš Sutanovac
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty FrameworkAapo Talvensaari
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksPuppet
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksCarlos Sanchez
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigilityKaloyan Raev
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceCumulus Networks
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Libraryjexp
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleAri Lerner
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоGeeksLab Odessa
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporationHenryk Konsek
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* updatekarupanerura
 

Similar to NLP in the WILD or Building a System for Text Language Identification (20)

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakes
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
 
Scalaxb preso
Scalaxb presoScalaxb preso
Scalaxb preso
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS Preprocessors
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Soap vs-rest
Soap vs-restSoap vs-rest
Soap vs-rest
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigility
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performance
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at Google
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporation
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* update
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

NLP in the WILD or Building a System for Text Language Identification

  • 1. NLP in the WILD -or- Building a System for Text Language Identification Vsevolod Dyomkin 12/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 4. Langid Problem * 150+ langs in Wikipedia * >10 writing systems (script/alphabet) in active use * script-lang: 1:1, 1:2, 1:n, n:1 :) * Latin >50 langs, Cyrillyc >20 * Long texts easy, short hmm– – * Internet texts (mixed langs) * Small task => resource-constrained
  • 6. Prior Art * C++: https://github.com/CLD2Owners/cld2 * Python: https://github.com/saffsd/langid.py * Java:  https://github.com/shuyo/language-detection/ http://blog.mikemccandless.com/2011/10/accuracy -and-performance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  • 7.
  • 8. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  • 9. Linguistics (domain knowledge) * Polyglots? * ISO 639 * Internet lang bias https://en.wikipedia.org/wiki/Languages_used_on_the_Internet * Rule-based ideas: - 1:1/1:2 scripts - unique letters * Per-script/per-lang segmentation insight data
  • 10. Data * evaluation data: - smoke test - in-/out-of-domain data - precision-/recall-oriented * training data - where to get? Wikidata - how to get? SAX parsing
  • 11. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 12. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 13. Wikipedia * >150 langs * size? Wikipedia abstracts * automation? * filtering?
  • 14. Alternatives * API (defun get-examples (word) (remove-if-not ^(upper-case-p (char % 0)) (mapcar ^(substitute #Space #_ (? % "text")) (? (yason:parse (drakma:http-request (fmt "http://api.wordnik.com/v4/word.json/~A/examples" (drakma:url-encode word :utf-8)) :additional-headers *wordnik-auth-headers*)) "examples")))) * Web scraping (defmethod scrape ((site (eql :linguaholic)) source) (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent") ($ text) (span) |...|)) !!!))))
  • 15. Research (quality) * Simple task => simple models (NB) * Challenges - short texts - mixed langs - 90% of data - cryptic ideas experiments
  • 16. Naive Bayes * Features: 3-/4-char ngrams * Improvement ideas: - add words (word unigrams) - factor in word lengths - use Internet lang bias Formula: (argmax (* (? priors lang) (or (? word-probs word) (norm (reduce '* ^(? 3g-probs %) (word-3gs word))))) langs) http://www.paulgraham.com/spam.html
  • 17.
  • 18. Experiments * Usual ML setup (70:30) doesn't work here * “If you torment the data too much...” (~c) Yaser Abu-Mosafa * Comparison with existing systems helps
  • 19. Confusion MatrixAB: 0.90 | FR:0.10 AF: 0.80 | EN:0.20 AK: 0.80 | NN:0.10 IT:0.10 AN: 0.90 | ES:0.10 AY: 0.90 | ES:0.10 BG: 0.60 | RU:0.40 BM: 0.80 | FR:0.10 LA:0.10 BS: 0.90 | EN:0.10 CO: 0.90 | IT:0.10 CR: 0.40 | FR:0.30 UND:0.20 MS:0.10 CS: 0.90 | IT:0.10 CU: 0.90 | VI:0.10 CV: 0.80 | RU:0.20 DA: 0.70 | FO:0.10 NO:0.10 NN:0.10 DV: 0.80 | UZ:0.10 EN:0.10 DZ: NIL | BO:0.80 IK:0.10 NE:0.10 EN: 0.90 | NL:0.10 ET: 0.80 | EN:0.20 FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10 FI: 0.80 | FR:0.10 DA:0.10 FJ: 0.90 | OC:0.10 GL: 0.90 | ES:0.10 HA: 0.80 | YO:0.10 EN:0.10 HR: 0.70 | BS:0.10 DE:0.10 GL:0.10 ID: 0.80 | MS:0.20 IE: 0.90 | EN:0.10 IG: 0.60 | EN:0.40 IO: 0.86 | DA:0.14 KG: 0.90 | SW:0.10 KL: 0.90 | EN:0.10 KS: 0.30 | UR:0.60 UND:0.10 KU: 0.90 | EN:0.10 KW: 0.89 | UND:0.11 LA: 0.90 | FR:0.10 LB: 0.90 | EN:0.10 LG: 0.90 | IT:0.10 LI: 0.80 | NL:0.20 MI: 0.90 | ES:0.10 MK: 0.80 | IT:0.10 RU:0.10 MS: 0.80 | ID:0.10 EN:0.10 MT: 0.90 | DE:0.10 NO: 0.90 | DA:0.10 NY: 0.80 | AR:0.10 SW:0.10 OM: 0.90 | EN:0.10 OS: 0.90 | RU:0.10 QU: 0.70 | ES:0.20 EN:0.10 RM: 0.90 | EN:0.10 RN: 0.50 | RW:0.40 YO:0.10 SC: 0.90 | FR:0.10 SG: 0.90 | FR:0.10 SR: 0.80 | HR:0.10 BS:0.10 SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10 ST: 0.90 | PT:0.10 SV: 0.90 | DA:0.10 TI: 0.40 | AM:0.40 LA:0.10 EN:0.10 TK: 0.80 | TR:0.20 TO: 0.50 | EN:0.50 TS: 0.80 | EN:0.10 UZ:0.10 TW: 0.40 | EN:0.40 AK:0.10 YO:0.10 TY: 0.90 | ES:0.10 UG: 0.60 | UZ:0.40 UK: 0.80 | UND:0.10 VI:0.10 VE: 0.90 | EN:0.10 WO: 0.80 | NL:0.10 FR:0.10 XH: 0.80 | UZ:0.10 EN:0.10 YO: 0.80 | EN:0.20 ZU: 0.60 | XH:0.30 PT:0.10 Total quality: 0.90
  • 20. The Ladder of NLP Rule-based Linear ML Decision Trees & co. Sequence models Artificial Neural networks
  • 21. Better Models What can be improved? * Account for word order * Discriminative models per script * DeepLearning™ model Marginal gain is not huge…
  • 22. Engineer (efficiency) * Just a small piece of the pipeline: - good-enough speed - minimize space usage - minimize external dependencies * Proper floating-point calculations * Proper processing of big texts? * Pre-/post-processing * Clean API implementation optimization
  • 23. Model Optimization Initial model size: ~1G Target: ~10M :) How to do it? - Lossy compression: pruning - Lossless compression: Huffman coding, efficient DS
  • 24. API * Levels of detalization: - text-langs - word-langs - window? * UI: library, REPL & Web APIs
  • 25. Recap * Triple view of any knowledge-related problem * Ladder of approaches to solving NLP problems * Importance of productive env: general- & special-purpose REPL lang API access to data– – efficient testing– * Main stages of problem solving: data experiment→ → implementation optimization→