O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
NLP Project Full Cycle
Vsevolod Dyomkin
10/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Plan
* Overview of NLP
* NLP Data
* Common NLP problems
and approaches
* Example NLP application:
text language identifica...
What Is NLP?
Transforming free-form text
into structured data and back
What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* Comp...
Natural Language
* ambiguous
* noisy
* evolving
Roles
linguist [noun]
1. A specialist in linguistics
linguist [noun]
1. A specialist in linguistics
linguistics [noun]
1. The scientific study of
language.
NLP Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithm...
Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* Internet/user Data
Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twi...
Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment an...
engineer [noun]
5. A person skilled in the
design and programming of
computer systems
Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":...
Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even mo...
Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather ...
Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE
researcher [noun]
1. One who researches
researcher [noun]
1. One who researches
research [noun]
1. Diligent inquiry or
examination to seek or revise
facts, princi...
Models
Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace
Language Models
Question: what is the probability of a
sequence of words/sentence?
Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w...
Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fi...
Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssa...
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is i...
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is i...
Naive Bayes
Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)
select Y = argmax P(Y|x)
Naive step:
P(Y|x) = P(Y) * prod(P(x|Y))
for...
Machine Learning
Approach
Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
p...
Shift-reduce Parsing
Shift-reduce Parsing
Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in exampl...
ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses...
The Hierarchy of
ML Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditiona...
Semantics
Question: how to model relationships
between words?
Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia
Word Similarity
Next question: now, how do we measure those
relations?
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
* PMI(x,y) ...
Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Wo...
Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
repr...
Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not prod...
Text Language
Identification
Not an unsolved problem:
* https://github.com/CLD2Owners/cld2 - C++
* https://github.com/saff...
WILD Challenges
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages,
always evolving
* Wanted to do in Lisp
WILD Linguistics
* Scripts vs languages
http://www.omniglot.com/writing/langalph.htm
* Languages distribution
https://en.w...
WILD Data
Wiktionary Wikipedia data:
used abstracts, ~175 languages
- download & store
- process (SAX parsing)
- setup lea...
WILD Engineering
* Initial model size ~1G -
script hacks & Huffman coding
to the rescue
* Model pruning
* Proper probabili...
NLP Project Full Cycle
NLP Project Full Cycle
NLP Project Full Cycle
NLP Project Full Cycle
Próximos SlideShares
Carregando em…5
×

NLP Project Full Cycle

10.336 visualizações

Publicada em

What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language

Publicada em: Tecnologia
  • Want to preview some of our plans? You can get 50 Woodworking Plans and a 440-Page "The Art of Woodworking" Book... Absolutely FREE ♣♣♣ http://tinyurl.com/y3hc8gpw
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

NLP Project Full Cycle

  1. 1. NLP Project Full Cycle Vsevolod Dyomkin 10/2016
  2. 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  3. 3. Plan * Overview of NLP * NLP Data * Common NLP problems and approaches * Example NLP application: text language identification
  4. 4. What Is NLP? Transforming free-form text into structured data and back
  5. 5. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * ML, Stats, Information Theory
  6. 6. Natural Language * ambiguous * noisy * evolving
  7. 7. Roles
  8. 8. linguist [noun] 1. A specialist in linguistics
  9. 9. linguist [noun] 1. A specialist in linguistics linguistics [noun] 1. The scientific study of language.
  10. 10. NLP Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  11. 11. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * Internet/user Data
  12. 12. Where to Get Data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  13. 13. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  14. 14. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  15. 15. engineer [noun] 5. A person skilled in the design and programming of computer systems
  16. 16. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  17. 17. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ In fact, it works: https://github.com/lang-uk/ner-uk/blob/master/doc /tokenization.md
  18. 18. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  19. 19. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  20. 20. researcher [noun] 1. One who researches
  21. 21. researcher [noun] 1. One who researches research [noun] 1. Diligent inquiry or examination to seek or revise facts, principles, theories, applications, etc.; laborious or continued search after truth
  22. 22. Models
  23. 23. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  24. 24. Language Models Question: what is the probability of a sequence of words/sentence?
  25. 25. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  26. 26. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  27. 27. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  28. 28. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold
  29. 29. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold http://www.paulgraham.com/spam.html - A Plan for Spam Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%
  30. 30. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  31. 31. Machine Learning Approach
  32. 32. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) https://honnibal.wordpress.com/2013/12/18/a-simple-fas t-algorithm-for-natural-language-dependency-parsing/
  33. 33. Shift-reduce Parsing
  34. 34. Shift-reduce Parsing
  35. 35. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  36. 36. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  37. 37. The Hierarchy of ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests, Boosted Trees * Artificial Neural networks
  38. 38. Semantics Question: how to model relationships between words?
  39. 39. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  40. 40. Word Similarity Next question: now, how do we measure those relations?
  41. 41. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  42. 42. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  43. 43. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe, …) * Hierarchical repr (Brown clusters)
  44. 44. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and frameworks where possible * Setup and perform a proper experiment (series of experiments) * Optimize the system for production
  45. 45. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Don't trust research results * Value pre- and post- processing * Gather user feedback
  46. 46. Text Language Identification Not an unsolved problem: * https://github.com/CLD2Owners/cld2 - C++ * https://github.com/saffsd/langid.py - Python * https://github.com/shuyo/language-detection/ - Java To read: https://blog.twitter.com/2015/evaluating-language-identifi cation-performance http://blog.mikemccandless.com/2011/10/accuracy-and-perfor mance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  47. 47. WILD Challenges
  48. 48. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  49. 49. WILD Linguistics * Scripts vs languages http://www.omniglot.com/writing/langalph.htm * Languages distribution https://en.wikipedia.org/wiki/Languages_used_o n_the_Internet#Content_languages_for_websites * Frequency word lists https://invokeit.wordpress.com/frequency-word- lists/ * Word segmentation?
  50. 50. WILD Data Wiktionary Wikipedia data: used abstracts, ~175 languages - download & store - process (SAX parsing) - setup learning & test data sets 10,778,404 unique words 481,581 unique character trigrams
  51. 51. WILD Engineering * Initial model size ~1G - script hacks & Huffman coding to the rescue * Model pruning * Proper probability calculations * Efficient testing * Properly saving the model * Library & public API

×