In this presentation and linked notebooks we learn the basics of creating a machine learning classifier from scratch using language classification as a running example. We start by implementing the naive intuition that letter frequency could provide a model for language classification, and then we will implement the n-gram paper from Cavnar and Trenkle.
In corresponding notebook we will create a Spark ML Transformer from the n-gram model that can be used to classify text in a Dataset or Dataframe
2. Gerard Maas
Lead Engineer @ Kensu
Computer Engineer
Scala Programmer
Early Spark Adopter
Spark Notebook Dev
Cassandra MVP (2015, 2016)
Stack Overflow Top Contributor
(Spark, Spark Streaming, Scala)
Wannabe IoT Hacker
Arduino Enthusiast
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
3. DATA SCIENCE GOVERNANCE
Adalog helps enterprises to ensure that data pipelines continually deliver
their value by combining the contextual information when the pipeline was
created with the evolving environment where the pipelines execute.
CONNECT - COLLECT - LEARN
8. Letter Frequency
Could we characterize a language by calculating the relative frequency of letters in some
text ?
Spanish vs English letter frequency
9. n-grams
"cavnar and trenkle"
bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_
tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_
quad-grams: cavn,...
http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
Could we characterize a language by calculating the relative frequency of sequence of
letters in some text ?
12. Spark APIs
RDD -> Resilient Distributed Datasets
- Lazy, functional-oriented, low level API
- Basis for execution of all high-level libraries
Dataframes
- Column-oriented, SQL-inspired DSL
- Many optimizations under the hood (Catalyst, Tungsten)
Dataset
- Best of both worlds (except …)
16. https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb
Implements the idea of using a letter frequency model to classify the language in a doc.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
It produces a training set of sampled strings that will be used also for the n-gram classifier
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder
/solutions contains the full working version.)
Notebook 1 : Naive Language Classification
17. Notebook 2 : n-gram Language Classification
https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb
Implements the n-gram algorithm described in the paper.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify
new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity.
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder
/solutions contains the full working version.)