Distinguish Pop from Heavy Metal using Apache Spark MLlib

using Apache Spark MLlib
#javaone

https://ua.linkedin.com/in/tarasmatyashovsky
2

I am not
a data science
engineer
3

“I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
6

“I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you”
7

 Look for particular words like “fear”, “fight”, “kill”,
“devil”, ”death”, etc.?
 Count length of a verse?
 Count unique words in a verse?
9

is the study of
computer
algorithms that
improve
automatically
through
experience
12

Supervise
d
learning
Unsupervise
d
learning
Reinforcemen
t
learning
13

 Date & time
 Conference name
 Speaker
 Talk name
 Track
 Duration
 Type
 Overall impression
 Overall rating
 Number of slides
 Time spent on live
coding
 Number of jokes
 Etc.
15

Learning algorithms
Hypotheses:
Сost function:
Features:
Target variable:
Training example:
Training set:
16

http://www.slideshare.net/liweiyang5/spark-mllib-training-material
17

Number of jokes during a talk
Speaker’s
rating
18

Positive
Negative
Impression
Number of jokes during a talk
25

 Collect data set of lyrics:
 Abba, Ace of base, Backstreet Boys, Britney Spears,
Christina Aguilera, Madonna, etc.
 Black Sabbath, In Flames, Iron Maiden, Metallica,
Moonspell, Nightwish, Sentenced, etc.
 Create training set, i.e. label (0|1) + features
 Train logistic regression (or other classification
algorithm)
32

33

GloV
e Bag
of
Words
Word2VecTF-
IDF
http://spark.apache.org/docs/latest/ml-features.html#feature-extractors
35

 Produces unique fixed-size dense vectors
 Captures semantic and morphologic similarity
https://code.google.com/archive/p/word2vec/
36

Similar
scores
(cos ~ 1)
Opposite
scores
(cos ~ -1)
Unrelated
scores
(cos ~ 0)
http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png
37

38
Verse Cosine Distance
baby one more time 0.482028
crazy for you 0.437875
show me the meaning
of being lonely
0.258147
highway to hell -0.1120049
kill them all -0.231876

39

Under-fitting
(high bias)
Over-fitting
(high variance)
Appropriate
fitting
http://mlwiki.org/index.php/Overfitting
42

Training set (66,6%)
Test set (33%)
K = 3
43

Test set (33%)
K = 3
44

Test set (33%)
K = 3
45

Weka
Encog
AerosolveFlinkM
L
https://github.com/josephmisiti/awesome-machine-learning
48

Easy of
use
Cloud
computing
Spee
d
Generali
ty
Data
processing
49

https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
50

Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
51

 Introduces a few new data types, e.g.
vector (dense and sparse), labeled point,
rating, etc.
 Allows to invoke various algorithms on
distributed datasets (RDD/Dataset)
http://spark.apache.org/docs/latest/mllib-guide.html
52

Build on
top of
RDDs
Build on
top of
Datasets
spark.mll
ib
spark.ml
53

 Utilities: linear algebra, statistics, etc.
 Features extraction, features transforming, etc.
 Regression
 Classification
 Clustering
 Collaborative filtering, e.g. alternating least squares
 Dimensionality reduction
 And many more
54

”All” spark.mllib features plus:
• Pipelines
• Persistence
• Model selection and tuning:
• Train validation split
• K-folds cross validation
http://spark.apache.org/docs/latest/ml-guide.html
55

Raw data Transformer
Estimator
[parameters]
Transformer
[parameters]
Estimator
[parameters]
Dataset Dataset
Dataset
Dataset
http://spark.apache.org/docs/latest/ml-pipeline.html
Cross
Validator
[pipeline,
evaluator,
parameters]
Dataset
56

Lyrics
58

I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you
59

Lyrics Cleanser
Dataset
60

I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you
61

Lyrics Cleanser
Dataset
Numerator
Dataset
62

Im a rolling thunder a pouring rain
Im comin on like a hurricane
My lightnings flashing across the sky
Youre only young but youre gonna die
I wont take no prisoners wont spare no lives
Nobodys putting up a fight
I got my bell Im gonna take you to hell
Im gonna get you Satan get you
63
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
Dataset
64

im a rolling thunder a pouring rain
im comin on like a hurricane
My lightnings flashing across the sky
youre only young but youre gonna die
I wont take no prisoners wont spare no lives
nobodys putting up a fight
I got my bell im gonna take you to hell
im gonna get you satan get you
65
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Dataset
66

im rolling thunder pouring rain
im comin like hurricane
lightnings flashing across sky
youre young youre gonna die
wont take prisoners wont spare lives
nobodiys putting fight
got bell im gonna take hell
im gonna get satan get
67
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
68

4
im roll thunder pour rain
im comin like hurrican
lightn flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
69
1
2
3
4
5
6
7
8
verse1
verse2

8
im roll thunder pour rain
im comin like hurrican
Light n flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
70
1
2
3
4
5
6
7
8
verse1

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
Dataset
71

4
[0.036463763926011056,
-0.013076733228398295,
...
0.03816963326281462]
72
feature1
feature2
[-0.013962931134021625,
0.049275818325650804,
...
-0.058982484615766086]

8
[0.036463763926011056,
-0.013076733228398295,
0.044362547532774695,
0.03816963326281462,
...
-0.013962931134021625,
0.049275818325650804,
-0.058982484615766086]
73
feature1

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Dataset
74

Probability:
[0.9212126972383768,
0.07878730276162313]
Prediction:
0.0
75

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
76

[0.8454839775240359,
0.9061236588248319,
0.9527128936788524,
0.9522790271664413,
...
0.9526248129757111,
0.9522790271664411]
77

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
78

• Other feature extractors:
• Term Frequency – Inverse Document
Frequency (TD-IDF), Token counts (TF), etc.
• Other classification algorithms:
• Naive Bayes, Random Forest, Support Vector
Machines (SVM), etc.
http://spark.apache.org/docs/latest/ml-guide.html
81

https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes
82
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
”Love Life Death”?

83
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3

84
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3

85
”Love Life”?
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3

86
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3

87
Lov
e
Lif
e
0.6
0.3
Deat
h
0.1
0.3 Lov
e
Lif
e
0.4
Deat
h
0.3

https://spark.apache.org/docs/latest/ml-features.html#feature-extractors
88

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
89

Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Naive
Bayes
Dataset Dataset
Dataset
90
Hashing
TF
[Num
Features]
IDF
[Min Doc
Freq]
Dataset
Cross
Validator
Model

93
 ML is not as complex as it seems from an applied
perspective
 Existing libraries and frameworks reduce a lot of
tedious work
 For instance, Spark MLlib can help to build nice ML
pipelines

 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia
 https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
 https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
 https://www.kaggle.com/c/dogs-vs-cats/
 http://yann.lecun.com/exdb/mnist/
 http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html
 http://www.slideshare.net/jeykottalam/pipelines-ampcamp
 https://github.com/master/spark-stemming
 https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html
 http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/
 https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html
 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 http://www.slideshare.net/liweiyang5/spark-mllib-training-material
 https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm
 http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l
 https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html
 https://github.com/deeplearning4j/deeplearning4j
 http://deeplearning4j.org/spark
 http://mlwiki.org/index.php/Overfitting
 http://bionlp-www.utu.fi/wv_demo/
 https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/
95

Distinguish Pop from Heavy Metal using Apache Spark MLlib

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Taras Matyashovsky

Mais de Taras Matyashovsky (8)

Último

Último (20)

Distinguish Pop from Heavy Metal using Apache Spark MLlib

Notas do Editor