NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

User of Spark since 2012
Organiser of the London Spark Meetup
Run Data Science team at Skimlinks
Who am I

5
RDD.map
>>> thisrdd = sc.parallelize(range(12), 4)
>>> thisrdd.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> otherrdd = thisrdd.map(lambda x:x%3)
>>> otherrdd.collect()
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

7
RDD.map
>>> otherrdd.zip(thisrdd).collect()
[(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0,
6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)]
>>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y).
collect()
[(0, 18), (1, 22), (2, 26)]

Set the number of reducers sensibly
Configure your pyspark cluster properly
Don’t shuffle (unless you have to)
Don’t groupBy
Repartition your data if necessary
9
How to not crash your spark job

Lots of people will say 'use scala'
10

Lots of people will say 'use scala'
Don't listen to those people.
11

# get (class label, word) tuples
label_token = gettokens(docs)
# [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...]
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...]
# get the word count for each class
termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),
...]
13
Naive Bayes in Spark

termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add,
counts, (1, 1))))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
# => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...]
# get the total number of words in each class
values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)):
(truecounts, falsecounts))
totals = values.reduce(lambda x,y: map(add, x,y))
# [1321, 2345]
P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv,
counts, totals)))
14

reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)

reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)

RDD.aggregate(zeroValue, seqOp, combOp)
Aggregate the elements of each partition, and then the results for all the partitions, using a given
combine functions and a neutral “zero value.”
17

class WordFrequencyAgreggator(object):
def __init__(self):
self.S = {}
def add(self, (token, count)):
if token not in self.S:
self.S[token] = (0,0)
self.S[token] = map(add, self.S[token], count)
return self
def merge(self, other):
for term, count in other.S.iteritems():
if term not in self.S:
self.S[term] = (0,0)
self.S[term] = map(add, self.S[term], count)
return self
18
Naive Bayes in Spark: Aggregation

With aggregate
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...]
# => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
With aggregate
aggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y),
lambda x,y: x.merge(y))
RDD.aggregate(zeroValue, seqOp, combOp)
19

20
Naive Bayes in Spark: Aggregation

21
Naive Bayes in Spark: treeAggregation

RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)
Aggregates the elements of this RDD in a multi-level tree pattern.
With reduce
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
# ===>
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',
[95, 100])),...]
With treeAggregate
aggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add
(y), lambda x,y: x.merge(y), depth=4)
22
Naive Bayes in Spark: treeAggregate

On 1B short documents:
RDD.reduceByKey: 18 min
RDD.treeAggregate: 10 min
https://gist.github.com/martingoodson/aad5d06e81f23930127b
23
treeAggregate performance

25
Training Word2Vec in Spark
from pyspark.mllib.feature import Word2Vec
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)

Averaging
Clustering
Convolutional Neural Network
26
How to use word2vec vectors for classification problems

27
K-Means in Spark
from pyspark.mllib.clustering import KMeans, KMeansModel
word=sc.textFile('GoogleNews-vectors-negative300.txt')
vectors = word.map(lambda line: array(
[float(x) for x in line.split('t')[1:]])
)
clusters = KMeans.train(vectors, 50000, maxIterations=10,
runs=10, initializationMode="random")
clusters_b = sc.broadcast(clusters)
labels = parsedData.map(lambda x:clusters_b.value.predict(x))

28
Semi Supervised Naive Bayes
● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only
● Loop while classifier parameters improve:
○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e.,
the probability that each class generated each document,
○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien,
A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

instead of labels:
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
use probabilities:
# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),
(u'com', [0.13, .05])),...]
29
Naive Bayes in Spark: EM

500K labelled examples
Precision: 0.27
Recall: 0.15
F1: 0.099
Add 10M unlabelled examples. 10 EM iterations.
Precision of 0.26
Recall of 0.31
F1 of 0.14
30

240M training examples
Precision: 0.31
Recall: 0.19
F1: 0.12
Add 250M unlabelled examples. 10 EM iterations.
Precision of 0.26 and
Recall of 0.22
F1: 0.12
31

PySpark Memory: worked example

33
PySpark Configuration: Worked Example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
OS: ~12GB
Each python process: ~4GB = 48GB
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf

We are hiring!
martin@skimlinks.com
@martingoodson

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Semelhante a NLP on a Billion Documents: Scalable Machine Learning with Apache Spark (20)

Último

Último (20)

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark