9. Set the number of reducers sensibly
Configure your pyspark cluster properly
Don’t shuffle (unless you have to)
Don’t groupBy
Repartition your data if necessary
9
How to not crash your spark job
17. RDD.aggregate(zeroValue, seqOp, combOp)
Aggregate the elements of each partition, and then the results for all the partitions, using a given
combine functions and a neutral “zero value.”
17
Naive Bayes in Spark
18. class WordFrequencyAgreggator(object):
def __init__(self):
self.S = {}
def add(self, (token, count)):
if token not in self.S:
self.S[token] = (0,0)
self.S[token] = map(add, self.S[token], count)
return self
def merge(self, other):
for term, count in other.S.iteritems():
if term not in self.S:
self.S[term] = (0,0)
self.S[term] = map(add, self.S[term], count)
return self
18
Naive Bayes in Spark: Aggregation
22. RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)
Aggregates the elements of this RDD in a multi-level tree pattern.
With reduce
termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
# ===>
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',
[95, 100])),...]
With treeAggregate
aggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add
(y), lambda x,y: x.merge(y), depth=4)
22
Naive Bayes in Spark: treeAggregate
23. On 1B short documents:
RDD.reduceByKey: 18 min
RDD.treeAggregate: 10 min
https://gist.github.com/martingoodson/aad5d06e81f23930127b
23
treeAggregate performance
27. 27
K-Means in Spark
from pyspark.mllib.clustering import KMeans, KMeansModel
word=sc.textFile('GoogleNews-vectors-negative300.txt')
vectors = word.map(lambda line: array(
[float(x) for x in line.split('t')[1:]])
)
clusters = KMeans.train(vectors, 50000, maxIterations=10,
runs=10, initializationMode="random")
clusters_b = sc.broadcast(clusters)
labels = parsedData.map(lambda x:clusters_b.value.predict(x))
28. 28
Semi Supervised Naive Bayes
● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only
● Loop while classifier parameters improve:
○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e.,
the probability that each class generated each document,
○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien,
A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
29. instead of labels:
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
use probabilities:
# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),
(u'com', [0.13, .05])),...]
29
Naive Bayes in Spark: EM
30. 500K labelled examples
Precision: 0.27
Recall: 0.15
F1: 0.099
Add 10M unlabelled examples. 10 EM iterations.
Precision of 0.26
Recall of 0.31
F1 of 0.14
30
Naive Bayes in Spark: EM
31. 240M training examples
Precision: 0.31
Recall: 0.19
F1: 0.12
Add 250M unlabelled examples. 10 EM iterations.
Precision of 0.26 and
Recall of 0.22
F1: 0.12
31
Naive Bayes in Spark: EM
33. 33
PySpark Configuration: Worked Example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
OS: ~12GB
Each python process: ~4GB = 48GB
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf