SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
User of Spark since 2012
Organiser of the London Spark Meetup
Run Data Science team at Skimlinks
Who am I
Apache Spark
4
The RDD
5
RDD.map
>>> thisrdd = sc.parallelize(range(12), 4)
>>> thisrdd.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> otherrdd = thisrdd.map(lambda x:x%3)
>>> otherrdd.collect()
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
6
RDD.map
7
RDD.map
>>> otherrdd.zip(thisrdd).collect()
[(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0,
6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)]
>>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y).
collect()
[(0, 18), (1, 22), (2, 26)]
8
RDD.reduceByKey
Set the number of reducers sensibly
Configure your pyspark cluster properly
Don’t shuffle (unless you have to)
Don’t groupBy
Repartition your data if necessary
9
How to not crash your spark job
Lots of people will say 'use scala'
10
Lots of people will say 'use scala'
Don't listen to those people.
11
12
Naive bayes - recap
# get (class label, word) tuples
label_token = gettokens(docs)
# [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...]
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...]
# get the word count for each class
termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),
...]
13
Naive Bayes in Spark
termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add,
counts, (1, 1))))
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
# => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...]
# get the total number of words in each class
values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)):
(truecounts, falsecounts))
totals = values.reduce(lambda x,y: map(add, x,y))
# [1321, 2345]
P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv,
counts, totals)))
14
Naive Bayes in Spark
reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)
(k1, 1)
(k1, 2)
(k1, 1)
(k1, 5)
RDD.aggregate(zeroValue, seqOp, combOp)
Aggregate the elements of each partition, and then the results for all the partitions, using a given
combine functions and a neutral “zero value.”
17
Naive Bayes in Spark
class WordFrequencyAgreggator(object):
def __init__(self):
self.S = {}
def add(self, (token, count)):
if token not in self.S:
self.S[token] = (0,0)
self.S[token] = map(add, self.S[token], count)
return self
def merge(self, other):
for term, count in other.S.iteritems():
if term not in self.S:
self.S[term] = (0,0)
self.S[term] = map(add, self.S[term], count)
return self
18
Naive Bayes in Spark: Aggregation
With aggregate
termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...]
# => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]
With aggregate
aggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y),
lambda x,y: x.merge(y))
RDD.aggregate(zeroValue, seqOp, combOp)
19
Naive Bayes in Spark
20
Naive Bayes in Spark: Aggregation
21
Naive Bayes in Spark: treeAggregation
RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)
Aggregates the elements of this RDD in a multi-level tree pattern.
With reduce
termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
# ===>
# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',
[95, 100])),...]
With treeAggregate
aggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add
(y), lambda x,y: x.merge(y), depth=4)
22
Naive Bayes in Spark: treeAggregate
On 1B short documents:
RDD.reduceByKey: 18 min
RDD.treeAggregate: 10 min
https://gist.github.com/martingoodson/aad5d06e81f23930127b
23
treeAggregate performance
24
Word2Vec
25
Training Word2Vec in Spark
from pyspark.mllib.feature import Word2Vec
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
Averaging
Clustering
Convolutional Neural Network
26
How to use word2vec vectors for classification problems
27
K-Means in Spark
from pyspark.mllib.clustering import KMeans, KMeansModel
word=sc.textFile('GoogleNews-vectors-negative300.txt')
vectors = word.map(lambda line: array(
[float(x) for x in line.split('t')[1:]])
)
clusters = KMeans.train(vectors, 50000, maxIterations=10,
runs=10, initializationMode="random")
clusters_b = sc.broadcast(clusters)
labels = parsedData.map(lambda x:clusters_b.value.predict(x))
28
Semi Supervised Naive Bayes
● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only
● Loop while classifier parameters improve:
○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e.,
the probability that each class generated each document,
○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.
Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien,
A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
instead of labels:
tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))
# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,
1])),...]
use probabilities:
# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),
(u'com', [0.13, .05])),...]
29
Naive Bayes in Spark: EM
500K labelled examples
Precision: 0.27
Recall: 0.15
F1: 0.099
Add 10M unlabelled examples. 10 EM iterations.
Precision of 0.26
Recall of 0.31
F1 of 0.14
30
Naive Bayes in Spark: EM
240M training examples
Precision: 0.31
Recall: 0.19
F1: 0.12
Add 250M unlabelled examples. 10 EM iterations.
Precision of 0.26 and
Recall of 0.22
F1: 0.12
31
Naive Bayes in Spark: EM
PySpark Memory: worked example
33
PySpark Configuration: Worked Example
10 x r3.4xlarge (122G, 16 cores)
Use half for each executor: 60GB
Number of cores = 120
OS: ~12GB
Each python process: ~4GB = 48GB
Cache = 60% x 60GB x 10 = 360GB
Each java thread: 40% x 60GB / 12 = ~2GB
more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf
We are hiring!
martin@skimlinks.com
@martingoodson
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Mais conteúdo relacionado

Mais procurados

computer notes - Data Structures - 21
computer notes - Data Structures - 21computer notes - Data Structures - 21
computer notes - Data Structures - 21
ecomputernotes
 
Computer notes - singleRightRotation
Computer notes   - singleRightRotationComputer notes   - singleRightRotation
Computer notes - singleRightRotation
ecomputernotes
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
Dmitry Buzdin
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013
Daichi Morifuji
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
ajhannan
 

Mais procurados (20)

Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data Types
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
computer notes - Data Structures - 21
computer notes - Data Structures - 21computer notes - Data Structures - 21
computer notes - Data Structures - 21
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Dotnet 18
Dotnet 18Dotnet 18
Dotnet 18
 
Computer notes - singleRightRotation
Computer notes   - singleRightRotationComputer notes   - singleRightRotation
Computer notes - singleRightRotation
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Poor Man's Functional Programming
Poor Man's Functional ProgrammingPoor Man's Functional Programming
Poor Man's Functional Programming
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
 

Destaque

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
toncho11
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
Atul Shridhar
 

Destaque (20)

NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approach
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATE
 

Semelhante a NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
MongoDB
 

Semelhante a NLP on a Billion Documents: Scalable Machine Learning with Apache Spark (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & AggregationWebinar: Applikationsentwicklung mit MongoDB: Teil 5: Reporting & Aggregation
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
MiamiJS - The Future of JavaScript
MiamiJS - The Future of JavaScriptMiamiJS - The Future of JavaScript
MiamiJS - The Future of JavaScript
 
2006 Small Scheme
2006 Small Scheme2006 Small Scheme
2006 Small Scheme
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
alexnet.pdf
alexnet.pdfalexnet.pdf
alexnet.pdf
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
1403 app dev series - session 5 - analytics
1403   app dev series - session 5 - analytics1403   app dev series - session 5 - analytics
1403 app dev series - session 5 - analytics
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

  • 1.
  • 2. User of Spark since 2012 Organiser of the London Spark Meetup Run Data Science team at Skimlinks Who am I
  • 5. 5 RDD.map >>> thisrdd = sc.parallelize(range(12), 4) >>> thisrdd.collect() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] >>> otherrdd = thisrdd.map(lambda x:x%3) >>> otherrdd.collect() [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
  • 7. 7 RDD.map >>> otherrdd.zip(thisrdd).collect() [(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)] >>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y). collect() [(0, 18), (1, 22), (2, 26)]
  • 9. Set the number of reducers sensibly Configure your pyspark cluster properly Don’t shuffle (unless you have to) Don’t groupBy Repartition your data if necessary 9 How to not crash your spark job
  • 10. Lots of people will say 'use scala' 10
  • 11. Lots of people will say 'use scala' Don't listen to those people. 11
  • 13. # get (class label, word) tuples label_token = gettokens(docs) # [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...] tokencounter = label_token.map(lambda (label, token): (token, (label, not label))) # [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...] # get the word count for each class termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])), ...] 13 Naive Bayes in Spark
  • 14. termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add, counts, (1, 1)))) # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...] # => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...] # get the total number of words in each class values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)): (truecounts, falsecounts)) totals = values.reduce(lambda x,y: map(add, x,y)) # [1321, 2345] P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv, counts, totals))) 14 Naive Bayes in Spark
  • 15. reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  • 16. reduceByKey(combineByKey) {k1: 2, …} (k1, 2) (k1, 3) (k1,5) {k1: 10, …} {…} combineLocally _mergeCombiners {k1: 3, …} {k1: 5, …} reduceByKey(numPartitions) (k1, 1) (k1, 1) (k1, 2) (k1, 1) (k1, 5)
  • 17. RDD.aggregate(zeroValue, seqOp, combOp) Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” 17 Naive Bayes in Spark
  • 18. class WordFrequencyAgreggator(object): def __init__(self): self.S = {} def add(self, (token, count)): if token not in self.S: self.S[token] = (0,0) self.S[token] = map(add, self.S[token], count) return self def merge(self, other): for term, count in other.S.iteritems(): if term not in self.S: self.S[term] = (0,0) self.S[term] = map(add, self.S[term], count) return self 18 Naive Bayes in Spark: Aggregation
  • 19. With aggregate termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...] # => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...] With aggregate aggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y), lambda x,y: x.merge(y)) RDD.aggregate(zeroValue, seqOp, combOp) 19 Naive Bayes in Spark
  • 20. 20 Naive Bayes in Spark: Aggregation
  • 21. 21 Naive Bayes in Spark: treeAggregation
  • 22. RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2) Aggregates the elements of this RDD in a multi-level tree pattern. With reduce termcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y)) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0, 1])),...] # ===> # [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),...] With treeAggregate aggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add (y), lambda x,y: x.merge(y), depth=4) 22 Naive Bayes in Spark: treeAggregate
  • 23. On 1B short documents: RDD.reduceByKey: 18 min RDD.treeAggregate: 10 min https://gist.github.com/martingoodson/aad5d06e81f23930127b 23 treeAggregate performance
  • 25. 25 Training Word2Vec in Spark from pyspark.mllib.feature import Word2Vec inp = sc.textFile("text8_lines").map(lambda row: row.split(" ")) word2vec = Word2Vec() model = word2vec.fit(inp)
  • 26. Averaging Clustering Convolutional Neural Network 26 How to use word2vec vectors for classification problems
  • 27. 27 K-Means in Spark from pyspark.mllib.clustering import KMeans, KMeansModel word=sc.textFile('GoogleNews-vectors-negative300.txt') vectors = word.map(lambda line: array( [float(x) for x in line.split('t')[1:]]) ) clusters = KMeans.train(vectors, 50000, maxIterations=10, runs=10, initializationMode="random") clusters_b = sc.broadcast(clusters) labels = parsedData.map(lambda x:clusters_b.value.predict(x))
  • 28. 28 Semi Supervised Naive Bayes ● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only ● Loop while classifier parameters improve: ○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e., the probability that each class generated each document, ○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document. Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.
  • 29. instead of labels: tokencounter = label_token.map(lambda (label, token): (token, (label, not label))) # [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0, 1])),...] use probabilities: # [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]), (u'com', [0.13, .05])),...] 29 Naive Bayes in Spark: EM
  • 30. 500K labelled examples Precision: 0.27 Recall: 0.15 F1: 0.099 Add 10M unlabelled examples. 10 EM iterations. Precision of 0.26 Recall of 0.31 F1 of 0.14 30 Naive Bayes in Spark: EM
  • 31. 240M training examples Precision: 0.31 Recall: 0.19 F1: 0.12 Add 250M unlabelled examples. 10 EM iterations. Precision of 0.26 and Recall of 0.22 F1: 0.12 31 Naive Bayes in Spark: EM
  • 33. 33 PySpark Configuration: Worked Example 10 x r3.4xlarge (122G, 16 cores) Use half for each executor: 60GB Number of cores = 120 OS: ~12GB Each python process: ~4GB = 48GB Cache = 60% x 60GB x 10 = 360GB Each java thread: 40% x 60GB / 12 = ~2GB more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf