SlideShare a Scribd company logo
1 of 39
Putting it All Together
Databricks, Streaming, DataFrames and
ML
Databricks Training Team
September 2015
Databricks
& Apache Spark
AWS EC2 Cluster
S3 Bucket
Use Case :: Twitter Streaming App
2
internet
You Are Here
Scala
Spark
App
Twitter Tweets
Running in Databricks
• We will walk through a Scala application running
in Databricks, and have hands-on labs.
• Three parts of the app will be created
1. Streaming only: write tweets to filesystem.
2. Streaming plus DataFrames: parquet
3. Add Machine Learning (ML) KMeans
• Scala code examples all compile and run in
Databricks attached to a Spark 1.4 cluster (unless
otherwise noted).
• The application will leverage:
• Twitter4j API1
• Spark Streaming API
• Spark DataFrames API
• Spark ML KMeans API2
3
Writing Code in Databricks
• When writing code in Databricks (or any Scala
REPL) it helps to wrap the code in an object,
for scope-control reasons:
object Runner extends Serializable {
/* code */
}
• Databricks becomes the “Driver” of the Spark app.
• Values and variables can be picked up and sent to
the Spark Workers for use by functions.
– Sometimes inadvertently
– Sometimes this causes bugs difficult to track in the REPL or
UI
4
Writing Code in Databricks
• Use the Factory methods to create (or get) a
StreamingContext:
• Best practice and better for recovery:
5
/* Experimental: return the "active" StreamingContext
(that is, started but not stopped) */
StreamingContext.getActiveOrCreate(…)
/* Either recreate a StreamingContext from checkpoint data
or create a new StreamingContext */
StreamingContext.getOrCreate(…)
/*Get the currently active context, if there is one.
Active means started but not stopped. */
StreamingContext.getActive()
Creating a Stream in Scala (1 of 3)
6
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter.TwitterUtils
import twitter4j.auth._
import twitter4j.conf._
import twitter4j.Status // A tweet
//How long between batches, each batch becomes an RDD
val BatchInterval = 60 // seconds
//Write to cluster later when we have a batch of tweets
val ParquetFile = "dbfs:/mnt/strata-nyc-dev-bootcamp
/tweets.parquet”
Creating a Stream in Scala (2 of 3)
7
//Within the object Runner …
def start() = {
val ssc = //Spark StreamingContext
StreamingContext.getActiveOrCreate(
createStreamingContext
)
ssc.start()
}
def stop() = { //Just stop streaming
StreamingContext.getActive.map { ssc => //lambda
ssc.stop(stopSparkContext=false)
}
}
Creating a Stream in Scala (3 of 3)
8
//Our implementation of scala.Function0<StreamingContext>
private def createStreamingContext(): StreamingContext =
{
val ssc = new StreamingContext(sc,
Seconds(BatchInterval))
/*TwitterUtils needs StreamingContext and auth details.
* @transient do not serialze vals no one else needs them. */
@transient val tconf =
TwitterCredentialsFactory.getCredentials()
@transient val auth = new OAuthAuthorization(tconf)
//Here tweets is a JavaDStream[Status]
val tweets = TwitterUtils.createStream(ssc, Some(auth))
/* much more code */
ssc // we promised to return the StreamingContext
}
Filter the Stream
• Now we have too many tweets coming over to us.
• Filter them down to only interesting Spark tweets:
9
//What we want to see tweets about:
val contains = Array("#spark", "#apachespark", "#strata",
"#databricks", "#hadoop", "#bigdata", " spark ", " strata ",
"databricks", "big data", "hadoop")
//Status is a Decorator around the tweet text itself:
val sparkTweets = tweets.filter { status => //lambda
contains.exists { word =>
status.getText.toLowerCase.contains(word)
}
}
Each Batch Becomes an RDD
10
/* Use foreachRDD to process the Tweet Stream... there will
be one RDD created by SparkStreaming per batch... */
sparkTweets.foreachRDD { rdd =>
// What shall we do with each tweet/status?
if (! rdd.isEmpty) { //only if tweets present...
//Let‘s write out a file and see what we found
rdd.map( status =>
status.getText()).coalesce(1)//One part-00000 file
.saveAsTextFile(
"dbfs:/mnt/strata-nyc-dev-bootcamp/twitter_" +
new Date().getTime()) //unique & easy to read
}
}
Hands On :: Twitter Config
Our first Twitter lab exercise is:
Twitter_Streaming_Lab01_S
cala
You will need application
credentials for this lab.
If you have a Twitter account:
https://apps.twitter.com
Or use our custom
TwitterCredentialsFactory
in the Scala lab code.
Your decision.
11
Hands On
Switch back to your
hands on notebook, and
look at the section
entitled
Twitter_Streaming_Lab
01_Scala.
Do the first lab exercise
there, which will write to
the attached cluster’s
filesystem.
12
Lab 1: See Results (1 of 3)
display(
dbutils.fs.ls("dbfs:/mnt/strata-nyc-dev-bootcamp")
)
display(dbutils.fs.ls(
"dbfs:/mnt/strata-nyc-dev-bootcamp/
twitter_1442861640052")
)
13
Lab 1: Display Results (2 of 3)
sc.textFile("dbfs:/mnt/strata-nyc-dev-
bootcamp/twitter_1442863860234/part-00000").collect()
//OR TRY THIS:
println (
dbutils.fs.head(
"dbfs:/mnt/training/twitter/
twitter_1442863860234/part-00000", 140)
)
14
Lab 1: Display Results (3 of 3)
15
Adding DataFrames (1 of 2)
16
/* Inside our foreachRDD call... our Function needs this,
* but the SQLContext from Driver is not Serializable */
val sqlContext =
SQLContext.getOrCreate(SparkContext.getOrCreate())
val sparkTweetsRDD = rdd.map( status => //lambda
Tweet(status.getText()) //need the tweet text only
)
/* The toDF method can drag references via the implicits in
Scala that are not serializable. */
val tweetsDF = sqlContext.createDataFrame(sparkTweetsRDD)
//Need this for DataFrame creation
case class Tweet(text: String)
Adding DataFrames (2 of 2)
17
/* We use a DataFrame here to make it easy to write out to
a parquet file, for ML K-Means clustering later...
This also disconnects our Streaming work from our ML work.
*/
tweetsDF.coalesce(1) // Make sure there is one partition
.write.mode(SaveMode.Append) //Grow the file
.parquet(ParquetFile)
//Need this for Parquet file creation
val ParquetFile = "dbfs:/mnt/strata-nyc-dev-
bootcamp/tweets.parquet"
Hands On
Switch back to your hands on notebook, and
look at the section entitled
Twitter_Streaming_Lab01_Scala.
Do the second lab exercise there, which will
write a parquet file to the filesystem and read it
back.
18
dbfs:/mnt/strata-nyc-dev-bootcamp/tweets.txt
Lab 2: See Results (1 of 2)
display(dbutils.fs.ls(ParquetFile))
val dfTweets = sqlContext.read.parquet(ParquetFile)
display(dfTweets.take(10)) // takes a few seconds
19
Lab 2: See Results (2 of 2)
20
Adding Machine Learning KMeans
• KMeans1 is a clustering ML algorithm.
• We will perform this next for our tweet data.
1. Get ML "training" data (legacy): old
Tweets.2
2. Featurize the data, turn text into Features.
3. Train the KMeans model on the featurized
text.
4. Build the predictions (what categories) the
tweets fall into: We chose four categories.
5. Display this using a visualization: pie chart.
6. Show some examples. 21
Machine Learning (ML) preparation
22
import org.apache.spark.ml._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer,
IDF}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row
//Case class needed for DataFrame
case class Tweet(text: String)
//Raw legacy tweets as a text file for training data
val rawTweets = sc.textFile("/mnt/strata-nyc-dev-bootcamp/
tweets.txt").filter(text => (!text.equals(""))
) // removed blank lines...
//This is the DataFrame[Tweet] representation
val training = rawTweets.map( text => Tweet(text)
).toDF("text")
Sample the Training Data
• Example tweets from training data
(tweets.txt):
23
ML Tokenizer
24
/* Configure an ML pipeline, which will consists of four
stages: * TOKENIZER, HASHING TF, IDF, and then K-MEANS.
A tokenizer that converts the input string, in our case the
Tweet, to lowercase and then splits it by white spaces. */
val tokenizer = new Tokenizer()
.setInputCol("text") // DataFrame column
.setOutputCol("words") // Tokenized text
ML Tokenizer
25
TF-IDF
• Term frequency-inverse document frequency (TF-IDF)
is a feature vectorization method widely used in text
mining to reflect the importance of a term to a
document in the corpus.
• We will use this technique to help us with Kmeans clustering
• In MLlib, we separate TF and IDF to make them flexible.1
• Term frequency (TF) is the number of times that term
appears in document, while document frequency is
the number of documents that contains term.
• If we only use term frequency to measure the importance, it is
very easy to over-emphasize terms that appear very often but
carry little information about the document, e.g., “a”, “the”, and
“of”.
• If a term appears very often across the corpus, it means it
doesn’t carry special information about a particular document.
26
Spark ML :: TF-IDF
27
/* HashingTF Maps a sequence of terms to their term
frequencies using the hashing trick.[1] */
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol) // Link the pipeline
.setOutputCol("features") /* Sparse Vector */
HashingTF “features” Column
28
val hashingOutput = hashingTF.transform(tokenOutput)
display(hashingOutput)
Spark ML :: IDF
29
/* Compute the Inverse Document Frequency (IDF) given a
collection of documents.*/
val inverseDocumentFreq = new IDF()
.setInputCol(hashingTF.getOutputCol) // Link the pipeline
.setOutputCol("frequency”)
IDF “frequency” Column
30
ML Normalizer
31
/* L2 (value of each element "L" squared) normalization
of the IDF output is needed. */
val normalizer = new Normalizer()
.setInputCol(inverseDocumentFreq.getOutputCol) //Link
.setOutputCol("norm")
IDF “norm” Column
32
ML KMeans v 1.4 (1 of 3)
33
/* K-means clustering with support for multiple parallel
runs… This is an iterative algorithm that will make
multiple passes over the data, so any RDDs given to it
should be cached by the user. imports not shown */
val km = new KMeans()
.setK(4) //how many clusters (centroids) to group
.setSeed(43L) // Random number generator reproducability
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF,
inverseDocumentFreq, normalizer)) //The pipeline so far
val pipelineModel = pipeline.fit(training)
val pipelineResult = pipelineModel.transform(training)
Kmeans Prediction (2 of 3)
34
val kmInput = pipelineResult.select("norm")
.rdd.map(_(0).asInstanceOf[Vector]) //sparse Vectors
.cache() //iterative algorithm
val kmModel = km.run(kmInput)
// training data predictions only
val kmPredictions = kmModel.predict(kmInput)
val sparkTweetsDF = // new fresh tweets
sqlContext.read.parquet("dbfs:/mnt/strata-nyc-dev-bootcamp/
tweets.parquet")
val newTweetsTransformed =
pipelineModel.transform(sparkTweetsDF)
Kmeans Prediction (3 of 3)
35
val kmNewTweetsInput =
newTweetsTransformed.select("norm").rdd
.map(_(0).asInstanceOf[Vector])
val newTweetText = newTweetsTransformed.select("text")
.rdd.map(_(0).asInstanceOf[String])
val newTweetsPrediction = kmModel.predict(kmNewTweetsInput)
/* We combine the new tweet text with the new tweet
prediction (group in KMeans) using a zip association */
val textAndPrediction =
newTweetText.zip(newTweetsPrediction)
println(textAndPrediction.collect().mkString("n"))
Text and Prediction
36
Display Predictions :: Pie Charts
• display(kmPredictions.toDF("prediction"))
37
Hands On
Switch back to your hands on notebook, and
look at the section entitled
Twitter_Streaming_Lab01_Scala.
Do the third lab exercise there, which will add
KMeans clustering analysis to the DataFrame.
38
// Questions?
val instructor= Databricks.getInstructor()
.setInputCol(student.getOutputCol) //Link
.setOutputCol("answer")

More Related Content

What's hot

Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgzznate
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San FranciscoMartin Odersky
 
The Ring programming language version 1.5.4 book - Part 14 of 185
The Ring programming language version 1.5.4 book - Part 14 of 185The Ring programming language version 1.5.4 book - Part 14 of 185
The Ring programming language version 1.5.4 book - Part 14 of 185Mahmoud Samir Fayed
 
RxJava from the trenches
RxJava from the trenchesRxJava from the trenches
RxJava from the trenchesPeter Hendriks
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingIndicThreads
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Art of unit testing: How developer should care about code quality
Art of unit testing: How developer should care about code qualityArt of unit testing: How developer should care about code quality
Art of unit testing: How developer should care about code qualityDmytro Patserkovskyi
 
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Michael Noll
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaHoang Nguyen
 

What's hot (20)

C chap16
C chap16C chap16
C chap16
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Rxjava meetup presentation
Rxjava meetup presentationRxjava meetup presentation
Rxjava meetup presentation
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
 
The Ring programming language version 1.5.4 book - Part 14 of 185
The Ring programming language version 1.5.4 book - Part 14 of 185The Ring programming language version 1.5.4 book - Part 14 of 185
The Ring programming language version 1.5.4 book - Part 14 of 185
 
RxJava from the trenches
RxJava from the trenchesRxJava from the trenches
RxJava from the trenches
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Art of unit testing: How developer should care about code quality
Art of unit testing: How developer should care about code qualityArt of unit testing: How developer should care about code quality
Art of unit testing: How developer should care about code quality
 
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
P4
P4P4
P4
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 

Viewers also liked

Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Platform as a Service with Kubernetes and Mesos
Platform as a Service with Kubernetes and Mesos Platform as a Service with Kubernetes and Mesos
Platform as a Service with Kubernetes and Mesos Miguel Zuniga
 
Hello World Python featuring GAE
Hello World Python featuring GAEHello World Python featuring GAE
Hello World Python featuring GAEMaito Kuwahara
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedWork-Bench
 
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, PaasCloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, PaasNeeraj Sabharwal
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakDataWorks Summit
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparisonKrishna-Kumar
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Enginerajdeep
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewKrishna-Kumar
 
Google app engine
Google app engineGoogle app engine
Google app engineSuraj Mehta
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Using machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversionUsing machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversionTammy Everts
 
Why Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoWhy Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoBig Data Spain
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration WarsKarl Isenberg
 
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learninghoondong kim
 

Viewers also liked (20)

Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Platform as a Service with Kubernetes and Mesos
Platform as a Service with Kubernetes and Mesos Platform as a Service with Kubernetes and Mesos
Platform as a Service with Kubernetes and Mesos
 
Python Gae django
Python Gae djangoPython Gae django
Python Gae django
 
Desarrollo con JSF
Desarrollo con JSFDesarrollo con JSF
Desarrollo con JSF
 
Hello World Python featuring GAE
Hello World Python featuring GAEHello World Python featuring GAE
Hello World Python featuring GAE
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
 
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, PaasCloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas
Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and CloudbreakData Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
 
Google app engine
Google app engineGoogle app engine
Google app engine
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Using machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversionUsing machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversion
 
Why Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén CasadoWhy Apache Flink is better than Spark by Rubén Casado
Why Apache Flink is better than Spark by Rubén Casado
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learning
 

Similar to PuttingItAllTogether

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and InvocationArvind Surve
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation processArvind Surve
 
Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructuredAmi Mahloof
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules RestructuredDoiT International
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxfredharris32
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesYevgeniy Brikman
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 

Similar to PuttingItAllTogether (20)

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
DML Syntax and Invocation process
DML Syntax and Invocation processDML Syntax and Invocation process
DML Syntax and Invocation process
 
Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructured
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modules
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark core
Spark coreSpark core
Spark core
 

PuttingItAllTogether

  • 1. Putting it All Together Databricks, Streaming, DataFrames and ML Databricks Training Team September 2015
  • 2. Databricks & Apache Spark AWS EC2 Cluster S3 Bucket Use Case :: Twitter Streaming App 2 internet You Are Here Scala Spark App Twitter Tweets
  • 3. Running in Databricks • We will walk through a Scala application running in Databricks, and have hands-on labs. • Three parts of the app will be created 1. Streaming only: write tweets to filesystem. 2. Streaming plus DataFrames: parquet 3. Add Machine Learning (ML) KMeans • Scala code examples all compile and run in Databricks attached to a Spark 1.4 cluster (unless otherwise noted). • The application will leverage: • Twitter4j API1 • Spark Streaming API • Spark DataFrames API • Spark ML KMeans API2 3
  • 4. Writing Code in Databricks • When writing code in Databricks (or any Scala REPL) it helps to wrap the code in an object, for scope-control reasons: object Runner extends Serializable { /* code */ } • Databricks becomes the “Driver” of the Spark app. • Values and variables can be picked up and sent to the Spark Workers for use by functions. – Sometimes inadvertently – Sometimes this causes bugs difficult to track in the REPL or UI 4
  • 5. Writing Code in Databricks • Use the Factory methods to create (or get) a StreamingContext: • Best practice and better for recovery: 5 /* Experimental: return the "active" StreamingContext (that is, started but not stopped) */ StreamingContext.getActiveOrCreate(…) /* Either recreate a StreamingContext from checkpoint data or create a new StreamingContext */ StreamingContext.getOrCreate(…) /*Get the currently active context, if there is one. Active means started but not stopped. */ StreamingContext.getActive()
  • 6. Creating a Stream in Scala (1 of 3) 6 import org.apache.spark.sql._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter.TwitterUtils import twitter4j.auth._ import twitter4j.conf._ import twitter4j.Status // A tweet //How long between batches, each batch becomes an RDD val BatchInterval = 60 // seconds //Write to cluster later when we have a batch of tweets val ParquetFile = "dbfs:/mnt/strata-nyc-dev-bootcamp /tweets.parquet”
  • 7. Creating a Stream in Scala (2 of 3) 7 //Within the object Runner … def start() = { val ssc = //Spark StreamingContext StreamingContext.getActiveOrCreate( createStreamingContext ) ssc.start() } def stop() = { //Just stop streaming StreamingContext.getActive.map { ssc => //lambda ssc.stop(stopSparkContext=false) } }
  • 8. Creating a Stream in Scala (3 of 3) 8 //Our implementation of scala.Function0<StreamingContext> private def createStreamingContext(): StreamingContext = { val ssc = new StreamingContext(sc, Seconds(BatchInterval)) /*TwitterUtils needs StreamingContext and auth details. * @transient do not serialze vals no one else needs them. */ @transient val tconf = TwitterCredentialsFactory.getCredentials() @transient val auth = new OAuthAuthorization(tconf) //Here tweets is a JavaDStream[Status] val tweets = TwitterUtils.createStream(ssc, Some(auth)) /* much more code */ ssc // we promised to return the StreamingContext }
  • 9. Filter the Stream • Now we have too many tweets coming over to us. • Filter them down to only interesting Spark tweets: 9 //What we want to see tweets about: val contains = Array("#spark", "#apachespark", "#strata", "#databricks", "#hadoop", "#bigdata", " spark ", " strata ", "databricks", "big data", "hadoop") //Status is a Decorator around the tweet text itself: val sparkTweets = tweets.filter { status => //lambda contains.exists { word => status.getText.toLowerCase.contains(word) } }
  • 10. Each Batch Becomes an RDD 10 /* Use foreachRDD to process the Tweet Stream... there will be one RDD created by SparkStreaming per batch... */ sparkTweets.foreachRDD { rdd => // What shall we do with each tweet/status? if (! rdd.isEmpty) { //only if tweets present... //Let‘s write out a file and see what we found rdd.map( status => status.getText()).coalesce(1)//One part-00000 file .saveAsTextFile( "dbfs:/mnt/strata-nyc-dev-bootcamp/twitter_" + new Date().getTime()) //unique & easy to read } }
  • 11. Hands On :: Twitter Config Our first Twitter lab exercise is: Twitter_Streaming_Lab01_S cala You will need application credentials for this lab. If you have a Twitter account: https://apps.twitter.com Or use our custom TwitterCredentialsFactory in the Scala lab code. Your decision. 11
  • 12. Hands On Switch back to your hands on notebook, and look at the section entitled Twitter_Streaming_Lab 01_Scala. Do the first lab exercise there, which will write to the attached cluster’s filesystem. 12
  • 13. Lab 1: See Results (1 of 3) display( dbutils.fs.ls("dbfs:/mnt/strata-nyc-dev-bootcamp") ) display(dbutils.fs.ls( "dbfs:/mnt/strata-nyc-dev-bootcamp/ twitter_1442861640052") ) 13
  • 14. Lab 1: Display Results (2 of 3) sc.textFile("dbfs:/mnt/strata-nyc-dev- bootcamp/twitter_1442863860234/part-00000").collect() //OR TRY THIS: println ( dbutils.fs.head( "dbfs:/mnt/training/twitter/ twitter_1442863860234/part-00000", 140) ) 14
  • 15. Lab 1: Display Results (3 of 3) 15
  • 16. Adding DataFrames (1 of 2) 16 /* Inside our foreachRDD call... our Function needs this, * but the SQLContext from Driver is not Serializable */ val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate()) val sparkTweetsRDD = rdd.map( status => //lambda Tweet(status.getText()) //need the tweet text only ) /* The toDF method can drag references via the implicits in Scala that are not serializable. */ val tweetsDF = sqlContext.createDataFrame(sparkTweetsRDD) //Need this for DataFrame creation case class Tweet(text: String)
  • 17. Adding DataFrames (2 of 2) 17 /* We use a DataFrame here to make it easy to write out to a parquet file, for ML K-Means clustering later... This also disconnects our Streaming work from our ML work. */ tweetsDF.coalesce(1) // Make sure there is one partition .write.mode(SaveMode.Append) //Grow the file .parquet(ParquetFile) //Need this for Parquet file creation val ParquetFile = "dbfs:/mnt/strata-nyc-dev- bootcamp/tweets.parquet"
  • 18. Hands On Switch back to your hands on notebook, and look at the section entitled Twitter_Streaming_Lab01_Scala. Do the second lab exercise there, which will write a parquet file to the filesystem and read it back. 18 dbfs:/mnt/strata-nyc-dev-bootcamp/tweets.txt
  • 19. Lab 2: See Results (1 of 2) display(dbutils.fs.ls(ParquetFile)) val dfTweets = sqlContext.read.parquet(ParquetFile) display(dfTweets.take(10)) // takes a few seconds 19
  • 20. Lab 2: See Results (2 of 2) 20
  • 21. Adding Machine Learning KMeans • KMeans1 is a clustering ML algorithm. • We will perform this next for our tweet data. 1. Get ML "training" data (legacy): old Tweets.2 2. Featurize the data, turn text into Features. 3. Train the KMeans model on the featurized text. 4. Build the predictions (what categories) the tweets fall into: We chose four categories. 5. Display this using a visualization: pie chart. 6. Show some examples. 21
  • 22. Machine Learning (ML) preparation 22 import org.apache.spark.ml._ import org.apache.spark.ml.feature.{HashingTF, Tokenizer, IDF} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.sql.Row //Case class needed for DataFrame case class Tweet(text: String) //Raw legacy tweets as a text file for training data val rawTweets = sc.textFile("/mnt/strata-nyc-dev-bootcamp/ tweets.txt").filter(text => (!text.equals("")) ) // removed blank lines... //This is the DataFrame[Tweet] representation val training = rawTweets.map( text => Tweet(text) ).toDF("text")
  • 23. Sample the Training Data • Example tweets from training data (tweets.txt): 23
  • 24. ML Tokenizer 24 /* Configure an ML pipeline, which will consists of four stages: * TOKENIZER, HASHING TF, IDF, and then K-MEANS. A tokenizer that converts the input string, in our case the Tweet, to lowercase and then splits it by white spaces. */ val tokenizer = new Tokenizer() .setInputCol("text") // DataFrame column .setOutputCol("words") // Tokenized text
  • 26. TF-IDF • Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. • We will use this technique to help us with Kmeans clustering • In MLlib, we separate TF and IDF to make them flexible.1 • Term frequency (TF) is the number of times that term appears in document, while document frequency is the number of documents that contains term. • If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., “a”, “the”, and “of”. • If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. 26
  • 27. Spark ML :: TF-IDF 27 /* HashingTF Maps a sequence of terms to their term frequencies using the hashing trick.[1] */ val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) // Link the pipeline .setOutputCol("features") /* Sparse Vector */
  • 28. HashingTF “features” Column 28 val hashingOutput = hashingTF.transform(tokenOutput) display(hashingOutput)
  • 29. Spark ML :: IDF 29 /* Compute the Inverse Document Frequency (IDF) given a collection of documents.*/ val inverseDocumentFreq = new IDF() .setInputCol(hashingTF.getOutputCol) // Link the pipeline .setOutputCol("frequency”)
  • 31. ML Normalizer 31 /* L2 (value of each element "L" squared) normalization of the IDF output is needed. */ val normalizer = new Normalizer() .setInputCol(inverseDocumentFreq.getOutputCol) //Link .setOutputCol("norm")
  • 33. ML KMeans v 1.4 (1 of 3) 33 /* K-means clustering with support for multiple parallel runs… This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user. imports not shown */ val km = new KMeans() .setK(4) //how many clusters (centroids) to group .setSeed(43L) // Random number generator reproducability val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, inverseDocumentFreq, normalizer)) //The pipeline so far val pipelineModel = pipeline.fit(training) val pipelineResult = pipelineModel.transform(training)
  • 34. Kmeans Prediction (2 of 3) 34 val kmInput = pipelineResult.select("norm") .rdd.map(_(0).asInstanceOf[Vector]) //sparse Vectors .cache() //iterative algorithm val kmModel = km.run(kmInput) // training data predictions only val kmPredictions = kmModel.predict(kmInput) val sparkTweetsDF = // new fresh tweets sqlContext.read.parquet("dbfs:/mnt/strata-nyc-dev-bootcamp/ tweets.parquet") val newTweetsTransformed = pipelineModel.transform(sparkTweetsDF)
  • 35. Kmeans Prediction (3 of 3) 35 val kmNewTweetsInput = newTweetsTransformed.select("norm").rdd .map(_(0).asInstanceOf[Vector]) val newTweetText = newTweetsTransformed.select("text") .rdd.map(_(0).asInstanceOf[String]) val newTweetsPrediction = kmModel.predict(kmNewTweetsInput) /* We combine the new tweet text with the new tweet prediction (group in KMeans) using a zip association */ val textAndPrediction = newTweetText.zip(newTweetsPrediction) println(textAndPrediction.collect().mkString("n"))
  • 37. Display Predictions :: Pie Charts • display(kmPredictions.toDF("prediction")) 37
  • 38. Hands On Switch back to your hands on notebook, and look at the section entitled Twitter_Streaming_Lab01_Scala. Do the third lab exercise there, which will add KMeans clustering analysis to the DataFrame. 38
  • 39. // Questions? val instructor= Databricks.getInstructor() .setInputCol(student.getOutputCol) //Link .setOutputCol("answer")

Editor's Notes

  1. SNotes: Twitter4J is an unofficial Java library for the Twitter API. See http://twitter4j.org/en/ The Kmeans API is in two locations, one for Spark 1.4 and the newer Spark Pipelines is in Spark 1.5. We show both ways.
  2. SNotes: Twitter4J is an unofficial Java library for the Twitter API. See http://twitter4j.org/en/ The Kmeans API is in two locations, one for Spark 1.4 and the newer Spark Pipelines is in Spark 1.5. We show both ways.
  3. SNotes: Twitter4J is an unofficial Java library for the Twitter API. See http://twitter4j.org/en/ The Kmeans API is in two locations, one for Spark 1.4 and the newer Spark Pipelines is in Spark 1.5. We show both ways.
  4. Notes: We created the TwitterCredentialsFactory for you, so that we would reduce the load on the Twitter connection usage. the createStream(…) method creates a input stream that returns tweets received from Twitter.
  5. Notes: In HDFS the attempt to write to a directory that already exists throws a Hadoop Exception. We have to make sure each batch is written to a unique directory.
  6. You’d want to be more robust about parsing CSV in real life. The serializer can’t easily infer the types of the columns if you use a Python class. A namedtuple is preferred.
  7. You’d want to be more robust about parsing CSV in real life. The serializer can’t easily infer the types of the columns if you use a Python class. A namedtuple is preferred.
  8. Notes: See: https://en.wikipedia.org/wiki/K-means_clustering … “k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.” There must be a mount point for this to work. Use: dbutils.fs.mount(…) to get the mounted directory, such as: /mnt/strata-nyc-dev-bootcamp. In this directory, should be the tweets.txt To apply most machine learning algorithms, we must first preprocess and featurize the data. That is, for each data point, we must generate a vector of numbers describing the salient properties of that data point. See: http://www.cs.berkeley.edu/~rxin/ampcamp-ecnu/machine-learning-with-spark.html
  9. Notes: 1. Remove blank lines from the tweets.txt legacy file... There should be about 200 tweets, after blank lines are removed.
  10. Notes: 1. Remove blank lines from the tweets.txt legacy file... There should be about 200 tweets, after blank lines are removed.
  11. Notes: 1. Remove blank lines from the tweets.txt legacy file... There should be about 200 tweets, after blank lines are removed.
  12. Notes: See: http://spark.apache.org/docs/latest/mllib-feature-extraction.html
  13. Notes: 1. See: https://en.wikipedia.org/wiki/Feature_hashing “In machine learning, feature hashing, also known as the hashing trick[1] (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.”
  14. Notes: 1. See: https://en.wikipedia.org/wiki/Feature_hashing “In machine learning, feature hashing, also known as the hashing trick[1] (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.”
  15. Notes: 1. Remove blank lines from the tweets.txt legacy file... There should be about 200 tweets, after blank lines are removed.
  16. Notes: 1. Normalizer output is the (normalized weights) to help with distance between clusters. RDD is an RDD of sparse Vectors... We cache to speed it up, because it is an iterative algo, before running the model...
  17. Notes: 1. Normalizer output is the (normalized weights) to help with distance between clusters. RDD is an RDD of sparse Vectors... We cache to speed it up, because it is an iterative algo, before running the model...
  18. Notes: 1. Normalizer output is the (normalized weights) to help with distance between clusters. RDD is an RDD of sparse Vectors... We cache to speed it up, because it is an iterative algo, before running the model...