SlideShare uma empresa Scribd logo
1 de 27
Brian Hess, Rob Murphy, Rocco Varela
Data Science with DataStax Enterprise
Who Are We?
Brian Hess
• Senior Product
Manager, Analytics
• 15+ years in data and
analytics
• Gov’t, NoSQL, Data
Warehousing, Big
Data
• Math and CS
background
Rob Murphy
• Solution Architect,
Vanguard Team
• Background in
computational science
and science-focused
informatics
• Thinks data, stats and
modeling are fun
Rocco Varela
• Software Engineer in
Test
• DSE Analytics Team
• PhD in Bioinformatics
• Background in
predictive modeling,
scientific computing
© DataStax, All Rights Reserved. 2
1 Data Science in an Operational Context
2 Exploratory Data Analysis
3 Model Building and Evaluation
4 Deploying Analytics in Production
5 Wrap Up
3© DataStax, All Rights Reserved.
© 2014 DataStax, All Rights Reserved.Company Confidential 4
© 2014 DataStax, All Rights Reserved.Company Confidential 5
Willie Sutton
Bank Robber in the 1930s-1950s
FBI Most Wanted List 1950
Captured in 1952
© 2014 DataStax, All Rights Reserved.Company Confidential 6
Willie Sutton
When asked
“Why do you rob banks?”
© 2014 DataStax, All Rights Reserved.Company Confidential 7
Willie Sutton
When asked
“Why do you rob banks?”
“Because that’s where
the money is.”
Why is DSE Good for Data Science?
© DataStax, All Rights Reserved. 8
Why is DSE Good for Data Science?
© DataStax, All Rights Reserved. 9
Why is DSE Good for Data Science
• Analytics on Operational Data is very valuable
• Data has a half-life
• Insights do, as well
• Cassandra is great for operational data
• Multi-DC, Continuous Availability, Scale-Out, etc, etc
• Workload isolation allows access
• No more stale “snapshots”
• Cassandra lets you “operationalize” your analysis
• Make insights available to users, applications, etc
• E.g., recommendations
© DataStax, All Rights Reserved. 10
Exploratory Data Analysis in DSE
What is EDA?
Wikipedia is pretty solid here:
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics, often with visual methods (https://en.wikipedia.org/wiki/Exploratory_data_analysis)
Why EDA?
John Tukey – Exploratory Data Analysis (1977) emphasized methods for exploring and
understanding data as a precursor to Confirmatory Data Analysis (CDA).
You can’t escape statics even if you just want to dive head first into machine learning!
© DataStax, All Rights Reserved. 11
Exploratory Data Analysis in DSE
General Statistics
© DataStax, All Rights Reserved. 12
// packages for Summary Statistics
import numpy as np
from pyspark.mllib.stat import Statistics
from pyspark.sql import Row, SQLContext
from pyspark import SparkContext, SparkConf
data=
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="input_table",keyspace="summi
t_ds").load()
rdd = data.map(lambda line: Vectors.dense(line[0:]))
summary = Statistics.colStats(rdd)
print(summary.mean())
print(summary.variance())
print(summary.numNonzeros())
# OR !!!!!!
data.describe().toPandas().transpose()
DataFrame
Spark ML
Start
sqlContext
RDD
Exploratory Data Analysis in DSE
Correlation
© DataStax, All Rights Reserved. 13
// packages for Summary Statistics
(imports)
data=
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="input_table",keyspac
e="summit_ds").load()
rdd = data.map(lambda line: Vectors.dense(line[0:]))
print(Statistics.corr(data, method="pearson"))
Or
print(Statistics.corr(rdd, method="spearman"))
DataFrame
Spark ML
Start
sqlContext
RDD
Exploratory Data Analysis in DSE
Visualization
© DataStax, All Rights Reserved. 14
Building Models
There are a few dragons:
• Spark ML – DataFrames and “The Way” of the future
• Spark MLLib, more complete but largely RDD based.
• Lots of good features are experimental and subject to
change (this is Spark right?)
© DataStax, All Rights Reserved. 15
Building Models
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
#- Pull data from DSE/Cassandra
data =
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="class_table",keyspace="su
mmit_ds").load()
#- Create an RDD of labeled points
dataForPredict = data.map(lambda line: LabeledPoint(line[1], [line[2:]]))
#- Basic split of train/test
train, test = (dataForPredict.randomSplit([0.8, 0.2]))
catFeatures = {2: 2, 3: 2}
#- Create instance of classifier with appropriate config
classifier = RandomForest.trainClassifier(train, numClasses=2, categoricalFeaturesInfo=catFeatures,
numTrees=5, featureSubsetStrategy="auto",
impurity="gini", maxDepth=5,
maxBins=100, seed=42)
predictions = classifier.predict(test.map(lambda x: x.features))
labelsAndPredictions = test.map(lambda lp: lp.label).zip(predictions)
© DataStax, All Rights Reserved. 16
DataFrame
Spark ML
Start
sqlContext
RDD
Evaluating Models
• Spark ML has continuously expanded model evaluation packages.
• Classification
• Spark does still not provide useful, ubiquitous coverage.
• You can create your own confusion matrix
• Precision is NOT the magic bullet.
• You MUST understand how much of the accuracy is attributed to the model and how much is
not.
• Regression
• Spark does still not provide useful, ubiquitous coverage.
© DataStax, All Rights Reserved. 17
Evaluating Models
© DataStax, All Rights Reserved. 18
• Use simple data driven ‘fit’ measures
• Apply these standard measures across
high level ML classes
• Easy to implement, wholly based on
expected vs. predicted label Confusion Matrix
Matthews Correlation Coefficient
Evaluating Models
<imports>
< data pulled from Cassandra and split >
rf = RandomForestClassifier(numTrees=2, maxDepth=2, labelCol="indexed", seed=4)
model = rf.fit(td)
test = model.transform(testingData)
predictionAndLabels = test.map(lambda lp: (float(lp.prediction), lp.label))
# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)
# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)
# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)
© DataStax, All Rights Reserved. 19
DataFrame
Spark ML
Start
sqlContext
RDD
We can easily analyze data with existing
workflows
Say for example we have multiple streams
incoming from a Kafka source.
Suppose we want to cluster data into known
categories.
Using Spark StreamingKmeans, we can easily
update a model in real time from one stream,
while making predictions on a separate stream.
Let’s see how we can do this.
© DataStax, All Rights Reserved. 20
We can easily update a clustering model in real
time
// define the streaming context
val ssc = new StreamingContext(conf, Seconds(batchDuration))
// define training and testing dstream by the Kafka topic
val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, trainTopic)
val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, testTopic)
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(nDimensions, seed)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print()
ssc.start()
© DataStax, All Rights Reserved. 21
StreamingKmeans
Model
Training Stream
Start
StreamingContext
Testing Stream
Streaming Model Setup
We can easily update a clustering model in real
time
// define the streaming context
val ssc = new StreamingContext(conf, Seconds(batchDuration))
// define training and testing dstream by the Kafka topic
val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, trainTopic)
val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, testTopic)
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(nDimensions, seed)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print()
ssc.start()
© DataStax, All Rights Reserved. 22
Decay factor is used to ignore old
data.
Decay = 1 will use all observed
data from the beginning for cluster
updates.
Decay = 0 will use only the most
recent data
We can easily update a clustering model in real
time
// define the streaming context
val ssc = new StreamingContext(conf, Seconds(batchDuration))
// define training and testing dstream by the Kafka topic
val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, trainTopic)
val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, testTopic)
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(nDimensions, seed)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print()
ssc.start()
© DataStax, All Rights Reserved. 23
DStream[Vector]
For each RDD
Perform a k-means update
on a batch of data.
Real time Training
Predictions
DStream[(K, Vector)]
mapOnValues
Find closest cluster center
for given data point
DStream[(K, PredictionVector)]
The same setup can be used for a real time
logistic regression model
// define the streaming context
val ssc = new StreamingContext(conf, Seconds(batchDuration))
// define training and testing dstream by the Kafka topic
val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, trainTopic)
val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, testTopic)
val model = new StreamingLogisticRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print()
ssc.start()
© DataStax, All Rights Reserved. 24
StreamingModel
Training Stream
Start
StreamingContext
Testing Stream
Layering this with fault-tolerance in DataStax
Enterprise is straight forward
// define the streaming context
val ssc = new StreamingContext(conf, Seconds(batchDuration))
// define training and testing dstream by the Kafka topic
val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, trainTopic)
val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, testTopic)
val model = new StreamingLogisticRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print()
ssc.start()
© DataStax, All Rights Reserved. 25
def main(args: Array[String]) {
Modeling with Fault-tolerance
def createStreamingContext():
Create StreamingContext
Define Streams
Define Model
Define checkpoint path
Make predictions
Process data
val ssc = StreamingContext.getActiveOrCreate(
checkpointPath,
createStreamingContext)
ssc.start()
ssc.awaitTermination()
}
Things you should take away
• Cassandra is "where the data are”
• Data Science Data Center - access to live data at low operational
impact
• Good (and *growing*) set of Data Science tools in Spark-
• Part of Spark, so leverage the rest of Spark for gaps-
• Easy to operationalize your Data Science –
• deploy models in streaming context –
• deploy models in batch context –
• save results to Cassandra for low-latency/high-concurrency retrieval in
operational apps
© DataStax, All Rights Reserved. 26
Thank You

Mais conteúdo relacionado

Mais procurados

DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...DataStax
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...DataStax
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...DataStax
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applicationsBen Slater
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...DataStax
 

Mais procurados (20)

DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...
DataStax | Deploy DataStax Enterprise Clusters with OpsCenter (LCM) (Manikand...
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 

Destaque

Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center StrategiesSteven Francia
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Cassandra and DataStax Enterprise on PCF
Cassandra and DataStax Enterprise on PCFCassandra and DataStax Enterprise on PCF
Cassandra and DataStax Enterprise on PCFVMware Tanzu
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike WayAerospike, Inc.
 
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...VMware Tanzu
 
Anatomy of a methodology
Anatomy of a methodologyAnatomy of a methodology
Anatomy of a methodologySaumya Ganguly
 
Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Congressional Budget Office
 
Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Hamed Hashemian
 
ХӨС семинар 10
ХӨС семинар 10ХӨС семинар 10
ХӨС семинар 10Usukhuu Galaa
 
совладельцы. сособственники
совладельцы. сособственникисовладельцы. сособственники
совладельцы. сособственникиSychev.n
 
10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied AccountingAccounting4Free
 
How to start a blog step-by-step guide
How to start a blog   step-by-step guideHow to start a blog   step-by-step guide
How to start a blog step-by-step guideKaran Labra
 
Why FPGA
Why FPGAWhy FPGA
Why FPGAProFAX
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...NextMove Software
 

Destaque (20)

Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center Strategies
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Cassandra and DataStax Enterprise on PCF
Cassandra and DataStax Enterprise on PCFCassandra and DataStax Enterprise on PCF
Cassandra and DataStax Enterprise on PCF
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
 
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...
Extending Cloud Foundry UAA for Authorizations and Multi-Data Center Deployme...
 
Section 3
Section 3Section 3
Section 3
 
Anatomy of a methodology
Anatomy of a methodologyAnatomy of a methodology
Anatomy of a methodology
 
Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025
 
Informal letter
Informal letterInformal letter
Informal letter
 
Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Habitual behaviour(unit 1)
Habitual behaviour(unit 1)
 
ХӨС семинар 10
ХӨС семинар 10ХӨС семинар 10
ХӨС семинар 10
 
совладельцы. сособственники
совладельцы. сособственникисовладельцы. сособственники
совладельцы. сособственники
 
Catalog
CatalogCatalog
Catalog
 
Letter of application
Letter of applicationLetter of application
Letter of application
 
10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting
 
How to start a blog step-by-step guide
How to start a blog   step-by-step guideHow to start a blog   step-by-step guide
How to start a blog step-by-step guide
 
Why FPGA
Why FPGAWhy FPGA
Why FPGA
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
Phonology
PhonologyPhonology
Phonology
 

Semelhante a DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Summit 2016

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Webinar - Bringing connected graph data to Cassandra with DSE Graph
Webinar - Bringing connected graph data to Cassandra with DSE GraphWebinar - Bringing connected graph data to Cassandra with DSE Graph
Webinar - Bringing connected graph data to Cassandra with DSE GraphDataStax
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 

Semelhante a DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Summit 2016 (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Webinar - Bringing connected graph data to Cassandra with DSE Graph
Webinar - Bringing connected graph data to Cassandra with DSE GraphWebinar - Bringing connected graph data to Cassandra with DSE Graph
Webinar - Bringing connected graph data to Cassandra with DSE Graph
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 

Mais de DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

Mais de DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Último

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 

Último (20)

%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Summit 2016

  • 1. Brian Hess, Rob Murphy, Rocco Varela Data Science with DataStax Enterprise
  • 2. Who Are We? Brian Hess • Senior Product Manager, Analytics • 15+ years in data and analytics • Gov’t, NoSQL, Data Warehousing, Big Data • Math and CS background Rob Murphy • Solution Architect, Vanguard Team • Background in computational science and science-focused informatics • Thinks data, stats and modeling are fun Rocco Varela • Software Engineer in Test • DSE Analytics Team • PhD in Bioinformatics • Background in predictive modeling, scientific computing © DataStax, All Rights Reserved. 2
  • 3. 1 Data Science in an Operational Context 2 Exploratory Data Analysis 3 Model Building and Evaluation 4 Deploying Analytics in Production 5 Wrap Up 3© DataStax, All Rights Reserved.
  • 4. © 2014 DataStax, All Rights Reserved.Company Confidential 4
  • 5. © 2014 DataStax, All Rights Reserved.Company Confidential 5 Willie Sutton Bank Robber in the 1930s-1950s FBI Most Wanted List 1950 Captured in 1952
  • 6. © 2014 DataStax, All Rights Reserved.Company Confidential 6 Willie Sutton When asked “Why do you rob banks?”
  • 7. © 2014 DataStax, All Rights Reserved.Company Confidential 7 Willie Sutton When asked “Why do you rob banks?” “Because that’s where the money is.”
  • 8. Why is DSE Good for Data Science? © DataStax, All Rights Reserved. 8
  • 9. Why is DSE Good for Data Science? © DataStax, All Rights Reserved. 9
  • 10. Why is DSE Good for Data Science • Analytics on Operational Data is very valuable • Data has a half-life • Insights do, as well • Cassandra is great for operational data • Multi-DC, Continuous Availability, Scale-Out, etc, etc • Workload isolation allows access • No more stale “snapshots” • Cassandra lets you “operationalize” your analysis • Make insights available to users, applications, etc • E.g., recommendations © DataStax, All Rights Reserved. 10
  • 11. Exploratory Data Analysis in DSE What is EDA? Wikipedia is pretty solid here: Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods (https://en.wikipedia.org/wiki/Exploratory_data_analysis) Why EDA? John Tukey – Exploratory Data Analysis (1977) emphasized methods for exploring and understanding data as a precursor to Confirmatory Data Analysis (CDA). You can’t escape statics even if you just want to dive head first into machine learning! © DataStax, All Rights Reserved. 11
  • 12. Exploratory Data Analysis in DSE General Statistics © DataStax, All Rights Reserved. 12 // packages for Summary Statistics import numpy as np from pyspark.mllib.stat import Statistics from pyspark.sql import Row, SQLContext from pyspark import SparkContext, SparkConf data= sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="input_table",keyspace="summi t_ds").load() rdd = data.map(lambda line: Vectors.dense(line[0:])) summary = Statistics.colStats(rdd) print(summary.mean()) print(summary.variance()) print(summary.numNonzeros()) # OR !!!!!! data.describe().toPandas().transpose() DataFrame Spark ML Start sqlContext RDD
  • 13. Exploratory Data Analysis in DSE Correlation © DataStax, All Rights Reserved. 13 // packages for Summary Statistics (imports) data= sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="input_table",keyspac e="summit_ds").load() rdd = data.map(lambda line: Vectors.dense(line[0:])) print(Statistics.corr(data, method="pearson")) Or print(Statistics.corr(rdd, method="spearman")) DataFrame Spark ML Start sqlContext RDD
  • 14. Exploratory Data Analysis in DSE Visualization © DataStax, All Rights Reserved. 14
  • 15. Building Models There are a few dragons: • Spark ML – DataFrames and “The Way” of the future • Spark MLLib, more complete but largely RDD based. • Lots of good features are experimental and subject to change (this is Spark right?) © DataStax, All Rights Reserved. 15
  • 16. Building Models from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import RandomForest, RandomForestModel #- Pull data from DSE/Cassandra data = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="class_table",keyspace="su mmit_ds").load() #- Create an RDD of labeled points dataForPredict = data.map(lambda line: LabeledPoint(line[1], [line[2:]])) #- Basic split of train/test train, test = (dataForPredict.randomSplit([0.8, 0.2])) catFeatures = {2: 2, 3: 2} #- Create instance of classifier with appropriate config classifier = RandomForest.trainClassifier(train, numClasses=2, categoricalFeaturesInfo=catFeatures, numTrees=5, featureSubsetStrategy="auto", impurity="gini", maxDepth=5, maxBins=100, seed=42) predictions = classifier.predict(test.map(lambda x: x.features)) labelsAndPredictions = test.map(lambda lp: lp.label).zip(predictions) © DataStax, All Rights Reserved. 16 DataFrame Spark ML Start sqlContext RDD
  • 17. Evaluating Models • Spark ML has continuously expanded model evaluation packages. • Classification • Spark does still not provide useful, ubiquitous coverage. • You can create your own confusion matrix • Precision is NOT the magic bullet. • You MUST understand how much of the accuracy is attributed to the model and how much is not. • Regression • Spark does still not provide useful, ubiquitous coverage. © DataStax, All Rights Reserved. 17
  • 18. Evaluating Models © DataStax, All Rights Reserved. 18 • Use simple data driven ‘fit’ measures • Apply these standard measures across high level ML classes • Easy to implement, wholly based on expected vs. predicted label Confusion Matrix Matthews Correlation Coefficient
  • 19. Evaluating Models <imports> < data pulled from Cassandra and split > rf = RandomForestClassifier(numTrees=2, maxDepth=2, labelCol="indexed", seed=4) model = rf.fit(td) test = model.transform(testingData) predictionAndLabels = test.map(lambda lp: (float(lp.prediction), lp.label)) # Instantiate metrics object metrics = BinaryClassificationMetrics(predictionAndLabels) # Area under precision-recall curve print("Area under PR = %s" % metrics.areaUnderPR) # Area under ROC curve print("Area under ROC = %s" % metrics.areaUnderROC) © DataStax, All Rights Reserved. 19 DataFrame Spark ML Start sqlContext RDD
  • 20. We can easily analyze data with existing workflows Say for example we have multiple streams incoming from a Kafka source. Suppose we want to cluster data into known categories. Using Spark StreamingKmeans, we can easily update a model in real time from one stream, while making predictions on a separate stream. Let’s see how we can do this. © DataStax, All Rights Reserved. 20
  • 21. We can easily update a clustering model in real time // define the streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration)) // define training and testing dstream by the Kafka topic val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, trainTopic) val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopic) val model = new StreamingKMeans() .setK(numClusters) .setDecayFactor(1.0) .setRandomCenters(nDimensions, seed) model.trainOn(trainingData) model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print() ssc.start() © DataStax, All Rights Reserved. 21 StreamingKmeans Model Training Stream Start StreamingContext Testing Stream Streaming Model Setup
  • 22. We can easily update a clustering model in real time // define the streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration)) // define training and testing dstream by the Kafka topic val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, trainTopic) val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopic) val model = new StreamingKMeans() .setK(numClusters) .setDecayFactor(1.0) .setRandomCenters(nDimensions, seed) model.trainOn(trainingData) model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print() ssc.start() © DataStax, All Rights Reserved. 22 Decay factor is used to ignore old data. Decay = 1 will use all observed data from the beginning for cluster updates. Decay = 0 will use only the most recent data
  • 23. We can easily update a clustering model in real time // define the streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration)) // define training and testing dstream by the Kafka topic val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, trainTopic) val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopic) val model = new StreamingKMeans() .setK(numClusters) .setDecayFactor(1.0) .setRandomCenters(nDimensions, seed) model.trainOn(trainingData) model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print() ssc.start() © DataStax, All Rights Reserved. 23 DStream[Vector] For each RDD Perform a k-means update on a batch of data. Real time Training Predictions DStream[(K, Vector)] mapOnValues Find closest cluster center for given data point DStream[(K, PredictionVector)]
  • 24. The same setup can be used for a real time logistic regression model // define the streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration)) // define training and testing dstream by the Kafka topic val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, trainTopic) val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopic) val model = new StreamingLogisticRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures)) model.trainOn(trainingData) model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print() ssc.start() © DataStax, All Rights Reserved. 24 StreamingModel Training Stream Start StreamingContext Testing Stream
  • 25. Layering this with fault-tolerance in DataStax Enterprise is straight forward // define the streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration)) // define training and testing dstream by the Kafka topic val trainingData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, trainTopic) val testData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, testTopic) val model = new StreamingLogisticRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures)) model.trainOn(trainingData) model.predictOnValues(testData.map(lp=>(lp.label,lp.features))).print() ssc.start() © DataStax, All Rights Reserved. 25 def main(args: Array[String]) { Modeling with Fault-tolerance def createStreamingContext(): Create StreamingContext Define Streams Define Model Define checkpoint path Make predictions Process data val ssc = StreamingContext.getActiveOrCreate( checkpointPath, createStreamingContext) ssc.start() ssc.awaitTermination() }
  • 26. Things you should take away • Cassandra is "where the data are” • Data Science Data Center - access to live data at low operational impact • Good (and *growing*) set of Data Science tools in Spark- • Part of Spark, so leverage the rest of Spark for gaps- • Easy to operationalize your Data Science – • deploy models in streaming context – • deploy models in batch context – • save results to Cassandra for low-latency/high-concurrency retrieval in operational apps © DataStax, All Rights Reserved. 26