SlideShare uma empresa Scribd logo
1 de 58
Streaming Big Data with Apache 
Spark, Kafka and Cassandra 
Helena Edelson 
@helenaedelson 
©2014 DataStax Confidential. Do not distribute without consent. 
1
1 Delivering Meaning In Near-Real Time At High Velocity 
2 Overview Of Spark Streaming, Cassandra, Kafka and Akka 
3 The Spark Cassandra Connector 
4 Integration In Big Data Applications 
© 2014 DataStax, All Rights Reserved. 2
Who Is This Person? 
• Using Scala in production since 2010! 
• Spark Cassandra Connector Committer! 
• Akka (Cluster) Contributor ! 
• Scala Driver for Cassandra Committer ! 
• @helenaedelson! 
• https://github.com/helena ! 
• Senior Software Engineer, Analytics Team, DataStax! 
• Spent the last few years as a Senior Cloud Engineer 
Analytic 
Analytic
Analytic 
Analytic 
Search 
Spark WordCount
Delivering Meaning In Near-Real 
Time At High Velocity At Massive 
Scale 
©2014 DataStax Confidential. Do not distribute without consent. 
5
Strategies 
• Partition For Scale! 
• Replicate For Resiliency! 
• Share Nothing! 
• Asynchronous Message Passing! 
• Parallelism! 
• Isolation! 
• Location Transparency
What We Need 
• Fault Tolerant ! 
• Failure Detection! 
• Fast - low latency, distributed, data locality! 
• Masterless, Decentralized Cluster Membership! 
• Span Racks and DataCenters! 
• Hashes The Node Ring ! 
• Partition-Aware! 
• Elasticity ! 
• Asynchronous - message-passing system! 
• Parallelism! 
• Network Topology Aware!
Why Akka Rocks 
• location transparency! 
• fault tolerant! 
• async message passing! 
• non-deterministic! 
• share nothing! 
• actor atomicity (w/in actor)! 
!
Apache Kafka! 
From Joe Stein
Lambda Architecture 
Spark SQL 
structured 
Spark Core 
Cassandra 
Spark 
Streaming 
real-time 
MLlib 
machine learning 
GraphX 
graph 
Apache ! 
Kafka
Apache Spark and Spark Streaming 
©2014 DataStax Confidential. Do not distribute without consent. 
12 
The Drive-By Version
Analytic 
Analytic 
Search 
What Is Apache Spark 
• Fast, general cluster compute system! 
• Originally developed in 2009 in UC 
Berkeley’s AMPLab! 
• Fully open sourced in 2010 – now at 
Apache Software Foundation! 
• Distributed, Scalable, Fault Tolerant
Apache Spark - Easy to Use & Fast 
• 10x faster on disk,100x faster in memory than Hadoop MR! 
• Works out of the box on EMR! 
• Fault Tolerant Distributed Datasets! 
• Batch, iterative and streaming analysis! 
• In Memory Storage and Disk ! 
• Integrates with Most File and Storage Options 
Up to 100× faster 
(2-10× on disk) 
Analytic 
2-5× less code 
Analytic 
Search
Spark Components 
Spark SQL 
structured 
Spark Core 
Spark 
Streaming 
real-time 
MLlib 
machine learning 
GraphX 
graph
• Functional 
• On the JVM 
• Capture functions and ship them across the network 
• Static typing - easier to control performance 
• Leverage Scala REPL for the Spark REPL 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536p6538. 
html 
Analytic 
Analytic 
Search 
Why Scala?
Spark Versus Spark Streaming 
zillions of bytes gigabytes per second
Input & Output Sources
Analytic 
Analytic 
Search 
Spark Streaming 
Kinesis,'S3'
Common Use Cases 
applications sensors web mobile phones 
intrusion detection 
malfunction 
detection 
site analytics 
network metrics 
analysis 
fraud detection 
dynamic process 
optimisation 
recommendations location based ads 
log processing 
supply chain 
planning 
sentiment analysis …
DStream - Micro Batches 
• Continuous sequence of micro batches! 
• More complex processing models are possible with less effort! 
• Streaming computations as a series of deterministic batch computations on small time intervals 
DStream 
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) 
Processing of DStream = Processing of μBatches, RDDs
DStreams representing the stream of raw data received from streaming 
sources. ! 
!•!Basic sources: Sources directly available in the StreamingContext API! 
!•!Advanced sources: Sources available through external modules 
available in Spark as artifacts 
22 
InputDStreams
Spark Streaming Modules 
GroupId ArtifactId Latest Version 
org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 
org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 
org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
Streaming Window Operations 
// where pairs are (word,count) 
pairsStream 
.flatMap { case (k,v) => (k,v.value) } 
.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), 
Seconds(30), Seconds(10)) 
.saveToCassandra(keyspace,table) 
24
Streaming Window Operations 
• window(Duration, Duration) 
• countByWindow(Duration, Duration) 
• reduceByWindow(Duration, Duration) 
• countByValueAndWindow(Duration, Duration) 
• groupByKeyAndWindow(Duration, Duration) 
• reduceByKeyAndWindow((V, V) => V, Duration, 
Duration)
Ten Things About Cassandra 
You always wanted to know but were afraid to ask 
©2014 DataStax Confidential. Do not distribute without consent. 
26
Quick intro to Cassandra 
• Shared nothing 
• Masterless peer-to-peer 
• Based on Dynamo 
• Highly Scalable 
• Spans DataCenters
Apache Cassandra 
• Elasticity - scale to as many nodes as you need, when you need! 
•! Always On - No single point of failure, Continuous availability!! 
!•! Masterless peer to peer architecture! 
•! Designed for Replication! 
!•! Flexible Data Storage! 
!•! Read and write to any node syncs across the cluster! 
!•! Operational simplicity - with all nodes in a cluster being the same, 
there is no complex configuration to manage
Easy to use 
• CQL - familiar syntax! 
• Friendly to programmers! 
• Paxos for locking 
CREATE TABLE users (! 
username varchar,! 
firstname varchar,! 
lastname varchar,! 
email list<varchar>,! 
password varchar,! 
created_date timestamp,! 
PRIMARY KEY (username)! 
); 
INSERT INTO users (username, firstname, lastname, ! 
email, password, created_date)! 
VALUES ('hedelson','Helena','Edelson',! 
[‘helena.edelson@datastax.com'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00');! 
INSERT INTO users (username, firstname, ! 
lastname, email, password, created_date)! 
VALUES ('pmcfadin','Patrick','McFadin',! 
['patrick@datastax.com'],! 
'ba27e03fd95e507daf2937c937d499ab',! 
'2011-06-20 13:50:00')! 
IF NOT EXISTS;
Spark Cassandra Connector 
©2014 DataStax Confidential. Do not distribute without consent. 
30
Spark On Cassandra 
• Server-Side filters (where clauses)! 
• Cross-table operations (JOIN, UNION, etc.)! 
• Data locality-aware (speed)! 
• Data transformation, aggregation, etc. ! 
• Natural Time Series Integration
Spark Cassandra Connector 
• Loads data from Cassandra to Spark! 
• Writes data from Spark to Cassandra! 
• Implicit Type Conversions and Object Mapping! 
• Implemented in Scala (offers a Java API)! 
• Open Source ! 
• Exposes Cassandra Tables as Spark RDDs + Spark DStreams
Spark Cassandra Connector 
!https://github.com/datastax/spark-cassandra-connector 
C* 
User Application 
Spark-Cassandra Connector 
Cassandra C* C* 
C* 
Spark Executor 
C* Java (Soon Scala) Driver
Use Cases: KillrWeather 
• Get data by weather station! 
• Get data for a single date and time! 
• Get data for a range of dates and times! 
• Compute, store and quickly retrieve daily, monthly and annual 
aggregations of data! 
Data Model to support queries 
• Store raw data per weather station 
• Store time series in order: most recent to oldest 
• Compute and store aggregate data in the stream 
• Set TTLs on historic data
Cassandra Data Model 
user_name tweet_id publication_date message 
pkolaczk 4b514a41-­‐9cf... 2014-­‐08-­‐21 
10:00:04 ... 
pkolaczk 411bbe34-­‐11d... 2014-­‐08-­‐24 
16:03:47 ... 
ewa 73847f13-­‐771... 2014-­‐08-­‐21 
10:00:04 ... 
ewa 41106e9a-­‐5ff... 2014-­‐08-­‐21 
10:14:51 ... 
row 
partitioning key 
partition 
primary key data columns
Spark Data Model 
A1 A2 A3 A4 A5 A6 A7 A8 
map 
B1 B2 B3 B4 B5 B6 B7 B8 
filter 
B1 B2 B5 B7 B8 
C 
reduce 
Resilient Distributed Dataset! 
! 
A Scala collection:! 
• immutable! 
• iterable! 
• serializable! 
• distributed! 
• parallel! 
• lazy evaluation
Spark Cassandra Example 
val conf = new SparkConf(loadDefaults = true) 
.set("spark.cassandra.connection.host", "127.0.0.1") 
.setMaster("spark://127.0.0.1:7077") 
! 
val sc = new SparkContext(conf) 
! 
! 
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets") 
! 
val ssc = new StreamingContext(sc, Seconds(30)) 
val stream = KafkaUtils.createStream[String, String, StringDecoder, 
StringDecoder]( 
ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) 
stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") 
ssc.start() 
ssc.awaitTermination() 
! 
Initialization 
CassandraRDD 
Stream Initialization 
Transformations 
and Action
Spark Cassandra Example 
val sc = new SparkContext(..) 
val ssc = new StreamingContext(sc, Seconds(5)) 
! 
val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2) 
val transform = (cruft: String) => 
Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) 
/** Note that Cassandra is doing the sorting for you here. */ 
stream.flatMap(_.getText.toLowerCase.split("""s+""")) 
.map(transform) 
.countByValueAndWindow(Seconds(5), Seconds(5)) 
.transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))}) 
.saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))
Reading: From C* To Spark 
row 
representation keyspace table 
val table = sc 
.cassandraTable[CassandraRow]("keyspace", "tweets") 
.select("user_name", "message") 
.where("user_name = ?", "ewa") 
server side 
column and row 
selection
CassandraRDD 
class CassandraRDD[R](..., keyspace: String, table: String, ...) 
extends RDD[R](...) { 
// Splits the table into multiple Spark partitions, 
// each processed by single Spark Task 
override def getPartitions: Array[Partition] 
// Returns names of hosts storing given partition (for data locality!) 
override def getPreferredLocations(split: Partition): Seq[String] 
// Returns iterator over Cassandra rows in the given partition 
override def compute(split: Partition, context: TaskContext): Iterator[R] 
}
CassandraStreamingRDD 
/** RDD representing a Cassandra table for Spark Streaming. 
* @see [[com.datastax.spark.connector.rdd.CassandraRDD]] 
*/ 
class CassandraStreamingRDD[R] private[connector] ( 
sctx: StreamingContext, 
connector: CassandraConnector, 
keyspace: String, 
table: String, 
columns: ColumnSelector = AllColumns, 
where: CqlWhereClause = CqlWhereClause.empty, 
readConf: ReadConf = ReadConf())( 
implicit ct : ClassTag[R], @transient rrf: RowReaderFactory[R]) 
extends CassandraRDD[R](sctx.sparkContext, connector, keyspace, table, columns, where, readConf)
Paging Reads with .cassandraTable 
• Page size is configurable! 
• Controls how many CQL rows to fetch at a time, when fetching a single 
partition! 
• Connector returns an iterator for rows to Spark! 
• Spark iterates over this, lazily ! 
• Handled by the java driver as well as spark
ResultSet Paging and Pre-Fetching 
Node 1 
Client Cassandra 
Node 1 
request a page 
data 
process data 
request a page 
data 
request a page 
Node 2 
Client Cassandra 
Node 2 
request a page 
data 
process data 
request a page 
data 
request a page
Co-locate Spark and C* for Best 
Performance 
44 
C* 
C* C* 
Spark 
Worker 
C* 
Spark 
Worker 
Spark 
Master 
Spark 
Worker 
Running Spark Workers on 
the same nodes as your C* Cluster 
will save network hops when 
reading and writing
The Key To Speed - Data Locality 
• LocalNodeFirstLoadBalancingPolicy! 
• Decides what node will become the coordinator for the given mutation/read! 
• Selects local node first and then nodes in the local DC in random order! 
• Once that node receives the request it will be distributed ! 
• Proximal Node Sort Defined by the C* snitch! 
Analytic 
•https://github.com/apache/cassandra/blob/trunk/src/java/org/ 
apache/cassandra/locator/DynamicEndpointSnitch.java#L155- 
Analytic 
L190 
Search
Column Type Conversions 
trait 
TypeConverter[T] 
extends 
Serializable 
{ 
def 
targetTypeTag: 
TypeTag[T] 
def 
convertPF: 
PartialFunction[Any, 
T] 
} 
! 
class 
CassandraRow(...) 
{ 
def 
get[T](index: 
Int)(implicit 
c: 
TypeConverter[T]): 
T 
= 
… 
def 
get[T](name: 
String)(implicit 
c: 
TypeConverter[T]): 
T 
= 
… 
… 
} 
! 
implicit 
object 
IntConverter 
extends 
TypeConverter[Int] 
{ 
def 
targetTypeTag 
= 
implicitly[TypeTag[Int]] 
convertPF 
= 
{ 
case 
def 
x: 
Number 
=> 
x.intValue 
case 
x: 
String 
=> 
x.toInt 
} 
}
Rows to Case Class Instances 
val 
row: 
CassandraRow 
= 
table.first 
case 
class 
Tweet(userName: 
String, 
tweetId: 
UUID, 
publicationDate: 
Date, 
message: 
String) 
extends 
Serializable 
! 
val 
tweets 
= 
sc.cassandraTable[Tweet]("db", 
"tweets") 
Scala Cassandra 
message message 
column1 column_1 
userName user_name 
Scala Cassandra 
Message Message 
column1 column1 
userName userName 
Recommended convention:
Convert Rows To Tuples 
! 
val 
tweets 
= 
sc 
.cassandraTable[(String, 
String)]("db", 
"tweets") 
.select("userName", 
"message") 
When returning tuples, always use 
select to ! specify the column order!
Handling Unsupported Types 
case 
class 
Tweet( 
userName: 
String, 
tweetId: 
my.private.implementation.of.UUID, 
publicationDate: 
Date, 
message: 
String) 
! 
val 
tweets: 
CassandraRDD[Tweet] 
= 
sc.cassandraTable[Tweet]("db", 
"tweets") 
Current behavior: runtime error 
Future: compile-time error (macros!) 
Macros could even 
parse Cassandra 
schema!
Connector Code and Docs 
https://github.com/datastax/spark-cassandra-connector! 
! 
! 
! 
! 
! 
! 
Add It To Your Project:! 
! 
val 
connector 
= 
"com.datastax.spark" 
%% 
"spark-­‐cassandra-­‐connector" 
% 
"1.1.0-­‐beta2"
What’s New In Spark? 
•Petabyte sort record! 
• Myth busting: ! 
• Spark is in-memory. It doesn’t work with big data! 
• It’s too expensive to buy a cluster with enough memory to fit our 
data! 
•Application Integration! 
• Tableau, Trifacta, Talend, ElasticSearch, Cassandra! 
•Ongoing development for Spark 1.2! 
• Python streaming, new MLib API, Yarn scaling…
Recent / Pending Spark Updates 
•Usability, Stability & Performance Improvements ! 
• Improved Lightning Fast Shuffle!! 
• Monitoring the performance long running or complex jobs! 
•Public types API to allow users to create SchemaRDD’s! 
•Support for registering Python, Scala, and Java lambda functions as UDF ! 
•Dynamic bytecode generation significantly speeding up execution for 
queries that perform complex expression evaluation
Recent Spark Streaming Updates 
•Apache Flume: a new pull-based mode (simplifying deployment and 
providing high availability)! 
•The first of a set of streaming machine learning algorithms is introduced 
with streaming linear regression.! 
•Rate limiting has been added for streaming inputs
Recent MLib Updates 
• New library of statistical packages which provides exploratory analytic 
functions *stratified sampling, correlations, chi-squared tests, creating 
random datasets…)! 
• Utilities for feature extraction (Word2Vec and TF-IDF) and feature 
transformation (normalization and standard scaling). ! 
• Support for nonnegative matrix factorization and SVD via Lanczos. ! 
• Decision tree algorithm has been added in Python and Java. ! 
• Tree aggregation primitive! 
• Performance improves across the board, with improvements of around
Recent/Pending Connector Updates 
•Spark SQL (came in 1.1)! 
•Python API! 
•Official Scala Driver for Cassandra! 
•Removing last traces of Thrift!!! 
•Performance Improvements! 
• Token-aware data repartitioning! 
• Token-aware saving! 
• Wide-row support - no costly groupBy call
Resources 
•Spark Cassandra Connector ! 
https://github.com/datastax/spark-cassandra-connector ! 
•Apache Cassandra http://cassandra.apache.org! 
•Apache Spark http://spark.apache.org ! 
•Apache Kafka http://kafka.apache.org ! 
•Akka http://akka.io 
Analytic 
Analytic
www.spark-summit.org
Thanks 
for 
listening! 
SEPTEMBER 
10 
-­‐ 
11, 
2014 
| 
SAN 
FRANCISCO, 
CALIF. 
| 
THE 
WESTIN 
ST. 
FRANCIS 
HOTEL 
Cassandra 
Summit

Mais conteúdo relacionado

Mais procurados

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 

Mais procurados (20)

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 

Destaque

Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
 
Streaming Data Analysis and Online Learning by John Myles White
Streaming Data Analysis and Online Learning by John Myles White Streaming Data Analysis and Online Learning by John Myles White
Streaming Data Analysis and Online Learning by John Myles White Hakka Labs
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Windows Azure Storage: Overview, Internals, and Best Practices
Windows Azure Storage: Overview, Internals, and Best PracticesWindows Azure Storage: Overview, Internals, and Best Practices
Windows Azure Storage: Overview, Internals, and Best PracticesAnton Vidishchev
 
Relational Databases are Evolving To Support New Data Capabilities
Relational Databases are Evolving To Support New Data CapabilitiesRelational Databases are Evolving To Support New Data Capabilities
Relational Databases are Evolving To Support New Data CapabilitiesEDB
 
Scala Abide: A lint tool for Scala
Scala Abide: A lint tool for ScalaScala Abide: A lint tool for Scala
Scala Abide: A lint tool for ScalaIulian Dragos
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrongnathanmarz
 
Puppet at Google
Puppet at GooglePuppet at Google
Puppet at GooglePuppet
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
LinkedIn Mobile: How do we do it?
LinkedIn Mobile: How do we do it?LinkedIn Mobile: How do we do it?
LinkedIn Mobile: How do we do it?phegaro
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 

Destaque (20)

Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
 
Streaming Data Analysis and Online Learning by John Myles White
Streaming Data Analysis and Online Learning by John Myles White Streaming Data Analysis and Online Learning by John Myles White
Streaming Data Analysis and Online Learning by John Myles White
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Windows Azure Storage: Overview, Internals, and Best Practices
Windows Azure Storage: Overview, Internals, and Best PracticesWindows Azure Storage: Overview, Internals, and Best Practices
Windows Azure Storage: Overview, Internals, and Best Practices
 
Relational Databases are Evolving To Support New Data Capabilities
Relational Databases are Evolving To Support New Data CapabilitiesRelational Databases are Evolving To Support New Data Capabilities
Relational Databases are Evolving To Support New Data Capabilities
 
Scala Abide: A lint tool for Scala
Scala Abide: A lint tool for ScalaScala Abide: A lint tool for Scala
Scala Abide: A lint tool for Scala
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Puppet at Google
Puppet at GooglePuppet at Google
Puppet at Google
 
Why Spark?
Why Spark?Why Spark?
Why Spark?
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
LinkedIn Mobile: How do we do it?
LinkedIn Mobile: How do we do it?LinkedIn Mobile: How do we do it?
LinkedIn Mobile: How do we do it?
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 

Semelhante a Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleTimothy Spann
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleScyllaDB
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 

Semelhante a Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka (20)

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Mais de Helena Edelson

Toward Predictability and Stability
Toward Predictability and StabilityToward Predictability and Stability
Toward Predictability and StabilityHelena Edelson
 
Disorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleDisorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleHelena Edelson
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 

Mais de Helena Edelson (7)

Toward Predictability and Stability
Toward Predictability and StabilityToward Predictability and Stability
Toward Predictability and Stability
 
Patterns In The Chaos
Patterns In The ChaosPatterns In The Chaos
Patterns In The Chaos
 
Disorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At ScaleDisorder And Tolerance In Distributed Systems At Scale
Disorder And Tolerance In Distributed Systems At Scale
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 

Último

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Último (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka

  • 1. Streaming Big Data with Apache Spark, Kafka and Cassandra Helena Edelson @helenaedelson ©2014 DataStax Confidential. Do not distribute without consent. 1
  • 2. 1 Delivering Meaning In Near-Real Time At High Velocity 2 Overview Of Spark Streaming, Cassandra, Kafka and Akka 3 The Spark Cassandra Connector 4 Integration In Big Data Applications © 2014 DataStax, All Rights Reserved. 2
  • 3. Who Is This Person? • Using Scala in production since 2010! • Spark Cassandra Connector Committer! • Akka (Cluster) Contributor ! • Scala Driver for Cassandra Committer ! • @helenaedelson! • https://github.com/helena ! • Senior Software Engineer, Analytics Team, DataStax! • Spent the last few years as a Senior Cloud Engineer Analytic Analytic
  • 4. Analytic Analytic Search Spark WordCount
  • 5. Delivering Meaning In Near-Real Time At High Velocity At Massive Scale ©2014 DataStax Confidential. Do not distribute without consent. 5
  • 6.
  • 7. Strategies • Partition For Scale! • Replicate For Resiliency! • Share Nothing! • Asynchronous Message Passing! • Parallelism! • Isolation! • Location Transparency
  • 8. What We Need • Fault Tolerant ! • Failure Detection! • Fast - low latency, distributed, data locality! • Masterless, Decentralized Cluster Membership! • Span Racks and DataCenters! • Hashes The Node Ring ! • Partition-Aware! • Elasticity ! • Asynchronous - message-passing system! • Parallelism! • Network Topology Aware!
  • 9. Why Akka Rocks • location transparency! • fault tolerant! • async message passing! • non-deterministic! • share nothing! • actor atomicity (w/in actor)! !
  • 10. Apache Kafka! From Joe Stein
  • 11. Lambda Architecture Spark SQL structured Spark Core Cassandra Spark Streaming real-time MLlib machine learning GraphX graph Apache ! Kafka
  • 12. Apache Spark and Spark Streaming ©2014 DataStax Confidential. Do not distribute without consent. 12 The Drive-By Version
  • 13. Analytic Analytic Search What Is Apache Spark • Fast, general cluster compute system! • Originally developed in 2009 in UC Berkeley’s AMPLab! • Fully open sourced in 2010 – now at Apache Software Foundation! • Distributed, Scalable, Fault Tolerant
  • 14. Apache Spark - Easy to Use & Fast • 10x faster on disk,100x faster in memory than Hadoop MR! • Works out of the box on EMR! • Fault Tolerant Distributed Datasets! • Batch, iterative and streaming analysis! • In Memory Storage and Disk ! • Integrates with Most File and Storage Options Up to 100× faster (2-10× on disk) Analytic 2-5× less code Analytic Search
  • 15. Spark Components Spark SQL structured Spark Core Spark Streaming real-time MLlib machine learning GraphX graph
  • 16. • Functional • On the JVM • Capture functions and ship them across the network • Static typing - easier to control performance • Leverage Scala REPL for the Spark REPL http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536p6538. html Analytic Analytic Search Why Scala?
  • 17. Spark Versus Spark Streaming zillions of bytes gigabytes per second
  • 18. Input & Output Sources
  • 19. Analytic Analytic Search Spark Streaming Kinesis,'S3'
  • 20. Common Use Cases applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis …
  • 21. DStream - Micro Batches • Continuous sequence of micro batches! • More complex processing models are possible with less effort! • Streaming computations as a series of deterministic batch computations on small time intervals DStream μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs
  • 22. DStreams representing the stream of raw data received from streaming sources. ! !•!Basic sources: Sources directly available in the StreamingContext API! !•!Advanced sources: Sources available through external modules available in Spark as artifacts 22 InputDStreams
  • 23. Spark Streaming Modules GroupId ArtifactId Latest Version org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
  • 24. Streaming Window Operations // where pairs are (word,count) pairsStream .flatMap { case (k,v) => (k,v.value) } .reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) .saveToCassandra(keyspace,table) 24
  • 25. Streaming Window Operations • window(Duration, Duration) • countByWindow(Duration, Duration) • reduceByWindow(Duration, Duration) • countByValueAndWindow(Duration, Duration) • groupByKeyAndWindow(Duration, Duration) • reduceByKeyAndWindow((V, V) => V, Duration, Duration)
  • 26. Ten Things About Cassandra You always wanted to know but were afraid to ask ©2014 DataStax Confidential. Do not distribute without consent. 26
  • 27. Quick intro to Cassandra • Shared nothing • Masterless peer-to-peer • Based on Dynamo • Highly Scalable • Spans DataCenters
  • 28. Apache Cassandra • Elasticity - scale to as many nodes as you need, when you need! •! Always On - No single point of failure, Continuous availability!! !•! Masterless peer to peer architecture! •! Designed for Replication! !•! Flexible Data Storage! !•! Read and write to any node syncs across the cluster! !•! Operational simplicity - with all nodes in a cluster being the same, there is no complex configuration to manage
  • 29. Easy to use • CQL - familiar syntax! • Friendly to programmers! • Paxos for locking CREATE TABLE users (! username varchar,! firstname varchar,! lastname varchar,! email list<varchar>,! password varchar,! created_date timestamp,! PRIMARY KEY (username)! ); INSERT INTO users (username, firstname, lastname, ! email, password, created_date)! VALUES ('hedelson','Helena','Edelson',! [‘helena.edelson@datastax.com'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00');! INSERT INTO users (username, firstname, ! lastname, email, password, created_date)! VALUES ('pmcfadin','Patrick','McFadin',! ['patrick@datastax.com'],! 'ba27e03fd95e507daf2937c937d499ab',! '2011-06-20 13:50:00')! IF NOT EXISTS;
  • 30. Spark Cassandra Connector ©2014 DataStax Confidential. Do not distribute without consent. 30
  • 31. Spark On Cassandra • Server-Side filters (where clauses)! • Cross-table operations (JOIN, UNION, etc.)! • Data locality-aware (speed)! • Data transformation, aggregation, etc. ! • Natural Time Series Integration
  • 32. Spark Cassandra Connector • Loads data from Cassandra to Spark! • Writes data from Spark to Cassandra! • Implicit Type Conversions and Object Mapping! • Implemented in Scala (offers a Java API)! • Open Source ! • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  • 33. Spark Cassandra Connector !https://github.com/datastax/spark-cassandra-connector C* User Application Spark-Cassandra Connector Cassandra C* C* C* Spark Executor C* Java (Soon Scala) Driver
  • 34. Use Cases: KillrWeather • Get data by weather station! • Get data for a single date and time! • Get data for a range of dates and times! • Compute, store and quickly retrieve daily, monthly and annual aggregations of data! Data Model to support queries • Store raw data per weather station • Store time series in order: most recent to oldest • Compute and store aggregate data in the stream • Set TTLs on historic data
  • 35. Cassandra Data Model user_name tweet_id publication_date message pkolaczk 4b514a41-­‐9cf... 2014-­‐08-­‐21 10:00:04 ... pkolaczk 411bbe34-­‐11d... 2014-­‐08-­‐24 16:03:47 ... ewa 73847f13-­‐771... 2014-­‐08-­‐21 10:00:04 ... ewa 41106e9a-­‐5ff... 2014-­‐08-­‐21 10:14:51 ... row partitioning key partition primary key data columns
  • 36. Spark Data Model A1 A2 A3 A4 A5 A6 A7 A8 map B1 B2 B3 B4 B5 B6 B7 B8 filter B1 B2 B5 B7 B8 C reduce Resilient Distributed Dataset! ! A Scala collection:! • immutable! • iterable! • serializable! • distributed! • parallel! • lazy evaluation
  • 37. Spark Cassandra Example val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077") ! val sc = new SparkContext(conf) ! ! val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets") ! val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start() ssc.awaitTermination() ! Initialization CassandraRDD Stream Initialization Transformations and Action
  • 38. Spark Cassandra Example val sc = new SparkContext(..) val ssc = new StreamingContext(sc, Seconds(5)) ! val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2) val transform = (cruft: String) => Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) /** Note that Cassandra is doing the sorting for you here. */ stream.flatMap(_.getText.toLowerCase.split("""s+""")) .map(transform) .countByValueAndWindow(Seconds(5), Seconds(5)) .transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))}) .saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))
  • 39. Reading: From C* To Spark row representation keyspace table val table = sc .cassandraTable[CassandraRow]("keyspace", "tweets") .select("user_name", "message") .where("user_name = ?", "ewa") server side column and row selection
  • 40. CassandraRDD class CassandraRDD[R](..., keyspace: String, table: String, ...) extends RDD[R](...) { // Splits the table into multiple Spark partitions, // each processed by single Spark Task override def getPartitions: Array[Partition] // Returns names of hosts storing given partition (for data locality!) override def getPreferredLocations(split: Partition): Seq[String] // Returns iterator over Cassandra rows in the given partition override def compute(split: Partition, context: TaskContext): Iterator[R] }
  • 41. CassandraStreamingRDD /** RDD representing a Cassandra table for Spark Streaming. * @see [[com.datastax.spark.connector.rdd.CassandraRDD]] */ class CassandraStreamingRDD[R] private[connector] ( sctx: StreamingContext, connector: CassandraConnector, keyspace: String, table: String, columns: ColumnSelector = AllColumns, where: CqlWhereClause = CqlWhereClause.empty, readConf: ReadConf = ReadConf())( implicit ct : ClassTag[R], @transient rrf: RowReaderFactory[R]) extends CassandraRDD[R](sctx.sparkContext, connector, keyspace, table, columns, where, readConf)
  • 42. Paging Reads with .cassandraTable • Page size is configurable! • Controls how many CQL rows to fetch at a time, when fetching a single partition! • Connector returns an iterator for rows to Spark! • Spark iterates over this, lazily ! • Handled by the java driver as well as spark
  • 43. ResultSet Paging and Pre-Fetching Node 1 Client Cassandra Node 1 request a page data process data request a page data request a page Node 2 Client Cassandra Node 2 request a page data process data request a page data request a page
  • 44. Co-locate Spark and C* for Best Performance 44 C* C* C* Spark Worker C* Spark Worker Spark Master Spark Worker Running Spark Workers on the same nodes as your C* Cluster will save network hops when reading and writing
  • 45. The Key To Speed - Data Locality • LocalNodeFirstLoadBalancingPolicy! • Decides what node will become the coordinator for the given mutation/read! • Selects local node first and then nodes in the local DC in random order! • Once that node receives the request it will be distributed ! • Proximal Node Sort Defined by the C* snitch! Analytic •https://github.com/apache/cassandra/blob/trunk/src/java/org/ apache/cassandra/locator/DynamicEndpointSnitch.java#L155- Analytic L190 Search
  • 46. Column Type Conversions trait TypeConverter[T] extends Serializable { def targetTypeTag: TypeTag[T] def convertPF: PartialFunction[Any, T] } ! class CassandraRow(...) { def get[T](index: Int)(implicit c: TypeConverter[T]): T = … def get[T](name: String)(implicit c: TypeConverter[T]): T = … … } ! implicit object IntConverter extends TypeConverter[Int] { def targetTypeTag = implicitly[TypeTag[Int]] convertPF = { case def x: Number => x.intValue case x: String => x.toInt } }
  • 47. Rows to Case Class Instances val row: CassandraRow = table.first case class Tweet(userName: String, tweetId: UUID, publicationDate: Date, message: String) extends Serializable ! val tweets = sc.cassandraTable[Tweet]("db", "tweets") Scala Cassandra message message column1 column_1 userName user_name Scala Cassandra Message Message column1 column1 userName userName Recommended convention:
  • 48. Convert Rows To Tuples ! val tweets = sc .cassandraTable[(String, String)]("db", "tweets") .select("userName", "message") When returning tuples, always use select to ! specify the column order!
  • 49. Handling Unsupported Types case class Tweet( userName: String, tweetId: my.private.implementation.of.UUID, publicationDate: Date, message: String) ! val tweets: CassandraRDD[Tweet] = sc.cassandraTable[Tweet]("db", "tweets") Current behavior: runtime error Future: compile-time error (macros!) Macros could even parse Cassandra schema!
  • 50. Connector Code and Docs https://github.com/datastax/spark-cassandra-connector! ! ! ! ! ! ! Add It To Your Project:! ! val connector = "com.datastax.spark" %% "spark-­‐cassandra-­‐connector" % "1.1.0-­‐beta2"
  • 51. What’s New In Spark? •Petabyte sort record! • Myth busting: ! • Spark is in-memory. It doesn’t work with big data! • It’s too expensive to buy a cluster with enough memory to fit our data! •Application Integration! • Tableau, Trifacta, Talend, ElasticSearch, Cassandra! •Ongoing development for Spark 1.2! • Python streaming, new MLib API, Yarn scaling…
  • 52. Recent / Pending Spark Updates •Usability, Stability & Performance Improvements ! • Improved Lightning Fast Shuffle!! • Monitoring the performance long running or complex jobs! •Public types API to allow users to create SchemaRDD’s! •Support for registering Python, Scala, and Java lambda functions as UDF ! •Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation
  • 53. Recent Spark Streaming Updates •Apache Flume: a new pull-based mode (simplifying deployment and providing high availability)! •The first of a set of streaming machine learning algorithms is introduced with streaming linear regression.! •Rate limiting has been added for streaming inputs
  • 54. Recent MLib Updates • New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets…)! • Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling). ! • Support for nonnegative matrix factorization and SVD via Lanczos. ! • Decision tree algorithm has been added in Python and Java. ! • Tree aggregation primitive! • Performance improves across the board, with improvements of around
  • 55. Recent/Pending Connector Updates •Spark SQL (came in 1.1)! •Python API! •Official Scala Driver for Cassandra! •Removing last traces of Thrift!!! •Performance Improvements! • Token-aware data repartitioning! • Token-aware saving! • Wide-row support - no costly groupBy call
  • 56. Resources •Spark Cassandra Connector ! https://github.com/datastax/spark-cassandra-connector ! •Apache Cassandra http://cassandra.apache.org! •Apache Spark http://spark.apache.org ! •Apache Kafka http://kafka.apache.org ! •Akka http://akka.io Analytic Analytic
  • 58. Thanks for listening! SEPTEMBER 10 -­‐ 11, 2014 | SAN FRANCISCO, CALIF. | THE WESTIN ST. FRANCIS HOTEL Cassandra Summit