SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
©2014 DataStax Confidential. Do not distribute without consent.
@rstml
Rustam Aliyev
Solution Architect 
Lightning-fast analytics with Spark
forCassandraandDataStaxEnterprise
1
What is Spark?
* Apache Project since 2010
* Fast
* 10x-100x faster than Hadoop MapReduce
* In-memory storage
* Single JVM process per node
* Easy
* Rich Scala, Java and Python APIs
* 2x-5x less code
* Interactive shell
Analytic
Analytic
Search
API
map! reduce!
API
map!
filter!
groupBy!
sort!
union!
join!
leftOuterJoin!
rightOuterJoin!
reduce!
count!
fold!
reduceByKey!
groupByKey!
cogroup!
cross!
zip!
sample!
take!
first!
partitionBy!
mapWith!
pipe!
save !
...!
API
* Resilient Distributed Datasets
* Collections of objects spread across a
cluster, stored in RAM or on Disk
* Built through parallel transformations
* Automatically rebuilt on failure

* Operations
* Transformations (e.g. map, filter, groupBy)
* Actions (e.g. count, collect, save)
Operator Graph: Optimization and Fault Tolerance
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= Cached partition= RDD
Fast
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 sec / iteration
first iteration 80 sec
further iterations 1 sec
* Logistic Regression Performance
"
Why Spark on Cassandra?
* Data model independent queries
* Cross-table operations (JOIN, UNION, etc.)
* Complex analytics (e.g. machine learning)
* Data transformation, aggregation, etc.
* Stream processing (coming soon)
How to Spark on Cassandra?
* DataStax Cassandra Spark driver
* Open source: https://github.com/datastax/cassandra-driver-spark
* Compatible with
* Spark 0.9+
* Cassandra 2.0+
* DataStax Enterprise 4.5+
Cassandra Spark Driver
* Cassandra tables exposed as Spark RDDs
* Read from and write to Cassandra
* Mapping of C* tables and rows to Scala objects
* All Cassandra types supported and converted to Scala types
* Server side data selection
* Virtual Nodes support
* Scala only driver for now
Connecting to Cassandra
//	
  Import	
  Cassandra-­‐specific	
  functions	
  on	
  SparkContext	
  and	
  RDD	
  objects	
  
import	
  com.datastax.driver.spark._	
  
	
  
	
  
//	
  Spark	
  connection	
  options	
  
val	
  conf	
  =	
  new	
  SparkConf(true)	
  
	
   	
  	
  	
  	
  	
  .setMaster("spark://192.168.123.10:7077")	
  
	
   	
  	
  	
  	
  	
  .setAppName("cassandra-­‐demo")	
  
	
  	
  	
  	
  	
  	
  	
  	
  .set("cassandra.connection.host",	
  "192.168.123.10")	
  //	
  initial	
  contact	
  
	
  	
  	
  	
  	
  	
  	
  	
  .set("cassandra.username",	
  "cassandra")	
  
	
  	
  	
  	
  	
  	
  	
  	
  .set("cassandra.password",	
  "cassandra")	
  	
  
	
  
val	
  sc	
  =	
  new	
  SparkContext(conf)	
  
Accessing Data
CREATE	
  TABLE	
  test.words	
  (word	
  text	
  PRIMARY	
  KEY,	
  count	
  int);	
  
	
  
INSERT	
  INTO	
  test.words	
  (word,	
  count)	
  VALUES	
  ('bar',	
  30);	
  
INSERT	
  INTO	
  test.words	
  (word,	
  count)	
  VALUES	
  ('foo',	
  20);	
  
//	
  Use	
  table	
  as	
  RDD	
  
val	
  rdd	
  =	
  sc.cassandraTable("test",	
  "words")	
  
//	
  rdd:	
  CassandraRDD[CassandraRow]	
  =	
  CassandraRDD[0]	
  
	
  
rdd.toArray.foreach(println)	
  
//	
  CassandraRow[word:	
  bar,	
  count:	
  30]	
  
//	
  CassandraRow[word:	
  foo,	
  count:	
  20]	
  
	
  
rdd.columnNames	
  	
  	
  	
  //	
  Stream(word,	
  count)	
  	
  
rdd.size	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  2	
  
	
  
val	
  firstRow	
  =	
  rdd.first	
  	
  //	
  firstRow:	
  CassandraRow	
  =	
  CassandraRow[word:	
  bar,	
  count:	
  30]	
  
firstRow.getInt("count")	
  	
  //	
  Int	
  =	
  30	
  
* Accessing table above as RDD:
Saving Data
val	
  newRdd	
  =	
  sc.parallelize(Seq(("cat",	
  40),	
  ("fox",	
  50)))	
  
//	
  newRdd:	
  org.apache.spark.rdd.RDD[(String,	
  Int)]	
  =	
  ParallelCollectionRDD[2]	
  
	
  
newRdd.saveToCassandra("test",	
  "words",	
  Seq("word",	
  "count"))	
  
SELECT	
  *	
  FROM	
  test.words;	
  
	
  
	
  word	
  |	
  count	
  
-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  bar	
  |	
  	
  	
  	
  30	
  
	
  	
  foo	
  |	
  	
  	
  	
  20	
  
	
  	
  cat	
  |	
  	
  	
  	
  40	
  
	
  	
  fox	
  |	
  	
  	
  	
  50	
  
	
  
(4	
  rows)	
  
* RDD above saved to Cassandra:
Type Mapping
CQL Type
 Scala Type
ascii	
   String	
  
bigint	
   Long	
  
boolean	
   Boolean	
  
counter	
   Long	
  
decimal	
   BigDecimal, java.math.BigDecimal	
  
double	
   Double	
  
float	
   Float	
  
inet	
   java.net.InetAddress	
  
int	
   Int	
  
list	
   Vector, List, Iterable, Seq, IndexedSeq, java.util.List	
  
map	
   Map, TreeMap, java.util.HashMap	
  
set	
   Set, TreeSet, java.util.HashSet	
  
text, varchar	
   String	
  
timestamp	
   Long, java.util.Date, java.sql.Date, org.joda.time.DateTime	
  
timeuuid	
   java.util.UUID	
  
uuid	
   java.util.UUID	
  
varint	
   BigInt, java.math.BigInteger	
  
*nullable values	
   Option	
  
Mapping Rows to Objects
CREATE	
  TABLE	
  test.cars	
  (	
  
	
  id	
  text	
  PRIMARY	
  KEY,	
  
	
  model	
  text,	
  
	
  fuel_type	
  text,	
  
	
  year	
  int	
  
);	
  
case	
  class	
  Vehicle(	
  
	
  id:	
  String,	
  
	
  model:	
  String,	
  
	
  fuelType:	
  String,	
  
	
  year:	
  Int	
  
)	
  
	
  
sc.cassandraTable[Vehicle]("test",	
  "cars").toArray	
  
//Array(Vehicle(KF334L,	
  Ford	
  Mondeo,	
  Petrol,	
  2009),	
  
//	
  	
  	
  	
  	
  	
  Vehicle(MT8787,	
  Hyundai	
  x35,	
  Diesel,	
  2011)	
  
à
* Mapping rows to Scala Case Classes
* CQL underscore case column mapped to Scala camel case property
* Custom mapping functions (see docs)"
Server Side Data Selection
* Reduce the amount of data transferred
* Selecting columns

* Selecting rows (by clustering columns and/or secondary indexes)
sc.cassandraTable("test",	
  "users").select("username").toArray.foreach(println)	
  
//	
  CassandraRow{username:	
  john}	
  	
  
//	
  CassandraRow{username:	
  tom}	
  
sc.cassandraTable("test",	
  "cars").select("model").where("color	
  =	
  ?",	
  "black").toArray.foreach(println)	
  
//	
  CassandraRow{model:	
  Ford	
  Mondeo}	
  
//	
  CassandraRow{model:	
  Hyundai	
  x35}	
  
Shark
* SQL query engine on top of Spark
* Not part of Apache Spark
* Hive compatible (JDBC, UDFs, types, metadata, etc.)
* Supports in-memory tables
* Available as a part of DataStax Enterprise
Shark In-memory Tables
CREATE	
  TABLE	
  CachedStocks	
  TBLPROPERTIES	
  ("shark.cache"	
  =	
  "true")	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  AS	
  SELECT	
  *	
  from	
  PortfolioDemo.Stocks	
  WHERE	
  value	
  >	
  95.0;	
  
OK	
  
Time	
  taken:	
  1.215	
  seconds	
  
	
  
	
  
SELECT	
  *	
  FROM	
  CachedStocks;	
  
OK	
  
MQT	
  price	
  97.9270442241818	
  
SII	
  price	
  99.69238346610474	
  
.	
  
.	
  (123	
  additional	
  prices)	
  
.	
  
PBG	
  price	
  96.09162963505352	
  
Time	
  taken:	
  0.569	
  seconds	
  
Spark SQL vs Shark
Shark
or
Spark SQL
Streaming
 ML
Spark (General execution engine)
Graph
Cassandra
Compatible
Analytics Workload Isolation
Cassandra
+ Spark DC
Cassandra
Only DC
Online
App
Analytical
App
Mixed Load Cassandra Cluster
Analytics High Availability
* All nodes are Spark Workers
* By default resilient to Worker failures
* First Spark node promoted as Spark Master
* Standby Master promoted on failure
* Master HA available in DataStax Enterprise"
Spark Master
Spark Standby Master
Spark Worker
Questions?
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra

Mais conteúdo relacionado

Mais procurados

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceDuyhai Doan
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentationDuyhai Doan
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Jon Haddad
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Data stax academy
Data stax academyData stax academy
Data stax academyDuyhai Doan
 

Mais procurados (20)

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 

Semelhante a Lightning fast analytics with Spark and Cassandra

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For XmlLars Trieloff
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Codemotion
 

Semelhante a Lightning fast analytics with Spark and Cassandra (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Lightning fast analytics with Spark and Cassandra

  • 1. ©2014 DataStax Confidential. Do not distribute without consent. @rstml Rustam Aliyev Solution Architect Lightning-fast analytics with Spark forCassandraandDataStaxEnterprise 1
  • 2. What is Spark? * Apache Project since 2010 * Fast * 10x-100x faster than Hadoop MapReduce * In-memory storage * Single JVM process per node * Easy * Rich Scala, Java and Python APIs * 2x-5x less code * Interactive shell Analytic Analytic Search
  • 5. API * Resilient Distributed Datasets * Collections of objects spread across a cluster, stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt on failure * Operations * Transformations (e.g. map, filter, groupBy) * Actions (e.g. count, collect, save)
  • 6. Operator Graph: Optimization and Fault Tolerance join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = Cached partition= RDD
  • 7. Fast 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 sec / iteration first iteration 80 sec further iterations 1 sec * Logistic Regression Performance "
  • 8. Why Spark on Cassandra? * Data model independent queries * Cross-table operations (JOIN, UNION, etc.) * Complex analytics (e.g. machine learning) * Data transformation, aggregation, etc. * Stream processing (coming soon)
  • 9. How to Spark on Cassandra? * DataStax Cassandra Spark driver * Open source: https://github.com/datastax/cassandra-driver-spark * Compatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+
  • 10. Cassandra Spark Driver * Cassandra tables exposed as Spark RDDs * Read from and write to Cassandra * Mapping of C* tables and rows to Scala objects * All Cassandra types supported and converted to Scala types * Server side data selection * Virtual Nodes support * Scala only driver for now
  • 11. Connecting to Cassandra //  Import  Cassandra-­‐specific  functions  on  SparkContext  and  RDD  objects   import  com.datastax.driver.spark._       //  Spark  connection  options   val  conf  =  new  SparkConf(true)              .setMaster("spark://192.168.123.10:7077")              .setAppName("cassandra-­‐demo")                  .set("cassandra.connection.host",  "192.168.123.10")  //  initial  contact                  .set("cassandra.username",  "cassandra")                  .set("cassandra.password",  "cassandra")       val  sc  =  new  SparkContext(conf)  
  • 12. Accessing Data CREATE  TABLE  test.words  (word  text  PRIMARY  KEY,  count  int);     INSERT  INTO  test.words  (word,  count)  VALUES  ('bar',  30);   INSERT  INTO  test.words  (word,  count)  VALUES  ('foo',  20);   //  Use  table  as  RDD   val  rdd  =  sc.cassandraTable("test",  "words")   //  rdd:  CassandraRDD[CassandraRow]  =  CassandraRDD[0]     rdd.toArray.foreach(println)   //  CassandraRow[word:  bar,  count:  30]   //  CassandraRow[word:  foo,  count:  20]     rdd.columnNames        //  Stream(word,  count)     rdd.size                      //  2     val  firstRow  =  rdd.first    //  firstRow:  CassandraRow  =  CassandraRow[word:  bar,  count:  30]   firstRow.getInt("count")    //  Int  =  30   * Accessing table above as RDD:
  • 13. Saving Data val  newRdd  =  sc.parallelize(Seq(("cat",  40),  ("fox",  50)))   //  newRdd:  org.apache.spark.rdd.RDD[(String,  Int)]  =  ParallelCollectionRDD[2]     newRdd.saveToCassandra("test",  "words",  Seq("word",  "count"))   SELECT  *  FROM  test.words;      word  |  count   -­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐      bar  |        30      foo  |        20      cat  |        40      fox  |        50     (4  rows)   * RDD above saved to Cassandra:
  • 14. Type Mapping CQL Type Scala Type ascii   String   bigint   Long   boolean   Boolean   counter   Long   decimal   BigDecimal, java.math.BigDecimal   double   Double   float   Float   inet   java.net.InetAddress   int   Int   list   Vector, List, Iterable, Seq, IndexedSeq, java.util.List   map   Map, TreeMap, java.util.HashMap   set   Set, TreeSet, java.util.HashSet   text, varchar   String   timestamp   Long, java.util.Date, java.sql.Date, org.joda.time.DateTime   timeuuid   java.util.UUID   uuid   java.util.UUID   varint   BigInt, java.math.BigInteger   *nullable values   Option  
  • 15. Mapping Rows to Objects CREATE  TABLE  test.cars  (    id  text  PRIMARY  KEY,    model  text,    fuel_type  text,    year  int   );   case  class  Vehicle(    id:  String,    model:  String,    fuelType:  String,    year:  Int   )     sc.cassandraTable[Vehicle]("test",  "cars").toArray   //Array(Vehicle(KF334L,  Ford  Mondeo,  Petrol,  2009),   //            Vehicle(MT8787,  Hyundai  x35,  Diesel,  2011)   à * Mapping rows to Scala Case Classes * CQL underscore case column mapped to Scala camel case property * Custom mapping functions (see docs)"
  • 16. Server Side Data Selection * Reduce the amount of data transferred * Selecting columns * Selecting rows (by clustering columns and/or secondary indexes) sc.cassandraTable("test",  "users").select("username").toArray.foreach(println)   //  CassandraRow{username:  john}     //  CassandraRow{username:  tom}   sc.cassandraTable("test",  "cars").select("model").where("color  =  ?",  "black").toArray.foreach(println)   //  CassandraRow{model:  Ford  Mondeo}   //  CassandraRow{model:  Hyundai  x35}  
  • 17. Shark * SQL query engine on top of Spark * Not part of Apache Spark * Hive compatible (JDBC, UDFs, types, metadata, etc.) * Supports in-memory tables * Available as a part of DataStax Enterprise
  • 18. Shark In-memory Tables CREATE  TABLE  CachedStocks  TBLPROPERTIES  ("shark.cache"  =  "true")                            AS  SELECT  *  from  PortfolioDemo.Stocks  WHERE  value  >  95.0;   OK   Time  taken:  1.215  seconds       SELECT  *  FROM  CachedStocks;   OK   MQT  price  97.9270442241818   SII  price  99.69238346610474   .   .  (123  additional  prices)   .   PBG  price  96.09162963505352   Time  taken:  0.569  seconds  
  • 19. Spark SQL vs Shark Shark or Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible
  • 20. Analytics Workload Isolation Cassandra + Spark DC Cassandra Only DC Online App Analytical App Mixed Load Cassandra Cluster
  • 21. Analytics High Availability * All nodes are Spark Workers * By default resilient to Worker failures * First Spark node promoted as Spark Master * Standby Master promoted on failure * Master HA available in DataStax Enterprise" Spark Master Spark Standby Master Spark Worker