Escape from Hadoop

Escape From Hadoop:
Spark One Liners for C* Ops
Kurt Russell Spitzer
DataStax

Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop,
Solr, and SPARK!!
• Spends a lot of time spinning
up clusters on EC2, GCE,
Azure, …
http://www.datastax.com/dev/
blog/testing-cassandra-1000-
nodes-at-a-time
• Developing new ways to make
sure that C* Scales

Why escape from Hadoop?
HADOOP
Many Moving Pieces
Map Reduce
Single Points of Failure
Lots of Overhead
And there is a way out!

Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor

Spark is Compatible with HDFS,
Parquet, CSVs, ….

Spark is Compatible with HDFS,
Parquet, CSVs, ….
AND
APACHE CASSANDRA
Apache
Cassandra

Apache Cassandra is a Linearly Scaling
and Fault Tolerant noSQL Database
Linearly Scaling:
The power of the database
increases linearly with the
number of machines
2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant:
Nodes down != Database Down
Datacenter down != Database Down

Apache Cassandra
Architecture is Very Simple
Node Roles 1
Replication Tunable
Replication
Consistency Tunable
C*
C* C*
C*
Client

DataStax OSS Connector
Spark to Cassandra
https://github.com/datastax/spark4cassandra4connector
Cassandra Spark
Keyspace Table
RDD[CassandraRow]
RDD[Tuples]
Bundled9and9Supported9with9DSE94.5!

Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1001 -2000
Tokens 1-1000
Tokens …
RDD’s read into different
splits based on sets of
tokens

Co-locate Spark and C* for
Best Performance
C*
C* C*
Spark
Worker
C*
Spark
Worker
Spark
Master
Spark
Running Spark Workers Worker
on
the same nodes as your
C* Cluster will save
network hops when
reading and writing

Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

We need a Distributed System
For Analytics and Batch Jobs
But it doesn’t have to be complicated!

Even count needs to be
distributed
Ask me to write a Map Reduce
for word count, I dare you.
You could make this easier by adding yet another
technology to your Hadoop Stack (hive, pig, impala) or
we could just do one liners on the spark shell.

Basics: Getting a Table and
Counting
CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};&
use&newyork;&
CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&);

Counting
use&newyork;&
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& &
cassandraTable

Counting
use&newyork;&
& .count&
res3:&Long&=&10
cassandraTable
count
10

Basics: take() and toArray
scala>&sc.cassandraTable("newyork","presidentlocations")&
cassandraTable

scala>&sc.cassandraTable("newyork","presidentlocations").take(1)&
!
res2:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(CassandraRow{time:&9,&location:&NYC})
cassandraTable
take(1)
Array of CassandraRows
9 NYC

!
cassandraTable
take(1)
9 NYC
scala>&sc.cassandraTable(“newyork","presidentlocations")
cassandraTable

!
cassandraTable
take(1)
9 NYC
scala>&sc.cassandraTable(“newyork","presidentlocations").toArray&
!
res3:&Array[com.datastax.spark.connector.CassandraRow]&=&Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&3,&location:&White&House},&&
& …,
& CassandraRow{time:&6,&location:&Air&Force&1})
cassandraTable
toArray
9 NYC
99 NNYYCC 99 NNYYCC

Basics: Getting Row Values
out of a CassandraRow
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")&
!
res5:&Int&=&9
cassandraTable
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html

!
res5:&Int&=&9
cassandraTable
take(1)
9 NYC

!
res5:&Int&=&9
cassandraTable
take(1)
9 NYC
9
get[Int]
get[Int]
get[String]
…
get[Any]
Got Null ?
get[Option[Int]]

Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);

Copy A Table
& time&int,&&
& character&text,&&
& location&text,&&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house

Copy A Table
& time&int,&&
& character&text,&&
& location&text,&&
);
& .map(&row&=>&(&
cassandraTable
get[Int] get[String]
1 white house
1,president,white house

C*
Copy A Table
& time&int,&&
& character&text,&&
& location&text,&&
);
& .map(&row&=>&(&
cassandraTable
1 white house
saveToCassandra

C*
Copy A Table
& time&int,&&
& character&text,&&
& location&text,&&
);
& .map(&row&=>&(&
cqlsh:newyork>&SELECT&*&FROM&characterlocations&;&
!
&time&|&character&|&location&
kkkkkk+kkkkkkkkkkk+kkkkkkkkkkkkk&
&&&&5&|&president&|&Air&Force&1&
&&&10&|&president&|&&&&&&&&&NYC&
…&
…&
cassandraTable
1 white house
saveToCassandra

Filter a Table
What if we want to filter based on a
non-clustering key column?
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable

Filter a Table
& .toArray&
!
Array(&
)
cassandraTable
Filter

Filter a Table
& .toArray&
!
Array(&
)
cassandraTable
Filter
_ (Anonymous Param)
1 white house

Filter a Table
& .toArray&
!
Array(&
)
cassandraTable
Filter
1 white house
get[Int]
1
_ (Anonymous Param)

Filter a Table
& .toArray&
!
Array(&
)
cassandraTable
_ (Anonymous Param) >7
1 white house
get[Int]
1
Filter

Backfill a Table with a
Different Key!
CREATE&TABLE&timelines&(&
&&time&int,&
&&character&text,&
&&location&text,&
&&PRIMARY&KEY&((character),&time)&
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.

Different Key!
&&time&int,&
&&character&text,&
&&location&text,&
)
structure.
sc.cassandraTable(“newyork","characterlocations")&
& .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president

Different Key!
&&time&int,&
&&character&text,&
&&location&text,&
)
structure.
1 white house
cassandraTable
saveToCassandra
president C*

Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
textFile

Import a CSV
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve

Import a CSV
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
textFile
Map
split

Import a CSV
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
textFile
Map
split
saveToCassandra
C*

Import a CSV
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
textFile
Map
plissken,1,white house
split
plissken 1 white house
plissken,1,white house
saveToCassandra
C*
cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';&
!
&character&|&time&|&location&
kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk&
&&plissken&|&&&&1&|&Federal&Reserve&
&&plissken&|&&&&4&|&&&&&&&&&&&Court&
&&plissken&|&&&&5&|&&&&&&&&&&&Court&
&&plissken&|&&&&6&|&&&&&&&&&&&Court&
&&plissken&|&&&&7&|&&&&&&&&&&&Court&
&&plissken&|&&&&8&|&&Stealth&Glider&
&&plissken&|&&&&9&|&&&&&&&&&&&&&NYC&
&&plissken&|&&&10&|&&&&&&&&&&&&&NYC

Perform a Join with MySQL
Maybe a little more than one line …
MySQL Table “quotes” in “escape_from_ny”
import&java.sql._&
import&org.apache.spark.rdd.JdbcRDD&
Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J&added&toSpark&Shell&Classpath&
val&quotes&=&new&JdbcRDD(&
& sc,&&
& ()&=>&{&
& & DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},&&
& "SELECT&*&FROM&quotes&WHERE&?&<=&ID&and&ID&<=&?”,&
& 0,&
& 100,&
& 5,&&
& (r:&ResultSet)&=>&{&
& & (r.getInt(2),r.getString(3))&
& }&
)&
!
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&

!
quotes.join(&
& sc.cassandraTable(“newyork","timelines")&
& .filter(&_.get[String]("character")&==&“plissken")&
& .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))&
& .take(1)&
& .foreach(println)&
!
(5,&
& (Bob&Hauk:&& There&was&an&accident.&&
& & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&&
& & & The&President&was&on&board.&
& &Snake&Plissken:&The&president&of&what?,&
& Court)&
)
cassandraTable
JdbcRDD
Needs to be in the form of RDD[K,V]
5, ‘Bob Hauk: …'

!
quotes.join(&
& .take(1)&
!
(5,&
& Court)&
)
cassandraTable
JdbcRDD
plissken,5,court
5,court

!
quotes.join(&
& .take(1)&
!
(5,&
& Court)&
)
cassandraTable
JdbcRDD
plissken,5,court
5,court 5,(‘Bob Hauk: …’,court)

Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
timelineRow
character,time,location

Classes
& .toArray&
cassandraTable[timelineRow]
timelineRow

Classes
& .toArray&
timelineRow
filter
character == plissken

Classes
& .toArray&
timelineRow
filter
time == 8

Classes
& .toArray&
timelineRow
filter
time == 8
character:plissken,time:8,location: Stealth Glider

Classes
& .toArray&
The Future
timelineRow
filter
time == 8
character:plissken,time:8,location: Stealth Glider

A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
cassandraTable

Count …
& .map(&(_,1))&
& .toArray&
1 white house
cassandraTable
get[String]

Count …
& .map(&(_,1))&
& .toArray&
1 white house
white house
cassandraTable
get[String]
_.split()

Count …
& .map(&(_,1))&
& .toArray&
1 white house
white house
white, 1 house, 1
cassandraTable
get[String]
_.split()
(_,1)

Count …
& .map(&(_,1))&
& .toArray&
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTable
get[String]
_.split()
(_,1)
_ + _

Stand Alone App Example
https://github.com/RussellSpitzer/spark4cassandra4csv
Car,:Model,:Color
Dodge,:Caravan,:Red:
Ford,:F150,:Black:
Toyota,:Prius,:Green
Spark SCC
RDD:
[CassandraRow]
!!!
FavoriteCars
Table
Cassandra
Column:Mapping
CSV

Thanks for listening!
There is plenty more we can do with Spark but …
Questions?

Getting started with Cassandra?!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Thanks for coming to the meetup!!
In production?!
Tweet us: @PlanetCassandra!

Thanks:for:your:Time:and:Come:to:C*:Summit!:
SEPTEMBER91094911,9201499|99SAN9FRANCISCO,9CALIF.99|99THE9WESTIN9ST.9FRANCIS9HOTEL
Cassandra:Summit:Link

Escape from Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (19)

Semelhante a Escape from Hadoop

Semelhante a Escape from Hadoop (20)

Mais de DataStax Academy

Mais de DataStax Academy (20)

Último

Último (20)

Escape from Hadoop