Spark zeppelin-cassandra at synchrotron

Spark/Cassandra/Zeppelin for particle accelerator
metrics storage and aggregation
DuyHai DOAN
Apache Cassandra Evangelist

@doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
•  talks, meetups, confs …
•  open-source projects (Achilles, Apache Zeppelin ...)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2

The HDB++ project
•  What is Synchrotron
•  HDB++ project presentation
•  Why Spark, Cassandra and Zeppelin ?

@doanduyhai
What is Synchrotron ?
4
•  particle accelerator (electrons)
•  electron beams used for crystallography analysis of:
•  material
•  molecular biology
•  …

@doanduyhai
What is Synchrotron ?
5

@doanduyhai
The HDB++ project
7
•  Sub-project of TANGO, software toolkit to
•  connect
•  control/monitor
•  integrate sensor devices
•  HDB++ = new TANGO event-driven archiving system
•  historically used MySQL
•  now stores data into Cassandra

@doanduyhai
The HDB++ project
8

@doanduyhai
The HDB++ project
9
As of Sept - 2015

@doanduyhai
The HDB++ hardware specs
12

The HDB++ Cassandra data model

@doanduyhai
Metrics table
15
CREATE TABLE hdb.att_scalar_devshort_ro (
att_conf_id timeuuid,
period text,
data_time timestamp, data_time_us int,
error_desc text,
insert_time timestamp, insert_time_us int,
quality int,
recv_time timestamp,recv_time_us int,
value_r int,
PRIMARY KEY((att_conf_id,period),data_time, data_time_us))

@doanduyhai
Statistics table
16
CREATE TABLE hdb.stat_scalar_devshort_ro (
att_conf_id text,
type_period text, //HOUR, DAY, MONTH, YEAR
period text, //yyyy-MM-dd:HH, yyyy-MM-dd, yyyy-MM, yyyy
count_distinct_error bigint,
count_error bigint,
count_point bigint,
value_r_max int, value_r_min int,
value_r_mean double,
value_r_sd double,
PRIMARY KEY ((att_conf_id, type_period), period)
);

@doanduyhai
Statistics table
17
INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period,
value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456);
value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456);
value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456);
// Request by period of time
SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND
type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';

@doanduyhai
Source code
20
val devShortRoTable = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "att_scalar_devshort_ro",
"keyspace" -> "hdbtest"))
.load()
devShortRoTable.registerTempTable("att_scalar_devshort_ro")

@doanduyhai
Source code
21
val devShortRo = sqlContext.sql(s"""
SELECT "DAY" AS type_period, att_conf_id, period,
count(att_conf_id) AS count_point,
count(error_desc) AS count_error,
count(DISTINCT error_desc) AS count_distinct_error,
min(value_r) AS value_r_min, max(value_r) AS value_r_max,
avg(value_r) AS value_r_mean,
stddev(value_r) AS value_r_sd
FROM att_scalar_devshort_ro
WHERE period="${day}"
GROUP BY att_conf_id, period""")

@doanduyhai
Source code
22
devShortRo.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "stat_scalar_devshort_ro",
"keyspace" -> "hdbtest"))
.mode(SaveMode.Append)
.save()

@doanduyhai
Zeppelin vizualisation (export as Iframe)
24

Spark/Cassandra/Zeppelin tricks and traps
•  Zeppelin/Spark/Cassandra
•  Spark/Cassandra

@doanduyhai
Zeppelin/Spark/Cassandra
27
•  Legend
💣 = trap
💡 = trick

@doanduyhai
28
•  Zeppelin build mode
•  standard
•  with Spark-Cassandra connector (maven proﬁle -Pcassandra-spark-1.x)
•  Spark run mode
•  local
•  with a stand-alone Spark co-located with Cassandra

@doanduyhai
29
•  Zeppelin build mode standard, Spark run mode local
•  needs to add Spark-Cassandra connector as dependency to the Spark interpreter
💡

@doanduyhai
30
•  Zeppelin build mode standard, Spark run mode local
•  on Spark interpreter init, all declared dependencies will be fetched from declared
repositories (default = Maven central + local Maven repo)
•  beware of corporate FIREWALL !!!!!!!!!
•  Where are the downloaded dependencies (jars) stored ?
💣
💡

@doanduyhai
31
•  Zeppelin build mode standard, Spark run mode cluster
•  Zeppelin uses spark-submit command
•  Spark interpreter run by bin/interpreter.sh
💡

@doanduyhai
32
•  Zeppelin build mode standard, Spark run mode cluster
•  run at least in local mode ONCE so that Zeppelin can dowload dependencies into
local repo !!!! (zeppelin.interpreter.localRepo)💣

@doanduyhai
33
•  Zeppelin build mode with connector, Spark run mode local or cluster
•  run smoothly because all Spark-Cassandra connector dependencies are merged
into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the
build process
💡

@doanduyhai
34
•  OSS Spark
•  needs to add Spark-Cassandra connector dependencies
•  in conf/spark-env.sh
...
...
Caused by: java.lang.NoClassDefFoundError: com/
datastax/driver/core/ConsistencyLevel

@doanduyhai
35
•  OSS Spark
•  needs to provide all transitive dependencies for the Spark-Cassandra
connector !!!
•  in conf/spark-env.sh
•  or use spark-submit --package groupId:artifactId:version option
💣

@doanduyhai
36
•  DSE Spark
•  run smoothly because the Spark-Cassandra connector dependencies are already
embedded into the package ($DSE_HOME/resources/spark/lib)

@doanduyhai
Spark/Cassandra
37
•  Spark deploy mode (spark-submit --deploy-mode )
•  client
•  cluster
•  Zeppelin deploys by default using client mode
💡

@doanduyhai
Spark/Cassandra
38
•  Spark client deploy mode
•  default
•  needs to ship all driver program dependencies to the workers (network intensive)
•  suitable for REPL (Spark Shell, Zeppelin)
•  suitable for one-shot job/testing

@doanduyhai
Spark/Cassandra
39
•  Spark cluster deploy mode
•  driver program runs on a worker node
•  all driver program dependencies should be reachable by any worker
•  usually dependencies are stored in HDFS, can be stored on local FS on all workers
•  suitable for recurrent jobs
•  need a consistent build & deploy process for your jobs

@doanduyhai
Spark/Cassandra
40
•  The job fails when using spark-submit
•  but succeeded with Zeppelin …
•  error: value stddev not found
val devShortRo = sqlContext.sql(s"""
SELECT "DAY" AS type_period, att_conf_id, period,
count(att_conf_id) AS count_point,
count(error_desc) AS count_error,
count(DISTINCT error_desc) AS count_distinct_error,
min(value_r) AS value_r_min, max(value_r) AS value_r_max,
avg(value_r) AS value_r_mean,
stddev(value_r) AS value_r_sd
FROM att_scalar_devshort_ro
WHERE period="${day}"
GROUP BY att_conf_id, period""")

@doanduyhai
Spark/Cassandra
41
•  Indeed Zeppelin use Hive context by default …
•  Fix
💣

@doanduyhai
Cassandra Summit 2016  
September 7-9 San Jose, CA
Get 15% Oﬀ with Code:  
DoanDuy15 
Cassandrasummit.org

44
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You

Spark zeppelin-cassandra at synchrotron

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark zeppelin-cassandra at synchrotron

Similar to Spark zeppelin-cassandra at synchrotron (20)

More from Duyhai Doan

More from Duyhai Doan (13)

Recently uploaded

Recently uploaded (20)

Spark zeppelin-cassandra at synchrotron