2. @doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
• talks, meetups, confs …
• open-source projects (Achilles, Apache Zeppelin ...)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. The HDB++ project
• What is Synchrotron
• HDB++ project presentation
• Why Spark, Cassandra and Zeppelin ?
4. @doanduyhai
What is Synchrotron ?
4
• particle accelerator (electrons)
• electron beams used for crystallography analysis of:
• material
• molecular biology
• …
7. @doanduyhai
The HDB++ project
7
• Sub-project of TANGO, software toolkit to
• connect
• control/monitor
• integrate sensor devices
• HDB++ = new TANGO event-driven archiving system
• historically used MySQL
• now stores data into Cassandra
17. @doanduyhai
Statistics table
17
INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period,
value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456);
INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period,
value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456);
INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period,
value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456);
// Request by period of time
SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND
type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';
21. @doanduyhai
Source code
21
val devShortRo = sqlContext.sql(s"""
SELECT "DAY" AS type_period, att_conf_id, period,
count(att_conf_id) AS count_point,
count(error_desc) AS count_error,
count(DISTINCT error_desc) AS count_distinct_error,
min(value_r) AS value_r_min, max(value_r) AS value_r_max,
avg(value_r) AS value_r_mean,
stddev(value_r) AS value_r_sd
FROM att_scalar_devshort_ro
WHERE period="${day}"
GROUP BY att_conf_id, period""")
30. @doanduyhai
Zeppelin/Spark/Cassandra
30
• Zeppelin build mode standard, Spark run mode local
• on Spark interpreter init, all declared dependencies will be fetched from declared
repositories (default = Maven central + local Maven repo)
• beware of corporate FIREWALL !!!!!!!!!
• Where are the downloaded dependencies (jars) stored ?
💣
💡
32. @doanduyhai
Zeppelin/Spark/Cassandra
32
• Zeppelin build mode standard, Spark run mode cluster
• run at least in local mode ONCE so that Zeppelin can dowload dependencies into
local repo !!!! (zeppelin.interpreter.localRepo)💣
33. @doanduyhai
Zeppelin/Spark/Cassandra
33
• Zeppelin build mode with connector, Spark run mode local or cluster
• run smoothly because all Spark-Cassandra connector dependencies are merged
into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the
build process
💡
35. @doanduyhai
Zeppelin/Spark/Cassandra
35
• OSS Spark
• needs to provide all transitive dependencies for the Spark-Cassandra
connector !!!
• in conf/spark-env.sh
• or use spark-submit --package groupId:artifactId:version option
💣
38. @doanduyhai
Spark/Cassandra
38
• Spark client deploy mode
• default
• needs to ship all driver program dependencies to the workers (network intensive)
• suitable for REPL (Spark Shell, Zeppelin)
• suitable for one-shot job/testing
39. @doanduyhai
Spark/Cassandra
39
• Spark cluster deploy mode
• driver program runs on a worker node
• all driver program dependencies should be reachable by any worker
• usually dependencies are stored in HDFS, can be stored on local FS on all workers
• suitable for recurrent jobs
• need a consistent build & deploy process for your jobs
40. @doanduyhai
Spark/Cassandra
40
• The job fails when using spark-submit
• but succeeded with Zeppelin …
• error: value stddev not found
val devShortRo = sqlContext.sql(s"""
SELECT "DAY" AS type_period, att_conf_id, period,
count(att_conf_id) AS count_point,
count(error_desc) AS count_error,
count(DISTINCT error_desc) AS count_distinct_error,
min(value_r) AS value_r_min, max(value_r) AS value_r_max,
avg(value_r) AS value_r_mean,
stddev(value_r) AS value_r_sd
FROM att_scalar_devshort_ro
WHERE period="${day}"
GROUP BY att_conf_id, period""")