JAN 2015
• P
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~45 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Cloudera
• Hortonworks
• MapR
• DataStax
Topics include:
• Using cloud-based notebooks to develop Enterprise data workflows
• Spark integration with Cassandra, Kafka, Elasticsearch
• Advanced use cases with Spark SQL and Spark Streaming
• Operationalizing Spark on DataStax, Cloudera, MapR, etc.
• Monitoring and evaluating performance metrics
• Estimating cluster resource requirements
• Debugging and troubleshooting Spark apps
• Cases studies for production deployments of Spark
• Preparation for Apache Spark developer certification exam
- 3 days of training (Tues – Thurs)
- $2,795
- Limited to 50 seats
- ~40% hands on labs
Spark Core Engine
(Scala / Python /Java)
Spark SQL
(Appox. SQL)
(R on Spark)
+ +
+ + +
Protocol Buffers
Standalone Scheduler YARN Mesos
(off-heap RDD)
[0, 1, 1, 0]
[1, 0, 0, 0]
[0, 0, 0, 1]
DEMO #1:
June 2010
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like
parallel operations.
RDDs achieve fault tolerance through a notion of
lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”
April 2012
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications
that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.
In both cases, keeping data in memory can improve
performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award”
- NSDI 2012
- Cited 392 times!
- 2 Streaming Paper(s) have been cited 138 times
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = p:
Seemlessly mix SQL queries with Spark programs.
Coming soon!
(Will be published in the upcoming
weeks for SIGMOD 2015)
eBook: $31.99
Print: $39.99
Early release available now!
Physical copy est Feb 2015
(Scala & Python only)
Driver Program
Worker Machine
Worker Machine
An RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDDs are made of multiple partitions (more partitions = more parallelism)
# Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scala
val wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-
memory collection and
pass it to SparkContext’s
parallelize method
- Not generally used
outside of prototyping
and testing since it
requires entire dataset in
memory on one
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/");
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
Filter(lambda line: Error in line)
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Filter(lambda line: msg1 in line)
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
.cache() == .persist(MEMORY_ONLY)
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. (default)
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
Same as the levels above, but replicate each partition on two cluster
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Reduces garbage collection
overhead and allows Executors to be smaller.
DEMO #2:
Examples of narrow and wide dependencies.
Each box is an RDD, with partitions shown as shaded rectangles.
= cached partition
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
Stage 2 = Job #1Stage 1
Stage 3
Stage 4
= Job #1
- Local
- Standalone Scheduler
- Mesos
Static Partitioning
Dynamic Partitioning
I’m HA via
Spark Master
(Course-Grained Scheduler)
(Task Scheduler)
spark.executor.memory = 512 MB
spark.cores.max = *
DSE implementation
Spark Master
(Course-Grained Scheduler) C
+ RD
512 MB
512 MB
16 - 256 MB
16 - 256 MB
Dynamically set
(Task Scheduler)
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs
Shuffle memory
Spark Executor
C* Java Driver
- Open Source
- Implemented mostly in Scala
- Scala + Java APIs
- Does automatic type conversions
val cassandraRDD = sc
.cassandraTable(“ks”, “mytable”)
.select(“col-1”, “col-3”)
.where(“col-5 = ?”, “blue”)
Keyspace Table
{Server side column
& row selection
c1 c2
c1 c2
c1 c2
c1 c2
(Resilient Distributed Dataset)
c1 c2
c1 c2
c1 c2
c1 c2
Partition 1 Partition 2 Partition 3 Partition 4
Every Spark task uses a CQL-like query to fetch
data for a given token range:
SELECT “key”, “value”
FROM “keyspace”.”table”
token(“key”) > 384023840238403 AND
token(“key”) <= 38402992849280
Configuration Settings
Cassandra Settings
DataStax Enterprise Settings
Spark Settings
Configuration Settings
# Set the amount of memory used by Spark Worker - if uncommented, it
overrides the setting initial_spark_worker_resources in dse.yaml.
# export SPARK_WORKER_MEMORY=2048m
# The amount of memory used by Spark Driver program
# Directory where RDDs will be cached
export SPARK_RDD_DIR="/var/lib/spark/rdd“
# The directory for storing master.log and worker.log files
export SPARK_LOG_DIR="/var/log/spark"
(Spark Settings)
Configuration Settings
/etc/dse/dse.yaml (DataStax Enterprise Settings)
# The fraction of available system resources to be used by Spark Worker
# This the only initial value, once it is reconfigured, the new value is
# stored and retrieved on next run.
initial_spark_worker_resources: 0.7
Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to C*)
Spark worker cores = initial_spark_worker_resources * total system cores
c1 c2
A 5 5
C 0 1
C 0 1
c1 c2
A 5 5
C 0 1
C 0 1
sc.cassandraTable(“KS", “TB").select(“C-1", “C-2")
.where(“C-3 = ?", "black")
Interesting Spark Settings
spark.speculation =
spark.locality.wait =
spark.local.dir = /tmp
spark.serializer =
JavaSerializer KryoSerialize
• HadoopRDD
• FilteredRDD
• MappedRDD
• PairRDD
• ShuffledRDD
• UnionRDD
• PythonRDD
• DoubleRDD
• JdbcRDD
• JsonRDD
• SchemaRDD
• VertexRDD
• EdgeRDD
• CassandraRDD
• EsSpark
Spark sorted the same data 3X faster
using 10X fewer machines
than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
All the sorting took place on disk (HDFS) without
using Spark’s in-memory cache!
- Stresses “shuffle” which underpins everything from SQL to Mllib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)
- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout
- Pipelining
EC2: i2.8xlarge
(206 workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed
it manually (i.e. never triggered the GC
- 32 slots per machine
- 6,592 slots total
Map() Map() Map() Map()
Reduce() Reduce() Reduce()
- Entirely bounded
by I/O reading from
HDFS and writing out
locally sorted files
- Mostly network bound
< 10,000 reducers
- Notice that map
has to keep 3 file
handles open
Map() Map() Map() Map()
(28,000 blocks)
RF = 2
250,000+ reducers!
- Only one file handle
open at a time
Map() Map() Map() Map()
(28,000 blocks) - 5 waves of maps
- 5 waves of reduces
Reduce() Reduce() Reduce()
RF = 2
RF = 2
250,000+ reducers!
Sustaining 1.1GB/s/node during shuffle
- Actual final run
- Fully saturated
the 10 Gbit link
PySpark at a Glance
Write Spark jobs
in Python
Run interactive
jobs in the shell
Supports C
(Not yet supported by the DataStax open source Spark connector… but soon..)
However, currently supported in DSE 4.6
PySpark - C*
- DSE has an implementation of PySpark that
supports reading and writing to C*
- But it is not in the open source connector (yet)
Spark Core Engine
Standalone Scheduler YARN MesosLocal
Java API
41 files
8,100 loc
Process data in Python
and persist/transfer it
in Java
Local Disk
Driver JVM
Executor JVM
Executor JVM
Worker MachineDriver Machine
MLlib, SQL, shuffle
MLlib, SQL, shuffle
Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD
PythonRDD RDD[Array[ ] ], , ,
(100 KB – 1MB each picked object)
• JIT, so faster
• less memory
• CFFI support)
(default python)
Choose Your Python Implementation
Driver Machine
Worker Machine
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit
Job CPython 2.7 PyPy 2.3.1 Speed up
Word Count 41 s 15 s 2.7 x
Sort 26 s 44 s 1.05 x
Stats 174 s 3.6 s 48 x
The performance speed up will depend on work load (from 20% to 3000%).
Here are some benchmarks:
Here is the code used for benchmark:
rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x:
- 102 pages
- DevOps style
- For complete beginners
- Includes:
- Spark Streaming
- Dangers of
GroupByKey vs.
- 109 pages
- DevOps style
- For complete beginners
- Includes:
- PySpark
- Spark SQL
- Spark-submit
Q & A
We’re Hiring!
+ Data Scientist
+ JavaScript Engineer
+ Product Manager
+ Data Solutions Engineer
+ Support Operations Engineer
+ Interns

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

  • 3. making big data simple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~45 employees • We’re hiring! • Level 2/3 support partnerships with • Cloudera • Hortonworks • MapR • DataStax (
  • 4. Topics include: • Using cloud-based notebooks to develop Enterprise data workflows • Spark integration with Cassandra, Kafka, Elasticsearch • Advanced use cases with Spark SQL and Spark Streaming • Operationalizing Spark on DataStax, Cloudera, MapR, etc. • Monitoring and evaluating performance metrics • Estimating cluster resource requirements • Debugging and troubleshooting Spark apps • Cases studies for production deployments of Spark • Preparation for Apache Spark developer certification exam - 3 days of training (Tues – Thurs) - $2,795 - Limited to 50 seats - ~40% hands on labs
  • 5. Spark Core Engine (Scala / Python /Java) Spark SQL BlinkDB (Appox. SQL) Spark Streaming MLlib (machine learning) GraphX (Graph Computation) SparkR (R on Spark) + + HDFS S3 + + + text JSON/CSV Protocol Buffers Alpha Standalone Scheduler YARN Mesos erkeley ata nalytics tack Local Tachyon (off-heap RDD) [0, 1, 1, 0] [1, 0, 0, 0] [0, 0, 0, 1]
  • 7. June 2010 “The main abstraction in Spark is that of a resilient dis- tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”
  • 8. April 2012 “We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude.” “Best Paper Award and Honorable Mention for Community Award” - NSDI 2012 - Cited 392 times!
  • 10. sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = p: Seemlessly mix SQL queries with Spark programs. Coming soon! (Will be published in the upcoming weeks for SIGMOD 2015)
  • 14. An RDD can be created 2 ways: - Parallelize a collection - Read data from an external source (S3, C*, HDFS, etc) RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDDs are made of multiple partitions (more partitions = more parallelism)
  • 15. # Parallelize in Python wordsRDD = sc.parallelize([“fish", “cats“, “dogs”]) // Parallelize in Scala val wordsRDD= sc.parallelize(List("fish", "cats", "dogs")) // Parallelize in Java JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”)); - Take an existing in- memory collection and pass it to SparkContext’s parallelize method - Not generally used outside of prototyping and testing since it requires entire dataset in memory on one machine
  • 16. # Read a local txt file in Python linesRDD = sc.textFile("/path/to/") // Read a local txt file in Scala val linesRDD = sc.textFile("/path/to/") // Read a local txt file in Java JavaRDD<String> lines = sc.textFile("/path/to/"); - There are other methods to read data from HDFS, C*, S3, HBase, etc.
  • 17. RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Filter(lambda line: Error in line) RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Coalesce(2)
  • 18. RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Cache() Count() 5 Filter(lambda line: msg1 in line) RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 Collect() Driver Driver saveToCassandra()
  • 19. map() intersection() cartesion() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... (lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations
  • 20. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() saveToCassandra() ...
  • 21. +Vs.
  • 23. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. (default) MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 Same as the levels above, but replicate each partition on two cluster nodes. OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Reduces garbage collection overhead and allows Executors to be smaller.
  • 24.
  • 26. Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles.
  • 27. = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 28. Stage 2 = Job #1Stage 1 Stage 3 Stage 4 . .
  • 30. - Local - Standalone Scheduler - YARN - Mesos Static Partitioning Dynamic Partitioning
  • 31. Ex OS RD D W RD D Driver T Ex OS RD D W RD D T Ex OS RD D W T Ex OS RD D W RD D T T TT Spark Master I’m HA via System TableSpark Master Spark Master Spark Master (Course-Grained Scheduler) (Task Scheduler) spark.executor.memory = 512 MB spark.cores.max = * DSE implementation
  • 32. Spark Master (Course-Grained Scheduler) C * Ex OS S M + RD D W RD D Driver T T 512 MB 512 MB 16 - 256 MB 16 - 256 MB Dynamically set (Task Scheduler) SPARK_TMP_DIR="/tmp/spark“ SPARK_RDD_DIR="/var/lib/spark/rd d"
  • 33. 60%20% 20% Default Memory Allocation in Executor JVM Cached RDDs User Programs (remainder) Shuffle memory
  • 34. Spark Executor Spark-C* Connector C* Java Driver - Open Source - Implemented mostly in Scala - Scala + Java APIs - Does automatic type conversions
  • 35. val cassandraRDD = sc .cassandraTable(“ks”, “mytable”) .select(“col-1”, “col-3”) .where(“col-5 = ?”, “blue”) Keyspace Table {Server side column & row selection
  • 36. +D A B C c1 c2 A K M c1 c2 G B L c1 c2 I W S c1 c2 C Y Z RDD (Resilient Distributed Dataset) c1 c2 G B L c1 c2 I W S c1 c2 C Y Z c1 c2 A K M Partition 1 Partition 2 Partition 3 Partition 4 Every Spark task uses a CQL-like query to fetch data for a given token range: SELECT “key”, “value” FROM “keyspace”.”table” WHERE token(“key”) > 384023840238403 AND token(“key”) <= 38402992849280 ALLOW FILTERING
  • 38. Configuration Settings /etc/dse/spark/ export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=7080 export SPARK_WORKER_WEBUI_PORT=7081 # export SPARK_EXECUTOR_MEMORY="512M“ # export DEFAULT_PER_APP_CORES="1“ # Set the amount of memory used by Spark Worker - if uncommented, it overrides the setting initial_spark_worker_resources in dse.yaml. # export SPARK_WORKER_MEMORY=2048m # The amount of memory used by Spark Driver program export SPARK_DRIVER_MEMORY="512M“ # Directory where RDDs will be cached export SPARK_RDD_DIR="/var/lib/spark/rdd“ # The directory for storing master.log and worker.log files export SPARK_LOG_DIR="/var/log/spark" (Spark Settings)
  • 39. Configuration Settings /etc/dse/dse.yaml (DataStax Enterprise Settings) # The fraction of available system resources to be used by Spark Worker # This the only initial value, once it is reconfigured, the new value is # stored and retrieved on next run. initial_spark_worker_resources: 0.7 Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to C*) Spark worker cores = initial_spark_worker_resources * total system cores
  • 40. or filter() c1 c2 A 5 5 C 0 1 C 0 1 RDD C* RDD c1 c2 A 5 5 C 0 1 C 0 1 RDD C* sc.cassandraTable(“KS", “TB").select(“C-1", “C-2") .where(“C-3 = ?", "black")
  • 41. Interesting Spark Settings spark.speculation = false spark.locality.wait = 3000 spark.local.dir = /tmp spark.serializer = JavaSerializer KryoSerialize r
  • 42. • HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD • CassandraRDD (DataStax) • GeoRDD (ESRI) • EsSpark (ElasticSearch)
  • 43.
  • 44. Spark sorted the same data 3X faster using 10X fewer machines than Hadoop MR in 2013. Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia 100TB Daytona Sort Competition 2014 More info: officially-sets-a-new-record-in-large-scale-sorting.html All the sorting took place on disk (HDFS) without using Spark’s in-memory cache!
  • 45. - Stresses “shuffle” which underpins everything from SQL to Mllib - Sorting is challenging b/c there is no reduction in data - Sort 100 TB = 500 TB disk I/O and 200 TB network Engineering Investment in Spark: - Sort-based shuffle (SPARK-2045) - Netty native network transport (SPARK-2468) - External shuffle service (SPARK-3796) Clever Application level Techniques: - GC and cache friendly memory layout - Pipelining
  • 46. Ex RD D W RD D T T EC2: i2.8xlarge (206 workers) - Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores - 244 GB of RAM - 8 x 800 GB SSD and RAID 0 setup formatted with /ext4 - ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes - Each record: 100 bytes (10 byte key & 90 byte value) - OpenJDK 1.7 - HDFS 2.4.1 w/ short circuit local reads enabled - Apache Spark 1.2.0 - Speculative Execution off - Increased Locality Wait to infinite - Compression turned off for input, output & network - Used Unsafe to put all the data off-heap and managed it manually (i.e. never triggered the GC - 32 slots per machine - 6,592 slots total
  • 47. Map() Map() Map() Map() Reduce() Reduce() Reduce() - Entirely bounded by I/O reading from HDFS and writing out locally sorted files - Mostly network bound < 10,000 reducers - Notice that map has to keep 3 file handles open TimSort
  • 48. Map() Map() Map() Map() (28,000 blocks) RF = 2 250,000+ reducers! - Only one file handle open at a time
  • 49. Map() Map() Map() Map() (28,000 blocks) - 5 waves of maps - 5 waves of reduces Reduce() Reduce() Reduce() RF = 2 RF = 2 250,000+ reducers! MergeSort! TimSort
  • 50. Sustaining 1.1GB/s/node during shuffle - Actual final run - Fully saturated the 10 Gbit link
  • 51.
  • 52. PySpark at a Glance Write Spark jobs in Python Run interactive jobs in the shell Supports C extensions (Not yet supported by the DataStax open source Spark connector… but soon..) However, currently supported in DSE 4.6
  • 53. PySpark - C* - DSE has an implementation of PySpark that supports reading and writing to C* - But it is not in the open source connector (yet)
  • 54. Spark Core Engine (Scala) Standalone Scheduler YARN MesosLocal Java API PySpark 41 files 8,100 loc 6,300 comments
  • 55. Process data in Python and persist/transfer it in Java
  • 56. Spark Context Controlle r Spark Context Py4j Socke t Local Disk Pipe Driver JVM Executor JVM Executor JVM Pipe Worker MachineDriver Machine F(x) F(x) F(x) F(x) F(x) RDD RDD RDD RDD MLlib, SQL, shuffle MLlib, SQL, shuffle
  • 57. Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD MappedRDD PythonRDD RDD[Array[ ] ], , , (100 KB – 1MB each picked object)
  • 58. pypy • JIT, so faster • less memory • CFFI support) CPython (default python) Choose Your Python Implementation Spark Context Driver Machine Worker Machine $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/pyspark $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit OR
  • 59. Job CPython 2.7 PyPy 2.3.1 Speed up Word Count 41 s 15 s 2.7 x Sort 26 s 44 s 1.05 x Stats 174 s 3.6 s 48 x The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Here is the code used for benchmark: rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/')) .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
  • 60. - 102 pages - DevOps style - For complete beginners - Includes: - Spark Streaming - Dangers of GroupByKey vs. ReduceByKey
  • 61. - 109 pages - DevOps style - For complete beginners - Includes: - PySpark - Spark SQL - Spark-submit
  • 62.
  • 63. Q & A We’re Hiring! + Data Scientist + JavaScript Engineer + Product Manager + Data Solutions Engineer + Support Operations Engineer + Interns

