Spark & Cassandra at DataStax Meetup on Jan 29, 2015
1. ARCHITECTURE + DEMO
JAN 2015
http://www.slideshare.net/blueplastic
www.linkedin.com/in/blueplastic
2. AGENDA
• DATABRICKS CLOUD DEMO
• SPARK STANDALONE MODE ARCHITECTURE
• RDDS, TRANSFORMATIONS AND ACTIONS
• NEW SHUFFLE IMPLEMENTATION
• PYSPARK ARCHITECTURE
• P
3. making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~45 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Cloudera
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
4. http://strataconf.com/big-data-conference-ca-2015/public/content/apache-spark
Topics include:
• Using cloud-based notebooks to develop Enterprise data workflows
• Spark integration with Cassandra, Kafka, Elasticsearch
• Advanced use cases with Spark SQL and Spark Streaming
• Operationalizing Spark on DataStax, Cloudera, MapR, etc.
• Monitoring and evaluating performance metrics
• Estimating cluster resource requirements
• Debugging and troubleshooting Spark apps
• Cases studies for production deployments of Spark
• Preparation for Apache Spark developer certification exam
- 3 days of training (Tues – Thurs)
- $2,795
- Limited to 50 seats
- ~40% hands on labs
7. June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like
parallel operations.
RDDs achieve fault tolerance through a notion of
lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”
8. April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications
that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.
In both cases, keeping data in memory can improve
performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award”
- NSDI 2012
- Cited 392 times!
10. sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Seemlessly mix SQL queries with Spark programs.
Coming soon!
(Will be published in the upcoming
weeks for SIGMOD 2015)
14. An RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
RDD
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDDs are made of multiple partitions (more partitions = more parallelism)
15. # Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scala
val wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-
memory collection and
pass it to SparkContext’s
parallelize method
- Not generally used
outside of prototyping
and testing since it
requires entire dataset in
memory on one
machine
16. # Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.
19. map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
(lazy)
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
23. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. (default)
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2
Same as the levels above, but replicate each partition on two cluster
nodes.
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Reduces garbage collection
overhead and allows Executors to be smaller.
32. Spark Master
(Course-Grained Scheduler) C
*
Ex
OS
S
M
+ RD
D
W
RD
D
Driver
T
T
512 MB
512 MB
16 - 256 MB
16 - 256 MB
Dynamically set
(Task Scheduler)
SPARK_TMP_DIR="/tmp/spark“
SPARK_RDD_DIR="/var/lib/spark/rd
d"
33. 60%20%
20%
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs
(remainder)
Shuffle memory
spark.storage.memoryFraction
spark.storage.memoryFraction
36. +D
A
B
C
c1 c2
A
K
M
c1 c2
G
B
L
c1 c2
I
W
S
c1 c2
C
Y
Z
RDD
(Resilient Distributed Dataset)
c1 c2
G
B
L
c1 c2
I
W
S
c1 c2
C
Y
Z
c1 c2
A
K
M
Partition 1 Partition 2 Partition 3 Partition 4
Every Spark task uses a CQL-like query to fetch
data for a given token range:
SELECT “key”, “value”
FROM “keyspace”.”table”
WHERE
token(“key”) > 384023840238403 AND
token(“key”) <= 38402992849280
ALLOW FILTERING
38. Configuration Settings
/etc/dse/spark/spark-env.sh
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_WORKER_WEBUI_PORT=7081
# export SPARK_EXECUTOR_MEMORY="512M“
# export DEFAULT_PER_APP_CORES="1“
# Set the amount of memory used by Spark Worker - if uncommented, it
overrides the setting initial_spark_worker_resources in dse.yaml.
# export SPARK_WORKER_MEMORY=2048m
# The amount of memory used by Spark Driver program
export SPARK_DRIVER_MEMORY="512M“
# Directory where RDDs will be cached
export SPARK_RDD_DIR="/var/lib/spark/rdd“
# The directory for storing master.log and worker.log files
export SPARK_LOG_DIR="/var/log/spark"
(Spark Settings)
39. Configuration Settings
/etc/dse/dse.yaml (DataStax Enterprise Settings)
# The fraction of available system resources to be used by Spark Worker
# This the only initial value, once it is reconfigured, the new value is
# stored and retrieved on next run.
initial_spark_worker_resources: 0.7
Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to C*)
Spark worker cores = initial_spark_worker_resources * total system cores
40. or
filter()
c1 c2
A 5 5
C 0 1
C 0 1
RDD
C*
RDD
c1 c2
A 5 5
C 0 1
C 0 1
RDD
C*
sc.cassandraTable(“KS", “TB").select(“C-1", “C-2")
.where(“C-3 = ?", "black")
44. Spark sorted the same data 3X faster
using 10X fewer machines
than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
All the sorting took place on disk (HDFS) without
using Spark’s in-memory cache!
45. - Stresses “shuffle” which underpins everything from SQL to Mllib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)
- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout
- Pipelining
46. Ex
RD
D
W
RD
D
T
T
EC2: i2.8xlarge
(206 workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed
it manually (i.e. never triggered the GC
- 32 slots per machine
- 6,592 slots total
47. Map() Map() Map() Map()
Reduce() Reduce() Reduce()
- Entirely bounded
by I/O reading from
HDFS and writing out
locally sorted files
- Mostly network bound
< 10,000 reducers
- Notice that map
has to keep 3 file
handles open
TimSort
48. Map() Map() Map() Map()
(28,000 blocks)
RF = 2
250,000+ reducers!
- Only one file handle
open at a time
52. PySpark at a Glance
Write Spark jobs
in Python
Run interactive
jobs in the shell
Supports C
extensions
(Not yet supported by the DataStax open source Spark connector… but soon..)
However, currently supported in DSE 4.6
53. PySpark - C*
- DSE has an implementation of PySpark that
supports reading and writing to C*
- But it is not in the open source connector (yet)
57. Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD
MappedRDD
PythonRDD RDD[Array[ ] ], , ,
(100 KB – 1MB each picked object)
58. pypy
• JIT, so faster
• less memory
• CFFI support)
CPython
(default python)
Choose Your Python Implementation
Spark
Context
Driver Machine
Worker Machine
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy
./bin/pyspark
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit
wordcount.py
OR
59. Job CPython 2.7 PyPy 2.3.1 Speed up
Word Count 41 s 15 s 2.7 x
Sort 26 s 44 s 1.05 x
Stats 174 s 3.6 s 48 x
The performance speed up will depend on work load (from 20% to 3000%).
Here are some benchmarks:
Here is the code used for benchmark:
rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x:
xrange(5024)).stats()
https://github.com/apache/spark/pull/2144
63. Q & A
We’re Hiring!
+ Data Scientist
+ JavaScript Engineer
+ Product Manager
+ Data Solutions Engineer
+ Support Operations Engineer
+ Interns
Editor's Notes
Q) Who knows why the project was named Spark?
REMEMBER to say out loud the % of hands that go up for each of these:
Q-1) How many people are Cassandra people.. sstables, bloom filters, vnodes?
Q-2) How many people have launched a any Spark REPL like Scala or PySpark or Spark SQL and typed in anything at all? So, anyone who has hands-on experience with typing a Spark Transformation or Action?
Q-3) I’ll be asking 2 questions, you can raise your hand for either or? How many people will be deploying Spark on Hadoop vs. people deploying Spark on Cassandra?
Q-4) Will be deploying Spark in the Amazon Cloud? (You guys would be our ideal customers…)
Mention that I was a Datastax Authorized Training partner and have been using C* since 2010. DS sent me to Singapore, Malaysia, Mexico, San Diego, NYC… just overall, I got to visit 20 countries and around 50 cities.
But recently I started getting interested in Apache Spark and tried to learn it on my own and getting beyond the basics was really hard. So, I decided to join Databricks to *really* learn Spark. When you get stuck with something obscure like how preservesPartitioning works, it helps to be sitting a few feet away from Matei Zaharia when you’re studying how the engine that he built works.
So, anyway, the goal of this talk teach you some of what I’ve learned in the past 6 months. I hope this talk will bootstrap you on the way to really understanding Spark and save you some of the headaches along the way.
Say that the goal of this talk is to get you up and running with Spark’s Python API, basic programming in Spark, understanding it’s architecture and using SparkSQL
Here’s the Agenda for tonight, in no particular order
So, Data
AmpLab was founded in Jan 2011 with a 6 year plan, has 65 students, faculty and postdocs.
Matei designed the Fair Scheduler for Hadoop in mid-2008:
https://issues.apache.org/jira/browse/HADOOP-3746
“The default job scheduler in Hadoop has a first-in-first-out queue of jobs for each priority level. The scheduler always assigns task slots to the first job in the highest-level priority queue that is in need of tasks. This makes it difficult to share a MapReduce cluster between users because a large job will starve subsequent jobs in its queue, but at the same time, giving lower priorities to large jobs would cause them to be starved by a stream of higher-priority jobs.
This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits up compute time proportionally between jobs that have been submitted, emulating an "ideal" scheduler that gives each job 1/Nth of the available capacity. When there is a single job running, that job receives all the capacity. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that everyone gets roughly the same amount of compute time. This lets short jobs finish in reasonable amounts of time while not starving long jobs. This is the type of scheduling used or emulated by operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that a job should get.”
Remind audience to add me on LinkedIn and I’ll let them know of my next upcoming trainings.
Spark runs everywhere.. Hadoop, Mesos, standalone or in the cloud.. Amazon, Rackspace, Azure. Even Docker, the list just goes on and on…
So, our goal is that you’ll be able to learn one processing engine and then re-use your knowledge against any data source or environment. But you do have to make that initial investment in learning Spark, and hopefully this presentation will get you going along the way.
A philosophy of tight integration has several benefits. First, all libraries and higher level components in the stack benefit from improvements at the lower layers. For example, when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well.
Second, the costs (deployment, maintenance, testing, support) associated with running the stack are minimized, because instead of running 5-10 independent software systems, an organization only needs to run one.
This also means that each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. This changes the cost of trying out a new type of data analysis from downloading, deploying, and learning a new software project to upgrading Spark.
Finally, one of the largest advantages of tight integration is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously analysts can query the resulting data, also in real-time, via SQL, e.g. to join the data with unstructured log files. In addition, more sophisticated data engineers & data scientists can access the same data via the Python shell for ad-hoc analysis. Others might access the data in standalone batch applications. All the while, the IT team only has to maintain one software stack.
Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.”
Note that Python 3 is not yet supported. You will have to use Python 2.6 or a newer 2.x branch.
Quote, Keep the simple things simple and make the complex possible.
Today we’ll just keep the simple things simple and not get into the complexities of Machine Learning and advanced visualizations
This was just a preview academic paper
By the way, the original Google MapReduce whitepaper from 2008 was cited 12,600 times.
In the code: Counting tweets on a sliding window
Build applications through high-level operators.
Stateful exactly-once semantics out of the box.
Combine streaming with batch and interactive queries.
In the code: Apply functions to results of SQL queries.
Run unmodified Hive queries on existing warehouses.
Connect through JDBC or ODBC.
Special Interest Group for Management of Data
This was an invited paper for the Industrial track to be held in Melbourne, Australia in early June 2015
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
Coalesce: “Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.”
The coalesce function is only used to reduce the number of partitions
NEED TO FIX ANIMATIONS HERE (after save to C* addition)
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
To-do: Put the ones for k/v pairs in a different color
Map: Return a new distributed dataset formed by passing each element of the source through a function func.
FlatMap: Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
Filter: Return a new dataset formed by selecting those elements of the source on which func returns true.
Pipe: Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
Sample: Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
Coalesce: Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
Actions force the evaluation of the transformations required for the RDD they are called on, since they are required to actually produce output.
Wide dependency pretty much always means a shuffle
Map, mappartitions, filters are cheap and can be pipelined
Two different types of transformations: Narrow & Wide
Wide dependencies are MORE expensive.
Narrow transformations: where the input data depends on a constant # of partitions of the parent data. Like map, filter, mapPartitions.
Wide transformations: where the input data depends on a variable # of partitions of the parent data. Like groupByKey or reduceByKey.
Some operations can be narrow or wide depending on if the data has the same partitioning, like join.
Figure 5: Example of how Spark computes job stages. Boxes
with solid outlines are RDDs. Partitions are shaded rectangles,
In black if they are already in memory. To run an action on RDD
G, we build stages at wide dependencies and pipeline nar-
row transformations inside each stage. In this case, stage 1’s
output RDD is already in RAM, so we run stage 2 and then 3.
Sameer notes: Here the RDD B is hash partitioned, and is probably a much larger RDD than E. The data from E is having to be sent over the network.. More on this later.
Not correct: Even during the join, Spark understands that these 2 RDDs (from Stage 1 and Stage 2) are hash partitioned to the same node. So, there is no need to over the network shuffling to make the Stage 3 RDD.
The set of stages produced for a particular action is termed a job . In each case when we invoke actions such as count , we are creating a job composed of one or more stages.
Once the stage graph is defined, tasks are created and dispatched to an internal scheduler
which varies depending on the deployment mode being used. Stages in the physical plan
can depend on each other, based on the RDD lineage, so they will be executed in a
specific order. For instance a stage that outputs shuffle data most occur before one that
relies on that data being present.
The set of stages produced for a particular action is termed a job . In each case when we invoke actions such as count , we are creating a job composed of one or more stages.
Once the stage graph is defined, tasks are created and dispatched to an internal scheduler
which varies depending on the deployment mode being used. Stages in the physical plan
can depend on each other, based on the RDD lineage, so they will be executed in a
specific order. For instance a stage that outputs shuffle data most occur before one that
relies on that data being present.
Mention that in YARN we can set spark.dynamicAllocation.enabled to True with minExecutors, maxExecutors and 3 timeouts.
In cluster mode Spark depends on a cluster manager to launch executors and, in certain cases, to launch the driver. The
cluster manager is a pluggable component in Spark.
Each one of these resource managers has different pros and cons. One nice thing about running Spark in YARN, for example, is that you can dynamically resize the # of Executors in the application. This feature is not yet possible in Standalone mode.
Static Partitioning: With this approach, each application is given a maximum amount of resources it can use, and holds onto them for its whole duration.
Mesos: In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity, where each application gets more or fewer machines as it ramps up and down, but it comes with an additional overhead in launching each task. This mode may be inappropriate for low-latency requirements like interactive queries or serving web requests.
The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini-tasks” within it. The benefit is much lower startup overhead, but at the cost of reserving the Mesos resources for the complete duration of the application.
In Standalone and Mesos course-grained, you can control the maximum number of resources Spark will acquire. By default, it will acquire all cores in the cluster (that get offered by Mesos), which only makes sense if you run just one application at a time. You can cap the max # of cores using conf.set("spark.cores.max", "10") (for example).
In YARN, ZooKeeper makes the SM HA
You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise)
In YARN the driver is HA via YARN primitives
Spark.cores.max = the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will bespark.deploy.defaultCores on Spark's standalone cluster manager
As far as C* goes, for example, in my 15 GB VM, the Cassandra JVM gets about 3.5 GB of RAM
The Driver runs within it a Task Scheduler. This scheduler schedules tasks on different Executors.
When running in cluster mode, Spark utilizes a master-slave architecture with one central coordinator (called Driver) and many distributed Executors. A driver and its executors are together termed a Spark application.
A Spark driver is responsible for compiling a user program into units of physical execution called tasks.
# Directory for Spark temporary files. It will be used by Spark Master, Spark Worker, Spark Shell and Spark applications.
export SPARK_TMP_DIR="/tmp/spark"
# Directory where RDDs will be cached
export SPARK_RDD_DIR="/var/lib/spark/rdd"
RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction
Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle.
User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated.
You can also specify the column order by stating explicitly which columns you want returned…
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For every RDD partition:
Spark asks RDD for a list of preferred nodes to process on
Spark creates a task and sends it to one of the nodes for execution
DataStax Enterprise 4.5.2 and later can control the memory and cores offered by particular Spark Workers in semiautomatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker.
The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.
The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
export SPARK_REPL_MEM="256M“ will be removed in newer versions of DSE: “SPARK_REPL_MEM is a left-over from some earlier Spark version where I guess SPARK_DRIVER_MEM was not present.We're going to remove this setting in the next version.
I created a ticket for this:
https://datastax.jira.com/browse/DSP-4495”
DataStax Enterprise 4.5.2 and later can control the memory and cores offered by particular Spark Workers in semiautomatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker.
The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.
The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
To filter rows, you can use the filter transformation provided by Spark. Filter transformation fetches all rows from Cassandra first and then filters them in Spark. Some CPU cycles are wasted serializing and deserializing objects excluded from the result.
To avoid this overhead, CassandraRDD has a method that passes an arbitrary CQL condition to filter the row set on the server.
Spark Speculation will enable speculative execution of tasks, so tasks that are running slowly will be re-launched (this can help cut down on straggler tasks in large clusters). SE only kicks in after 75% of tasks are done (this is separate for each stage). This is off by default. There are 3 other settings: spark.speculation.interval (default 100: How often Spark will check for tasks to speculate, in milliseconds), spark.speculation.quantile (default 0.75; percentage of tasks which must be complete before speculation is enabled for a particular stage, spark.speculation.multiplier (default 1.5; how many times slower a task is than the median to be considered for speculation)
Locality.wait defaults to 3000 ms. This is how long to wait to launch a data-local task before giving up and launching it on a less-local node. There are many locality levels (process-local, node-local, rack-local, any). You should increase this setting if your tasks are long and see poor locality.
Spark.serializer: Class to use for serializing objects that will be sent over the network or need to be cached in serialized form. JavaSerializer is slow, but works with any object. Consider using org.apache.spark.serializer.KryoSerializer. Can be any subclass of org.apache.spark.Serializer
Research some cool things to say about the different RDDs and what they are used for
Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.”
Organizations from around the world often build dedicated sort machines (specialized software and sometimes specialized hardware) to compete in this benchmark.. Spark actually tied for 1st place with a team from University of California San Diego who have been working on creating a specialized sorting system called TritonSort.
Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs.
Named after Jim Gray, the benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O.
Requires read and write of 500 TB of disk I/O and 200 TB of network (b/c you have to replicate the output to make it fault taulerant)
First time a system based on a public cloud system has won
Zero-Copy (terminology): With Netty, the data from disk only gets sent to one NIC buffer and sent out to other node from there.
Older implementation (normal) would have to first copy to FileSystem kernel buffer cache, then to a Executor JVM user-space buffer, and then to kernel NIC buffer and out to the remove reducer node
GC: Netty does an explicit managed memory, malloc (outside of JVM). Netty is in the normal Executor JVM process that allocates a bunch of memory buffers off-heap and manages these transport buffers entirely by itself.
In HDFS, reads normally go through the DataNode. Thus, when the client asks the DataNode to read a file, the DataNode reads that file off of the disk and sends the data to the client over a TCP socket. So-called "short-circuit" reads bypass the DataNode, allowing the client to read the file directly. Obviously, this is only possible in cases where the client is co-located with the data. Short-circuit reads provide a substantial performance boost to many applications.
<name>dfs.client.read.shortcircuit</name> <value>true</value>
No network usage at all in the Map phase
In the Reduce stage, each reducer fetches its own range of data as determined by the partitioner from all the map outputs. It then sorts these outputs using TimSort and writes the sorted output to HDFS.
48 MB is the amount in flight at a time
In both shuffles: R talks to 5 maps at a time and grabs 48 MB from each
Right now if there are 3 file handles open for each map
The sort based shuffle basically doesn’t require writing a separate file for each reduce task from each mapper
Defaults to sort in 1.2
The sort based shuffle basically doesn’t require writing a separate file for each reduce task from each mapper
Reducer hasn’t changed (basically the same)
Only where you fetch the data has changed
Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.”
This is based on regular C python… it’s not jython. So that means all python libraries are compatible. If you use numpy or NLTK or something else that has a c extension, then you can import that into PySpark. It’s a regular unmodified python interpreter.
Right now
Built on top of the Java API. It’s a very thin layer that sits on top of Scala and Java Spark.
It provides a way to get data in and out of python and it reuses a lot of the Spark Scala functionality.
You process all the data in python, so all of the UDFs you write will be python.
But Spark will persist + transfer it through the network and storage layers in Scala + Java.
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Essentially, you just call Java methods from your python code as if they were python methods.
These are executor JVMs where num_cores = 2
In the Executor JVM, Pyrolite is used to unpickle the data or to pickle it. “This library allows your Java or .NET program to interface very easily with the Python world. It uses the Pyro protocol to call methods on remote objects. (See https://github.com/irmen/Pyro4). Pyrolite contains and uses a feature complete pickle protocol implementation to exchange data with Pyro/Python. “
Why does the local disk get used on the driver machine? Py4J is very inefficient for transferring large data b/c it uses a text based protocol.. It has to encode in base 64 so really slow. A collect or broadcast may be huge, so Py4j writes one file to the local FS. The final unpickling happens in the Python process on the Driver Machine.
Python functions and closures (with dependencies) are serialized using PiCloud’s CloudPickle module
You could also use Py4J. There is an example on the frontpage and lots of documentation, but essentially, you just call Java methods from your python code as if they were python methods:
You first write a java program where you create a class and run that java program in a JVM. Then you write a Python program that can access that java object via a network socket.
Here is what happens behind the scenes.
On my local machine, I have the Spark Context object… top level entry point for Spark.. Used to create RDDs. SC runs in the Python process.
When a SC is made, behind the scenes it creates a Java Spark context, running on top of regular Spark. Py4J is used to talk (copy data) between a JVM and python using reflection so you can call a java call just as if it was a native python method. It’s pretty cool.
So the python Spark Context is controlling the Java Spark Context. A local file system is also used for communication through multiple files for doing things like collecting results.
The integration with Py4J and Python is fine for like control point stuff, but you don’t want to transfer a whole lot of data through it because reflection is just too expensive.
When you submit a job to the cluster, your job (the Python transformations and stuff) gets mapped down to just a few transformations on this special Python RDD object. Behind the scenes when these objects are shipped to the Spark executor, they actually spin up a bunch of actual python processes. And the Spark Executors communicate with the Python code over pipes using a special format to pass the functions that the user is writing.
SO The executor will pipe all the data into the python processes, get all the results back and serialize it and transfer it.
Naturally, there is some serialization involved
there is one Python process per executor thread (we need this due to Python’s GIL so that CPU-bound tasks can benefit from multiple cores). In older versions of PySpark, we’d end up launching a new Python worker process per task. In newer versions (1.2.0+), executors launch Python processes as needed, keep them around in a pool, and terminate them if they’re idle for too long (1 minute by default); see https://github.com/apache/spark/pull/2259 and the followup patches for more details. The actual subprocess architecture is a little bit complicated because we use a separate process for managing the Python worker processes in order to work around some fork() performance issues in the JVM. This may mean that you’ll see up to 5 Python processes on an executor with 4 threads: one Python process per core, plus the daemon.py middle layer; see https://github.com/apache/spark/pull/1680 for some additional context.
Python UDFs are executed in regular Python interpreters, so things like sc.map(lambda x: some_arbitrary_python_code(x)) are run in Python interpreters, not through anything like Jython. Spark SQL and MLlib may actually end up executing code that runs in the JVM for doing heavy number-crunching and I expect that we’ll see more in-JVM execution of certain operations in future versions of Spark.
See also: the (woefully out-of-date) PySpark internals page on the Spark wiki: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. I really ought to update this page to reflect some of these recent developments.
CPython is the original Python implementation. It is the implementation you download from Python.org. CPython compiles your python code into bytecode (transparently) and interprets that bytecode in a evaluation loop. So CPython does not translate your Python code to C by itself. It instead runs a interpreter loop. There is a project that does translate Python-ish code to C, and that is called Cython.
Pypy uses just in time compilation (JIT) or dynamic translation, so it runs faster. This means that compilation is done during the execution of the program (at run time) rather than just prior to execution.
- - -
C Foreign Function Interface for Python. The goal is to provide a convenient and reliable way to call compiled C code from Python using interface declarations written in C.
- - -
Pypy uses just in time compilation (JIT) or dynamic translation, so it runs faster. This means that compilation is done during the execution of the program (at run time) rather than just prior to execution.
A JIT compiler is a program that turns Java bytecode (instructions that must be interpreted) into instructions that can be sent directly to the processor.
After you've written a Java program, the source language statements are compiled by the Java compiler into bytecode rather than into code that contains instructions that match a particular hardware platform's processor (for example, an Intel Pentium microprocessor or an IBM System/390 processor). The bytecode is platform-independent code that can be sent to any platform and run on that platform.
In the past, most programs written in any language have had to be recompiled, and sometimes, rewritten for each computer platform. One of the biggest advantages of Java is that you only have to write and compile a program once. The Java on any platform will interpret the compiled bytecode into instructions understandable by the particular processor. However, the virtual machine handles one bytecode instruction at a time. Using the Java just-in-time compiler (really a second compiler) at the particular system platform compiles the bytecode into the particular system code (as though the program had been compiled initially on that platform). Once the code has been (re-)compiled by the JIT compiler, it will usually run more quickly in the computer.
- - -
Good explanation of JIT:
A JIT compiler runs after the program has started and compiles the code (usually bytecode or some kind of VM instructions) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently.
This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run.
To paraphrase, conventional compilers build the whole program as an EXE file BEFORE the first time you run it. For newer style programs, an assembly is generated with pseudocode (p-code). Only AFTER you execute the program on the OS (e.g., by double-clicking on its icon) will the (JIT) compiler kick in and generate machine code (m-code) that the Intel-based processor or whatever will understand.
- - - - -
Research: https://github.com/apache/spark/blob/master/bin/pyspark#L54
https://github.com/apache/spark/pull/2144
In general, PySpark with CPython will run 3-10x slower than Scala/Java.. Unless you use PyPy
One more thing…
This will definitely be sold out
Let audience know that - I’ll be teaching the 1-day Advanced Training sessions @ Spark Summit East (NYC) on March 18-19 or Summit West
Mention that we’re adding many interesting, new features to DBC that will be revealed soon.
And say that I demoed many features of DBC from ETL to Exploration and Dashboards.