SlideShare a Scribd company logo
1 of 63
ARCHITECTURE + DEMO
JAN 2015
http://www.slideshare.net/blueplastic
www.linkedin.com/in/blueplastic
AGENDA
• DATABRICKS CLOUD DEMO
• SPARK STANDALONE MODE ARCHITECTURE
• RDDS, TRANSFORMATIONS AND ACTIONS
• NEW SHUFFLE IMPLEMENTATION
• PYSPARK ARCHITECTURE
• P
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~45 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Cloudera
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
http://strataconf.com/big-data-conference-ca-2015/public/content/apache-spark
Topics include:
• Using cloud-based notebooks to develop Enterprise data workflows
• Spark integration with Cassandra, Kafka, Elasticsearch
• Advanced use cases with Spark SQL and Spark Streaming
• Operationalizing Spark on DataStax, Cloudera, MapR, etc.
• Monitoring and evaluating performance metrics
• Estimating cluster resource requirements
• Debugging and troubleshooting Spark apps
• Cases studies for production deployments of Spark
• Preparation for Apache Spark developer certification exam
- 3 days of training (Tues – Thurs)
- $2,795
- Limited to 50 seats
- ~40% hands on labs
Spark Core Engine
(Scala / Python /Java)
Spark SQL
BlinkDB
(Appox. SQL)
Spark
Streaming
MLlib
(machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
+ +
HDFS S3
+ + +
text
JSON/CSV
Protocol Buffers
Alpha
Standalone Scheduler YARN Mesos
erkeley
ata
nalytics
tack
Local
Tachyon
(off-heap RDD)
[0, 1, 1, 0]
[1, 0, 0, 0]
[0, 0, 0, 1]
DEMO #1:
June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like
parallel operations.
RDDs achieve fault tolerance through a notion of
lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”
April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications
that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.
In both cases, keeping data in memory can improve
performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award”
- NSDI 2012
- Cited 392 times!
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
- 2 Streaming Paper(s) have been cited 138 times
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Seemlessly mix SQL queries with Spark programs.
Coming soon!
(Will be published in the upcoming
weeks for SIGMOD 2015)
http://shop.oreilly.com/product/0636920028512.do
eBook: $31.99
Print: $39.99
Early release available now!
Physical copy est Feb 2015
(Scala & Python only)
Driver Program
Ex
RD
D
W
RD
D
T
T
Ex
RD
D
W
RD
D
T
T
Worker Machine
Worker Machine
An RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
RDD
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDDs are made of multiple partitions (more partitions = more parallelism)
# Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scala
val wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-
memory collection and
pass it to SparkContext’s
parallelize method
- Not generally used
outside of prototyping
and testing since it
requires entire dataset in
memory on one
machine
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.
RDD
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDD
Partition 1 Partition 2 Partition 3 Partition 4
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
Filter(lambda line: Error in line)
RDD
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Coalesce(2)
RDD
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3
Error, ts, msg4
Error, ts, msg1
Cache()
Count()
5
Filter(lambda line: msg1 in line)
RDD
Partition 1 Partition 2
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
Collect()
Driver
Driver
saveToCassandra()
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
(lazy)
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
+Vs.
.cache() == .persist(MEMORY_ONLY)
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. (default)
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2
Same as the levels above, but replicate each partition on two cluster
nodes.
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Reduces garbage collection
overhead and allows Executors to be smaller.
DEMO #2:
Examples of narrow and wide dependencies.
Each box is an RDD, with partitions shown as shaded rectangles.
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Stage 2 = Job #1Stage 1
Stage 3
Stage 4
.
.
= Job #1
Filter()
Map()
Map()
MapPartitions
()
GroupByKey(
)
...
rePartition
()
...
- Local
- Standalone Scheduler
- YARN
- Mesos
Static Partitioning
Dynamic Partitioning
Ex
OS
RD
D
W
RD
D
Driver
T
Ex
OS
RD
D
W
RD
D
T
Ex
OS
RD
D
W
T
Ex
OS
RD
D
W
RD
D
T
T
TT
Spark
Master
I’m HA via
System
TableSpark
Master
Spark
Master
Spark Master
(Course-Grained Scheduler)
(Task Scheduler)
spark.executor.memory = 512 MB
spark.cores.max = *
DSE implementation
Spark Master
(Course-Grained Scheduler) C
*
Ex
OS
S
M
+ RD
D
W
RD
D
Driver
T
T
512 MB
512 MB
16 - 256 MB
16 - 256 MB
Dynamically set
(Task Scheduler)
SPARK_TMP_DIR="/tmp/spark“
SPARK_RDD_DIR="/var/lib/spark/rd
d"
60%20%
20%
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs
(remainder)
Shuffle memory
spark.storage.memoryFraction
spark.storage.memoryFraction
https://github.com/datastax/spark-cassandra-connector
Spark Executor
Spark-C*
Connector
C* Java Driver
- Open Source
- Implemented mostly in Scala
- Scala + Java APIs
- Does automatic type conversions
val cassandraRDD = sc
.cassandraTable(“ks”, “mytable”)
.select(“col-1”, “col-3”)
.where(“col-5 = ?”, “blue”)
Keyspace Table
{Server side column
& row selection
+D
A
B
C
c1 c2
A
K
M
c1 c2
G
B
L
c1 c2
I
W
S
c1 c2
C
Y
Z
RDD
(Resilient Distributed Dataset)
c1 c2
G
B
L
c1 c2
I
W
S
c1 c2
C
Y
Z
c1 c2
A
K
M
Partition 1 Partition 2 Partition 3 Partition 4
Every Spark task uses a CQL-like query to fetch
data for a given token range:
SELECT “key”, “value”
FROM “keyspace”.”table”
WHERE
token(“key”) > 384023840238403 AND
token(“key”) <= 38402992849280
ALLOW FILTERING
Configuration Settings
/etc/dse/cassandra/cassandra.yaml
/etc/dse/dse.yaml
/etc/dse/spark/spark-env.sh
Cassandra Settings
DataStax Enterprise Settings
Spark Settings
Configuration Settings
/etc/dse/spark/spark-env.sh
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_WORKER_WEBUI_PORT=7081
# export SPARK_EXECUTOR_MEMORY="512M“
# export DEFAULT_PER_APP_CORES="1“
# Set the amount of memory used by Spark Worker - if uncommented, it
overrides the setting initial_spark_worker_resources in dse.yaml.
# export SPARK_WORKER_MEMORY=2048m
# The amount of memory used by Spark Driver program
export SPARK_DRIVER_MEMORY="512M“
# Directory where RDDs will be cached
export SPARK_RDD_DIR="/var/lib/spark/rdd“
# The directory for storing master.log and worker.log files
export SPARK_LOG_DIR="/var/log/spark"
(Spark Settings)
Configuration Settings
/etc/dse/dse.yaml (DataStax Enterprise Settings)
# The fraction of available system resources to be used by Spark Worker
# This the only initial value, once it is reconfigured, the new value is
# stored and retrieved on next run.
initial_spark_worker_resources: 0.7
Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to C*)
Spark worker cores = initial_spark_worker_resources * total system cores
or
filter()
c1 c2
A 5 5
C 0 1
C 0 1
RDD
C*
RDD
c1 c2
A 5 5
C 0 1
C 0 1
RDD
C*
sc.cassandraTable(“KS", “TB").select(“C-1", “C-2")
.where(“C-3 = ?", "black")
Interesting Spark Settings
spark.speculation =
false
spark.locality.wait =
3000
spark.local.dir = /tmp
spark.serializer =
JavaSerializer KryoSerialize
r
• HadoopRDD
• FilteredRDD
• MappedRDD
• PairRDD
• ShuffledRDD
• UnionRDD
• PythonRDD
• DoubleRDD
• JdbcRDD
• JsonRDD
• SchemaRDD
• VertexRDD
• EdgeRDD
• CassandraRDD
(DataStax)
• GeoRDD (ESRI)
• EsSpark
(ElasticSearch)
Spark sorted the same data 3X faster
using 10X fewer machines
than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
All the sorting took place on disk (HDFS) without
using Spark’s in-memory cache!
- Stresses “shuffle” which underpins everything from SQL to Mllib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)
- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout
- Pipelining
Ex
RD
D
W
RD
D
T
T
EC2: i2.8xlarge
(206 workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed
it manually (i.e. never triggered the GC
- 32 slots per machine
- 6,592 slots total
Map() Map() Map() Map()
Reduce() Reduce() Reduce()
- Entirely bounded
by I/O reading from
HDFS and writing out
locally sorted files
- Mostly network bound
< 10,000 reducers
- Notice that map
has to keep 3 file
handles open
TimSort
Map() Map() Map() Map()
(28,000 blocks)
RF = 2
250,000+ reducers!
- Only one file handle
open at a time
Map() Map() Map() Map()
(28,000 blocks) - 5 waves of maps
- 5 waves of reduces
Reduce() Reduce() Reduce()
RF = 2
RF = 2
250,000+ reducers!
MergeSort!
TimSort
Sustaining 1.1GB/s/node during shuffle
- Actual final run
- Fully saturated
the 10 Gbit link
PySpark at a Glance
Write Spark jobs
in Python
Run interactive
jobs in the shell
Supports C
extensions
(Not yet supported by the DataStax open source Spark connector… but soon..)
However, currently supported in DSE 4.6
PySpark - C*
- DSE has an implementation of PySpark that
supports reading and writing to C*
- But it is not in the open source connector (yet)
Spark Core Engine
(Scala)
Standalone Scheduler YARN MesosLocal
Java API
PySpark
41 files
8,100 loc
6,300
comments
Process data in Python
and persist/transfer it
in Java
Spark
Context
Controlle
r
Spark
Context
Py4j
Socke
t
Local Disk
Pipe
Driver JVM
Executor JVM
Executor JVM
Pipe
Worker MachineDriver Machine
F(x)
F(x)
F(x)
F(x)
F(x)
RDD
RDD
RDD
RDD
MLlib, SQL, shuffle
MLlib, SQL, shuffle
daemon.py
daemon.py
Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD
MappedRDD
PythonRDD RDD[Array[ ] ], , ,
(100 KB – 1MB each picked object)
pypy
• JIT, so faster
• less memory
• CFFI support)
CPython
(default python)
Choose Your Python Implementation
Spark
Context
Driver Machine
Worker Machine
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy
./bin/pyspark
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit
wordcount.py
OR
Job CPython 2.7 PyPy 2.3.1 Speed up
Word Count 41 s 15 s 2.7 x
Sort 26 s 44 s 1.05 x
Stats 174 s 3.6 s 48 x
The performance speed up will depend on work load (from 20% to 3000%).
Here are some benchmarks:
Here is the code used for benchmark:
rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x:
xrange(5024)).stats()
https://github.com/apache/spark/pull/2144
http://tinyurl.com/dsesparklab
- 102 pages
- DevOps style
- For complete beginners
- Includes:
- Spark Streaming
- Dangers of
GroupByKey vs.
ReduceByKey
http://tinyurl.com/cdhsparklab
- 109 pages
- DevOps style
- For complete beginners
- Includes:
- PySpark
- Spark SQL
- Spark-submit
Q & A
We’re Hiring!
+ Data Scientist
+ JavaScript Engineer
+ Product Manager
+ Data Solutions Engineer
+ Support Operations Engineer
+ Interns

More Related Content

What's hot

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

What's hot (20)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Spark overview
Spark overviewSpark overview
Spark overview
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 

Viewers also liked

CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
Steven Wardell
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
LyleK
 

Viewers also liked (20)

Graph for SQL Practitioners
Graph for SQL PractitionersGraph for SQL Practitioners
Graph for SQL Practitioners
 
CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
CISummit 2013: Busting Fraud Rings - The Cases of Healthcare & Financial Serv...
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
Gartner eBook on Big Data
Gartner eBook on Big DataGartner eBook on Big Data
Gartner eBook on Big Data
 
Offre Transformation Digitale HMC Conseil Cortambert Consultant
Offre Transformation Digitale HMC Conseil Cortambert ConsultantOffre Transformation Digitale HMC Conseil Cortambert Consultant
Offre Transformation Digitale HMC Conseil Cortambert Consultant
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3Titan: Scaling Graphs and TinkerPop3
Titan: Scaling Graphs and TinkerPop3
 

Similar to Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Similar to Spark & Cassandra at DataStax Meetup on Jan 29, 2015 (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Spark
SparkSpark
Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

  • 1. ARCHITECTURE + DEMO JAN 2015 http://www.slideshare.net/blueplastic www.linkedin.com/in/blueplastic
  • 2. AGENDA • DATABRICKS CLOUD DEMO • SPARK STANDALONE MODE ARCHITECTURE • RDDS, TRANSFORMATIONS AND ACTIONS • NEW SHUFFLE IMPLEMENTATION • PYSPARK ARCHITECTURE • P
  • 3. making big data simple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~45 employees • We’re hiring! • Level 2/3 support partnerships with • Cloudera • Hortonworks • MapR • DataStax (http://databricks.workable.com)
  • 4. http://strataconf.com/big-data-conference-ca-2015/public/content/apache-spark Topics include: • Using cloud-based notebooks to develop Enterprise data workflows • Spark integration with Cassandra, Kafka, Elasticsearch • Advanced use cases with Spark SQL and Spark Streaming • Operationalizing Spark on DataStax, Cloudera, MapR, etc. • Monitoring and evaluating performance metrics • Estimating cluster resource requirements • Debugging and troubleshooting Spark apps • Cases studies for production deployments of Spark • Preparation for Apache Spark developer certification exam - 3 days of training (Tues – Thurs) - $2,795 - Limited to 50 seats - ~40% hands on labs
  • 5. Spark Core Engine (Scala / Python /Java) Spark SQL BlinkDB (Appox. SQL) Spark Streaming MLlib (machine learning) GraphX (Graph Computation) SparkR (R on Spark) + + HDFS S3 + + + text JSON/CSV Protocol Buffers Alpha Standalone Scheduler YARN Mesos erkeley ata nalytics tack Local Tachyon (off-heap RDD) [0, 1, 1, 0] [1, 0, 0, 0] [0, 0, 0, 1]
  • 7. June 2010 http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf “The main abstraction in Spark is that of a resilient dis- tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”
  • 8. April 2012 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf “We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude.” “Best Paper Award and Honorable Mention for Community Award” - NSDI 2012 - Cited 392 times!
  • 10. sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name) Seemlessly mix SQL queries with Spark programs. Coming soon! (Will be published in the upcoming weeks for SIGMOD 2015)
  • 14. An RDD can be created 2 ways: - Parallelize a collection - Read data from an external source (S3, C*, HDFS, etc) RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDDs are made of multiple partitions (more partitions = more parallelism)
  • 15. # Parallelize in Python wordsRDD = sc.parallelize([“fish", “cats“, “dogs”]) // Parallelize in Scala val wordsRDD= sc.parallelize(List("fish", "cats", "dogs")) // Parallelize in Java JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”)); - Take an existing in- memory collection and pass it to SparkContext’s parallelize method - Not generally used outside of prototyping and testing since it requires entire dataset in memory on one machine
  • 16. # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Scala val linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Java JavaRDD<String> lines = sc.textFile("/path/to/README.md"); - There are other methods to read data from HDFS, C*, S3, HBase, etc.
  • 17. RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDD Partition 1 Partition 2 Partition 3 Partition 4 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Filter(lambda line: Error in line) RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Coalesce(2)
  • 18. RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 Cache() Count() 5 Filter(lambda line: msg1 in line) RDD Partition 1 Partition 2 Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 Collect() Driver Driver saveToCassandra()
  • 19. map() intersection() cartesion() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... (lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations
  • 20. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() saveToCassandra() ...
  • 21. +Vs.
  • 23. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. (default) MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 Same as the levels above, but replicate each partition on two cluster nodes. OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Reduces garbage collection overhead and allows Executors to be smaller.
  • 24.
  • 26. Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles.
  • 27. = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 28. Stage 2 = Job #1Stage 1 Stage 3 Stage 4 . .
  • 30. - Local - Standalone Scheduler - YARN - Mesos Static Partitioning Dynamic Partitioning
  • 31. Ex OS RD D W RD D Driver T Ex OS RD D W RD D T Ex OS RD D W T Ex OS RD D W RD D T T TT Spark Master I’m HA via System TableSpark Master Spark Master Spark Master (Course-Grained Scheduler) (Task Scheduler) spark.executor.memory = 512 MB spark.cores.max = * DSE implementation
  • 32. Spark Master (Course-Grained Scheduler) C * Ex OS S M + RD D W RD D Driver T T 512 MB 512 MB 16 - 256 MB 16 - 256 MB Dynamically set (Task Scheduler) SPARK_TMP_DIR="/tmp/spark“ SPARK_RDD_DIR="/var/lib/spark/rd d"
  • 33. 60%20% 20% Default Memory Allocation in Executor JVM Cached RDDs User Programs (remainder) Shuffle memory spark.storage.memoryFraction spark.storage.memoryFraction
  • 34. https://github.com/datastax/spark-cassandra-connector Spark Executor Spark-C* Connector C* Java Driver - Open Source - Implemented mostly in Scala - Scala + Java APIs - Does automatic type conversions
  • 35. val cassandraRDD = sc .cassandraTable(“ks”, “mytable”) .select(“col-1”, “col-3”) .where(“col-5 = ?”, “blue”) Keyspace Table {Server side column & row selection
  • 36. +D A B C c1 c2 A K M c1 c2 G B L c1 c2 I W S c1 c2 C Y Z RDD (Resilient Distributed Dataset) c1 c2 G B L c1 c2 I W S c1 c2 C Y Z c1 c2 A K M Partition 1 Partition 2 Partition 3 Partition 4 Every Spark task uses a CQL-like query to fetch data for a given token range: SELECT “key”, “value” FROM “keyspace”.”table” WHERE token(“key”) > 384023840238403 AND token(“key”) <= 38402992849280 ALLOW FILTERING
  • 38. Configuration Settings /etc/dse/spark/spark-env.sh export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=7080 export SPARK_WORKER_WEBUI_PORT=7081 # export SPARK_EXECUTOR_MEMORY="512M“ # export DEFAULT_PER_APP_CORES="1“ # Set the amount of memory used by Spark Worker - if uncommented, it overrides the setting initial_spark_worker_resources in dse.yaml. # export SPARK_WORKER_MEMORY=2048m # The amount of memory used by Spark Driver program export SPARK_DRIVER_MEMORY="512M“ # Directory where RDDs will be cached export SPARK_RDD_DIR="/var/lib/spark/rdd“ # The directory for storing master.log and worker.log files export SPARK_LOG_DIR="/var/log/spark" (Spark Settings)
  • 39. Configuration Settings /etc/dse/dse.yaml (DataStax Enterprise Settings) # The fraction of available system resources to be used by Spark Worker # This the only initial value, once it is reconfigured, the new value is # stored and retrieved on next run. initial_spark_worker_resources: 0.7 Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to C*) Spark worker cores = initial_spark_worker_resources * total system cores
  • 40. or filter() c1 c2 A 5 5 C 0 1 C 0 1 RDD C* RDD c1 c2 A 5 5 C 0 1 C 0 1 RDD C* sc.cassandraTable(“KS", “TB").select(“C-1", “C-2") .where(“C-3 = ?", "black")
  • 41. Interesting Spark Settings spark.speculation = false spark.locality.wait = 3000 spark.local.dir = /tmp spark.serializer = JavaSerializer KryoSerialize r
  • 42. • HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD • CassandraRDD (DataStax) • GeoRDD (ESRI) • EsSpark (ElasticSearch)
  • 43.
  • 44. Spark sorted the same data 3X faster using 10X fewer machines than Hadoop MR in 2013. Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia 100TB Daytona Sort Competition 2014 More info: http://sortbenchmark.org http://databricks.com/blog/2014/11/05/spark- officially-sets-a-new-record-in-large-scale-sorting.html All the sorting took place on disk (HDFS) without using Spark’s in-memory cache!
  • 45. - Stresses “shuffle” which underpins everything from SQL to Mllib - Sorting is challenging b/c there is no reduction in data - Sort 100 TB = 500 TB disk I/O and 200 TB network Engineering Investment in Spark: - Sort-based shuffle (SPARK-2045) - Netty native network transport (SPARK-2468) - External shuffle service (SPARK-3796) Clever Application level Techniques: - GC and cache friendly memory layout - Pipelining
  • 46. Ex RD D W RD D T T EC2: i2.8xlarge (206 workers) - Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores - 244 GB of RAM - 8 x 800 GB SSD and RAID 0 setup formatted with /ext4 - ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes - Each record: 100 bytes (10 byte key & 90 byte value) - OpenJDK 1.7 - HDFS 2.4.1 w/ short circuit local reads enabled - Apache Spark 1.2.0 - Speculative Execution off - Increased Locality Wait to infinite - Compression turned off for input, output & network - Used Unsafe to put all the data off-heap and managed it manually (i.e. never triggered the GC - 32 slots per machine - 6,592 slots total
  • 47. Map() Map() Map() Map() Reduce() Reduce() Reduce() - Entirely bounded by I/O reading from HDFS and writing out locally sorted files - Mostly network bound < 10,000 reducers - Notice that map has to keep 3 file handles open TimSort
  • 48. Map() Map() Map() Map() (28,000 blocks) RF = 2 250,000+ reducers! - Only one file handle open at a time
  • 49. Map() Map() Map() Map() (28,000 blocks) - 5 waves of maps - 5 waves of reduces Reduce() Reduce() Reduce() RF = 2 RF = 2 250,000+ reducers! MergeSort! TimSort
  • 50. Sustaining 1.1GB/s/node during shuffle - Actual final run - Fully saturated the 10 Gbit link
  • 51.
  • 52. PySpark at a Glance Write Spark jobs in Python Run interactive jobs in the shell Supports C extensions (Not yet supported by the DataStax open source Spark connector… but soon..) However, currently supported in DSE 4.6
  • 53. PySpark - C* - DSE has an implementation of PySpark that supports reading and writing to C* - But it is not in the open source connector (yet)
  • 54. Spark Core Engine (Scala) Standalone Scheduler YARN MesosLocal Java API PySpark 41 files 8,100 loc 6,300 comments
  • 55. Process data in Python and persist/transfer it in Java
  • 56. Spark Context Controlle r Spark Context Py4j Socke t Local Disk Pipe Driver JVM Executor JVM Executor JVM Pipe Worker MachineDriver Machine F(x) F(x) F(x) F(x) F(x) RDD RDD RDD RDD MLlib, SQL, shuffle MLlib, SQL, shuffle daemon.py daemon.py
  • 57. Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD MappedRDD PythonRDD RDD[Array[ ] ], , , (100 KB – 1MB each picked object)
  • 58. pypy • JIT, so faster • less memory • CFFI support) CPython (default python) Choose Your Python Implementation Spark Context Driver Machine Worker Machine $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/pyspark $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py OR
  • 59. Job CPython 2.7 PyPy 2.3.1 Speed up Word Count 41 s 15 s 2.7 x Sort 26 s 44 s 1.05 x Stats 174 s 3.6 s 48 x The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Here is the code used for benchmark: rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/')) .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats() https://github.com/apache/spark/pull/2144
  • 60. http://tinyurl.com/dsesparklab - 102 pages - DevOps style - For complete beginners - Includes: - Spark Streaming - Dangers of GroupByKey vs. ReduceByKey
  • 61. http://tinyurl.com/cdhsparklab - 109 pages - DevOps style - For complete beginners - Includes: - PySpark - Spark SQL - Spark-submit
  • 62.
  • 63. Q & A We’re Hiring! + Data Scientist + JavaScript Engineer + Product Manager + Data Solutions Engineer + Support Operations Engineer + Interns

Editor's Notes

  1. Q) Who knows why the project was named Spark? REMEMBER to say out loud the % of hands that go up for each of these: Q-1) How many people are Cassandra people.. sstables, bloom filters, vnodes? Q-2) How many people have launched a any Spark REPL like Scala or PySpark or Spark SQL and typed in anything at all? So, anyone who has hands-on experience with typing a Spark Transformation or Action? Q-3) I’ll be asking 2 questions, you can raise your hand for either or? How many people will be deploying Spark on Hadoop vs. people deploying Spark on Cassandra? Q-4) Will be deploying Spark in the Amazon Cloud? (You guys would be our ideal customers…) Mention that I was a Datastax Authorized Training partner and have been using C* since 2010. DS sent me to Singapore, Malaysia, Mexico, San Diego, NYC… just overall, I got to visit 20 countries and around 50 cities. But recently I started getting interested in Apache Spark and tried to learn it on my own and getting beyond the basics was really hard. So, I decided to join Databricks to *really* learn Spark. When you get stuck with something obscure like how preservesPartitioning works, it helps to be sitting a few feet away from Matei Zaharia when you’re studying how the engine that he built works. So, anyway, the goal of this talk teach you some of what I’ve learned in the past 6 months. I hope this talk will bootstrap you on the way to really understanding Spark and save you some of the headaches along the way. Say that the goal of this talk is to get you up and running with Spark’s Python API, basic programming in Spark, understanding it’s architecture and using SparkSQL
  2. Here’s the Agenda for tonight, in no particular order
  3. So, Data AmpLab was founded in Jan 2011 with a 6 year plan, has 65 students, faculty and postdocs. Matei designed the Fair Scheduler for Hadoop in mid-2008: https://issues.apache.org/jira/browse/HADOOP-3746 “The default job scheduler in Hadoop has a first-in-first-out queue of jobs for each priority level. The scheduler always assigns task slots to the first job in the highest-level priority queue that is in need of tasks. This makes it difficult to share a MapReduce cluster between users because a large job will starve subsequent jobs in its queue, but at the same time, giving lower priorities to large jobs would cause them to be starved by a stream of higher-priority jobs. This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits up compute time proportionally between jobs that have been submitted, emulating an "ideal" scheduler that gives each job 1/Nth of the available capacity. When there is a single job running, that job receives all the capacity. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that everyone gets roughly the same amount of compute time. This lets short jobs finish in reasonable amounts of time while not starving long jobs. This is the type of scheduling used or emulated by operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that a job should get.”
  4. Remind audience to add me on LinkedIn and I’ll let them know of my next upcoming trainings.
  5. Spark runs everywhere.. Hadoop, Mesos, standalone or in the cloud.. Amazon, Rackspace, Azure. Even Docker, the list just goes on and on… So, our goal is that you’ll be able to learn one processing engine and then re-use your knowledge against any data source or environment. But you do have to make that initial investment in learning Spark, and hopefully this presentation will get you going along the way. A philosophy of tight integration has several benefits. First, all libraries and higher level components in the stack benefit from improvements at the lower layers. For example, when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well. Second, the costs (deployment, maintenance, testing, support) associated with running the stack are minimized, because instead of running 5-10 independent software systems, an organization only needs to run one. This also means that each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. This changes the cost of trying out a new type of data analysis from downloading, deploying, and learning a new software project to upgrading Spark. Finally, one of the largest advantages of tight integration is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously analysts can query the resulting data, also in real-time, via SQL, e.g. to join the data with unstructured log files. In addition, more sophisticated data engineers & data scientists can access the same data via the Python shell for ad-hoc analysis. Others might access the data in standalone batch applications. All the while, the IT team only has to maintain one software stack. Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.” Note that Python 3 is not yet supported. You will have to use Python 2.6 or a newer 2.x branch.
  6. Quote, Keep the simple things simple and make the complex possible. Today we’ll just keep the simple things simple and not get into the complexities of Machine Learning and advanced visualizations
  7. This was just a preview academic paper
  8. By the way, the original Google MapReduce whitepaper from 2008 was cited 12,600 times.
  9. In the code: Counting tweets on a sliding window Build applications through high-level operators. Stateful exactly-once semantics out of the box. Combine streaming with batch and interactive queries.
  10. In the code: Apply functions to results of SQL queries. Run unmodified Hive queries on existing warehouses. Connect through JDBC or ODBC. Special Interest Group for Management of Data This was an invited paper for the Industrial track to be held in Melbourne, Australia in early June 2015
  11. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives
  12. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives
  13. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
  14. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives
  15. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives
  16. You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives Coalesce: “Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.” The coalesce function is only used to reduce the number of partitions
  17. NEED TO FIX ANIMATIONS HERE (after save to C* addition) You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives
  18. To-do: Put the ones for k/v pairs in a different color Map: Return a new distributed dataset formed by passing each element of the source through a function func. FlatMap: Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Filter: Return a new dataset formed by selecting those elements of the source on which func returns true. Pipe: Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. Sample: Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed. Coalesce: Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
  19. Actions force the evaluation of the transformations required for the RDD they are called on, since they are required to actually produce output.
  20. Wide dependency pretty much always means a shuffle Map, mappartitions, filters are cheap and can be pipelined Two different types of transformations: Narrow & Wide Wide dependencies are MORE expensive. Narrow transformations: where the input data depends on a constant # of partitions of the parent data. Like map, filter, mapPartitions. Wide transformations: where the input data depends on a variable # of partitions of the parent data. Like groupByKey or reduceByKey. Some operations can be narrow or wide depending on if the data has the same partitioning, like join.
  21. Figure 5: Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, In black if they are already in memory. To run an action on RDD G, we build stages at wide dependencies and pipeline nar- row transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3. Sameer notes: Here the RDD B is hash partitioned, and is probably a much larger RDD than E. The data from E is having to be sent over the network.. More on this later. Not correct: Even during the join, Spark understands that these 2 RDDs (from Stage 1 and Stage 2) are hash partitioned to the same node. So, there is no need to over the network shuffling to make the Stage 3 RDD.
  22. The set of stages produced for a particular action is termed a job . In each case when we invoke actions such as count , we are creating a job composed of one or more stages. Once the stage graph is defined, tasks are created and dispatched to an internal scheduler which varies depending on the deployment mode being used. Stages in the physical plan can depend on each other, based on the RDD lineage, so they will be executed in a specific order. For instance a stage that outputs shuffle data most occur before one that relies on that data being present.
  23. The set of stages produced for a particular action is termed a job . In each case when we invoke actions such as count , we are creating a job composed of one or more stages. Once the stage graph is defined, tasks are created and dispatched to an internal scheduler which varies depending on the deployment mode being used. Stages in the physical plan can depend on each other, based on the RDD lineage, so they will be executed in a specific order. For instance a stage that outputs shuffle data most occur before one that relies on that data being present.
  24. Mention that in YARN we can set spark.dynamicAllocation.enabled to True with minExecutors, maxExecutors and 3 timeouts. In cluster mode Spark depends on a cluster manager to launch executors and, in certain cases, to launch the driver. The cluster manager is a pluggable component in Spark. Each one of these resource managers has different pros and cons. One nice thing about running Spark in YARN, for example, is that you can dynamically resize the # of Executors in the application. This feature is not yet possible in Standalone mode. Static Partitioning: With this approach, each application is given a maximum amount of resources it can use, and holds onto them for its whole duration. Mesos: In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity, where each application gets more or fewer machines as it ramps up and down, but it comes with an additional overhead in launching each task. This mode may be inappropriate for low-latency requirements like interactive queries or serving web requests. The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini-tasks” within it. The benefit is much lower startup overhead, but at the cost of reserving the Mesos resources for the complete duration of the application. In Standalone and Mesos course-grained, you can control the maximum number of resources Spark will acquire. By default, it will acquire all cores in the cluster (that get offered by Mesos), which only makes sense if you run just one application at a time. You can cap the max # of cores using conf.set("spark.cores.max", "10") (for example).
  25. In YARN, ZooKeeper makes the SM HA You can make the Driver HA in standalone by setting the Boolean for –supervise to true when using Spark Submit (https://github.com/apache/spark/search?utf8=%E2%9C%93&q=supervise) In YARN the driver is HA via YARN primitives Spark.cores.max = the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will bespark.deploy.defaultCores on Spark's standalone cluster manager
  26. As far as C* goes, for example, in my 15 GB VM, the Cassandra JVM gets about 3.5 GB of RAM The Driver runs within it a Task Scheduler. This scheduler schedules tasks on different Executors. When running in cluster mode, Spark utilizes a master-slave architecture with one central coordinator (called Driver) and many distributed Executors. A driver and its executors are together termed a Spark application. A Spark driver is responsible for compiling a user program into units of physical execution called tasks. # Directory for Spark temporary files. It will be used by Spark Master, Spark Worker, Spark Shell and Spark applications. export SPARK_TMP_DIR="/tmp/spark" # Directory where RDDs will be cached export SPARK_RDD_DIR="/var/lib/spark/rdd"
  27. RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle. User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated.
  28. You can also specify the column order by stating explicitly which columns you want returned…
  29. Spark asks an RDD for a list of its partitions (splits) Each split consists of one or more token-ranges For every RDD partition: Spark asks RDD for a list of preferred nodes to process on Spark creates a task and sends it to one of the nodes for execution
  30. DataStax Enterprise 4.5.2 and later can control the memory and cores offered by particular Spark Workers in semi­automatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker. The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node. The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
  31. export SPARK_REPL_MEM="256M“ will be removed in newer versions of DSE: “SPARK_REPL_MEM is a left-over from some earlier Spark version where I guess SPARK_DRIVER_MEM was not present.We're going to remove this setting in the next version. I created a ticket for this: https://datastax.jira.com/browse/DSP-4495”
  32. DataStax Enterprise 4.5.2 and later can control the memory and cores offered by particular Spark Workers in semi­automatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker. The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node. The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
  33. To filter rows, you can use the filter transformation provided by Spark. Filter transformation fetches all rows from Cassandra first and then filters them in Spark. Some CPU cycles are wasted serializing and deserializing objects excluded from the result.  To avoid this overhead, CassandraRDD has a method that passes an arbitrary CQL condition to filter the row set on the server.
  34. Spark Speculation will enable speculative execution of tasks, so tasks that are running slowly will be re-launched (this can help cut down on straggler tasks in large clusters). SE only kicks in after 75% of tasks are done (this is separate for each stage). This is off by default. There are 3 other settings: spark.speculation.interval (default 100: How often Spark will check for tasks to speculate, in milliseconds), spark.speculation.quantile (default 0.75; percentage of tasks which must be complete before speculation is enabled for a particular stage, spark.speculation.multiplier (default 1.5; how many times slower a task is than the median to be considered for speculation) Locality.wait defaults to 3000 ms. This is how long to wait to launch a data-local task before giving up and launching it on a less-local node. There are many locality levels (process-local, node-local, rack-local, any). You should increase this setting if your tasks are long and see poor locality. Spark.serializer: Class to use for serializing objects that will be sent over the network or need to be cached in serialized form. JavaSerializer is slow, but works with any object. Consider using org.apache.spark.serializer.KryoSerializer. Can be any subclass of org.apache.spark.Serializer
  35. Research some cool things to say about the different RDDs and what they are used for
  36. Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.”
  37. Organizations from around the world often build dedicated sort machines (specialized software and sometimes specialized hardware) to compete in this benchmark.. Spark actually tied for 1st place with a team from University of California San Diego who have been working on creating a specialized sorting system called TritonSort. Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs.  Named after Jim Gray, the benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O. Requires read and write of 500 TB of disk I/O and 200 TB of network (b/c you have to replicate the output to make it fault taulerant) First time a system based on a public cloud system has won
  38. Zero-Copy (terminology): With Netty, the data from disk only gets sent to one NIC buffer and sent out to other node from there. Older implementation (normal) would have to first copy to FileSystem kernel buffer cache, then to a Executor JVM user-space buffer, and then to kernel NIC buffer and out to the remove reducer node GC: Netty does an explicit managed memory, malloc (outside of JVM). Netty is in the normal Executor JVM process that allocates a bunch of memory buffers off-heap and manages these transport buffers entirely by itself.
  39. In HDFS, reads normally go through the DataNode. Thus, when the client asks the DataNode to read a file, the DataNode reads that file off of the disk and sends the data to the client over a TCP socket. So-called "short-circuit" reads bypass the DataNode, allowing the client to read the file directly. Obviously, this is only possible in cases where the client is co-located with the data. Short-circuit reads provide a substantial performance boost to many applications. <name>dfs.client.read.shortcircuit</name> <value>true</value>
  40. No network usage at all in the Map phase In the Reduce stage, each reducer fetches its own range of data as determined by the partitioner from all the map outputs. It then sorts these outputs using TimSort and writes the sorted output to HDFS. 48 MB is the amount in flight at a time In both shuffles: R talks to 5 maps at a time and grabs 48 MB from each Right now if there are 3 file handles open for each map
  41. The sort based shuffle basically doesn’t require writing a separate file for each reduce task from each mapper Defaults to sort in 1.2
  42. The sort based shuffle basically doesn’t require writing a separate file for each reduce task from each mapper Reducer hasn’t changed (basically the same) Only where you fetch the data has changed
  43. Why Scala from Matei: “When we started Spark, we wanted it to have a concise API for users, which Scala did well. At the same time, we wanted it to be fast (to work on large datasets), so many scripting languages didn't fit the bill. Scala can be quite fast because it's statically typed and it compiles in a known way to the JVM. Finally, running on the JVM also let us call into other Java-based big data systems, such as Cassandra, HDFS and HBase.”
  44. This is based on regular C python… it’s not jython. So that means all python libraries are compatible. If you use numpy or NLTK or something else that has a c extension, then you can import that into PySpark. It’s a regular unmodified python interpreter. Right now
  45. Built on top of the Java API. It’s a very thin layer that sits on top of Scala and Java Spark. It provides a way to get data in and out of python and it reuses a lot of the Spark Scala functionality.
  46. You process all the data in python, so all of the UDFs you write will be python. But Spark will persist + transfer it through the network and storage layers in Scala + Java.
  47. Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Essentially, you just call Java methods from your python code as if they were python methods. These are executor JVMs where num_cores = 2 In the Executor JVM, Pyrolite is used to unpickle the data or to pickle it. “This library allows your Java or .NET program to interface very easily with the Python world. It uses the Pyro protocol to call methods on remote objects. (See https://github.com/irmen/Pyro4). Pyrolite contains and uses a feature complete pickle protocol implementation to exchange data with Pyro/Python. “ Why does the local disk get used on the driver machine? Py4J is very inefficient for transferring large data b/c it uses a text based protocol.. It has to encode in base 64 so really slow. A collect or broadcast may be huge, so Py4j writes one file to the local FS. The final unpickling happens in the Python process on the Driver Machine. Python functions and closures (with dependencies) are serialized using PiCloud’s CloudPickle module You could also use Py4J. There is an example on the frontpage and lots of documentation, but essentially, you just call Java methods from your python code as if they were python methods: You first write a java program where you create a class and run that java program in a JVM. Then you write a Python program that can access that java object via a network socket. Here is what happens behind the scenes. On my local machine, I have the Spark Context object… top level entry point for Spark.. Used to create RDDs. SC runs in the Python process. When a SC is made, behind the scenes it creates a Java Spark context, running on top of regular Spark. Py4J is used to talk (copy data) between a JVM and python using reflection so you can call a java call just as if it was a native python method. It’s pretty cool. So the python Spark Context is controlling the Java Spark Context. A local file system is also used for communication through multiple files for doing things like collecting results. The integration with Py4J and Python is fine for like control point stuff, but you don’t want to transfer a whole lot of data through it because reflection is just too expensive. When you submit a job to the cluster, your job (the Python transformations and stuff) gets mapped down to just a few transformations on this special Python RDD object. Behind the scenes when these objects are shipped to the Spark executor, they actually spin up a bunch of actual python processes. And the Spark Executors communicate with the Python code over pipes using a special format to pass the functions that the user is writing. SO The executor will pipe all the data into the python processes, get all the results back and serialize it and transfer it. Naturally, there is some serialization involved there is one Python process per executor thread (we need this due to Python’s GIL so that CPU-bound tasks can benefit from multiple cores).  In older versions of PySpark, we’d end up launching a new Python worker process per task.  In newer versions (1.2.0+), executors launch Python processes as needed, keep them around in a pool, and terminate them if they’re idle for too long (1 minute by default); see https://github.com/apache/spark/pull/2259 and the followup patches for more details.  The actual subprocess architecture is a little bit complicated because we use a separate process for managing the Python worker processes in order to work around some fork() performance issues in the JVM.  This may mean that you’ll see up to 5 Python processes on an executor with 4 threads: one Python process per core, plus the daemon.py middle layer; see https://github.com/apache/spark/pull/1680 for some additional context. Python UDFs are executed in regular Python interpreters, so things like sc.map(lambda x: some_arbitrary_python_code(x)) are run in Python interpreters, not through anything like Jython.  Spark SQL and MLlib may actually end up executing code that runs in the JVM for doing heavy number-crunching and I expect that we’ll see more in-JVM execution of certain operations in future versions of Spark. See also: the (woefully out-of-date) PySpark internals page on the Spark wiki: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.  I really ought to update this page to reflect some of these recent developments.
  48. CPython is the original Python implementation. It is the implementation you download from Python.org.  CPython compiles your python code into bytecode (transparently) and interprets that bytecode in a evaluation loop. So CPython does not translate your Python code to C by itself. It instead runs a interpreter loop. There is a project that does translate Python-ish code to C, and that is called Cython. Pypy uses just in time compilation (JIT) or dynamic translation, so it runs faster. This means that compilation is done during the execution of the program (at run time) rather than just prior to execution. - - - C Foreign Function Interface for Python. The goal is to provide a convenient and reliable way to call compiled C code from Python using interface declarations written in C. - - - Pypy uses just in time compilation (JIT) or dynamic translation, so it runs faster. This means that compilation is done during the execution of the program (at run time) rather than just prior to execution. A JIT compiler is a program that turns Java bytecode (instructions that must be interpreted) into instructions that can be sent directly to the processor. After you've written a Java program, the source language statements are compiled by the Java compiler into bytecode rather than into code that contains instructions that match a particular hardware platform's processor (for example, an Intel Pentium microprocessor or an IBM System/390 processor). The bytecode is platform-independent code that can be sent to any platform and run on that platform. In the past, most programs written in any language have had to be recompiled, and sometimes, rewritten for each computer platform. One of the biggest advantages of Java is that you only have to write and compile a program once. The Java on any platform will interpret the compiled bytecode into instructions understandable by the particular processor. However, the virtual machine handles one bytecode instruction at a time. Using the Java just-in-time compiler (really a second compiler) at the particular system platform compiles the bytecode into the particular system code (as though the program had been compiled initially on that platform). Once the code has been (re-)compiled by the JIT compiler, it will usually run more quickly in the computer. - - - Good explanation of JIT: A JIT compiler runs after the program has started and compiles the code (usually bytecode or some kind of VM instructions) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently. This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run. To paraphrase, conventional compilers build the whole program as an EXE file BEFORE the first time you run it. For newer style programs, an assembly is generated with pseudocode (p-code). Only AFTER you execute the program on the OS (e.g., by double-clicking on its icon) will the (JIT) compiler kick in and generate machine code (m-code) that the Intel-based processor or whatever will understand. - - - - - Research: https://github.com/apache/spark/blob/master/bin/pyspark#L54 https://github.com/apache/spark/pull/2144
  49. In general, PySpark with CPython will run 3-10x slower than Scala/Java.. Unless you use PyPy
  50. One more thing…
  51. This will definitely be sold out Let audience know that - I’ll be teaching the 1-day Advanced Training sessions @ Spark Summit East (NYC) on March 18-19 or Summit West
  52. Mention that we’re adding many interesting, new features to DBC that will be revealed soon. And say that I demoed many features of DBC from ETL to Exploration and Dashboards.