SlideShare uma empresa Scribd logo
1 de 172
Baixar para ler offline
INTRO TO SPARK DEVELOPMENT
June 2015: Spark Summit West / San Francisco
http://training.databricks.com/intro.pdf
https://www.linkedin.com/profile/view?id=4367352
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~55 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
The Databricks team contributed more than 75% of the code added to Spark in the past year
AGENDA
• History of Big Data & Spark
• RDD fundamentals
• Databricks UI demo
• Lab: DevOps 101
• Transformations & Actions
Before Lunch
• Transformations & Actions (continued)
• Lab: Transformations & Actions
• Dataframes
• Lab: Dataframes
• Spark UIs
• Resource Managers: Local & Stanalone
• Memory and Persistence
• Spark Streaming
• Lab: MISC labs
After Lunch
Some slides will be skipped
Please keep Q&A low during class
(5pm – 5:30pm for Q&A with instructor)
2 anonymous surveys: Pre and Post class
Lunch: noon – 1pm
2 breaks (sometime before lunch and after lunch)
Homepage: http://www.ardentex.com/
LinkedIn: https://www.linkedin.com/in/bclapper
@brianclapper
- 30 years experience building & maintaining software
systems
- Scala, Python, Ruby, Java, C, C#
- Founder of Philadelphia area Scala user group (PHASE)
- Spark instructor for Databricks
0 10 20 30 40 50 60 70 80
Sales / Marketing
Data Scientist
Management / Exec
Administrator / Ops
Developer
Survey completed by
58 out of 115 students
Survey completed by
58 out of 115 students
SF Bay Area
42%
CA
12%
West US
5%
East US
24%
Europe
4%
Asia
10%
Intern. - O
3%
0 5 10 15 20 25 30 35 40 45 50
Retail / Distributor
Healthcare/ Medical
Academia / University
Telecom
Science & Tech
Banking / Finance
IT / Systems
Survey completed by
58 out of 115 students
0 10 20 30 40 50 60 70 80 90 100
Vendor Training
SparkCamp
AmpCamp
None
Survey completed by
58 out of 115 students
Survey completed by
58 out of 115 students
Zero
48%
< 1 week
26%
< 1 month
22%
1+ months
4%
Survey completed by
58 out of 115 students
Reading
58%
1-node VM
19%
POC / Prototype
21%
Production
2%
Survey completed by
58 out of 115 students
Survey completed by
58 out of 115 students
Survey completed by
58 out of 115 students
Survey completed by
58 out of 115 students
0 10 20 30 40 50 60 70 80 90 100
Use Cases
Architecture
Administrator / Ops
Development
Survey completed by
58 out of 115 students
NoSQL battles Compute battles
(then) (now)
NoSQL battles Compute battles
(then) (now)
Key -> Value Key -> Doc Column Family Graph Search
Redis - 95
Memcached - 33
DynamoDB - 16
Riak - 13
MongoDB - 279
CouchDB - 28
Couchbase - 24
DynamoDB – 15
MarkLogic - 11
Cassandra - 109
HBase - 62
Neo4j - 30
OrientDB - 4
Titan – 3
Giraph - 1
Solr - 81
Elasticsearch - 70
Splunk – 41
General Batch Processing
Pregel
Dremel
Impala
GraphLab
Giraph
Drill
Tez
S4
Storm
Specialized Systems
(iterative, interactive, ML, streaming, graph, SQL, etc)
General Unified Engine
(2004 – 2013)
(2007 – 2015?)
(2014 – ?)
Mahout
Scheduling Monitoring Distributing
RDBMS
Streaming
SQL
GraphX
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
DataFrames API
- Developers from 50+ companies
- 400+ developers
- Apache Committers from 16+ organizations
vs
YARN
SQL
MLlib
Streaming
Mesos
Tachyon
10x – 100x
Aug 2009
Source: openhub.net
...in June 2013
CPUs:
10 GB/s
100 MB/s
0.1 ms random access
$0.45 per GB
600 MB/s
3-12 ms random access
$0.05 per GB
1 Gb/s or
125 MB/s
Network
0.1 Gb/s
Nodes in
another
rack
Nodes in
same rack
1 Gb/s or
125 MB/s
June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like
parallel operations.
RDDs achieve fault tolerance through a notion of
lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”
April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications
that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.
In both cases, keeping data in memory can improve
performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award”
- NSDI 2012
- Cited 400+ times!
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
- 2 Streaming Paper(s) have been cited 138 times
Analyze real time streams of data in ½ second intervals
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Seemlessly mix SQL queries with Spark programs.
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}
Analyze networks of nodes and edges using graph processing
https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
SQL queries with Bounded Errors and Bounded Response Times
https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
Estimate
# of data points
true answer
How do you know
when to stop?
Estimate
# of data points
true answer
Error bars on every
answer!
Estimate
# of data points
true answer
Stop when error smaller
than a given threshold
time
http://shop.oreilly.com/product/0636920028512.do
eBook: $33.99
Print: $39.99
PDF, ePub, Mobi, DAISY
Shipping now!
http://www.amazon.com/Learning-Spark-Lightning-Fast-
Data-Analysis/dp/1449358624
$30 @ Amazon:
Spark sorted the same data 3X faster
using 10X fewer machines
than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
All the sorting took place on disk (HDFS) without
using Spark’s in-memory cache!
- Stresses “shuffle” which underpins everything from SQL to MLlib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)
- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout
- Pipelining
Ex
RDD
W
RDD
T
T
EC2: i2.8xlarge
(206 workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed
it manually (i.e. never triggered the GC)
- 32 slots per machine
- 6,592 slots total
(Scala & Python only)
Driver Program
Ex
RDD
W
RDD
T
T
Ex
RDD
W
RDD
T
T
Worker Machine
Worker Machine
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
RDD
Ex
RDD
W
RDD
Ex
RDD
W
RDD
Ex
RDD
W
more partitions = more parallelism
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
RDD w/ 4 partitions
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
An RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
logLinesRDD
# Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scala
val wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-memory
collection and pass it to
SparkContext’s parallelize
method
- Not generally used outside of
prototyping and testing since it
requires entire dataset in
memory on one machine
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)
errorsRDD
.coalesce( 2 )
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
.collect( )
Execute DAG!
Driver
.collect( )
Driver
logLinesRDD
.collect( )
logLinesRDD
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)
Pipelined
Stage-1
Driver
logLinesRDD
errorsRDD
cleanedRDD
Driver
data
logLinesRDD
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.saveToCassandra( )
.count( )
5
logLinesRDD
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.count( )
.saveToCassandra( )
5
1) Create some input RDDs from external data or parallelize a
collection in your driver program.
2) Lazily transform them to define new RDDs using
transformations like filter() or map()
3) Ask Spark to cache() any intermediate RDDs that will need to
be reused.
4) Launch actions such as count() and collect() to kick off a
parallel computation, which is then optimized and executed
by Spark.
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
(lazy)
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
• HadoopRDD
• FilteredRDD
• MappedRDD
• PairRDD
• ShuffledRDD
• UnionRDD
• PythonRDD
• DoubleRDD
• JdbcRDD
• JsonRDD
• SchemaRDD
• VertexRDD
• EdgeRDD
• CassandraRDD (DataStax)
• GeoRDD (ESRI)
• EsSpark (ElasticSearch)
“Simple things
should be simple,
complex things
should be possible”
- Alan Kay
DEMO:
https://classeast01.cloud.databricks.com
https://classeast02.cloud.databricks.com
- 60 user accounts
- 60 user accounts
- 60 user clusters
- 1 community cluster
- 60 user clusters
- 1 community cluster
Databricks Guide (5 mins)
DevOps 101 (30 mins)
DevOps 102 (30 mins)Transformations &
Actions (30 mins)
SQL 101 (30 mins)
Dataframes (20 mins)
Switch to Transformations & Actions slide deck….
UserID Name Age Location Pet
28492942 John Galt 32 New York Sea Horse
95829324 Winston Smith 41 Oceania Ant
92871761 Tom Sawyer 17 Mississippi Raccoon
37584932 Carlos Hinojosa 33 Orlando Cat
73648274 Luis Rodriguez 34 Orlando Dogs
JDBC/ODBC Your App
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
• Announced Feb 2015
• Inspired by data frames in R
and Pandas in Python
• Works in:
Features
• Scales from KBs to PBs
• Supports wide array of data formats and
storage systems (Hive, existing RDDs, etc)
• State-of-the-art optimization and code
generation via Spark SQL Catalyst optimizer
• APIs in Python, Java
• a distributed collection of data organized into
named columns
• Like a table in a relational database
What is a Dataframe?
Step 1: Construct a DataFrame
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.jsonFile("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
## age name
## null Michael
## 30 Andy
## 19 Justin
Step 2: Use the DataFrame
# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
## name
## Michael
## Andy
## Justin
# Select everybody, but increment the age by 1
df.select("name", df.age + 1).show()
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
SQL Integration
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT * FROM table")
SQL + RDD Integration
2 methods for converting existing RDDs into DataFrames:
1. Use reflection to infer the schema of an RDD that
contains different types of objects
2. Use a programmatic interface that allows you to
construct a schema and then apply it to an existing
RDD.
(more concise)
(more verbose)
SQL + RDD Integration: via reflection
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.inferSchema(people)
schemaPeople.registerTempTable("people")
SQL + RDD Integration: via reflection
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print teenName
SQL + RDD Integration: via programmatic schema
DataFrame can be created programmatically with 3 steps:
1. Create an RDD of tuples or lists from the original RDD
2. Create the schema represented by a StructType matching the
structure of tuples or lists in the RDD created in the step 1
3. Apply the schema to the RDD via createDataFrame method
provided by SQLContext
Step 1: Construct a DataFrame
# Constructs a DataFrame from the users table in Hive.
users = context.table("users")
# from JSON files in S3
logs = context.load("s3n://path/to/data.json", "json")
Step 2: Use the DataFrame
# Create a new DataFrame that contains “young users” only
young = users.filter(users.age < 21)
# Alternatively, using Pandas-like syntax
young = users[users.age < 21]
# Increment everybody’s age by 1
young.select(young.name, young.age + 1)
# Count the number of young users by gender
young.groupBy("gender").count()
# Join young users with another DataFrame called logs
young.join(logs, logs.userId == users.userId, "left_outer")
1.4.0 Event timeline all jobs page
1.4.0
Event timeline within 1 job
1.4.0
Event timeline within 1 stage
1.4.0
sc.textFile(“blog.txt”)
.cache()
.flatMap { line => line.split(“ “) }
.map { word => (word, 1) }
.reduceByKey { case (count1, count2) => count1 + count2 }
.collect()
1.4.0
Task-1
Task-2
Task-3
Task-4
logLinesRDD
errorsRDD
0 5 10 15 20 25 30 35 40 45 50
Don't know
Mesos
Apache + Standalone
C* + Standalone
Hadoop YARN
Databricks Cloud
Survey completed by
58 out of 115 students
0 10 20 30 40 50 60 70
Different Cloud
On-prem
Amazon Cloud
Survey completed by
58 out of 115 students
JobTracker
DNTT
M
M
R M
M
R M
M
R M
M
R
M
M
M
M
M
M
RR R
R
OSOSOSOS
JT
DN DNTT DNTT TT
History:
NameNode
NN
- Local
- Standalone Scheduler
- YARN
- Mesos
JVM: Ex + Driver
Disk
RDD, P1 Task
3 options:
- local
- local[N]
- local[*]
RDD, P2
RDD, P1
RDD, P2
RDD, P3
Task
Task
Task
Task
Task
CPUs:
Task
Task
Task
Task
Task
Task
Internal
Threads
val conf = new SparkConf()
.setMaster("local[12]")
.setAppName(“MyFirstApp")
.set("spark.executor.memory", “3g")
val sc = new SparkContext(conf)
> ./bin/spark-shell --master local[12]
> ./bin/spark-submit --name "MyFirstApp"
--master local[12] myApp.jar
Worker Machine
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P4
W
RDD, P6
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P7
W
RDD, P8
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Spark
Master
Ex
RDD, P5
W
RDD, P3
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
T T
T T
different spark-env.sh
- SPARK_WORKER_CORES
vs.> ./bin/spark-submit --name “SecondApp"
--master spark://host4:port1
myApp.jar - SPARK_LOCAL_DIRSspark-env.sh
Spark Central Master Who starts Executors? Tasks run in
Local [none] Human being Executor
Standalone Standalone Master Worker JVM Executor
YARN YARN App Master Node Manager Executor
Mesos Mesos Master Mesos Slave Executor
spark-submit provides a uniform interface for
submitting jobs across all cluster managers
bin/spark-submit --master spark://host:7077
--executor-memory 10g
my_script.py
Source: Learning Spark
Ex
RDD, P1
RDD, P2
RDD, P1
T T
T T
T T
Internal
Threads
Recommended to use at most only 75% of a machine’s memory
for Spark
Minimum Executor heap size should be 8 GB
Max Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and
serialization format
+Vs.
Persistence description
MEMORY_ONLY Store RDD as deserialized Java objects in
the JVM
MEMORY_AND_DISK Store RDD as deserialized Java objects in
the JVM and spill to disk
MEMORY_ONLY_SER Store RDD as serialized Java objects (one
byte array per partition)
MEMORY_AND_DISK_SER Spill partitions that don't fit in memory to
disk instead of recomputing them on the fly
each time they're needed
DISK_ONLY Store the RDD partitions only on disk
MEMORY_ONLY_2, MEMORY_AND_DISK_2 Same as the levels above, but replicate
each partition on two cluster nodes
OFF_HEAP Store RDD in serialized format in Tachyon
RDD.cache() == RDD.persist(MEMORY_ONLY)
JVM
deserialized
most CPU-efficient option
RDD.persist(MEMORY_ONLY_SER)
JVM
serialized
+
.persist(MEMORY_AND_DISK)
JVM
deserialized Ex
W
RDD-P1
T
T
OS Disk
SSD
RDD-P1
RDD-P2
+
.persist(MEMORY_AND_DISK_SER)
JVM
serialized
.persist(DISK_ONLY)
JVM
RDD.persist(MEMORY_ONLY_2)
JVM on Node X
deserialized deserialized
JVM on Node Y
+
.persist(MEMORY_AND_DISK_2)
JVM
deserialized
+
JVM
deserialized
.persist(OFF_HEAP)
JVM-1 / App-1
serialized
Tachyon
JVM-2 / App-1
JVM-7 / App-2
.unpersist()
JVM
JVM
?
Intermediate data is automatically persisted during shuffle operations
Remember!
60%20%
20%
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs
(remainder)
Shuffle memory
spark.storage.memoryFraction
spark.shuffle.memoryFraction
RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of
memory used when caching to a certain fraction of the JVM’s overall heap, set by
spark.storage.memoryFraction
Shuffle and aggregation buffers: When performing shuffle operations, Spark will
create intermediate buffers for storing shuffle output data. These buffers are used to
store intermediate results of aggregations in addition to buffering data that is going
to be directly output as part of the shuffle.
User code: Spark executes arbitrary user code, so user functions can themselves
require substantial memory. For instance, if a user application allocates large arrays
or other objects, these will content for overall memory usage. User code has access
to everything “left” in the JVM heap after the space for RDD storage and shuffle
storage are allocated.
Spark uses memory for:
Serialization is used when:
Transferring data over the network
Spilling data to disk
Caching to memory serialized
Broadcasting variables
Java serialization Kryo serializationvs.
• Uses Java’s ObjectOutputStream framework
• Works with any class you create that implements
java.io.Serializable
• You can control the performance of serialization
more closely by extending java.io.Externalizable
• Flexible, but quite slow
• Leads to large serialized formats for many classes
• Recommended serialization for production apps
• Use Kyro version 2 for speedy serialization (10x) and
more compactness
• Does not support all Serializable types
• Requires you to register the classes you’ll use in
advance
• If set, will be used for serializing shuffle data between
nodes and also serializing RDDs to disk
conf.set(“spark.serializer”, "org.apache.spark.serializer.KryoSerializer")
++ +
Ex
Ex
Ex
x = 5
T
T
x = 5
x = 5
x = 5
x = 5
T
T
• Broadcast variables – Send a large read-only lookup table to all the nodes, or
send a large feature vector in a ML algorithm to all nodes
• Accumulators – count events that occur during job execution for debugging
purposes. Example: How many lines of the input file were blank? Or how many
corrupt records were in the input dataset?
++ +
Spark supports 2 types of shared variables:
• Broadcast variables – allows your program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations. Like
sending a large, read-only lookup table to all the nodes.
• Accumulators – allows you to aggregate values from worker nodes back to
the driver program. Can be used to count the # of errors seen in an RDD of
lines spread across 100s of nodes. Only the driver can access the value of an
accumulator, tasks cannot. For tasks, accumulators are write-only.
++ +
Broadcast variables let programmer keep a read-
only variable cached on each machine rather than
shipping a copy of it with tasks
For example, to give every node a copy of a large
input dataset efficiently
Spark also attempts to distribute broadcast variables
using efficient broadcast algorithms to reduce
communication cost
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
broadcastVar = sc.broadcast(list(range(1, 4)))
broadcastVar.value
Scala:
Python:
Accumulators are variables that can only be “added” to through
an associative operation
Used to implement counters and sums, efficiently in parallel
Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend
for new types
Only the driver program can read an accumulator’s value, not the
tasks
++ +
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
accum = sc.accumulator(0)
rdd = sc.parallelize([1, 2, 3, 4])
def f(x):
global accum
accum += x
rdd.foreach(f)
accum.value
Scala:
Python:
++ +
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Kafka
Flume
HDFS
S3
Kinesis
Twitter
TCP socket
HDFS
Cassandra
Dashboards
Databases
- Scalable
- High-throughput
- Fault-tolerant
Complex algorithms can be expressed using:
- Spark transformations: map(), reduce(), join(), etc
- MLlib + GraphX
- SQL
Batch Realtime
One unified API
Tathagata Das (TD)
- Lead developer of Spark Streaming + Committer
on Apache Spark core
- Helped re-write Spark Core internals in 2012 to
make it 10x faster to support Streaming use cases
- On leave from UC Berkeley PhD program
- Ex: Intern @ Amazon, Intern @ Conviva, Research
Assistant @ Microsoft Research India
- Scales to 100s of nodes
- Batch sizes as small at half a second
- Processing latency as low as 1 second
- Exactly-once semantics no matter what fails
Page views Kafka for buffering Spark for processing
(live statistics)
Smart meter readings
Live weather data
Join 2 live data
sources
(Anomaly Detection)
Input data stream
Batches of
processed data
Batches every X seconds
(Discretized Stream)
Block #1
RDD @ T=5
Block #2 Block #3
Batch interval = 5 seconds
Block #1
RDD @ T=10
Block #2 Block #3
T = 5 T = 10
Input
DStream
One RDD is created every 5 seconds
Block #1 Block #2 Block #3
Part. #1 Part. #2 Part. #3
Part. #1 Part. #2 Part. #3
5 sec
wordsRDD
flatMap()
linesRDD
linesDStream
wordsDStream
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5)
# Create a DStream that will connect to hostname:port, like localhost:9999
linesDStream = ssc.socketTextStream("localhost", 9999)
# Split each line into words
wordsDStream = linesDStream.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairsDStream = wordsDStream.map(lambda word: (word, 1))
wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCountsDStream.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
linesStream
wordsStream
pairsStream
wordCountsStream
Terminal #1 Terminal #2
$ nc -lk 9999
hello hello world
$ ./network_wordcount.py localhost 9999
. . .
--------------------------
Time: 2015-04-25 15:25:21
--------------------------
(hello, 2)
(world, 1)
Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Batch interval = 600 ms
R
Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
200 ms later
Ex
W
block, P2 T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
200 ms later
Ex
W
block, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
block, P3
block, P3
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3
1.4.0
New UI for Streaming
1.4.0
DAG Visualization for Streaming
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Batch interval = 600 ms
Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
block, P1
2 input DStreams
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
block, P1 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
block, P1
Batch interval = 600 ms
block, P2
block, P3
block, P2
block, P3
block, P2
block, P3
block, P2 block, P3
Ex
W
Driver
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P1 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
RDD, P1
Batch interval = 600 ms
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2 RDD, P3
Materialize!
Ex
W
Driver
RDD, P3
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
RDD, P4
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P3 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
RDD, P6
Batch interval = 600 ms
RDD, P4
RDD, P5
RDD, P2
RDD, P2
RDD, P5
RDD, P1
RDD, P1 RDD, P6
Union!
- File systems
- Socket Connections
- Kafka
- Flume
- Twitter
Sources directly available
in StreamingContext API
Requires linking against
extra dependencies
- Anywhere
Requires implementing
user-defined receiver
map( )
flatMap( )
filter( )
repartition(numPartitions)
union(otherStream)
count()
reduce( )
countByValue()
reduceAByKey( ,[numTasks])
join(otherStream,[numTasks])
cogroup(otherStream,[numTasks])
transform( )
RDD
RDD
updateStateByKey( )*
print()
saveAsTextFile(prefix, [suffix])
foreachRDD( )
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
Introduction to Spark Training

Mais conteúdo relacionado

Mais procurados

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 

Mais procurados (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 

Destaque

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshopIntroduction to Spark SQL training workshop
Introduction to Spark SQL training workshop(Susan) Xinh Huynh
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachSpark Summit
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015Nathan Halko
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Variance in scala
Variance in scalaVariance in scala
Variance in scalaLyleK
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaVasil Remeniuk
 
Python in real world.
Python in real world.Python in real world.
Python in real world.Alph@.M
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesAshish Sharma
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandrajohnrjenson
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on SparkAlpine Data
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learningdatamantra
 

Destaque (20)

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshopIntroduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Q2 teenagers
Q2 teenagersQ2 teenagers
Q2 teenagers
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 
Python in real world.
Python in real world.Python in real world.
Python in real world.
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Python programming - Everyday(ish) Examples
Python programming - Everyday(ish) ExamplesPython programming - Everyday(ish) Examples
Python programming - Everyday(ish) Examples
 
Neo, Titan & Cassandra
Neo, Titan & CassandraNeo, Titan & Cassandra
Neo, Titan & Cassandra
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 

Semelhante a Introduction to Spark Training

How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache SparkKnoldus Inc.
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Architecting Wide-ranging Analytical Solutions with MongoDB
Architecting Wide-ranging Analytical Solutions with MongoDBArchitecting Wide-ranging Analytical Solutions with MongoDB
Architecting Wide-ranging Analytical Solutions with MongoDBMatthew Kalan
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 

Semelhante a Introduction to Spark Training (20)

How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Architecting Wide-ranging Analytical Solutions with MongoDB
Architecting Wide-ranging Analytical Solutions with MongoDBArchitecting Wide-ranging Analytical Solutions with MongoDB
Architecting Wide-ranging Analytical Solutions with MongoDB
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Último (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Introduction to Spark Training

  • 1. INTRO TO SPARK DEVELOPMENT June 2015: Spark Summit West / San Francisco http://training.databricks.com/intro.pdf https://www.linkedin.com/profile/view?id=4367352
  • 2. making big data simple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~55 employees • We’re hiring! • Level 2/3 support partnerships with • Hortonworks • MapR • DataStax (http://databricks.workable.com)
  • 3. The Databricks team contributed more than 75% of the code added to Spark in the past year
  • 4. AGENDA • History of Big Data & Spark • RDD fundamentals • Databricks UI demo • Lab: DevOps 101 • Transformations & Actions Before Lunch • Transformations & Actions (continued) • Lab: Transformations & Actions • Dataframes • Lab: Dataframes • Spark UIs • Resource Managers: Local & Stanalone • Memory and Persistence • Spark Streaming • Lab: MISC labs After Lunch
  • 5. Some slides will be skipped Please keep Q&A low during class (5pm – 5:30pm for Q&A with instructor) 2 anonymous surveys: Pre and Post class Lunch: noon – 1pm 2 breaks (sometime before lunch and after lunch)
  • 6. Homepage: http://www.ardentex.com/ LinkedIn: https://www.linkedin.com/in/bclapper @brianclapper - 30 years experience building & maintaining software systems - Scala, Python, Ruby, Java, C, C# - Founder of Philadelphia area Scala user group (PHASE) - Spark instructor for Databricks
  • 7. 0 10 20 30 40 50 60 70 80 Sales / Marketing Data Scientist Management / Exec Administrator / Ops Developer Survey completed by 58 out of 115 students
  • 8. Survey completed by 58 out of 115 students SF Bay Area 42% CA 12% West US 5% East US 24% Europe 4% Asia 10% Intern. - O 3%
  • 9. 0 5 10 15 20 25 30 35 40 45 50 Retail / Distributor Healthcare/ Medical Academia / University Telecom Science & Tech Banking / Finance IT / Systems Survey completed by 58 out of 115 students
  • 10. 0 10 20 30 40 50 60 70 80 90 100 Vendor Training SparkCamp AmpCamp None Survey completed by 58 out of 115 students
  • 11. Survey completed by 58 out of 115 students Zero 48% < 1 week 26% < 1 month 22% 1+ months 4%
  • 12. Survey completed by 58 out of 115 students Reading 58% 1-node VM 19% POC / Prototype 21% Production 2%
  • 13. Survey completed by 58 out of 115 students
  • 14. Survey completed by 58 out of 115 students
  • 15. Survey completed by 58 out of 115 students
  • 16. Survey completed by 58 out of 115 students
  • 17. 0 10 20 30 40 50 60 70 80 90 100 Use Cases Architecture Administrator / Ops Development Survey completed by 58 out of 115 students
  • 18. NoSQL battles Compute battles (then) (now)
  • 19. NoSQL battles Compute battles (then) (now)
  • 20. Key -> Value Key -> Doc Column Family Graph Search Redis - 95 Memcached - 33 DynamoDB - 16 Riak - 13 MongoDB - 279 CouchDB - 28 Couchbase - 24 DynamoDB – 15 MarkLogic - 11 Cassandra - 109 HBase - 62 Neo4j - 30 OrientDB - 4 Titan – 3 Giraph - 1 Solr - 81 Elasticsearch - 70 Splunk – 41
  • 21. General Batch Processing Pregel Dremel Impala GraphLab Giraph Drill Tez S4 Storm Specialized Systems (iterative, interactive, ML, streaming, graph, SQL, etc) General Unified Engine (2004 – 2013) (2007 – 2015?) (2014 – ?) Mahout
  • 23. RDBMS Streaming SQL GraphX Hadoop Input Format Apps Distributions: - CDH - HDP - MapR - DSE Tachyon MLlib DataFrames API
  • 24.
  • 25. - Developers from 50+ companies - 400+ developers - Apache Committers from 16+ organizations
  • 29.
  • 30. CPUs: 10 GB/s 100 MB/s 0.1 ms random access $0.45 per GB 600 MB/s 3-12 ms random access $0.05 per GB 1 Gb/s or 125 MB/s Network 0.1 Gb/s Nodes in another rack Nodes in same rack 1 Gb/s or 125 MB/s
  • 31. June 2010 http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf “The main abstraction in Spark is that of a resilient dis- tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”
  • 32. April 2012 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf “We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude.” “Best Paper Award and Honorable Mention for Community Award” - NSDI 2012 - Cited 400+ times!
  • 33. TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5)) - 2 Streaming Paper(s) have been cited 138 times Analyze real time streams of data in ½ second intervals
  • 34. sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name) Seemlessly mix SQL queries with Spark programs.
  • 35. graph = Graph(vertices, edges) messages = spark.textFile("hdfs://...") graph2 = graph.joinVertices(messages) { (id, vertex, msg) => ... } Analyze networks of nodes and edges using graph processing https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  • 36. SQL queries with Bounded Errors and Bounded Response Times https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
  • 37. Estimate # of data points true answer How do you know when to stop?
  • 38. Estimate # of data points true answer Error bars on every answer!
  • 39. Estimate # of data points true answer Stop when error smaller than a given threshold time
  • 40. http://shop.oreilly.com/product/0636920028512.do eBook: $33.99 Print: $39.99 PDF, ePub, Mobi, DAISY Shipping now! http://www.amazon.com/Learning-Spark-Lightning-Fast- Data-Analysis/dp/1449358624 $30 @ Amazon:
  • 41.
  • 42.
  • 43. Spark sorted the same data 3X faster using 10X fewer machines than Hadoop MR in 2013. Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia 100TB Daytona Sort Competition 2014 More info: http://sortbenchmark.org http://databricks.com/blog/2014/11/05/spark- officially-sets-a-new-record-in-large-scale-sorting.html All the sorting took place on disk (HDFS) without using Spark’s in-memory cache!
  • 44.
  • 45. - Stresses “shuffle” which underpins everything from SQL to MLlib - Sorting is challenging b/c there is no reduction in data - Sort 100 TB = 500 TB disk I/O and 200 TB network Engineering Investment in Spark: - Sort-based shuffle (SPARK-2045) - Netty native network transport (SPARK-2468) - External shuffle service (SPARK-3796) Clever Application level Techniques: - GC and cache friendly memory layout - Pipelining
  • 46. Ex RDD W RDD T T EC2: i2.8xlarge (206 workers) - Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores - 244 GB of RAM - 8 x 800 GB SSD and RAID 0 setup formatted with /ext4 - ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes - Each record: 100 bytes (10 byte key & 90 byte value) - OpenJDK 1.7 - HDFS 2.4.1 w/ short circuit local reads enabled - Apache Spark 1.2.0 - Speculative Execution off - Increased Locality Wait to infinite - Compression turned off for input, output & network - Used Unsafe to put all the data off-heap and managed it manually (i.e. never triggered the GC) - 32 slots per machine - 6,592 slots total
  • 47.
  • 48.
  • 52. Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 RDD w/ 4 partitions Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 An RDD can be created 2 ways: - Parallelize a collection - Read data from an external source (S3, C*, HDFS, etc) logLinesRDD
  • 53. # Parallelize in Python wordsRDD = sc.parallelize([“fish", “cats“, “dogs”]) // Parallelize in Scala val wordsRDD= sc.parallelize(List("fish", "cats", "dogs")) // Parallelize in Java JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”)); - Take an existing in-memory collection and pass it to SparkContext’s parallelize method - Not generally used outside of prototyping and testing since it requires entire dataset in memory on one machine
  • 54. # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Scala val linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Java JavaRDD<String> lines = sc.textFile("/path/to/README.md"); - There are other methods to read data from HDFS, C*, S3, HBase, etc.
  • 55. Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 logLinesRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 errorsRDD .filter( ) (input/base RDD)
  • 56. errorsRDD .coalesce( 2 ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 .collect( ) Driver
  • 59. .collect( ) logLinesRDD errorsRDD cleanedRDD .filter( ) .coalesce( 2 ) Driver Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1
  • 63. logLinesRDD errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RDD .collect( ) .saveToCassandra( ) .count( ) 5
  • 64. logLinesRDD errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RDD .collect( ) .count( ) .saveToCassandra( ) 5
  • 65. 1) Create some input RDDs from external data or parallelize a collection in your driver program. 2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache() any intermediate RDDs that will need to be reused. 4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.
  • 66. map() intersection() cartesion() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... (lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations
  • 67. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() saveToCassandra() ...
  • 68. • HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD • CassandraRDD (DataStax) • GeoRDD (ESRI) • EsSpark (ElasticSearch)
  • 69.
  • 70.
  • 71. “Simple things should be simple, complex things should be possible” - Alan Kay
  • 72. DEMO:
  • 73. https://classeast01.cloud.databricks.com https://classeast02.cloud.databricks.com - 60 user accounts - 60 user accounts - 60 user clusters - 1 community cluster - 60 user clusters - 1 community cluster
  • 74. Databricks Guide (5 mins) DevOps 101 (30 mins) DevOps 102 (30 mins)Transformations & Actions (30 mins) SQL 101 (30 mins) Dataframes (20 mins)
  • 75. Switch to Transformations & Actions slide deck….
  • 76. UserID Name Age Location Pet 28492942 John Galt 32 New York Sea Horse 95829324 Winston Smith 41 Oceania Ant 92871761 Tom Sawyer 17 Mississippi Raccoon 37584932 Carlos Hinojosa 33 Orlando Cat 73648274 Luis Rodriguez 34 Orlando Dogs
  • 78.
  • 80. • Announced Feb 2015 • Inspired by data frames in R and Pandas in Python • Works in: Features • Scales from KBs to PBs • Supports wide array of data formats and storage systems (Hive, existing RDDs, etc) • State-of-the-art optimization and code generation via Spark SQL Catalyst optimizer • APIs in Python, Java • a distributed collection of data organized into named columns • Like a table in a relational database What is a Dataframe?
  • 81. Step 1: Construct a DataFrame from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.jsonFile("examples/src/main/resources/people.json") # Displays the content of the DataFrame to stdout df.show() ## age name ## null Michael ## 30 Andy ## 19 Justin
  • 82. Step 2: Use the DataFrame # Print the schema in a tree format df.printSchema() ## root ## |-- age: long (nullable = true) ## |-- name: string (nullable = true) # Select only the "name" column df.select("name").show() ## name ## Michael ## Andy ## Justin # Select everybody, but increment the age by 1 df.select("name", df.age + 1).show() ## name (age + 1) ## Michael null ## Andy 31 ## Justin 20
  • 83. SQL Integration from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT * FROM table")
  • 84. SQL + RDD Integration 2 methods for converting existing RDDs into DataFrames: 1. Use reflection to infer the schema of an RDD that contains different types of objects 2. Use a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. (more concise) (more verbose)
  • 85. SQL + RDD Integration: via reflection # sc is an existing SparkContext. from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) # Load a text file and convert each line to a Row. lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) # Infer the schema, and register the DataFrame as a table. schemaPeople = sqlContext.inferSchema(people) schemaPeople.registerTempTable("people")
  • 86. SQL + RDD Integration: via reflection # SQL can be run over DataFrames that have been registered as a table. teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") # The results of SQL queries are RDDs and support all the normal RDD operations. teenNames = teenagers.map(lambda p: "Name: " + p.name) for teenName in teenNames.collect(): print teenName
  • 87. SQL + RDD Integration: via programmatic schema DataFrame can be created programmatically with 3 steps: 1. Create an RDD of tuples or lists from the original RDD 2. Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1 3. Apply the schema to the RDD via createDataFrame method provided by SQLContext
  • 88. Step 1: Construct a DataFrame # Constructs a DataFrame from the users table in Hive. users = context.table("users") # from JSON files in S3 logs = context.load("s3n://path/to/data.json", "json")
  • 89. Step 2: Use the DataFrame # Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21) # Alternatively, using Pandas-like syntax young = users[users.age < 21] # Increment everybody’s age by 1 young.select(young.name, young.age + 1) # Count the number of young users by gender young.groupBy("gender").count() # Join young users with another DataFrame called logs young.join(logs, logs.userId == users.userId, "left_outer")
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99. 1.4.0 Event timeline all jobs page
  • 102. 1.4.0 sc.textFile(“blog.txt”) .cache() .flatMap { line => line.split(“ “) } .map { word => (word, 1) } .reduceByKey { case (count1, count2) => count1 + count2 } .collect()
  • 103. 1.4.0
  • 104.
  • 106. 0 5 10 15 20 25 30 35 40 45 50 Don't know Mesos Apache + Standalone C* + Standalone Hadoop YARN Databricks Cloud Survey completed by 58 out of 115 students
  • 107. 0 10 20 30 40 50 60 70 Different Cloud On-prem Amazon Cloud Survey completed by 58 out of 115 students
  • 108. JobTracker DNTT M M R M M R M M R M M R M M M M M M RR R R OSOSOSOS JT DN DNTT DNTT TT History: NameNode NN
  • 109. - Local - Standalone Scheduler - YARN - Mesos
  • 110.
  • 111. JVM: Ex + Driver Disk RDD, P1 Task 3 options: - local - local[N] - local[*] RDD, P2 RDD, P1 RDD, P2 RDD, P3 Task Task Task Task Task CPUs: Task Task Task Task Task Task Internal Threads val conf = new SparkConf() .setMaster("local[12]") .setAppName(“MyFirstApp") .set("spark.executor.memory", “3g") val sc = new SparkContext(conf) > ./bin/spark-shell --master local[12] > ./bin/spark-submit --name "MyFirstApp" --master local[12] myApp.jar Worker Machine
  • 112.
  • 113. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P4 W RDD, P6 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P7 W RDD, P8 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Spark Master Ex RDD, P5 W RDD, P3 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD T T T T different spark-env.sh - SPARK_WORKER_CORES vs.> ./bin/spark-submit --name “SecondApp" --master spark://host4:port1 myApp.jar - SPARK_LOCAL_DIRSspark-env.sh
  • 114. Spark Central Master Who starts Executors? Tasks run in Local [none] Human being Executor Standalone Standalone Master Worker JVM Executor YARN YARN App Master Node Manager Executor Mesos Mesos Master Mesos Slave Executor
  • 115. spark-submit provides a uniform interface for submitting jobs across all cluster managers bin/spark-submit --master spark://host:7077 --executor-memory 10g my_script.py Source: Learning Spark
  • 116.
  • 117. Ex RDD, P1 RDD, P2 RDD, P1 T T T T T T Internal Threads Recommended to use at most only 75% of a machine’s memory for Spark Minimum Executor heap size should be 8 GB Max Executor heap size depends… maybe 40 GB (watch GC) Memory usage is greatly affected by storage level and serialization format
  • 118. +Vs.
  • 119. Persistence description MEMORY_ONLY Store RDD as deserialized Java objects in the JVM MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM and spill to disk MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition) MEMORY_AND_DISK_SER Spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed DISK_ONLY Store the RDD partitions only on disk MEMORY_ONLY_2, MEMORY_AND_DISK_2 Same as the levels above, but replicate each partition on two cluster nodes OFF_HEAP Store RDD in serialized format in Tachyon
  • 121.
  • 126. RDD.persist(MEMORY_ONLY_2) JVM on Node X deserialized deserialized JVM on Node Y
  • 130. JVM ?
  • 131. Intermediate data is automatically persisted during shuffle operations Remember!
  • 132. 60%20% 20% Default Memory Allocation in Executor JVM Cached RDDs User Programs (remainder) Shuffle memory spark.storage.memoryFraction spark.shuffle.memoryFraction
  • 133. RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle. User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated. Spark uses memory for:
  • 134.
  • 135. Serialization is used when: Transferring data over the network Spilling data to disk Caching to memory serialized Broadcasting variables
  • 136. Java serialization Kryo serializationvs. • Uses Java’s ObjectOutputStream framework • Works with any class you create that implements java.io.Serializable • You can control the performance of serialization more closely by extending java.io.Externalizable • Flexible, but quite slow • Leads to large serialized formats for many classes • Recommended serialization for production apps • Use Kyro version 2 for speedy serialization (10x) and more compactness • Does not support all Serializable types • Requires you to register the classes you’ll use in advance • If set, will be used for serializing shuffle data between nodes and also serializing RDDs to disk conf.set(“spark.serializer”, "org.apache.spark.serializer.KryoSerializer")
  • 137. ++ +
  • 138. Ex Ex Ex x = 5 T T x = 5 x = 5 x = 5 x = 5 T T
  • 139. • Broadcast variables – Send a large read-only lookup table to all the nodes, or send a large feature vector in a ML algorithm to all nodes • Accumulators – count events that occur during job execution for debugging purposes. Example: How many lines of the input file were blank? Or how many corrupt records were in the input dataset? ++ +
  • 140. Spark supports 2 types of shared variables: • Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes. • Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only. ++ +
  • 141. Broadcast variables let programmer keep a read- only variable cached on each machine rather than shipping a copy of it with tasks For example, to give every node a copy of a large input dataset efficiently Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost
  • 142. val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value broadcastVar = sc.broadcast(list(range(1, 4))) broadcastVar.value Scala: Python:
  • 143. Accumulators are variables that can only be “added” to through an associative operation Used to implement counters and sums, efficiently in parallel Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types Only the driver program can read an accumulator’s value, not the tasks ++ +
  • 144. val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value accum = sc.accumulator(0) rdd = sc.parallelize([1, 2, 3, 4]) def f(x): global accum accum += x rdd.foreach(f) accum.value Scala: Python: ++ +
  • 146. Kafka Flume HDFS S3 Kinesis Twitter TCP socket HDFS Cassandra Dashboards Databases - Scalable - High-throughput - Fault-tolerant Complex algorithms can be expressed using: - Spark transformations: map(), reduce(), join(), etc - MLlib + GraphX - SQL
  • 148. Tathagata Das (TD) - Lead developer of Spark Streaming + Committer on Apache Spark core - Helped re-write Spark Core internals in 2012 to make it 10x faster to support Streaming use cases - On leave from UC Berkeley PhD program - Ex: Intern @ Amazon, Intern @ Conviva, Research Assistant @ Microsoft Research India - Scales to 100s of nodes - Batch sizes as small at half a second - Processing latency as low as 1 second - Exactly-once semantics no matter what fails
  • 149. Page views Kafka for buffering Spark for processing (live statistics)
  • 150. Smart meter readings Live weather data Join 2 live data sources (Anomaly Detection)
  • 151. Input data stream Batches of processed data Batches every X seconds
  • 152. (Discretized Stream) Block #1 RDD @ T=5 Block #2 Block #3 Batch interval = 5 seconds Block #1 RDD @ T=10 Block #2 Block #3 T = 5 T = 10 Input DStream One RDD is created every 5 seconds
  • 153. Block #1 Block #2 Block #3 Part. #1 Part. #2 Part. #3 Part. #1 Part. #2 Part. #3 5 sec wordsRDD flatMap() linesRDD linesDStream wordsDStream
  • 154. from pyspark import SparkContext from pyspark.streaming import StreamingContext # Create a local StreamingContext with two working thread and batch interval of 1 second sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 5) # Create a DStream that will connect to hostname:port, like localhost:9999 linesDStream = ssc.socketTextStream("localhost", 9999) # Split each line into words wordsDStream = linesDStream.flatMap(lambda line: line.split(" ")) # Count each word in each batch pairsDStream = wordsDStream.map(lambda word: (word, 1)) wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y) # Print the first ten elements of each RDD generated in this DStream to the console wordCountsDStream.pprint() ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate linesStream wordsStream pairsStream wordCountsStream
  • 155. Terminal #1 Terminal #2 $ nc -lk 9999 hello hello world $ ./network_wordcount.py localhost 9999 . . . -------------------------- Time: 2015-04-25 15:25:21 -------------------------- (hello, 2) (world, 1)
  • 156. Ex RDD, P1 W Driver RDD, P2 block, P1 T Internal Threads SSD SSDOS Disk T T T T Ex RDD, P3 W RDD, P4 block, P1 T Internal Threads SSD SSDOS Disk T T T T T Batch interval = 600 ms R
  • 157. Ex RDD, P1 W Driver RDD, P2 block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P3 W RDD, P4 block, P1 T Internal Threads SSD SSDOS Disk T T T T T 200 ms later Ex W block, P2 T Internal Threads SSD SSDOS Disk T T T T T block, P2 Batch interval = 600 ms
  • 158. Ex RDD, P1 W Driver RDD, P2 block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 block, P1 T Internal Threads SSD SSDOS Disk T T T T T 200 ms later Ex W block, P2 T Internal Threads SSD SSDOS Disk T T T T T block, P2 Batch interval = 600 ms block, P3 block, P3
  • 159. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P2 T Internal Threads SSD SSDOS Disk T T T T T RDD, P2 Batch interval = 600 ms RDD, P3 RDD, P3
  • 160. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P2 T Internal Threads SSD SSDOS Disk T T T T T RDD, P2 Batch interval = 600 ms RDD, P3 RDD, P3
  • 161. 1.4.0 New UI for Streaming
  • 163. Ex W Driver block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T T Batch interval = 600 ms Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T R block, P1 2 input DStreams
  • 164. Ex W Driver block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T R block, P1 Batch interval = 600 ms block, P2 block, P3 block, P2 block, P3 block, P2 block, P3 block, P2 block, P3
  • 165. Ex W Driver RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P1 T Internal Threads SSD SSDOS Disk T T T T R RDD, P1 Batch interval = 600 ms RDD, P2 RDD, P3 RDD, P2 RDD, P3 RDD, P2 RDD, P3 RDD, P2 RDD, P3 Materialize!
  • 166. Ex W Driver RDD, P3 T R Internal Threads SSD SSDOS Disk T T T T Ex W RDD, P4 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P3 T Internal Threads SSD SSDOS Disk T T T T R RDD, P6 Batch interval = 600 ms RDD, P4 RDD, P5 RDD, P2 RDD, P2 RDD, P5 RDD, P1 RDD, P1 RDD, P6 Union!
  • 167. - File systems - Socket Connections - Kafka - Flume - Twitter Sources directly available in StreamingContext API Requires linking against extra dependencies - Anywhere Requires implementing user-defined receiver
  • 168.
  • 169.
  • 170. map( ) flatMap( ) filter( ) repartition(numPartitions) union(otherStream) count() reduce( ) countByValue() reduceAByKey( ,[numTasks]) join(otherStream,[numTasks]) cogroup(otherStream,[numTasks]) transform( ) RDD RDD updateStateByKey( )*