SlideShare uma empresa Scribd logo
1 de 72
Baixar para ler offline
pipeline.io
After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML
Big Data Reference Pipeline
Atlanta Spark User Group
Sept 22, 2016
Thanks Emory Continuing Education!
Chris Fregly
Research Scientist @ PipelineIO
We’re Hiring - Only Nice People!
pipeline.io
advancedspark.com
pipeline.io
Who Am I?
2
Research Scientist @ PipelineIO
github.com/fluxcapacitor/pipeline
Meetup Founder
Advanced
Book Author
Advanced .
pipeline.io
Who Was I?
3
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
pipeline.io
Advanced Spark and Tensorflow Meetup
Meetup Metrics
Top 5 Most Active Spark Meetup!
4000+ Members in just 1 year!!
6000+ Downloads of Docker Image
with many Meetup Demos!!!
@ advancedspark.com
Meetup Goals
Code dive deep into Spark and related open source code bases
Study integrations with Cassandra, ElasticSearch,Tachyon, S3,
BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share patterns and idioms of well-designed,
distributed, big data processing systems
pipeline.io
Atlanta Hadoop User Meetup Last Night
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
pipeline.io
MLconf ATL Tomorrow
pipeline.io
Current PipelineIO Research
Model Deploying and Testing
Model Scaling and Serving
Online Model Training
Dynamic Model Optimizing
7
pipeline.io
PipelineIO Deliverables
100% Open Source!!
Github
https://github.com/fluxcapacitor/
DockerHub
https://hub.docker.com/r/fluxcapacitor
Workshop
http://pipeline.io
8
pipeline.io
Topics of This Talk (20-30 mins each)
① Spark Streaming and Spark ML
Generating Real-time Recommendations
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
9
pipeline.io
Live, Interactive Demo!
Kafka, Cassandra, ElasticSearch, Redis, Spark ML
pipeline.io
Audience Participation Required!
11
You -> Audience Instructions
①Navigate to
demo.pipeline.io
②Swipe software used in
Production Only!
Data ->
Scientist
This is Totally Anonymous!!
pipeline.io
Topics of This Talk (15-20 mins each)
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
12
pipeline.io
Mechanical Sympathy
“Hardware and software working together.”
- Martin Thompson
http://mechanical-sympathy.blogspot.com
“Whatever your data structure, my array will win.”
- Scott Meyers
Every C++ Book, basically
13
pipeline.io
Spark and Mechanical Sympathy
14
Project
Tungsten
(Spark 1.4-1.6+)
100TB
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory & GC
Maximize CPU Cache
Saturate Network I/O
Saturate Disk I/O
pipeline.io
CPU Cache Refresher
15
aka
“LLC”
My
Laptop
pipeline.io
CPU Cache Sympathy (AlphaSort paper)
Key (10 bytes) + Pointer (4 bytes) = 14 bytes
16
Key Ptr
Pre-process & Pull Key From Record
PtrKey-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes
Ptr
Must Dereference to Compare Key
Key
Pointer (4 bytes) = 4 bytes
Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Dereference full key only
to resolve prefix duplicates
pipeline.io
Sort Performance Comparison
17
pipeline.io
Sequential vs Random Cache Misses
18
pipeline.io
Demo!
Sorting
pipeline.io
Instrumenting and Monitoring CPU
Use Linux perf command!
20
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
pipeline.io
Results of Random vs. Sequential Sort
21
Naïve Random Access
Pointer Sort
Cache Friendly Sequential
Key/Pointer Sort
Ptr Key
Must Dereference to Compare Key
Key Ptr
Pre-process & Pull Key From Record
-35%
-90%
-68%
-26%
% Change
-55%
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
pipeline.io
Demo!
Matrix Multiplication
pipeline.io
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
23
Bad: Row-wise traversal,
not using full CPU cache line,
ineffective pre-fetching
pipeline.io
CPU Cache Friendly Matrix
Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
24
Good: Full CPU Cache Line,
Effective Prefetching
OLD: matB [ k ][ j ];
// Modify algo for Transpose B
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
pipeline.io
Results Of Matrix Multiplication
Cache-Friendly Matrix MultiplyNaïve Matrix Multiply
perf stat –event 
L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
-96%
-93%
-93%
-70%
-53%
% Change
-63%
+8543%?
pipeline.io
Demo!
Thread Synchronization
pipeline.io
Thread and Context Switch Sympathy
Problem
Atomically Increment 2 Counters
(each at different increments) by
1000’s of Simultaneous Threads
27
Possible Solutions
① Synchronized Immutable
② Synchronized Mutable
③ AtomicReference CAS
④ Volatile?
Context Switches are Expen$ive!!
aka
“LLC”
pipeline.io
Synchronized Immutable Counters
case class Counters(left: Int, right: Int)
object SynchronizedImmutableCounters {
var counters = new Counters(0,0)
def getCounters(): Counters = {
this.synchronized { counters }
}
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
}
} 28
Locks whole
outer object!!
pipeline.io
Synchronized Mutable Counters
class MutableCounters(left: Int, right: Int) {
def increment(leftIncrement: Int, rightIncrement: Int): Unit={
this.synchronized {…}
}
def getCountersTuple(): (Int, Int) = {
this.synchronized{ (counters.left, counters.right) }
}
}
object SynchronizedMutableCounters {
val counters = new MutableCounters(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
counters.increment(leftIncrement, rightIncrement)
}
29
Locks just
MutableCounters
pipeline.io
Lock-Free AtomicReference Counters
case class Counters(left: Int, right: Int)
object LockFreeAtomicReferenceCounters {
val counters = new AtomicReference[Counters](new Counters(0,0))
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalCounters: Counters = null
var updatedCounters: Counters = null
do {
originalCounters = getCounters()
updatedCounters = new Counters(originalCounters.left+ leftIncrement,
originalCounters.right+ rightIncrement)
} // Retry lock-free, optimistic compareAndSet() until AtomicRef updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}
30Lock Free!!
pipeline.io
Lock-Free AtomicLong Counters
object LockFreeAtomicLongCounters {
// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)
val counters = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int): Unit = {
var originalCounters = 0L
var updatedCounters = 0L
do {
originalCounters = counters.get()
…
// Store two 32-bit Int into one 64-bit Long
// Use >>> 32 and << 32 to set and retrieve each Int from the Long
}
// Retry lock-free, optimistic compareAndSet() until AtomicLong updates
while !(counters.compareAndSet(originalCounters, updatedCounters))
}31 Lock Free!!
A: The JVM does not
guarantee atomic
updates of 64-bit
longs and doubles
Q: Why not use
@volatile long?
pipeline.io
Results of Thread Synchronization
Immutable Case Class
32
Lock-Free AtomicLong
-64%
-46%
-17%
-33%
perf stat –event 
context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, 
LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
% Change
-31%
-32%
-33%
-27%
case class Counters(left:Int, right: Int)
...
this.synchronized {
counters = new Counters(counters.left + leftIncrement,
counters.right + rightIncrement)
}
val counters = new AtomicLong()
…
do {
…
} while !(counters.compareAndSet(originalCounters,
updatedCounters))
pipeline.io
Profile Visualizations: Flame Graphs
33
Example: Spark Word Count
Java Stack Traces are Good! JDK 1.8+
(-XX:-Inline -XX:+PreserveFramePointer)
Plateaus are Bad!
I/O stalls, Heavy CPU
serialization, etc
pipeline.io
Project Tungsten: CPU and Memory
Create Custom Data Structures & Algorithms
Operate on serialized and compressed ByteArrays!
Minimize Garbage Collection
Reuse ByteArrays
In-place updates for aggregations
Maximize CPU Cache Effectiveness
8-byte alignment
AlphaSort-based Key-Prefix
Utilize Catalyst Dynamic Code Generation
Dynamic optimizations using entire query plan
Developer implements genCode() to create Scala source code
(String)
34
pipeline.io
Why is CPU the Bottleneck?
CPU for serialization, hashing, & compression
Spark 1.2 updates saturated Network, Disk I/O
10x increase in I/O throughput relative to CPU
More partition, pruning, and pushdown support
Newer columnar file formats help reduce I/O
35
pipeline.io
Custom Data Structs & Algos: Aggs
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap internally
In-place updates of serialized aggregation
No object creation on hot-path
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 steps to avoid single-key OOMs
① Hash-based (grouping) agg spills to disk if needed
② Sort-based agg performs external merge sort on spills
36
pipeline.io
Custom Data Structures & Algorithms
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeExternalSorter
UnsafeShuffleWriter
UnsafeInMemorySorter
RecordPointerAndKeyPrefix
37
PtrKey-Prefix
2x CPU Cache-line Friendly!
SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>
Note: Mixing multiplesubclasses of SortDataFormat
simultaneously will prevent JIT inlining.
Supports merging compressed records
(if compression CODEC supports it, ie. LZF)
In-place external sorting of spilled BytesToBytes data
AlphaSort-based, 8-byte aligned sort key
In-place sorting of BytesToBytesMap data
pipeline.io
Code Generation
Problem
Boxing creates excessive objects
Expression tree evaluations are costly
JVM can’t inline polymorphic impls
Lack of polymorphism == poor code design
Solution
Code generation enables inlining
Rewrite and optimize code using overall plan, 8-byte align
Defer source code generation to each operator, UDF,
UDAF
Use Janino to compile generated source code ->bytecode
38
pipeline.io
Autoscaling Spark Workers (Spark 1.5+)
Scaling up is easy J
SparkContext.addExecutors() until max is
reached
Scaling down is hard L
SparkContext.removeExecutors()
Lose RDD cache inside Executor JVM
Must rebuild active RDD partitions in another Executor
JVM
Uses External Shuffle Service from Spark 1.1-1.2
If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
pipeline.io
“Hidden” Spark Submit REST API
http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Submit Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/create 
--header "Content-Type:application/json;charset=UTF-8" 
--data ’{"action" : "CreateSubmissionRequest”,
"mainClass" : "org.apache.spark.examples.SparkPi”,
"sparkProperties" : {
"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar",
"spark.app.name" : "SparkPi",…
}}’
Get Spark Job Status
curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>
Kill Spark Job
curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>
40
(the snitch)
pipeline.io
Outline
① Spark Streaming and Spark ML
Kafka, Cassandra, ElasticSearch, Redis, Docker
② Spark Core
Tuning and Profiling
③ Spark SQL
Tuning and Customizing
41
pipeline.io
Parquet Columnar File Format
Based on Google Dremel paper ~2010
Collaboration with Twitter and Cloudera
Columnar storage format for fast columnar aggs
Supports evolving schema
Supports pushdowns
Support nested partitions
Tight compression
Min/max heuristics enable file and chunk skipping 42
Min/Max Heuristics
For Chunk Skipping
pipeline.io
Partitions
Partition Based on Data Access Patterns
/genders.parquet/gender=M/…
/gender=F/… <-- Use Case: Access Users by Gender
/gender=U/…
Dynamic Partition Creation (Write)
Dynamically create partitions on write based on column (ie. Gender)
SQL: INSERT TABLE genders PARTITION (gender) SELECT …
DF: gendersDF.write.format("parquet").partitionBy("gender")
.save("/genders.parquet")
Partition Discovery (Read)
Dynamically infer partitions on read based on paths (ie./gender=F/…)
SQL: SELECT id FROM genders WHERE gender=F
DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").
.where("gender=F")
43
pipeline.io
Pruning
Partition Pruning
Filter out rows by partition
SELECT id, gender FROM genders WHERE gender = ‘F’
Column Pruning
Filter out columns by column filter
Extremely useful for columnar storage formats (Parquet)
Skip entire blocks of columns
SELECT id, gender FROM genders
44
pipeline.io
Pushdowns
aka. Predicate or Filter Pushdowns
Predicate returns true or false for given function
Filters rows deep into the data source
Reduces number of rows returned
Data Source must implement PrunedFilteredScan
def buildScan(requiredColumns: Array[String],
filters: Array[Filter]): RDD[Row]
45
pipeline.io
Demo!
File Formats, Partitions, Pushdowns, and Joins
pipeline.io
Predicate Pushdowns & Filter Collapsing
47
Filter pushdown
No extra pass
Filter combining
Only 1 extra pass
2 extra passes through the data after retrieval
pipeline.io
Join Between Partitioned &
Unpartitioned
48
pipeline.io
Join Between Partitioned & Partitioned
49
pipeline.io
Broadcast Join vs. Normal Shuffle Join
50
pipeline.io
Cartesian Join vs. Inner Join
51
pipeline.io
Visualizing the Query Plan
52
Effectiveness
of Filter
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
& DistributedCache
Peak Memory for
Joins and Aggs
UnsafeFixedWidthAggregationMap
getPeakMemoryUsedBytes()
pipeline.io
Data Source API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl) 53
pipeline.io
Native Spark SQL Data Sources
54
pipeline.io
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
55
json() convenience method
pipeline.io
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=false (unless your schema is evolving)
spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
56
pipeline.io
ElasticSearch Data Source
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
57
pipeline.io
Cassandra Data Source
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…) 58
pipeline.io
Tips for Cassandra Analytics
By-pass Cassandra CQL “front door”
CQL Optimized for Transactions
Bulk read and write directly against SSTables
Check out Netflix OSS project “Aegisthus”
Cassandra becomesa first-class analytics option
Replicated analytics cluster no longer needed
59
pipeline.io
Creating a Custom Data Source
① Study existing implementations
o.a.s.sql.execution.datasources.jdbc.JDBCRelation
② Extend base traits & implement required methods
o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}
Spark JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
DataStax Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation 60
pipeline.io
Demo!
Create a Custom Data Source
pipeline.io
Publishing Custom Data Sources
62
spark-packages.org
pipeline.io
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Every UDF must use Expressions and
implement Expression.genCode()
to participate in the fun
Lambdas (RDD or Dataset API)
and sqlContext.udf.registerFunction()
are not enough!!
pipeline.io
Creating a Custom UDF with Code Gen
① Study existing implementations
o.a.s.sql.catalyst.expressions.Substring
② Extend and implement base trait
o.a.s.sql.catalyst.expressions.Expression.genCode
③ Don’t forget about Python!
python.pyspark.sql.functions.py
64
pipeline.io
Demo!
Creating a Custom UDF participating in Code Generation
pipeline.io
Spark 1.6 and 2.0 Improvements
Adaptiveness, Metrics, Datasets, and Streaming State
pipeline.io
Adaptive Query Execution
Adapt query execution using data from previous stages
Dynamically choose spark.sql.shuffle.partitions (default 200)
67
Broadcast Join
(popular keys)
Shuffle Join
(not-so-popular keys)
Adaptive
Hybrid
Join
pipeline.io
Adaptive Memory Management
Spark <1.6
Manual configure between 2 memory regions
Spark execution engine (shuffles, joins, sorts, aggs)
spark.shuffle.memoryFraction
RDD Data Cache
spark.storage.memoryFraction
Spark 1.6+
Unified memory regions
Dynamically expand/contract memory regions
Supports minimum for RDD storage (LRU Cache)
68
pipeline.io
Metrics
Shows exact memory usage per operator & node
Helps debugging and identifying skew
69
pipeline.io
Spark SQL API
Datasets type safe API (similar to RDDs) utilizing Tungsten
val ds = sqlContext.read.text("ratings.csv").as[String]
val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF
val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)
Typed Aggregators used alongside UDFs and UDAFs
val simpleSum = new Aggregator[Int, Int, Int] with Serializable {
def zero: Int = 0
def reduce(b: Int, a: Int) = b + a
def merge(b1: Int, b2: Int) = b1 + b2
def finish(b: Int) = b
}.toColumn
val sum = Seq(1,2,3,4).toDS().select(simpleSum)
Query files directly without registerTempTable()
%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
pipeline.io
Spark Streaming State Management
New trackStateByKey()
Store deltas, compact later
More efficient per-key state update
Session TTL
Integrated A/B Testing (?!)
Show Failed Output in Admin UI
Better debugging
71
pipeline.io
Thank You!!
Chris Fregly
Research Scientist @ PipelineIO
(http://pipeline.io)
San Francisco, California, USA
advancedspark.com
Sign up for the Meetup and Book
Contribute on Github!
Run All Demos in Docker
~6000 Docker Downloads!!
Find me on LinkedIn, Twitter, Github, Email, Fax 72

Mais conteúdo relacionado

Mais procurados

Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
Anu Shetty
 

Mais procurados (20)

The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

Destaque

Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
Chris Fregly
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Chris Fregly
 

Destaque (20)

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 

Semelhante a Atlanta Spark User Meetup 09 22 2016

Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Mingliang Liu
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
Universität Rostock
 

Semelhante a Atlanta Spark User Meetup 09 22 2016 (20)

Protocol Independence
Protocol IndependenceProtocol Independence
Protocol Independence
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Serial-War
Serial-WarSerial-War
Serial-War
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...Inside LoLA - Experiences from building a state space tool for place transiti...
Inside LoLA - Experiences from building a state space tool for place transiti...
 

Mais de Chris Fregly

Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Chris Fregly
 

Mais de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
PipelineAI + AWS SageMaker + Distributed TensorFlow + AI Model Training and S...
 
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
Building Google's ML Engine from Scratch on AWS with GPUs, Kubernetes, Istio,...
 

Último

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 

Atlanta Spark User Meetup 09 22 2016

  • 1. pipeline.io After Dark 2.0End-to-End,Real-time, Advanced Analytics and ML Big Data Reference Pipeline Atlanta Spark User Group Sept 22, 2016 Thanks Emory Continuing Education! Chris Fregly Research Scientist @ PipelineIO We’re Hiring - Only Nice People! pipeline.io advancedspark.com
  • 2. pipeline.io Who Am I? 2 Research Scientist @ PipelineIO github.com/fluxcapacitor/pipeline Meetup Founder Advanced Book Author Advanced .
  • 3. pipeline.io Who Was I? 3 Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principal Data Solutions Engineer IBM Technology Center
  • 4. pipeline.io Advanced Spark and Tensorflow Meetup Meetup Metrics Top 5 Most Active Spark Meetup! 4000+ Members in just 1 year!! 6000+ Downloads of Docker Image with many Meetup Demos!!! @ advancedspark.com Meetup Goals Code dive deep into Spark and related open source code bases Study integrations with Cassandra, ElasticSearch,Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc Surface and share patterns and idioms of well-designed, distributed, big data processing systems
  • 5. pipeline.io Atlanta Hadoop User Meetup Last Night http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
  • 7. pipeline.io Current PipelineIO Research Model Deploying and Testing Model Scaling and Serving Online Model Training Dynamic Model Optimizing 7
  • 8. pipeline.io PipelineIO Deliverables 100% Open Source!! Github https://github.com/fluxcapacitor/ DockerHub https://hub.docker.com/r/fluxcapacitor Workshop http://pipeline.io 8
  • 9. pipeline.io Topics of This Talk (20-30 mins each) ① Spark Streaming and Spark ML Generating Real-time Recommendations ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 9
  • 10. pipeline.io Live, Interactive Demo! Kafka, Cassandra, ElasticSearch, Redis, Spark ML
  • 11. pipeline.io Audience Participation Required! 11 You -> Audience Instructions ①Navigate to demo.pipeline.io ②Swipe software used in Production Only! Data -> Scientist This is Totally Anonymous!!
  • 12. pipeline.io Topics of This Talk (15-20 mins each) ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 12
  • 13. pipeline.io Mechanical Sympathy “Hardware and software working together.” - Martin Thompson http://mechanical-sympathy.blogspot.com “Whatever your data structure, my array will win.” - Scott Meyers Every C++ Book, basically 13
  • 14. pipeline.io Spark and Mechanical Sympathy 14 Project Tungsten (Spark 1.4-1.6+) 100TB GraySort Challenge (Spark 1.1-1.2) Minimize Memory & GC Maximize CPU Cache Saturate Network I/O Saturate Disk I/O
  • 16. pipeline.io CPU Cache Sympathy (AlphaSort paper) Key (10 bytes) + Pointer (4 bytes) = 14 bytes 16 Key Ptr Pre-process & Pull Key From Record PtrKey-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes Ptr Must Dereference to Compare Key Key Pointer (4 bytes) = 4 bytes Key Ptr Pad /Pad CPU Cache-line Friendly! Dereference full key only to resolve prefix duplicates
  • 20. pipeline.io Instrumenting and Monitoring CPU Use Linux perf command! 20 http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
  • 21. pipeline.io Results of Random vs. Sequential Sort 21 Naïve Random Access Pointer Sort Cache Friendly Sequential Key/Pointer Sort Ptr Key Must Dereference to Compare Key Key Ptr Pre-process & Pull Key From Record -35% -90% -68% -26% % Change -55% perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses
  • 23. pipeline.io CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 23 Bad: Row-wise traversal, not using full CPU cache line, ineffective pre-fetching
  • 24. pipeline.io CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 24 Good: Full CPU Cache Line, Effective Prefetching OLD: matB [ k ][ j ]; // Modify algo for Transpose B for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
  • 25. pipeline.io Results Of Matrix Multiplication Cache-Friendly Matrix MultiplyNaïve Matrix Multiply perf stat –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend -96% -93% -93% -70% -53% % Change -63% +8543%?
  • 27. pipeline.io Thread and Context Switch Sympathy Problem Atomically Increment 2 Counters (each at different increments) by 1000’s of Simultaneous Threads 27 Possible Solutions ① Synchronized Immutable ② Synchronized Mutable ③ AtomicReference CAS ④ Volatile? Context Switches are Expen$ive!! aka “LLC”
  • 28. pipeline.io Synchronized Immutable Counters case class Counters(left: Int, right: Int) object SynchronizedImmutableCounters { var counters = new Counters(0,0) def getCounters(): Counters = { this.synchronized { counters } } def increment(leftIncrement: Int, rightIncrement: Int): Unit = { this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } } } 28 Locks whole outer object!!
  • 29. pipeline.io Synchronized Mutable Counters class MutableCounters(left: Int, right: Int) { def increment(leftIncrement: Int, rightIncrement: Int): Unit={ this.synchronized {…} } def getCountersTuple(): (Int, Int) = { this.synchronized{ (counters.left, counters.right) } } } object SynchronizedMutableCounters { val counters = new MutableCounters(0,0) … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { counters.increment(leftIncrement, rightIncrement) } 29 Locks just MutableCounters
  • 30. pipeline.io Lock-Free AtomicReference Counters case class Counters(left: Int, right: Int) object LockFreeAtomicReferenceCounters { val counters = new AtomicReference[Counters](new Counters(0,0)) def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalCounters: Counters = null var updatedCounters: Counters = null do { originalCounters = getCounters() updatedCounters = new Counters(originalCounters.left+ leftIncrement, originalCounters.right+ rightIncrement) } // Retry lock-free, optimistic compareAndSet() until AtomicRef updates while !(counters.compareAndSet(originalCounters, updatedCounters)) } 30Lock Free!!
  • 31. pipeline.io Lock-Free AtomicLong Counters object LockFreeAtomicLongCounters { // a single Long (64-bit) will maintain 2 separate Ints (32-bits each) val counters = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int): Unit = { var originalCounters = 0L var updatedCounters = 0L do { originalCounters = counters.get() … // Store two 32-bit Int into one 64-bit Long // Use >>> 32 and << 32 to set and retrieve each Int from the Long } // Retry lock-free, optimistic compareAndSet() until AtomicLong updates while !(counters.compareAndSet(originalCounters, updatedCounters)) }31 Lock Free!! A: The JVM does not guarantee atomic updates of 64-bit longs and doubles Q: Why not use @volatile long?
  • 32. pipeline.io Results of Thread Synchronization Immutable Case Class 32 Lock-Free AtomicLong -64% -46% -17% -33% perf stat –event context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend % Change -31% -32% -33% -27% case class Counters(left:Int, right: Int) ... this.synchronized { counters = new Counters(counters.left + leftIncrement, counters.right + rightIncrement) } val counters = new AtomicLong() … do { … } while !(counters.compareAndSet(originalCounters, updatedCounters))
  • 33. pipeline.io Profile Visualizations: Flame Graphs 33 Example: Spark Word Count Java Stack Traces are Good! JDK 1.8+ (-XX:-Inline -XX:+PreserveFramePointer) Plateaus are Bad! I/O stalls, Heavy CPU serialization, etc
  • 34. pipeline.io Project Tungsten: CPU and Memory Create Custom Data Structures & Algorithms Operate on serialized and compressed ByteArrays! Minimize Garbage Collection Reuse ByteArrays In-place updates for aggregations Maximize CPU Cache Effectiveness 8-byte alignment AlphaSort-based Key-Prefix Utilize Catalyst Dynamic Code Generation Dynamic optimizations using entire query plan Developer implements genCode() to create Scala source code (String) 34
  • 35. pipeline.io Why is CPU the Bottleneck? CPU for serialization, hashing, & compression Spark 1.2 updates saturated Network, Disk I/O 10x increase in I/O throughput relative to CPU More partition, pruning, and pushdown support Newer columnar file formats help reduce I/O 35
  • 36. pipeline.io Custom Data Structs & Algos: Aggs UnsafeFixedWidthAggregationMap Uses BytesToBytesMap internally In-place updates of serialized aggregation No object creation on hot-path TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 steps to avoid single-key OOMs ① Hash-based (grouping) agg spills to disk if needed ② Sort-based agg performs external merge sort on spills 36
  • 37. pipeline.io Custom Data Structures & Algorithms o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeExternalSorter UnsafeShuffleWriter UnsafeInMemorySorter RecordPointerAndKeyPrefix 37 PtrKey-Prefix 2x CPU Cache-line Friendly! SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]> Note: Mixing multiplesubclasses of SortDataFormat simultaneously will prevent JIT inlining. Supports merging compressed records (if compression CODEC supports it, ie. LZF) In-place external sorting of spilled BytesToBytes data AlphaSort-based, 8-byte aligned sort key In-place sorting of BytesToBytesMap data
  • 38. pipeline.io Code Generation Problem Boxing creates excessive objects Expression tree evaluations are costly JVM can’t inline polymorphic impls Lack of polymorphism == poor code design Solution Code generation enables inlining Rewrite and optimize code using overall plan, 8-byte align Defer source code generation to each operator, UDF, UDAF Use Janino to compile generated source code ->bytecode 38
  • 39. pipeline.io Autoscaling Spark Workers (Spark 1.5+) Scaling up is easy J SparkContext.addExecutors() until max is reached Scaling down is hard L SparkContext.removeExecutors() Lose RDD cache inside Executor JVM Must rebuild active RDD partitions in another Executor JVM Uses External Shuffle Service from Spark 1.1-1.2 If Executor JVM dies/restarts, shuffle keeps shufflin’! 39
  • 40. pipeline.io “Hidden” Spark Submit REST API http://arturmkrtchyan.com/apache-spark-hidden-rest-api Submit Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data ’{"action" : "CreateSubmissionRequest”, "mainClass" : "org.apache.spark.examples.SparkPi”, "sparkProperties" : { "spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar", "spark.app.name" : "SparkPi",… }}’ Get Spark Job Status curl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request> Kill Spark Job curl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request> 40 (the snitch)
  • 41. pipeline.io Outline ① Spark Streaming and Spark ML Kafka, Cassandra, ElasticSearch, Redis, Docker ② Spark Core Tuning and Profiling ③ Spark SQL Tuning and Customizing 41
  • 42. pipeline.io Parquet Columnar File Format Based on Google Dremel paper ~2010 Collaboration with Twitter and Cloudera Columnar storage format for fast columnar aggs Supports evolving schema Supports pushdowns Support nested partitions Tight compression Min/max heuristics enable file and chunk skipping 42 Min/Max Heuristics For Chunk Skipping
  • 43. pipeline.io Partitions Partition Based on Data Access Patterns /genders.parquet/gender=M/… /gender=F/… <-- Use Case: Access Users by Gender /gender=U/… Dynamic Partition Creation (Write) Dynamically create partitions on write based on column (ie. Gender) SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format("parquet").partitionBy("gender") .save("/genders.parquet") Partition Discovery (Read) Dynamically infer partitions on read based on paths (ie./gender=F/…) SQL: SELECT id FROM genders WHERE gender=F DF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id"). .where("gender=F") 43
  • 44. pipeline.io Pruning Partition Pruning Filter out rows by partition SELECT id, gender FROM genders WHERE gender = ‘F’ Column Pruning Filter out columns by column filter Extremely useful for columnar storage formats (Parquet) Skip entire blocks of columns SELECT id, gender FROM genders 44
  • 45. pipeline.io Pushdowns aka. Predicate or Filter Pushdowns Predicate returns true or false for given function Filters rows deep into the data source Reduces number of rows returned Data Source must implement PrunedFilteredScan def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] 45
  • 47. pipeline.io Predicate Pushdowns & Filter Collapsing 47 Filter pushdown No extra pass Filter combining Only 1 extra pass 2 extra passes through the data after retrieval
  • 50. pipeline.io Broadcast Join vs. Normal Shuffle Join 50
  • 52. pipeline.io Visualizing the Query Plan 52 Effectiveness of Filter Cost-based Join Optimization Similar to MapReduce Map-side Join & DistributedCache Peak Memory for Joins and Aggs UnsafeFixedWidthAggregationMap getPeakMemoryUsedBytes()
  • 53. pipeline.io Data Source API Relations (o.a.s.sql.sources.interfaces.scala) BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl) 53
  • 54. pipeline.io Native Spark SQL Data Sources 54
  • 55. pipeline.io JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 55 json() convenience method
  • 56. pipeline.io Parquet Data Source Configuration spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=false (unless your schema is evolving) spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable()) spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet") 56
  • 57. pipeline.io ElasticSearch Data Source Github https://github.com/elastic/elasticsearch-hadoop Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0 Code val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>") 57
  • 59. pipeline.io Tips for Cassandra Analytics By-pass Cassandra CQL “front door” CQL Optimized for Transactions Bulk read and write directly against SSTables Check out Netflix OSS project “Aegisthus” Cassandra becomesa first-class analytics option Replicated analytics cluster no longer needed 59
  • 60. pipeline.io Creating a Custom Data Source ① Study existing implementations o.a.s.sql.execution.datasources.jdbc.JDBCRelation ② Extend base traits & implement required methods o.a.s.sql.sources.{BaseRelation,PrunedFilterScan} Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation 60
  • 62. pipeline.io Publishing Custom Data Sources 62 spark-packages.org
  • 63. pipeline.io Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Every UDF must use Expressions and implement Expression.genCode() to participate in the fun Lambdas (RDD or Dataset API) and sqlContext.udf.registerFunction() are not enough!!
  • 64. pipeline.io Creating a Custom UDF with Code Gen ① Study existing implementations o.a.s.sql.catalyst.expressions.Substring ② Extend and implement base trait o.a.s.sql.catalyst.expressions.Expression.genCode ③ Don’t forget about Python! python.pyspark.sql.functions.py 64
  • 65. pipeline.io Demo! Creating a Custom UDF participating in Code Generation
  • 66. pipeline.io Spark 1.6 and 2.0 Improvements Adaptiveness, Metrics, Datasets, and Streaming State
  • 67. pipeline.io Adaptive Query Execution Adapt query execution using data from previous stages Dynamically choose spark.sql.shuffle.partitions (default 200) 67 Broadcast Join (popular keys) Shuffle Join (not-so-popular keys) Adaptive Hybrid Join
  • 68. pipeline.io Adaptive Memory Management Spark <1.6 Manual configure between 2 memory regions Spark execution engine (shuffles, joins, sorts, aggs) spark.shuffle.memoryFraction RDD Data Cache spark.storage.memoryFraction Spark 1.6+ Unified memory regions Dynamically expand/contract memory regions Supports minimum for RDD storage (LRU Cache) 68
  • 69. pipeline.io Metrics Shows exact memory usage per operator & node Helps debugging and identifying skew 69
  • 70. pipeline.io Spark SQL API Datasets type safe API (similar to RDDs) utilizing Tungsten val ds = sqlContext.read.text("ratings.csv").as[String] val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DF val agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc) Typed Aggregators used alongside UDFs and UDAFs val simpleSum = new Aggregator[Int, Int, Int] with Serializable { def zero: Int = 0 def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2 def finish(b: Int) = b }.toColumn val sum = Seq(1,2,3,4).toDS().select(simpleSum) Query files directly without registerTempTable() %sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70
  • 71. pipeline.io Spark Streaming State Management New trackStateByKey() Store deltas, compact later More efficient per-key state update Session TTL Integrated A/B Testing (?!) Show Failed Output in Admin UI Better debugging 71
  • 72. pipeline.io Thank You!! Chris Fregly Research Scientist @ PipelineIO (http://pipeline.io) San Francisco, California, USA advancedspark.com Sign up for the Meetup and Book Contribute on Github! Run All Demos in Docker ~6000 Docker Downloads!! Find me on LinkedIn, Twitter, Github, Email, Fax 72