The document summarizes Pinterest's migration of ETL workflows from Cascading and Scalding to Spark. Key points:
- Pinterest runs Spark on AWS but manages its own clusters to avoid vendor lock-in. They have multiple Spark clusters with hundreds to thousands of nodes.
- The migration plan is to move remaining workloads from Hive, Cascading/Scalding, and Hadoop streaming to SparkSQL, PySpark, and native Spark over time. An automatic migration service helps with the process.
- Technical challenges included secondary sorting, accumulators behaving differently between frameworks, and output committer issues. Performance profiling and tuning was also important.
- Results of migrating so far include
2. About Us
• Daniel Dai
• Tech Lead at Pinterest
• PMC member for Apache Hive and Pig
• Zirui Li
• Software Engineer at Pinterest Spark Platform Team
• Focus on building Pinterest in-house Spark platform & functionalities
3. Agenda
▪ Spark @ Pinterest
▪ Cascading/Scalding to
Spark Conversion
▪ Technical Challenges
▪ Migration Process
▪ Result and Future Plan
4. Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
5. We Are on Cloud
• We use AWS
• However, we build our
own clusters
• Avoid vendor lockdown
• Timely support by our own team
• We store everything on
S3
• Cost less than HDFS
• HDFS is for temporary storage
S3
EC2
HDFS
Yarn
EC2
HDFS
Yarn
EC2
HDFS
Yarn
6. Spark Clusters
• We have a couple of Spark clusters
• From several hundred nodes to 1000+ nodes
• Spark only cluster and mixed use cluster
• Cross cluster routing
• R5D instance type for Spark only cluster
• Faster local disk
• High memory to cpu ratio
7. Spark Versions and Use Cases
• We are running Spark 2.4
• With quite a few internal fixes
• Will migrate to 3.1 this year
• Use cases
• Production use cases
• SparkSQL, PySpark, Spark Native via airflow
• Adhoc use case
• SparkSQL via Querybook, PySpark via Jupyter
8. Migration Plan
• 40% workloads are already
Spark
• The number is 12% one year ago
• Migration in progress
• Hive to SparkSQL
• Cacading/Scalding to Spark
• Hadoop streaming to Spark pipe
Hive
Cascading/Scalding
Hadoop Streaming
Where are we?
9. Migration Plan
• Still half workloads are on
Cascading/Scalding
• ETL use cases
• Spark Future
• Query engine: Presto/SparkSQL
• ETL: Spark native
• Machine learning: PySpark
10. Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
11. Cascading
• Simple DAG
• Only 6 different pipes
• Most logic in UDF
• Each – UDF in map
• Every – UDF in reduce
• Java API
Source
Each
GroupBy
Every
Sink
Pattern 1
Source
Each
CoGroup
Every
Sink
Pattern 2
Source
Each
12. Scalding
• Rich set of operators on top of Cascading
• Operators are very similar to Spark RDD
• Scala API
13. Migration Path
+
▪ UDF
interface is
private
▪ SQL easy to
migrate to
any engine
Recommend if there’s not
many UDFs
SparkSQL
−
PySpark
▪ Suboptimal
performanc
e, especially
for Python
UDF
▪ Rich Python
libraries
available to
use
+ −
Recommended for Machine
Learning only
+
Native Spark
▪ most structured path to enjoin
rich spark syntax
▪ Work for almost all
Cascading/Scalding
applications
Default & Recommended for
general cases
14. Spark API
▪ Newer & Recommended API
RDD
Spark Dataframe/Dataset
▪ Most inputs are thrift sequence files
▪ Encode/Decode thrift object to/from
dataframe is slow
Recommended only for non-thrift-
sequence file
▪ More Flexible on handling thrift object
serialization / deserialization
▪ Semantically close to Scalding
▪ Older API
▪ Less performant than Dataframe
Default choice for the conversion
+
−
+
−
15. Rewrite the
application manually
Reuse most of
Cascading/Scalding
library code
▪ However, avoid
Cascading
specific structure
Automatic tool to help
result validation &
performance tuning
Approach
16. Translate Cascading
• DAG is usually simple
• Most Cascading pipe has one-to-one mapping to Spark transformation
// val processedInput: RDD[(String, Token)]
// val tokenFreq: RDD[(String, Double)]
val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap())
val joined = processedInput.map {
t => (t._1, (t._2, tokenFreqVar.value.get(t._1)))
}
Cascading Pipe Spark RDD Operator Note
Each Map side UDF
Every Reduce side UDF
Merge union
CoGroup join/leftOuterJoin/right
OuterJoin/fullOuterJoin
GroupBy GroupBy/GroupByKey secondary sort might be needed
HashJoin Broadcast join no native support in RDD, simulate via broadcast variable
• Complexity is in UDF
17. UDF Translation
Semantic Difference
Multi-threading
UDF initialization
and cleanup
▪ Do both filtering &
transformation
▪ Java
▪ map + filter
▪ Scala
▪ Multi-thread model
▪ Worst case set
executor-cores=1
▪ Single-thread model
▪ Class with initialization &
cleanup
▪ No init / cleanup hook
▪ mapPartitions to
simulate
Cascading UDF Spark
VS
.mapPartitions{iter =>
// Expensive initialization block
// init block
while (iter.hasNext()) {
val event = iter.next
process(event)
}
// cleanup block
}
18. Translate Scalding
• Most operator has 1 to 1 mapping to RDD operator
• UDF can be used in Spark without change
Scalding Operator Spark RDD Operator Note
map map
flatMap flatMap
filter filter
filterNot filter Spark does not have filterNot, use filter with negative condition
groupBy groupBy
group groupByKey
groupAll groupBy(t=>1)
...
19. Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
20. Secondary Sort
• Use “repartitionAndSortWithinPartitions” in Spark
• There’s gap in semantics: Use GroupSortedIterator to fill the gap
output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key"));
group key sort key
(2, 2), "apple"
(1, 3), "facebook"
(1, 1), "pinterest"
(1, 2), "twitter"
(3, 2), "google"
input
iterator for key 1:
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
iterator for key 2:
(2, 2), "apple"
iterator for key 3:
(3, 2), "google"
Cascading
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
(2, 2), "apple"
(3, 2), "google"
Spark
21. Accumulators
• Spark accumulator is not
accurate
• Stage retry
• Same code run multiple times in different
stage
• Solution
• Deduplicate with stage+partition
• persist
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val a = sc.textFile("studenttab10k");
val b = a.map(line => line.split("t"));
val c = b.map { t =>
inputRecords.add(1L)
(t(0), t(1).toInt, t(2).toDouble)
};
val sumScore = c.map(t => t._3).sum()
// c.persist()
c.map { t =>
(t._1, t._3/sumScore)
}.saveAsTextFile("output")
22. Accumulator Continue
• Retrieve the Accumulator of
the Earliest Stage
• Exception: user intentionally
use the same accumulator in
different stages
NUM_OUTPUT_TOKENS
Stage 14: 168006868318
Stage 21: 336013736636
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val input1 = sc.textFile("input1");
val input1_processed = input1.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
val input2 = sc.textFile("input2");
val input2_processed = input2.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
input1_processed.join(input2_processed)
.saveAsTextFile("output")
24. Profiling
• Visualize frame graph using Nebula
• Realtime
• Ability to segment into stage/task
• Focus on only useful threads
25. OutputCommitter
• Issue with OutputCommitter
• slow metadata operation
• 503 errors
• Netflix s3committer
• Wrapper for Spark RDD
• s3committer only support old API
26. Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
30. Performance Tuning
Collect runtime
memory/vcore usage
Tuning passed if
criterias meet:
▪ Runtime reduced
▪ Vcore*sec
reduced 20%+
▪ Memory increase
less than 100%
Retry with tuned
memory / vcore if
necessary
31. Balancing Performance
• Trade-offs
• More executors
• Better performance, but cost more
• Use more cores per executor
• Save on memory, but cost more on cpu
• Use dynamic allocation usually save cost
• Skew won’t cost more with dynamic allocation
• Control parallelism
• spark.default.parallelism for RDD
• spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
32. ▪ Automatically pick Spark over
Cascading/Scalding during runtime if
condition meets
▪ Data Validation Pass
▪ Performance Optimization Pass
▪ Automatically handle failure with handlers if
applicable
▪ Configuration incorrectness
▪ OutOfMemory
▪ ...
▪ Manual troubleshooting is needed for other
uncaught failures
Failure handling
Automatic Migration
Automatic Migration & Failure Handling
33. Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan