O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Fabian Hueske – Cascading on Flink

7.176 visualizações

Publicada em

Flink Forward 2015

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Fabian Hueske – Cascading on Flink

  1. 1. Cascading on Flink Fabian Hueske @fhueske
  2. 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Open Source (AL2) • Developed by Concurrent 2
  3. 3. Cascading on MapReduce  Originally for Hadoop MapReduce  Much better API than MapReduce • DAG programming model • Higher-level operators (join, coGroup, merge) • Composable and reusable code  Automatic translation to MapReduce jobs • Minimizes number of MapReduce jobs  Rock-solid execution due to Hadoop MapReduce 3
  4. 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful! Example taken from docs.cascading.org/impatient
  5. 5. Who uses Cascading?  Runs in many production environments • Twitter, Soundcloud, Etsy, Airbnb, …  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API) 5
  6. 6. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 6
  7. 7. Cascading on Flink 7
  8. 8. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 8
  9. 9. Cascading on Flink Released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 9
  10. 10. Translation Details 10
  11. 11. Flow Translation  Implemented on top of Java DataSet API  Using Cascading’s rule-based planner • Flow is compiled into a single Flink job • The operators of a job are partitioned into nodes • Chaining of operators  Translation rules partition the flow if • Data is shuffled • Data is processed by Flink’s internal operators • Flows branch or merge • At sources and sinks 11
  12. 12. Operator Translation 12  Cascading operators have fixed execution strategy • No degree of freedom for Flink’s optimizer • Strategies fixed using hints for Flink’s optimizer Cascading Operator Flink Operator(s) (n-ary) GroupBy (Union -) Reduce (n-ary) BufferJoin CoGroup (Union -) Reduce (n-ary) CoGroup (Sequence of) binary hash-partitioned, sorted OuterJoin (n-ary) HashJoin (Sequence of) binary Broadcasted HashJoin (n-ary) Merge n-ary Union Tap Source or Sink
  13. 13. Serializers & Comparators  Flink needs information about all processed data types • Generation of serializer and comparators  Cascading supports • Schema-less tuples (no length, no types) • Definition of key fields by name and (relative) position • Null-valued fields and key fields  Custom type information for Cascading tuples • Native serializers & comparators for fields with known type • Kryo for unknown field types • Support for null values by wrapping serializers & comparators 13
  14. 14. Going Out-of-Core  Join and CoGroup must hold data in memory • If data exceeds memory, we need to go to disk  Cascading on MR uses spillable collections • Spill to disk if #elements > threshold • Part of Cascading (not MapReduce) • Threshold either too low or too high  Cascading on Flink uses Flink’s Join and OuterJoin • Part of Flink (not Cascading) • Backed by Flink’s manage memory • Transparently spill to disk if necessary 14
  15. 15. Running Cascading on Flink 15
  16. 16. How to Run Cascading on Flink  Add the cascading-flink Maven dependency to your Cascading project • Available in Sonatype Nexus Repository • Or build it from source (Github)  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic  Execute Cascading program as regular Flink program  Detailed instructions on Github 16
  17. 17. (Preliminary!) Performance Evaluation  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB generated text data 17
  18. 18. Baseline Wordcount 18 0 2 4 6 8 10 12 14 16 MapReduce native Flink native Cascading on MR Cascading on Flink Execution Time (min)  Cascading on MR compiled to 1 MR job • Similar execution strategy (hash-partition, sort) • No significant speed gain expected  Verifies of our implementation  Hash-Aggregators!
  19. 19. Something more complex: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 19http://docs.cascading.org/impatient
  20. 20. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 20
  21. 21. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 21
  22. 22. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 22 0 200 400 600 Cascading on MR Cascading on Flink Execution Time (min)
  23. 23. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 23
  24. 24. 24

×