O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
What to Upload to SlideShare
Avançar
Transfira para ler offline e ver em ecrã inteiro.

Compartilhar

Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

Baixar para ler offline

Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode. Such flexibility is great if you are experienced Spark developer solving a complicated data engineering problem, which might include ML or streaming. In Airbnb, 95% of all data pipelines are daily batch jobs, which read from Hive tables and write to Hive tables. For such jobs, you would like to trade some flexibility for more extensive functionality around writing to Hive or multiple days processing orchestration. Another advantage of reducing flexibility is creating "best practices", which can be followed by less experienced data engineers. In Airbnb, we've created a framework called "Sputnik," which tries to address these issues. In this talk, I'll show the typical boilerplate code, which Sputnik tries to reduce and concepts it introduces to simplify pipeline development.

  • Seja a primeira pessoa a gostar disto

Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor Pakhomov, AirBnB

  1. 1. WritingSpark pipelineswithless boilerplatecode EGOR PAKHOMOV • AIRBNB
  2. 2. TypicalSparkjob
  3. 3. TypicalSparkjob
  4. 4. TypicalSparkjob
  5. 5. https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png Toomuchflexibility! Ourcase
  6. 6. Whatdataengineer shouldonlywrite
  7. 7. Joblogic • Job does some business logic (for example multiply every value from input table by 2) • Job specifies: - Source tables and result tables - Partitioning schema - Validations for result data JoblogicvsRunLogic Runlogic(examples) • Running job for specific date retrieves input only for that date from input table • Job tries to write to table, which does not exists, so we need to create the table • Job runs in testing mode, so all result tables are created with “_testing” suffix
  8. 8. JoblogicvsRunLogic
  9. 9. Sputnikjob
  10. 10. RunningtheSputnik job
  11. 11. Writingdata inSputnik
  12. 12. SputnikHiveTableWriter: • creates table with “CREATE TABLE” hive statement, if table does not yet exist. • updates table metainformation • manage result table name (staging/testing mode) • normalize dataframe schema according to result Hive table • repartitions and tries to reduce number of result files on disk • runs the checks on result, before saving it. • Etc… Writingdata inSputnik
  13. 13. Readingdata inSputnik Reading dataframe: Reading dataset:
  14. 14. TestinginSputnik
  15. 15. TestinginSputnik
  16. 16. TestinginSputnik
  17. 17. TestinginSputnik
  18. 18. TestinginSputnik
  19. 19. ConfigsinSputnik Job: Config:
  20. 20. Checksonresult inSputnik
  21. 21. Checksonresult inSputnik
  22. 22. Checksonresult inSputnik
  23. 23. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-082019-01-07 Dailyjob --ds2019-01-07 Dailyjob --ds2019-01-08 t
  24. 24. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-082019-01-07 Dailyjob --ds2019-01-07 Dailyjob --ds2019-01-08 Backfilljob --startDate2019-01-01--endDate2019-01-06 t
  25. 25. BackfillinginSputnik HiveTable 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 Backfilljob --startDate2019-01-01--endDate2019-01-06 t --stepSize3
  26. 26. Environments inSputnik some_database.some_input_table some_database.result_table Production Sputnik
  27. 27. Environments inSputnik some_database.some_input_table some_database.result_table Production Sputnik some_database.result_table_dev Testing Sputnik some_database.some_input_table
  28. 28. Environments inSputnik some_database.result_table_dev Testing Sputnik some_database.some_input_table -- writeEnv PROD some_database.result_table -- writeEnv STAGE some_database.result_table_staging -- writeEnv DEV some_database.result_table_dev
  29. 29. Flags
  30. 30. https://github.com/airbnb/sputnik pahomov.egor@gmail.com egor.pakhomov@airbnb.com

Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode. Such flexibility is great if you are experienced Spark developer solving a complicated data engineering problem, which might include ML or streaming. In Airbnb, 95% of all data pipelines are daily batch jobs, which read from Hive tables and write to Hive tables. For such jobs, you would like to trade some flexibility for more extensive functionality around writing to Hive or multiple days processing orchestration. Another advantage of reducing flexibility is creating "best practices", which can be followed by less experienced data engineers. In Airbnb, we've created a framework called "Sputnik," which tries to address these issues. In this talk, I'll show the typical boilerplate code, which Sputnik tries to reduce and concepts it introduces to simplify pipeline development.

Vistos

Vistos totais

289

No Slideshare

0

De incorporações

0

Número de incorporações

0

Ações

Baixados

7

Compartilhados

0

Comentários

0

Curtir

0

×