O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Data Agility with DaliSpark
Adwait Tumbde
LinkedIn
“All problems in computer science can be solved by
another layer of indirection”
David Wheeler
Make data scientists
more productive
Data Agility
Data Agility Dali Tables + Views
Storage
• Why views?
• What problems does it solve?
• Technology
Data Agility
• Why views?
• What problems does it solve?
• Technology
Data Agility
View of History
Navigational DBMS MapReduce
Programmers
only!
Programmers
only!
SQL: Anyone can
query
“MapReduce: A Major Step Backwards”
Michael Stonebraker
The Good Part
• Scale
• Best engine for the job!
A “De-constructed” Database
Tables
Queries &
Optimizations
Catalog
Storage
Physical
Schema
Conceptual
Schema
External
Schema
External
Schema
External
Schema
Views
Public Interface
Abstract
...
Storage
Physical
Schema
Conceptual
Schema
External
Schema
External
Schema
External
Schema
Code
Dataset As A Service
• Why views?
• What problems does it solve?
• Technology
Data Agility
Challenges for Data Consumers
Data models are in constant flux
Data cleaning hurts productivity!
Have to change
data proce...
Challenge for InfrastructureProviders
HARD TO CHANGE ANYTHING UNDERNEATH!
Dependencies on path and hard-coded format
Hard ...
Storage DB
Format
Tables
Engines
Change is the Only Constant!
Data Format Evolution
Table1
(Avro)DaliSpark Reader
HDFS
Data Format Evolution
Table1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
HDFS
Change Storage
Table1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
Blobstore
One View for All
Table 1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
Blobstore
Push Down Logic
Mobile
Events
Desktop
Events
DaliSpark Reader
Union View,
Filter events
Views as Code Sharing
Mobile
Events
Desktop
Events
DaliSpark Reader
Union View,
Filter Events
Flatten
Nested DataDaliSpark...
• Why views?
• What problems does it solve?
• Technology
Data Agility
Sample Dali Dataset
CREATE VIEW profile_flattened
TBLPROPERTIES (
'functions' =
'get_profile_section:isb.GetProfileSection...
SQL + UDFs
Logical Independence
SQL + UDFs
Logical Independence
⋈
σ T
⋈
R S
Relational Algebra
Intermediate Representation
Transport
UDFs
Single UDF for All Engines
• Portability
• Schema Evolution
• Privacy compliance
(GDPR)
• Materialized Views
⋈
σ T
⋈
R S
• Intelligent Materialization
• Run-time query rewrite to
use materialized views
Declarative data
processing pipelines
Fut...
Dali Views
Insulate applications from storage and
structure of data
• Public API - private implementation
• Enable evoluti...
Contributors
Thank you
Backup
Tables, Views, and Dali Views
Table
Path
Format
Schema
Partitioning Scheme
…
View f(table(s) | view(s))
Dali
View
f(table(...
Only what is needed
to defined logic. No
boilerplate code.
Simple
Declarative type
signatures with
generics.
Nullable argu...
Transport Gradle Plugin
Code Analysis – Metadata Generation
Autogenerated Engine Wrappers
Presto Hive Spark …
Presto
Autog...
DaliSpark: DataFrame API
Dali Catalog
DaliSpark.createDataFrame(…)
⋈
σ T
⋈
R S
1. registerBaseTables(R, S, T)
2. registerU...
Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility
Próximos SlideShares
Carregando em…5
×

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

1.550 visualizações

Publicada em

At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover
* What is Dali and how it simplifies complex data ecosystem
* Dali as unified data access layer at LinkedIn
* DaliSpark Architecture
* Roadmap including plans to open source Dali

Publicada em: Software
  • Seja o primeiro a comentar

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

  1. 1. Data Agility with DaliSpark Adwait Tumbde LinkedIn
  2. 2. “All problems in computer science can be solved by another layer of indirection” David Wheeler
  3. 3. Make data scientists more productive Data Agility
  4. 4. Data Agility Dali Tables + Views Storage
  5. 5. • Why views? • What problems does it solve? • Technology Data Agility
  6. 6. • Why views? • What problems does it solve? • Technology Data Agility
  7. 7. View of History Navigational DBMS MapReduce Programmers only! Programmers only! SQL: Anyone can query
  8. 8. “MapReduce: A Major Step Backwards” Michael Stonebraker
  9. 9. The Good Part • Scale • Best engine for the job!
  10. 10. A “De-constructed” Database Tables Queries & Optimizations Catalog
  11. 11. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Views Public Interface Abstract physical details
  12. 12. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Code Dataset As A Service
  13. 13. • Why views? • What problems does it solve? • Technology Data Agility
  14. 14. Challenges for Data Consumers Data models are in constant flux Data cleaning hurts productivity! Have to change data processing logic everywhere! My Raw Data
  15. 15. Challenge for InfrastructureProviders HARD TO CHANGE ANYTHING UNDERNEATH! Dependencies on path and hard-coded format Hard to move to better formats without breaking everyone or copying data twice My Raw Data
  16. 16. Storage DB Format Tables Engines Change is the Only Constant!
  17. 17. Data Format Evolution Table1 (Avro)DaliSpark Reader HDFS
  18. 18. Data Format Evolution Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader HDFS
  19. 19. Change Storage Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore
  20. 20. One View for All Table 1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore
  21. 21. Push Down Logic Mobile Events Desktop Events DaliSpark Reader Union View, Filter events
  22. 22. Views as Code Sharing Mobile Events Desktop Events DaliSpark Reader Union View, Filter Events Flatten Nested DataDaliSpark Reader
  23. 23. • Why views? • What problems does it solve? • Technology Data Agility
  24. 24. Sample Dali Dataset CREATE VIEW profile_flattened TBLPROPERTIES ( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5') AS SELECT get_profile_section(...) FROM prod_identity.profile;
  25. 25. SQL + UDFs Logical Independence
  26. 26. SQL + UDFs Logical Independence ⋈ σ T ⋈ R S Relational Algebra Intermediate Representation
  27. 27. Transport UDFs Single UDF for All Engines
  28. 28. • Portability • Schema Evolution • Privacy compliance (GDPR) • Materialized Views ⋈ σ T ⋈ R S
  29. 29. • Intelligent Materialization • Run-time query rewrite to use materialized views Declarative data processing pipelines Future: Materialized Views
  30. 30. Dali Views Insulate applications from storage and structure of data • Public API - private implementation • Enable evolution • Focus on business logic
  31. 31. Contributors
  32. 32. Thank you
  33. 33. Backup
  34. 34. Tables, Views, and Dali Views Table Path Format Schema Partitioning Scheme … View f(table(s) | view(s)) Dali View f(table(s)|view(s)) UDFs Dependencies
  35. 35. Only what is needed to defined logic. No boilerplate code. Simple Declarative type signatures with generics. Nullable arguments. API-level HDFS support. High-level user- friendly data types. Feature-rich Can run on multiple platforms. Code specific to platform is auto-generated. Translatable Direct access to native platform data. Performant TransportUDFs:CrossplatformUDFAPI
  36. 36. Transport Gradle Plugin Code Analysis – Metadata Generation Autogenerated Engine Wrappers Presto Hive Spark … Presto Autogenerated UDF JARs Hive Spark … User-defined Transport UDF Transport UDF Code Generation
  37. 37. DaliSpark: DataFrame API Dali Catalog DaliSpark.createDataFrame(…) ⋈ σ T ⋈ R S 1. registerBaseTables(R, S, T) 2. registerUDFs(…) 3. generateSparkSQL(…) 4. spark.sql(sqlStmt)

×