At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover
* What is Dali and how it simplifies complex data ecosystem
* Dali as unified data access layer at LinkedIn
* DaliSpark Architecture
* Roadmap including plans to open source Dali
13. • Why views?
• What problems does it solve?
• Technology
Data Agility
14. Challenges for Data Consumers
Data models are in constant flux
Data cleaning hurts productivity!
Have to change
data processing
logic everywhere!
My Raw Data
15. Challenge for InfrastructureProviders
HARD TO CHANGE ANYTHING UNDERNEATH!
Dependencies on path and hard-coded format
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
28. • Portability
• Schema Evolution
• Privacy compliance
(GDPR)
• Materialized Views
⋈
σ T
⋈
R S
29. • Intelligent Materialization
• Run-time query rewrite to
use materialized views
Declarative data
processing pipelines
Future: Materialized Views
30. Dali Views
Insulate applications from storage and
structure of data
• Public API - private implementation
• Enable evolution
• Focus on business logic
35. Tables, Views, and Dali Views
Table
Path
Format
Schema
Partitioning Scheme
…
View f(table(s) | view(s))
Dali
View
f(table(s)|view(s))
UDFs
Dependencies
36. Only what is needed
to defined logic. No
boilerplate code.
Simple
Declarative type
signatures with
generics.
Nullable arguments.
API-level HDFS
support.
High-level user-
friendly data types.
Feature-rich
Can run on multiple
platforms.
Code specific to platform
is auto-generated.
Translatable
Direct access to
native platform data.
Performant
TransportUDFs:CrossplatformUDFAPI
37. Transport Gradle Plugin
Code Analysis – Metadata Generation
Autogenerated Engine Wrappers
Presto Hive Spark …
Presto
Autogenerated UDF JARs
Hive Spark …
User-defined Transport UDF
Transport UDF Code Generation
38. DaliSpark: DataFrame API
Dali Catalog
DaliSpark.createDataFrame(…)
⋈
σ T
⋈
R S
1. registerBaseTables(R, S, T)
2. registerUDFs(…)
3. generateSparkSQL(…)
4. spark.sql(sqlStmt)