Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

•

4 gostaram•4,497 visualizações

At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover * What is Dali and how it simplifies complex data ecosystem * Dali as unified data access layer at LinkedIn * DaliSpark Architecture * Roadmap including plans to open source Dali

Software

Data Agility with DaliSpark
Adwait Tumbde
LinkedIn

“All problems in computer science can be solved by
another layer of indirection”
David Wheeler

Make data scientists
more productive
Data Agility

Data Agility Dali Tables + Views
Storage

• Why views?
• What problems does it solve?
• Technology
Data Agility

View of History
Navigational DBMS MapReduce
Programmers
only!
Programmers
only!
SQL: Anyone can
query

“MapReduce: A Major Step Backwards”
Michael Stonebraker

The Good Part
• Scale
• Best engine for the job!

A “De-constructed” Database
Tables
Queries &
Optimizations
Catalog

Storage
Physical
Schema
Conceptual
Schema
External
Schema
External
Schema
External
Schema
Views
Public Interface
Abstract
physical details

Storage
Physical
Schema
Conceptual
Schema
External
Schema
External
Schema
External
Schema
Code
Dataset As A Service

Challenges for Data Consumers
Data models are in constant flux
Data cleaning hurts productivity!
Have to change
data processing
logic everywhere!
My Raw Data

Challenge for InfrastructureProviders
HARD TO CHANGE ANYTHING UNDERNEATH!
Dependencies on path and hard-coded format
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data

Storage DB
Format
Tables
Engines
Change is the Only Constant!

Data Format Evolution
Table1
(Avro)DaliSpark Reader
HDFS

Data Format Evolution
Table1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
HDFS

Change Storage
Table1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
Blobstore

One View for All
Table 1
(Union view)
Table1_Avro Table1_ORC
DaliSpark Reader
Blobstore

Push Down Logic
Mobile
Events
Desktop
Events
DaliSpark Reader
Union View,
Filter events

Views as Code Sharing
Mobile
Events
Desktop
Events
DaliSpark Reader
Union View,
Filter Events
Flatten
Nested DataDaliSpark Reader

SQL + UDFs
Logical Independence
⋈
σ T
⋈
R S
Relational Algebra
Intermediate Representation

Transport
UDFs
Single UDF for All Engines

• Portability
• Schema Evolution
• Privacy compliance
(GDPR)
• Materialized Views
⋈
σ T
⋈
R S

• Intelligent Materialization
• Run-time query rewrite to
use materialized views
Declarative data
processing pipelines
Future: Materialized Views

Dali Views
Insulate applications from storage and
structure of data
• Public API - private implementation
• Enable evolution
• Focus on business logic

Tables, Views, and Dali Views
Table
Path
Format
Schema
Partitioning Scheme
…
View f(table(s) | view(s))
Dali
View
f(table(s)|view(s))
UDFs
Dependencies

Only what is needed
to defined logic. No
boilerplate code.
Simple
Declarative type
signatures with
generics.
Nullable arguments.
API-level HDFS
support.
High-level user-
friendly data types.
Feature-rich
Can run on multiple
platforms.
Code specific to platform
is auto-generated.
Translatable
Direct access to
native platform data.
Performant
TransportUDFs:CrossplatformUDFAPI

Transport Gradle Plugin
Code Analysis – Metadata Generation
Autogenerated Engine Wrappers
Presto Hive Spark …
Presto
Autogenerated UDF JARs
Hive Spark …
User-defined Transport UDF
Transport UDF Code Generation

DaliSpark: DataFrame API
Dali Catalog
DaliSpark.createDataFrame(…)
⋈
σ T
⋈
R S
1. registerBaseTables(R, S, T)
2. registerUDFs(…)
3. generateSparkSQL(…)
4. spark.sql(sqlStmt)

Mais conteúdo relacionado

Mais de Databricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Hyperspace for Delta LakeDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Mais de Databricks (20)

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Infrastructure Agnostic Machine Learning Workload Deployment

Improving Apache Spark for Dynamic Allocation and Spot Instances

Importance of ML Reproducibility & Applications with MLfLow

Hyperspace for Delta Lake

How We Optimize Spark SQL Jobs With parallel and sync IO

Último

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Right Money Management App For Your Financial GoalsJhone kinadey

Software Quality Assurance Interview QuestionsArshad QA

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

5 Signs You Need a Fashion PLM Software.pdfWave PLM

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

1. Data Agility with DaliSpark Adwait Tumbde LinkedIn

2. “All problems in computer science can be solved by another layer of indirection” David Wheeler

3. Make data scientists more productive Data Agility

4. Data Agility Dali Tables + Views Storage

5. • Why views? • What problems does it solve? • Technology Data Agility

6. • Why views? • What problems does it solve? • Technology Data Agility

7. View of History Navigational DBMS MapReduce Programmers only! Programmers only! SQL: Anyone can query

8. “MapReduce: A Major Step Backwards” Michael Stonebraker

9. The Good Part • Scale • Best engine for the job!

10. A “De-constructed” Database Tables Queries & Optimizations Catalog

11. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Views Public Interface Abstract physical details

12. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Code Dataset As A Service

13. • Why views? • What problems does it solve? • Technology Data Agility

14. Challenges for Data Consumers Data models are in constant flux Data cleaning hurts productivity! Have to change data processing logic everywhere! My Raw Data

15. Challenge for InfrastructureProviders HARD TO CHANGE ANYTHING UNDERNEATH! Dependencies on path and hard-coded format Hard to move to better formats without breaking everyone or copying data twice My Raw Data

16. Storage DB Format Tables Engines Change is the Only Constant!

17. Data Format Evolution Table1 (Avro)DaliSpark Reader HDFS

18. Data Format Evolution Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader HDFS

19. Change Storage Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore

20. One View for All Table 1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore

21. Push Down Logic Mobile Events Desktop Events DaliSpark Reader Union View, Filter events

22. Views as Code Sharing Mobile Events Desktop Events DaliSpark Reader Union View, Filter Events Flatten Nested DataDaliSpark Reader

23. • Why views? • What problems does it solve? • Technology Data Agility

24. Sample Dali Dataset CREATE VIEW profile_flattened TBLPROPERTIES ( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5') AS SELECT get_profile_section(...) FROM prod_identity.profile;

25. SQL + UDFs Logical Independence

26. SQL + UDFs Logical Independence ⋈ σ T ⋈ R S Relational Algebra Intermediate Representation

27. Transport UDFs Single UDF for All Engines

28. • Portability • Schema Evolution • Privacy compliance (GDPR) • Materialized Views ⋈ σ T ⋈ R S

29. • Intelligent Materialization • Run-time query rewrite to use materialized views Declarative data processing pipelines Future: Materialized Views

30. Dali Views Insulate applications from storage and structure of data • Public API - private implementation • Enable evolution • Focus on business logic

31. Contributors

32. Thank you

33.

34. Backup

35. Tables, Views, and Dali Views Table Path Format Schema Partitioning Scheme … View f(table(s) | view(s)) Dali View f(table(s)|view(s)) UDFs Dependencies

36. Only what is needed to defined logic. No boilerplate code. Simple Declarative type signatures with generics. Nullable arguments. API-level HDFS support. High-level user- friendly data types. Feature-rich Can run on multiple platforms. Code specific to platform is auto-generated. Translatable Direct access to native platform data. Performant TransportUDFs:CrossplatformUDFAPI

37. Transport Gradle Plugin Code Analysis – Metadata Generation Autogenerated Engine Wrappers Presto Hive Spark … Presto Autogenerated UDF JARs Hive Spark … User-defined Transport UDF Transport UDF Code Generation

38. DaliSpark: DataFrame API Dali Catalog DaliSpark.createDataFrame(…) ⋈ σ T ⋈ R S 1. registerBaseTables(R, S, T) 2. registerUDFs(…) 3. generateSparkSQL(…) 4. spark.sql(sqlStmt)

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility