O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Just-in-Time Data Warehousing on
Databricks: Change Data Capture
and Schema On Read
Jason Pohl, Data Solutions Engineer
De...
About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks,
focused on helping customers become succ...
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with
Databricks; he is a hands-on data sciences engine...
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
con...
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environm...
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
Traditional Data Warehousing Pain Points

Inelasticity of compute and storage resources
• Burst workloads requires max. lo...
Traditional Data Warehousing Pain Points

Rigid architecture that’s difficult to change

• Traditional DW are schema-on-wr...
Traditional Data Warehousing Pain Points

Limited advanced analytics capabilities

• Want more than what business intellig...
Just-in-Time Data Warehousing

Scale resources on demand
13
• Scale resources based on query load
• Separate compute and s...
Just-in-Time Data Warehousing

Direct access to data sources
14
• Scale resources based on query load
• Separate compute a...
Just-in-Time Data Warehousing

Scale resources on demand
15
• Scale resources based on query load
• Separate compute and s...
Change Data Capture

What is it?
• System to automatically capture changes in source system (e.g.
transactional database) ...
Change Data Capture

Source to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00...
Change Data Capture

Add new row
18
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Targ...
Change Data Capture

Update an existing row
19
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $...
Change Data Capture

Update an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 ...
Demo
High Watermark with LastUpdatedDate
21
22
Stage Data from Employee Database
23
Update Records in Employee Source Database
UPDATE employees
SET last_name = 'Spark'
WHERE emp_no = 16894
Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2...
25
Add a column to the Departments table
ALTER TABLE departments
ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments
SET d...
Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_namedept_no
dept_name
dept_desc
Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data
Warehousing on Databricks: Ch...
Resources
• Just-in-Time Data Warehousing Solution Brief
• Building a Turbo-fast Data Warehousing Platform with
Databricks...
More resources
• Databricks Guide
• Apache Spark User Guide
• Databricks Community Forum
• Training courses: public classe...
Thanks!
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Próximos SlideShares
Carregando em…5
×

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

In this webcast, Jason Pohl, Solution Engineer from Databricks, will cover how to build a Just-in-Time Data Warehouse on Databricks with a focus on performing Change Data Capture from a relational database and joining that data to a variety of data sources. Not only does Apache Spark and Databricks allow you to do this easier with less code, the routine will automatically ingest changes to the source schema.

  • Entre para ver os comentários

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

  1. 1. Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist
  2. 2. About the speaker: Jason Pohl Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions. 2
  3. 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight). 3
  4. 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  5. 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  6. 6. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  7. 7. Traditional Data Warehousing Pain Points
 Inelasticity of compute and storage resources • Burst workloads requires max. load capacity planning • Fixed size DW = compute and storage to scale linearly together (these are orthogonal problems) • Expensive conundrum: • If your DW is successful, you cannot easily exapnd • If there is overcapacity = idle resources
  8. 8. Traditional Data Warehousing Pain Points
 Rigid architecture that’s difficult to change
 • Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be pre-built. • Rigidity = maintaining costly ETL pipelines • Expend finite resources to continually augment pipelines to absorb new data.
  9. 9. Traditional Data Warehousing Pain Points
 Limited advanced analytics capabilities
 • Want more than what business intelligence and data warehousing provides • More than just counts, aggregates and trends • Desire forecasting using ML, segmentation, graph processing, etc.
  10. 10. Just-in-Time Data Warehousing
 Scale resources on demand 13 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  11. 11. Just-in-Time Data Warehousing
 Direct access to data sources 14 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  12. 12. Just-in-Time Data Warehousing
 Scale resources on demand 15 • Scale resources based on query load • Separate compute and storage to scale either independently • Easily setup multiple clusters against the same data sources
  13. 13. Change Data Capture
 What is it? • System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250 16
  14. 14. Change Data Capture
 Source to Target 17 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 Target ID Date Product Price ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00
  15. 15. Change Data Capture
 Add new row 18 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 Target ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00
  16. 16. Change Data Capture
 Update an existing row 19 Source ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 Target ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $250.00 103 1/3/2016 Disc $15.00 ID Date Product Price 101 1/1/2016 Skates $80.00 102 1/2/2016 Skis $350.00 103 1/3/2016 Disc $15.00
  17. 17. Change Data Capture
 Update an existing row 20 Source Target ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $350.00 1/5/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/5/2016 103 1/3/2016 Disc $15.00 1/3/2016 102 1/2/2016 Skis $350.00 1/5/2016
  18. 18. Demo High Watermark with LastUpdatedDate 21
  19. 19. 22 Stage Data from Employee Database
  20. 20. 23 Update Records in Employee Source Database UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894
  21. 21. Job to Automate CDC 24 Source Target ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 Jobs ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016 ID Date Product Tag Price LastUpdated 101 1/1/2016 Skates ice $80.00 1/1/2016 102 1/2/2016 Skis snow $250.00 1/2/2016 103 1/3/2016 Disc field $15.00 1/3/2016 ID Date Product Price LastUpdated 101 1/1/2016 Skates $80.00 1/1/2016 102 1/2/2016 Skis $250.00 1/2/2016 103 1/3/2016 Disc $15.00 1/3/2016
  22. 22. 25 Add a column to the Departments table ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50) UPDATE departments SET dept_desc = dept_name
  23. 23. Job to Automate CDC Source Target Jobs dept_no dept_name dept_no dept_namedept_no dept_name dept_desc
  24. 24. Notebooks To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar. • Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process • Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments
  25. 25. Resources • Just-in-Time Data Warehousing Solution Brief • Building a Turbo-fast Data Warehousing Platform with Databricks • Spark DataFrames: Simple and Fast Analysis of Structured Data • Transitioning from Traditional DW to Spark in OR Predictive Modeling • Advertising Technology Sample Notebook (Part 1)
  26. 26. More resources • Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark. Join the waitlist for the beta release! 29
  27. 27. Thanks!

×