O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Introduction SQL Analytics on Lakehouse Architecture

Introduction SQL Analytics on Lakehouse Architecture.
Instructor Doug Bateman

  • Entre para ver os comentários

Introduction SQL Analytics on Lakehouse Architecture

  1. 1. Introduction SQL Analytics on Lakehouse Architecture Instructor: Doug Bateman
  2. 2. About Your Instructor ▪ Principal Data Engineering Instructor at Databricks ▪ Joined Databricks in 2016 ▪ 20+ Years of Industry Experience Doug Bateman
  3. 3. About Your Instructor (Personal) ▪ Two children ▪ 2 and 5 years old ▪ For fun: ▪ Sailing ▪ Rock Climbing ▪ Snowboarding (badly) ▪ Chess (badly) Doug Bateman
  4. 4. Course goals Describe key features of a data Lakehouse Explain how Delta Lake enables a Lakehouse architecture 1 2 3 Define key features available in the Databricks SQL Analytics user interface.
  5. 5. Course Agenda Activity Course welcome Introduction to Lakehouse Architecture Delta Lake Databricks SQL Analytics Intro Databricks SQL Analytics Demo Wrap up and Q & A
  6. 6. Access the Slides https://tinyurl.com/lakehouse-webinar
  7. 7. About You (Polls)
  8. 8. Introduction to Lakehouse Architecture
  9. 9. Data Driven Decisions
  10. 10. Data Warehouses were purpose-built for BI and reporting, however… ▪ No support for video, audio, text ▪ No support for data science, ML ▪ Limited support for streaming ▪ Closed & proprietary formats Therefore, most data is stored in data lakes & blob stores ETL External Data Operational Data Data Warehouses BI Reports
  11. 11. Data Lakes could store all your data and determine what you want to know later ▪ Poor BI support ▪ Complex to set up ▪ Poor performance ▪ Unreliable data swamps BIData Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL
  12. 12. How do we get the best of both worlds? BIData Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL ETL External Data Operational Data Data Warehouses BI Reports
  13. 13. Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data
  14. 14. Lakehouse Summary A Lakehouse has the following key features: ● support for diverse data types and formats ● data reliability and consistency ● support for diverse workloads (BI, data science, machine learning, and analytics) ● ability to use BI tools directly on source data
  15. 15. The core components we need to build a Lakehouse Building a Lakehouse 1. Your data lake (cloud blob storage, open source format) 2. Transaction layer to provide consistency (Delta) 3. ETL and data cleansing workflow (Spark + Databricks Delta Pipelines) 4. Security, data integrity, and performance (Databricks Delta Engine) 5. As well as integrations for all of your user communities: a. SQL (Databricks SQL Analytics) b. BI tools and dashboards c. ML d. Streaming
  16. 16. Delta Lake
  17. 17. Really cheap, durable storage 10 nines of durability. Cheap. Infinite scale. The Emergence of Data Lakes Store all types of raw data Video, audio, text, structured, unstructured Open, standardized formats Parquet format, big ecosystem of tools operate on these file formats
  18. 18. Challenges with data lakes 1. Hard to append data Adding newly arrived data leads to incorrect reads 2. Modification of existing data is difficult GDPR/CCPA requires making fine grained changes to existing data lake 3. Jobs failing mid way Half of the data appears in the data lake, the rest is missing
  19. 19. Challenges with data lakes 4. Real-time operations Mixing streaming and batch leads to inconsistency 5. Costly to keep historical versions of the data Regulated environments require reproducibility, auditing, governance 6. Difficult to handle large metadata For large data lakes the metadata itself becomes difficult to manage
  20. 20. Challenges with data lakes 7. “Too many files” problems Data lakes are not great at handling millions of small files 8. Hard to get great performance Partitioning the data for performance is error-prone and difficult to change 9. Data quality issues It’s a constant headache to ensure that all the data is correct and high quality
  21. 21. A new standard for building data lakes An opinionated approach to building Data Lakes ■ Adds reliability, quality, performance to Data Lakes ■ Brings the best of data warehousing and data lakes ■ Based on open source and open format (Parquet) - Delta Lake is also open source
  22. 22. 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  23. 23. ACID Transactions Make every operation transactional • It either fully succeeds - or it is fully aborted for later retries /path/to/table/_delta_log - 0000.json - 0001.json - 0002.json - … - 0010.parquet 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  24. 24. ACID Transactions Make every operation transactional • It either fully succeeds - or it is fully aborted for later retries /path/to/table/_delta_log - 0000.json - 0001.json - 0002.json - … - 0010.parquet {Add file1.parquet Add file2.parquet ... 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  25. 25. ACID Transactions Make every operation transactional • It either fully succeeds - or it is fully aborted for later retries /path/to/table/_delta_log - 0000.json - 0001.json - 0002.json - … - 0010.parquet {Remove file1.parquet Add file3.parquet ... 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  26. 26. ACID Transactions Make every operation transactional • It either fully succeeds - or it is fully aborted for later retries /path/to/table/_delta_log - 0000.json - 0001.json - 0002.json - … - 0010.parquet - 0010.json - 0011.json 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  27. 27. ACID Transactions Make every operation transactional • It either fully succeeds - or it is fully aborted for later retries Review past transactions • All transactions are recorded and you can go back in time to review previous versions of the data (i.e. time travel) SELECT * FROM events TIMESTAMP AS OF ... SELECT * FROM events VERSION AS OF ... 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  28. 28. Spark under the hood • Spark is built for handling large amounts of data • All Delta Lake metadata stored in open Parquet format • Portions of it cached and optimized for fast access • Data and it’s metadata always co-exist. No need to keep catalog<>data in sync 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues
  29. 29. File Consolidation Automatically optimize a layout that enables fast access • Partitioning: layout for typical queries • Data skipping: prune files based on statistics on numericals • Z-ordering: layout to optimize multiple columns 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues OPTIMIZE events ZORDER BY (eventType)
  30. 30. Schema validation Schema validation and evolution • All data in Delta Tables have to adhere to a strict schema (star, etc) • Includes schema evolution in merge operations 1. Hard to append data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Real-time operations hard 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues MERGE INTO events USING changes ON events.id = changes.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
  31. 31. Delta Lake Summary ▪ Core component of a Lakehouse architecture ▪ Offers guaranteed consistency because it's ACID compliant ▪ Robust data store ▪ Designed to work with Apache Spark
  32. 32. Elements of Delta Lake ▪ Delta Architecture ▪ Delta Storage Layer ▪ Delta Engine
  33. 33. Delta architecture AI & Reporting Streaming Analytics Bronze Silver Gold Data quality DATA Raw Ingestion Filtered, Cleaned, Augmented Business level Aggregates
  34. 34. Delta Storage Layer Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer
  35. 35. Databricks' Delta Engine ▪ File management optimizations ▪ Performance optimization with Delta Caching ▪ Dynamic File Pruning ▪ Adaptive Query Execution DELTA ENGINE Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Performance
  36. 36. High performance query engineDELTA ENGINE One platform for every use caseStreaming Analytics BI Data Science Machine Learning Data Lake for all your data Structured, Semi-Structured and Unstructured Data Structured transactional layer
  37. 37. SQL Analytics
  38. 38. Data driven decisions Data analysts Sales Executives Marketing Operations Finance
  39. 39. Challenges solved by Delta Lake Stale dataIncomplete data silos Complexity
  40. 40. SQL-native user interface ▪ Familiar SQL Editor ▪ Auto Complete ▪ Built in visualizations ▪ Data Browser
  41. 41. SQL-native user interface ▪ Familiar SQL Editor ▪ Auto Complete ▪ Built in visualizations ▪ Data Browser ▪ Automatic Alerts ▪ Trigger based upon values ▪ Email or Slack integration
  42. 42. SQL-native user interface ▪ Familiar SQL Editor ▪ Auto Complete ▪ Built in visualizations ▪ Data Browser ▪ Automatic Alerts ▪ Trigger based upon values ▪ Email or Slack integration ▪ Dashboards ▪ Simply convert queries to dashboards ▪ Share with Access
  43. 43. Built-in connectors for existing BI tools Other BI & SQL clients that support ▪ Supports your favorite tool ▪ Connectors for top BI & SQL clients ▪ Simple connection setup ▪ Optimized performance ▪ OAuth & Single Sign On ▪ Quick and easy authentication experience. No need to deal with access tokens. ▪ Power BI Available now ▪ Others coming soon
  44. 44. SQL Analytics Demo
  45. 45. Join us for Part 2 Login and use SQL Analytics hands-on: Dec 15 at 10am (San Francisco Time) Thanks for coming!
  46. 46. Setup & Administration
  47. 47. SQL Endpoints SQL Optimized Compute SQL Endpoints give a quick way to setup SQL / BI optimized compute. You pick a tshirt size. Databricks will ensure configuration that provides the highest price/performance. Concurrency Scaling Built-in [Private Preview] Virtual clusters can load balance queries across multiple clusters behind the scenes, providing unlimited concurrency.
  48. 48. Query History Central Query Log Track & understand usage across virtual clusters, users & time. Easily observe workloads across Redash, BI tools & any other SQL client usage. Troubleshoot & debug History is the starting point for understanding / triaging any errors & performance issues. Jump into detailed Spark query profile as needed.
  49. 49. Performance
  50. 50. DATABRICKS CONFIDENTIAL Performance - Life of a Query DELTA LakeODBC/JDBC Drivers BI & SQL Client Connectors Routing Service Query Planning Query Execution Databricks SQL Analytics
  51. 51. Up to 9x better price/performance 30TB TPC-DS Price/Performance Lower is better
  52. 52. Course Agenda Activity Duration Course welcome 5 min Introduction to Lakehouse Architecture 5 min Delta Lake 10 min Databricks SQL Analytics Intro 5 min Databricks SQL Analytics Demo 20 min Wrap up and Q & A 15 min

×