O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Introducing Databricks Delta

3.868 visualizações

Publicada em

Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Introducing Databricks Delta

  1. 1. Unifying Data Warehousing with Data Lakes Ali Ghodsi, Co-Founder & CEO Oct 25, 2017
  2. 2. Many enterprises are undergoing a data transformation
  3. 3. Databricks Customers Across Industries Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services
  4. 4. Health care AI cloud dataset use case Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services Correlate EMR of 50,000 patients compared with their DNA
  5. 5. Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services Enterprise AI use case 5 Provide recommendations to sales using NLP and deep learning
  6. 6. Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services 6 Real-time AI use-case Curb abusive behavior across gamers globally
  7. 7. Big Data was the Missing Link for AI BIG DATA Customer Data Emails/Web pages Click Streams Sensor data (IoT) Video/Speech … Most companies are Struggling with Big Data GREAT RESULTS
  8. 8. Hardest part of AI isn’t AI “Hidden Technical Debt in Machine Learning Systems", Google NIPS 2015 The hardest part of AI is Big Data ML Code
  9. 9. Building Predictive Applications is really Hard!
  10. 10. Unified Analytics Platform UNIFIED EXPERIENCE ACROSS TEAMS UNIFIED PROCESSING ENGINE
  11. 11. The Evolution of Big Data
  12. 12. The Era of the Data Warehouse
  13. 13. Data Warehouse (DW) THE GOOD • Pristine Data • Fast Queries • Transactional THE BAD • Expensive to Scale, not Elastic • Requires ETL, Stale Data, No Real-Time • No Predictions, No ML • Closed formats (lock in) Not Future Proof – Missing Predictions, Real-time, Scale ETL important data to central DW and get Business Intelligence (BI)
  14. 14. The Era of the Data Lake
  15. 15. THE BAD • Inconsistent Data • Unreliable for Analytics • Lack of Schema • Poor Performance Hadoop Data Lake Become a cheap messy data store with poor performance ETL all data to central scalable open lake for all use cases THE GOOD • Massive scale • Inexpensive Storage • Open Formats (Parquet, ORC) • Promise of ML & Real Time Streaming
  16. 16. The Current State of Data Platforms
  17. 17. Info Sec at a Fortune 100 Company DISADVANTAGES OF ARCHITECTURE • Poor agility in responding to new threats • Scale Limitations, no historical data • 6 Months and twenty people to build ENTERPRISE DATA WAREHOUSE • Only 2 weeks of data • Very expensive to scale • Proprietary Formats • No Predictions (ML) Messy data not ready for analytics Billions of records a day HADOOP DATA LAKE Complex ETL EDW EDW EDW Incidence Response Alerting Reports
  18. 18. The Next Generation Data Platform
  19. 19. First UNIFIED data management system that delivers: The SCALE of data lake The LOW-LATENCY of streaming The RELIABILITY & PERFORMANCE of data warehouse Announcing Databricks Delta
  20. 20. The SCALE of data lake The LOW-LATENCY of streaming The RELIABILITY & PERFORMANCE of data warehouse
  21. 21. Databricks Delta Enables Predictions, Real-time and Ad Hoc Analytics at Massive Scale THE GOOD OF DATA LAKES • Massive scale on Amazon S3 • Open Formats (Parquet, ORC) • Predictions (ML) & Real Time Streaming THE GOOD OF DATA WAREHOUSES • Pristine Data • Transactional Reliability • Fast Queries (10-100x)
  22. 22. Databricks Delta Under the Hood • Decouple Compute & Storage • ACID Transactions & Data Validation • Data Indexing & Caching (10-100x) • Real-Time Streaming Ingest MASSIVE SCALE RELIABILITY PERFORMANCE LOW-LATENCY
  23. 23. Info Sec with Databricks Delta DATABRICKS RUNTIME powered by DATABRICKS RUNTIME Trillion Records a Day DATABRICKS DELTA ETL, Schema Validation SQL , ML, Stream ADVANTAGES • AI capable data warehouse at the scale of a data lake • Interactive analysis on 2 years of data • 2 Weeks to build with a 5 person data platform team
  24. 24. Unified Analytics Platform UNIFIED EXPERIENCE ACROSS TEAMS Notebooks,Dashboards,Reports
  25. 25. Unified Analytics Platform + UNIFIED DATA MANAGEMENT ReliableTransactions,Performance UNIFIED EXPERIENCE ACROSS TEAMS Notebooks,Dashboards,Reports
  26. 26. Demo by Michael Armbrust
  27. 27. Evolution of a Cutting-Edge Data Pipeline Events ? Reporting Streaming Analytics Data Lake
  28. 28. Evolution of a Cutting-Edge Data Pipeline Events Reporting Streaming Analytics Data Lake
  29. 29. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics Reporting Events λ-arch1 1 1
  30. 30. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics Reporting Events Validation λ-arch Validation 1 21 1 2
  31. 31. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2
  32. 32. Reprocessing Challenge #4: Query Performance? Data Lake λ-arch λ-arch Streaming Analytics Reporting Events Validation λ-arch Validation Reprocessing Compaction Partitioned Compact Small Files Scheduled to Avoid Compaction 1 2 3 1 1 2 4 4 4 2
  33. 33. Let’s try it instead with DELTA
  34. 34. Reprocessing The Canonical Data Pipeline Data Lake λ-arch λ-arch Streaming Analytics Reporting Events Validation λ-arch Validation Reprocessing Compaction Partitioned Compact Small Files Scheduled to Avoid Compaction 1 2 3 1 1 2 4 4 4 2 Challenge
  35. 35. DELTA DATA LAKE Reporting Streaming Analytics The LOW-LATENCY of streaming The RELIABILITY & PERFORMANCE of data warehouse The SCALE of data lake The Delta Architecture
  36. 36. Sign up for the Private Beta visit databricks.

×