Data Lakehouse, Data Mesh, and Data Fabric (r1)

Big Data/Data Warehouse Evangelist at Microsoft em Microsoft
5 de Aug de 2021
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
1 de 27

Mais conteúdo relacionado

Mais procurados

Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1

Mais procurados(20)

Similar a Data Lakehouse, Data Mesh, and Data Fabric (r1)

Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra

Similar a Data Lakehouse, Data Mesh, and Data Fabric (r1)(20)

Mais de James Serra

Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernanceJames Serra
Power BI OverviewPower BI Overview
Power BI OverviewJames Serra
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra

Mais de James Serra(20)

Último

Common WordPress APIs_ Settings APICommon WordPress APIs_ Settings API
Common WordPress APIs_ Settings APIJonathan Bossenger
OpenAI API crash courseOpenAI API crash course
OpenAI API crash courseDimitrios Platis
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixJiangwei Pan
Manage and Release Changes Easily and Collaboratively with DevOps Center - Sa...Manage and Release Changes Easily and Collaboratively with DevOps Center - Sa...
Manage and Release Changes Easily and Collaboratively with DevOps Center - Sa...Amol Dixit
Deep Dive Microsoft Viva Insights - Collabdays Bletchley Park 2023Deep Dive Microsoft Viva Insights - Collabdays Bletchley Park 2023
Deep Dive Microsoft Viva Insights - Collabdays Bletchley Park 2023Chirag Patel
Mastering Automation Quality: Exploring UiPath's Test Suite for Seamless Test...Mastering Automation Quality: Exploring UiPath's Test Suite for Seamless Test...
Mastering Automation Quality: Exploring UiPath's Test Suite for Seamless Test...DianaGray10

Último(20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Notas do Editor

  1. So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric.  What do all these terms mean and how do they compare to a data warehouse?  In this session I’ll cover all of them in detail and compare the pros and cons of each.  I’ll include use cases so you can see what approach will work best for your big data needs.
  2. Fluff, but point is I bring real work experience to the session
  3. http://www.ispot.tv/ad/7f64/directv-hang-gliding
  4. One version of truth story: different departments using different financial formulas to help bonus This leads to reasons to use BI. This is used to convince your boss of need for DW Note that you still want to do some reporting off of source system (i.e. current inventory counts). It’s important to know upfront if data warehouse needs to be updated in real-time or very frequently as that is a major architectural decision JD Edwards has tables names like T117
  5. https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/
  6. Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data There are two approaches to doing information management for analytics: Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen? Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen? In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach. .
  7. https://www.jamesserra.com/archive/2017/06/data-lake-details/ https://blog.pythian.com/reduce-costs-by-adding-a-data-lake-to-your-cloud-data-warehouse/ Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/ http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses http://www.martinsights.com/?p=1088 http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/ http://www.martinsights.com/?p=1082 http://www.martinsights.com/?p=1094 http://www.martinsights.com/?p=1102
  8. Adam: 2 min/11 total Let’s expand on this concept of leaders versus laggards just a bit. There are different stages of enterprise data maturity as we see on this slide. Organizations go through several stages in this process, from being reactive or informative with data to being predictive and transformative with data. And with every step that an organization takes along these stages, their ability to be successful in digital transformation accelerates. The reason for this acceleration is simple and to me, the secret is found in the seven most important words on this slides – the seven words that define the transformative end of the spectrum here – are “any data, any source, anywhere at scale”. This is an essential and an ambitious goal for any organization. What about third-party governmental data about demographics and income? Yes, any data. How about data formats that you have not seen before which come from systems coming across from a recent acquisitions? Yes, any source. What about data generated by devices that are only intermittently connected to the internet? Yes, anywhere. How about data that comes in 100 times as fast as it ever came in before because a movie star mentioned your product or service? Yes, at scale. The more data that customers bring to the cloud and make available for AI, the more successful they can become. As customers increasingly realize this, they start to lever AI more and more, creating a demand pipeline for additional data to go to the cloud. Let’s drill down on that next.
  9. Data Fabric adds: data access, data policies, data catalog, MDM, data virtualization, data scientist tools, APIs, building blocks, products
  10. Data Fabric adds: data access, data policies, data catalog, MDM, data virtualization, data scientist tools, APIs, building blocks, products
  11.  Delta Lake, Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi),
  12. Reliability. Keeping the data lake and warehouse consistent is difficult and costly. Continuous engineering is required to ETL data between the two systems and make it available to high-performance decision support and BI. Each ETL step also risks incurring failures or introducing bugs that reduce data quality, e.g., due to subtle differences between the data lake and warehouse engines. Data staleness. The data in the warehouse is stale compared to that of the data lake, with new data frequently taking days to load. This is a step back compared to the first generation of analytics systems, where new operational data was immediately available for queries. According to a survey by Dimensional Research and Fivetran, 86% of analysts use out-of-date data and 62% report waiting on engineering resources numerous times per month [47]. Limited support for advanced analytics. Businesses want to ask predictive questions using their warehousing data, e.g., “which customers should I offer discounts to?” Despite much research on the confluence of ML and data management, none of the leading machine learning systems, such as TensorFlow, PyTorch and XGBoost, work well on top of warehouses. Unlike BI queries, which extract a small amount of data, these systems need to process large datasets using complex non-SQL code. Reading this data via ODBC/JDBC is inefficient, and there is no way to directly access the internal warehouse proprietary formats. For these use cases, warehouse vendors recommend exporting data to files, which further increases complexity and staleness (adding a third ETL step!). Alternatively, users can run these systems against data lake data in open formats. However, they then lose rich management features from data warehouses, such as ACID transactions, data versioning and indexing. Total cost of ownership. Apart from paying for continuous ETL, users pay double the storage cost for data copied to a warehouse, and commercial warehouses lock data into proprietary formats that increase the cost of migrating data or workloads to other systems
  13. Speed: Queries against a relational storage will always be faster than against a data lake (roughly 5X) because of missing features in the data lake such as the lack of statistics, query plans, result-set caching, materialized views, in-memory caching, SSD-based caches, indexes, and the ability to design and align data and tables. Counter: DirectParquet, CSV 2.0, query acceleration, predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database.  Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into Azure Analysis Services to get even faster performance. Another thing to keep in mind affecting query performance is Synapse is a Massive parallel processing (MPP) technology that has features such as replicated tables for smaller tables (i.e. dimension tables) and distributed tables for large tables (i.e. fact tables) with the ability to control how they are distributed across storage (hash, round-robin). This could provide much greater performance compared to a data lake that uses HDFS where large files are chunked across the storage Security: Row-level security (RLS), column-level security, dynamic data masking, and data discovery & classification are security-related features that are not available in a data lake. Counter: User RLS in Power BI or RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. Or you could import the data from the data lake into Power BI, creating a star schema model in a Power BI dataset. But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake Missing features: Auditing, referential integrity, ACID compliance, updating/deleting rows of data, data caching, Transparent Data Encryption (TDE), workload management, full support of T-SQL – all are not available in a data lake. Counter: some of these features can be accomplished when using Delta Lake, Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi), but will not be as easy to implement as a relational database and you will be locked into using Spark. Also, features being added to Blob Storage (see More Azure Blob Storage enhancements) can be used instead of resorting to Delta Lake, such as blob versioning as a replacement for time travel in Delta Lake
  14. What is Data Mesh? Data Mesh – an approach founded by Zhamak Dehghani – refers to a decentralized, distributed approach to enterprise data management. It is a holistic concept that sees different datasets as distributed products, orientated around domains. The idea is that each domain-specific dataset has its own embedded engineers and product owners to manage that data and its availability to other teams, driving a level of data ownership and responsibility, which is often lacking in the current data platforms that are largely centralised, monolithic, and often built around complex pipelines. MDW: missing is data access, data policies, data catalog, MDM, data virtualization, data scientist, APIs, building blocks, products Data fabric: HR, finance, payroll, operations Concerning data virtualization, if a data fabric uses data virtualization to keep data in place, then I would say a data fabric and a data mesh are the same thing. Maybe a difference is a data mesh has standards/frameworks on how each domain handles its data, treating data as a product, where a data fabric does not have that. Data lakehouse: tradeoffs: speed, security, no metadata, extra complexity, MDM, referential integrity Leading edge: security/ABAC, data ingestion, cataloging, size of 3rd-party data sources, using it to feed products, data explorer/marketplace, GSL, data virtualization, APIs Other companies (generalities): most are at MDW, depends on size Microsoft: Enterprise Scale Analytics and AI MS relationship: Advancing capabilities, giving feedback, executive awareness Microsoft gaps: ABAC, MDM, data virtualization (denoto, dremio, faxes) Removed: How TDF meets business goals and strategic objectives How TDF addresses stakeholder concerns