O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake

A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.

Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.

  • Seja o primeiro a comentar

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake

  1. 1. Data Mesh in Practice Max Schultze - max.schultze@zalando.de Arif Wider - awider@thoughtworks.com 12-06-2020 How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake @mcs1408 @arifwider
  2. 2. 2 Max Schultze ● Lead Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink ● Retired semi-professional Magic: the Gathering player Who are we? Arif Wider ● Lead Technology Consultant ● Head of AI, ThoughtWorks Germany ● Scala & FP enthusiast ● Coffee geek
  3. 3. 7000+ technologists with 43 offices in 14 countries Partner for technology driven business transformation Barcelona - Madrid - London - Manchester - Berlin - Hamburg - Munich - Cologne
  4. 4. #1 in Agile and Continuous Delivery 100+ books written ©ThoughtWorks 2020
  5. 5. #1 in Agile and Continuous Delivery 100+ books written ©ThoughtWorks 2020
  6. 6. 6 WHAT TO EXPECT Zalando Analytics Cloud Journey What’s this Data Mesh? Data Mesh in Practice
  7. 7. 7 Zalando Analytics Cloud Journey
  8. 8. 8 Legacy Analytics DWH
  9. 9. 9 Messaging Bus Data Lake Legacy Evolving
  10. 10. 10 Zalando’s Data Lake Ingestion Storage Serving
  11. 11. 11 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving
  12. 12. 12 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore
  13. 13. 13 Zalando’s Data Lake Data CatalogWeb Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore Fast Query Layer Processing Platform
  14. 14. 14 Centralization Challenges Datasets provided by data agnostic infrastructure team ● Lack of ownership ?
  15. 15. 15 Field_A Field_B Record_1 Record_2 Record_3 Datasets provided by data agnostic infrastructure team ● Lack of ownership Pipeline responsibility on data agnostic infrastructure team ● Lack of quality Centralization Challenges
  16. 16. 16 Centralization Challenges Datasets provided by data agnostic infrastructure team ● Lack of ownership Pipeline responsibility on data agnostic infrastructure team ● Lack of quality Organizational scaling ● Central team becomes the bottleneck
  17. 17. 17 A Recurring Pattern Product teams generating data Data engineers maintaining the data platform Decisions makers, data scientists consuming data
  18. 18. 18 Why is that? central data platform
  19. 19. 19 Why is that? checkout service checkout events
  20. 20. 20 What is Data Mesh? Old wine applied to new bottles… → Product Thinking → Domain-Driven Distributed Architecture → Infrastructure as a Platform … creates value from Data
  21. 21. 21 Data as a Product Data Product What is my market? What are the desires of my customers? What “price” is justified? How to do marketing? What’s the USP? Are my customers happy?
  22. 22. 22 Domain-Driven Distributed Data Architecture Domain 22
  23. 23. 23 Domain-Driven Distributed Data Architecture Domain 23 → The Data Product is the fundamental building block Aggregated Domain
  24. 24. 24 Domain-Driven Distributed Data Architecture Discoverable Addressable Self-describing Trustworthy Interoperable (governed by open standard) Secure (governed by global access control) Domain 24 → The Data Product is the fundamental building block Aggregated Domain
  25. 25. 25 Self-Service Data Infrastructure Data Infra as a Platform Storage, pipeline, catalogue, access control, etc Data infra engineers Discoverable Addressable Self-describing Trustworthy Interoperable (governed by open standard) Secure (governed by global access control) Domain 25 → The Data Product is the fundamental building block Aggregated Domain
  26. 26. 26 Global Governance & Open Standards Enable interoperability An Ecosystem of Data Products Data Infra as a Platform Storage, pipeline, catalogue, access control, etc Data infra engineers Discoverable Addressable Self-describing Trustworthy Interoperable (governed by open standard) Secure (governed by global access control) Domain 26 → The Data Product is the fundamental building block Aggregated Domain
  27. 27. 27 It’s a mindset shift FROM TO Centralized ownership Decentralized ownership Pipelines as first class concern Domain Data as first class concern Data as a by-product Data as a Product Siloed Data Engineering Team Cross-functional Domain-Data Teams Centralized Data Lake / Warehouse Ecosystem of Data Products
  28. 28. 28 Data Mesh in Practice
  29. 29. 29 Recap: ● From Bottleneck to Infra Platform Data Mesh in Practice Data Infra as a Platform Storage, pipeline, catalogue, access control, etc
  30. 30. 30 Recap: ● From Bottleneck to Infra Platform ● From Data Monolith to Interoperable Services Data Mesh in Practice Data Infra as a Platform Storage, pipeline, catalogue, access control, etc central data platform
  31. 31. 31 Data Lake Storage Metadata Layer Central Services with Global Interoperability
  32. 32. 32 Data Lake Storage Metadata Layer Bring Your Own Bucket (BYOB)
  33. 33. 33 Data Lake Storage Processing Platform Metadata Layer Central Processing Platform
  34. 34. 34 Data Lake Storage Processing Platform Metadata Layer Simplify Data Sharing
  35. 35. 35 Central Services with Global Interoperability Decentralized ownership does not imply decentralized infrastructure! Interoperability is created through convenient solutions of a self service platform. Decentral Storage Central Infrastructure Decentral Ownership Central Governance
  36. 36. 36 Recap: ● Datasets provided through pipelines of data agnostic infrastructure teams Data Mesh in Practice ?
  37. 37. 37 Recap: ● Datasets provided through pipelines of data agnostic infrastructure teams Data Mesh in Practice ? Who is allowed to share data? What are the criteria to enable data consumers? How to ensure data quality?
  38. 38. 38 How to Ensure Data Quality? Make conscious decisions ● Opt-in instead of default storage
  39. 39. 39 How to Ensure Data Quality? Make conscious decisions ● Opt-in instead of default storage ● Classification of data usage
  40. 40. 40 Data Quality - A Contract between Consumer and Producer Behavioral changes for data producers ● Data is a product not a by-product
  41. 41. 41 Behavioral changes for data producers ● Data is a product not a by-product ● Dedicate resources to ○ Understand usage ○ Ensure quality Data Quality - A Contract between Consumer and Producer
  42. 42. 42 Into the Future
  43. 43. 43 Into the Future ● Domain Enterprise Architecture ○ Definition of domain responsibilities ○ Appointment of domain specific experts
  44. 44. 44 Into the Future ● Domain Enterprise Architecture ○ Definition of domain responsibilities ○ Appointment of domain specific experts ● “Off the shelf” data products ○ De-centralized archiving ○ Template driven data preparation
  45. 45. 45 Data Mesh in Practice How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake Max Schultze max.schultze@zalando.de @mcs1408 Arif Wider awider@thoughtworks.com @arifwider

×