O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Everything Has Changed Except Us: Modernizing the Data Warehouse

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 61 Anúncio

Everything Has Changed Except Us: Modernizing the Data Warehouse

Baixar para ler offline

Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
 
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
 
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
 
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
 
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
 
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.

Keynote, Munich, June 2016
The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
 
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
 
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
 
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
 
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
 
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a Everything Has Changed Except Us: Modernizing the Data Warehouse (20)

Mais de mark madsen (14)

Anúncio

Mais recentes (20)

Everything Has Changed Except Us: Modernizing the Data Warehouse

  1. 1. Everything Has Changed Except Us Modernizing the Data Warehouse Architecture June, 2016 Mark Madsen www.ThirdNature.net @markmadsen
  2. 2. Copyright Third Nature, Inc. The big data market is: … a leap forward A leap in evolution to a more flexible way of gathering data and generating useful information. ▪ Decide if data is good enough at the time of use ▪ Files are flexible ▪ Save time collecting data ▪ “self service” A new approach to managing and using data.
  3. 3. Copyright Third Nature, Inc. The big data market is: … a step backward A step backward to methods not capable of providing the quality, manageability, accessibility and reuse we need. ▪ Ensure data quality up front rather than after the fact ▪ Tables are stable ▪ Save time using data ▪ “self service” A reinvention of the wheel, going back to the manual era.
  4. 4. Copyright Third Nature, Inc. The architecture we are (still) using today The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published. 4 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 4Copyright Third Nature, Inc.
  5. 5. Copyright Third Nature, Inc. Environment in 1988: no big data, only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public, enters US market ▪ Unix is mostly run by long-haired freaks ▪ Mobile was this The environment: scarcity of data, of reports, of computer systems (outside of core financials), of resources, of money to pay for storage.
  6. 6. Copyright Third Nature, Inc. Data warehouse: centralize, that solves all problems! Creates bottlenecks Causes scale problems Enforces a single model
  7. 7. Copyright Third Nature, Inc. The data lake solution: no central authority wtf, it was fully operational!
  8. 8. Copyright Third Nature, Inc. The data lake solution? There’s a problem: as the lake is envisioned, it is still a centralized data architecture, but this time there is no single global model. Instead it’s files and not modeled. It’s still a death star.
  9. 9. Copyright Third Nature, Inc. Eventually we run into the same problems Seriously, wtf? It was agile and operational
  10. 10. Copyright Third Nature, Inc. Growing complexity has changed our environment You can barely keep up with source changes You can’t keep up with new data requests You are scale, performance and latency limited But: The organizations want more, different, current data The reality is continuous change – lots of maintenance
  11. 11. Copyright Third Nature, Inc. Meanwhile, we’ve become the department of “No” • Environments are more complex • Data volumes are 10,000X larger • Data is more complex • More use of BI • We have entirely new data uses We’ve been struggling with shrinking load windows, performance problems, and an inability to quickly meet new data and analytics requests for years, yet we keep using the same design.
  12. 12. Copyright Third Nature, Inc. I never said the “E” in EDW meant “everything”… What do you mean, “3 months?”
  13. 13. Copyright Third Nature, Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you must know and model all data before using it is untenable.
  14. 14. Copyright Third Nature, Inc. Streams of events and observations This data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  15. 15. Copyright Third Nature, Inc. Declarations The problems of events apply to text. Which can be an event payload
  16. 16. Copyright Third Nature, Inc. Categorizing the measurement data we collect The convenient data is the transactional data. ▪ Goes in the DW and is used, even if it isn’t the right measurement. The inconvenient data is the observational data. ▪ It’s not neat, clean, or designed into most systems of operation. The difficult data is the declarative data. ▪ What people say and what they do require ground truth. We need an architecture that supports all three categories. Copyright Third Nature, Inc.
  17. 17. Copyright Third Nature, Inc. New uses: analytics embiggens the data volume problem Many of the algorithms we need them to make sense of the observations and declarations are are O(n2) or worse.
  18. 18. Copyright Third Nature, Inc. Publishing is the primary view of BI, self service, big data
  19. 19. Copyright Third Nature, Inc. What people do with data today: not just read it Explore and Understand Inform and Explain Convince and Decice Deliver Process Collect
  20. 20. Copyright Third Nature, Inc. Data architecture requires understanding data use so we can build the right infrastructure Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process We need to focus on what people do with information as the primary task, not on the data or the technology.
  21. 21. Copyright Third Nature, Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP
  22. 22. Copyright Third Nature, Inc. The usage models for analytics and “big data” Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and exploration is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc query, reporting, or OLAP.
  23. 23. Copyright Third Nature, Inc. BI is part of a dynamic system. There are feedback loops that can change both the data and the model. Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Known: stable, can be modeled in advance Unknown: new, you can’t model in advance
  24. 24. Copyright Third Nature, Inc. The root cause of our problems is change Our rate of change is slower than the business. This can’t be fixed via buying more technology.
  25. 25. Copyright Third Nature, Inc. We have a design for stability. We need one for adaptability
  26. 26. Copyright Third Nature, Inc. Agile methods without agile architectures fail
  27. 27. Copyright Third Nature, Inc. A core problem with one big schema is change
  28. 28. Copyright Third Nature, Inc. The big data market has an answer…
  29. 29. Copyright Third Nature, Inc. Big data answer? Schema-on-read! Rather than make data fit the model, make the model fit the data. There’s a price to pay with using “schema-on- read” for everything. You won’t see the problems with this until you add a second application, and a third.
  30. 30. Copyright Third Nature, Inc. The data lake: just dump the data in!
  31. 31. Copyright Third Nature, Inc. Combine with self- service: we’ll figure it all out later! Aren’t we back where we started?
  32. 32. Copyright Third Nature, Inc. Data hoarding is not a data management strategy
  33. 33. Copyright Third Nature, Inc. What does all of this imply? The data is not always known in advance, so it can’t be modeled in advance. The data architecture must be read-write from both the back and front, not a one-way data flow. The data written back may be repeatedly used, persistent data, or it may be temporary. The data may arrive with any frequency, and the rate may not be under your control.
  34. 34. Copyright Third Nature, Inc. Slide 34 The solution to our problems isn’t technology, it’s architecture.
  35. 35. Copyright Third Nature, Inc. Big data is a leap forward flexibility speed save dev time Big data is a fall backward repeatability quality save user time Choose one or the other.
  36. 36. Copyright Third Nature, Inc. What if we could reconcile two opposing ideas using ideas borrowed from other domains? Big data flexibility speed save dev time BI / DW repeatability quality save user time Choose both
  37. 37. © Third Nature Inc.© Third Nature Inc. Separating fast from slow: pace layering and change in buildings “How Buildings Learn”, Stewart Brand Complex systems can be decomposed into layers that change at different rates.
  38. 38. Copyright Third Nature, Inc. fast layers: absorb change propose solutions learn get all the attention: fixtures like Data discovery tools
  39. 39. Copyright Third Nature, Inc. integrate change constrain options remember slow layers: do all the work: plumbing like Data warehouses
  40. 40. Copyright Third Nature, Inc. Architecture: components and layering Components above: flexibility, repurposing, quicker change above Application Layers below: stability, reuse, slow predictable change below Infrastructure We thought this was the schema… Infrastructure is just a layer carefully chosen after a lot of experience
  41. 41. Copyright Third Nature, Inc. Decoupling: design for isolation of unrelated change
  42. 42. Copyright Third Nature, Inc. In reality, you are building three systems, not one There are three things happening inside a DW: ▪ Data acquisition ▪ Data management ▪ Data delivery Isolate them and their uses of data from one another. Data Warehouse Separate the component systems to isolate unrelated change
  43. 43. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately Platform Services Data Access Deliver & Use Data storage This separates BI from other uses of data , allowing each type of use to structure the data specific to its own requirements. Data access is already somewhat separate today. Make the separation of different access methods a formal part of the architecture. Don’t force one model.
  44. 44. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately Platform Services Data Management Process & Integrate Data storage Data management should not be subject to the constraints of a single use Data management was blended with both data acquisition and structuring data for client tools. It should be an independent function.
  45. 45. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data storage Data arrives in many latencies, from real-time to one-time. Acquisition can’t be limited by the management or consumption layers. Data acquisition should not be directly tied to the needs of consumption. It must operate independently of data use.
  46. 46. Copyright Third Nature, Inc. The full analytic environment subsumes all the functions of the data warehouse, and extends them Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage The platform has to do more than serve queries; it has to be read-write.
  47. 47. Copyright Third Nature, Inc. DATA ARCHITECTURE We’re so focused on the light switch that we’re not talking about the light
  48. 48. Copyright Third Nature, Inc. As with the code, decouple the data architecture The core of the data warehouse isn’t the database, it’s the data architecture that the database and tools implement. We need a data architecture that is not limiting: ▪ Deals with data and schema change easily ▪ Does not always require up front modeling ▪ Does not limit the format or structure of data ▪ Assumes a full range of data latencies, from streaming to one-time bulk loads, both in and out ▪ Support different uses of the same data
  49. 49. Copyright Third Nature, Inc. The data architecture must align with system components because each of them addresses different data needs Incremental Collect Batch One-time copy Real time Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use Data architecture is part of the mechanism for change isolation
  50. 50. Copyright Third Nature, Inc. The data is in zones of management, not isolating layers Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data Relax control to enable self-service while avoiding a mess. Do not constrain access to one zone or to a single tool. Focus on visibility of data use, not control of data.
  51. 51. Copyright Third Nature, Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  52. 52. Copyright Third Nature, Inc. This is not a single global data model Data can live in more than one place, in more than one zone, in more than one form
  53. 53. Copyright Third Nature, Inc. This data architecture resolves rate of change problems Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data New data of unknown value, simple requests for new data can land here first, with little work by IT. More effort applied to management, slower. Optimized for specific uses / workloads. Generally the slowest change.
  54. 54. Copyright Third Nature, Inc. With abundant data our approach has to be inverted The process we still use today: 1. Model 2. Collect 3. Analyze The new process is: 1. Collect 2. Analyze 3. Model 4. Promote This is a shift from planned design to adaptive design for data management
  55. 55. Copyright Third Nature, Inc. New environment enables adaptive use of data Data needs change, so you need a system for evolution as well as data infrastructure that provides stability. Data Acquisition Collect & Store Incremental Batch One-time copy Real time Data Lake / LDW AE Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage You can’t build this all at once. You need to grow it over time.
  56. 56. Copyright Third Nature, Inc. What this means is… In this new environment the data warehouse still has enormous value. Data standards, rules and models provide the most directly usable data. Self-service can operate independently, yet still be part of the larger framework that includes the DW Data management still functions, and the useful discoveries filter through from less-managed to more-managed areas.
  57. 57. Copyright Third Nature, Inc. However, our role shifts. We no longer control so much as we guide We aren’t designers of perfect governance and control, we are designers of an environment that is resilient in the face of change
  58. 58. Copyright Third Nature, Inc. Be the department of Yes Instead of trying to model everything in advance, collect it. Instead of trying to control change, accept it. Instead of trying to control what people do with data, focus on visibility.
  59. 59. Copyright Third Nature, Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: pyramid_camel_rider.jpg - http://www.flickr.com/photos/khalid-almasoud/1528054134/ House on fire - http://flickr.com/photos/oldonliner/1485881035/ glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721 Building demolition - https://www.flickr.com/photos/gregpc/4429888820 peek_fence_dog.jpg - http://www.flickr.com/photos/webwalker/114998078/ donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/ chinese garden gate.jpg - http://www.flickr.com/photos/musaeum/420467301/
  60. 60. Copyright Third Nature, Inc. About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  61. 61. About Third Nature Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.

×