O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

The Future of Data Warehousing: ETL Will Never be the Same

10.044 visualizações

Publicada em

Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.

In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.

Publicada em: Software

The Future of Data Warehousing: ETL Will Never be the Same

  1. 1. 1© Cloudera, Inc. All rights reserved. The Future of Data Warehousing: ETL Will Never be the Same Ralph Kimball| Founder, Kimball Group Manish Vipani| Vice President and Chief Architect, Kaiser Permanente
  2. 2. 2© Cloudera, Inc. All rights reserved. Hadoop’s impact on data warehousing • Traditional DBMS stack exploded into separate layers • Data layer: HDFS files, not curated relational tables • Metadata layer: open extensible HCatalog, not vendor system tables • Query layer: cottage industry of query engines, not vendor specific SQL • Schema on Read • Allow the query layer to decide how to consume the data • Materialize the view later (e.g., into Parquet files) for high performance Integration goes far beyond relational tables • Conformed dimensions remain the glue holding together Hadoop applications (even if you have never heard of conformed dimensions!)
  3. 3. 3© Cloudera, Inc. All rights reserved. The logical architecture hasn’t changed • Original Sources  ETL Step  Exposed Presentation Data  BI Application • BUT, the physical architecture of the back room now looks very different
  4. 4. 4© Cloudera, Inc. All rights reserved. Old back room • Slow transfer from sources • Physical transformations required • Cleaning, normalization required • Mandated RDBMS table targets • Metadata limited to system tables • Presentation layer vendor mandated • Single focus: RDBMS SQL only New back room • Purpose built for high transfer rates • Physical transformations optional • Cleaning, normalization discouraged • Table targets optional or deferred • Extensible metadata via HCatalog • Presentation layer open ended • Before or after any transformations • Analytic client specific • Multiple simultaneous personalities The old and new back rooms
  5. 5. 5© Cloudera, Inc. All rights reserved. Old back room • Off limits except to ETL staff • “we aren’t ready” • “the data must be cleaned” • “data governance trumps” • “end users not trusted” • Traditional IT control New back room • Doors open to • Qualified analytic users • Automated processes • Experiments, model building • Clients other than SQL • Open data marketplace The biggest change to the back room
  6. 6. 6© Cloudera, Inc. All rights reserved. The Landing Zone at Kaiser Permanente Implementing the new ETL approach in the real world. A unified data repository for secure and trusted data.
  7. 7. 7© Cloudera, Inc. All rights reserved. Landing Zone Landing Zone – Home to secure and organized data • A self service data platform hosting both the raw and prepared data sets for quick business consumption to drive advanced business insights and decisions. • Allow seamless data access for authorized users across enterprise business functions. • Data is organized by domains/use cases in Raw and Refined zone. • Perimeter security with data encrypted at rest. • Kerberized with integration to identity and Access Management system. Parts of Landing Zone • Raw Zone -> Exact replica of source data. • Refined Zone -> Transformed prepared data sets organized by use cases. • User Defined Space -> Secure and common access to raw and trusted data. • Master Data, Metadata, Internal Reference Data, Industry Reference Data, etc…
  8. 8. 8© Cloudera, Inc. All rights reserved. Landing Zone SQL Java PIGHIVE Replicate Data Selection Python Source Data Exploratory Intelligence A MRD Analyze MineRefineDiscover E DW/DM L Data Extract Role Based Access Control Perimeter Security Data Registry (Tags & Catalog) Internal Reference Data Meta Data Industry Reference Data HDFS Master Data Raw Zone User Defined Space Refined Zone Usage Data All Data Encrypted @ Rest Access Authentication Data Load Extract- Load Copy Landing Zone – A Self Service Data Platform hosting both the raw and prepared data sets for quick business consumption.  Data Security –  Deployed on secured network with traffic monitoring.  Data is encrypted at rest.  Role based access and authorization.  Data Organization –  Exact replica of source data organized by information domains in Raw Zone.  Data organized by use cases in the Refined Zone (transformed prepared data sets).  Separate area allocated to track master data, metadata, internal reference data & industry specific reference data sets. Impala
  9. 9. 9© Cloudera, Inc. All rights reserved. The ETL Revolution Poses Significant Challenges Some old, some new
  10. 10. 10© Cloudera, Inc. All rights reserved. Old challenges we’ve seen before • Big data world is furiously implementing stovepipes • Good news is the excitement of new data sources and analyses • Bad news is ignoring integration, the fix is to start over • New departments not seen with traditional data warehousing • Not on anyone’s radar  rolling their own systems • Unusual business user profiles, latency demands, security lapses • Big speed bumps when replacing old systems with new • Users don’t want to switch • New results don’t match old results • Legacy hardware and software absurdly expensive, doesn’t scale reasonably
  11. 11. 11© Cloudera, Inc. All rights reserved. New challenges needing inventive approaches • Traditional BI decision makers joined by • Data scientists • Roll their own ETL, hardware, OSs, programming languages • Take results to senior management directly • Don’t stick around for documentation, rollout, user support, maintenance • Predictive models and modelers • Constantly changing schemas • Tricky integration, e.g., joining relational tables to HBase • Automatic daemons • Enormous, bursty demand for computing resources
  12. 12. 12© Cloudera, Inc. All rights reserved. Kaiser Permanente’s Pragmatic Response to the Challenges Pain Points: • Lack of user transient store and structural flexibility due to slow adaption to changes. • Lack of ability to do analytics and hypothesis testing of new data from disparate systems. Successes: • Over 10+ proven use cases with some early adopters.
  13. 13. 13© Cloudera, Inc. All rights reserved. Landing Zone use cases Problem • Lack insight to understand factors influencing members’ adoption and utilization of online services. • Lack data integration and co-relation due to disparate systems. • Lack 3600 member service utilization view and dashboards. Resolution • Summarized and aggregated data sets in landing zone helps in improved decision making. • Faster and complete access to data at scale for metrics reporting and analytics. • Reduced data collection & metric reporting time from 3 weeks to 10 hours. • Ease of building “decision-centric” dashboards (8 in 3 months). Online Member Services – “kp.org”
  14. 14. 14© Cloudera, Inc. All rights reserved. Landing Zone use cases cont… Problem • Commercial large-scale data warehouse (Teradata) repository is expensive at scale, grows exponentially, and processes large volumes of queries/month. • Continuing workload tuning efforts are slow to yield expected results. Resolution • Replicate data from Teradata into Landing Zone. • Rewrite and tune queries to eliminate semantically equivalent queries to achieve better performance. Moving Traditional Data Warehouse Workload to Landing Zone Problem • Lack of platform to collect and correlate structured and unstructured data from consumer facing health monitoring devices e.g.: Fitbit, Glucometer, etc. • Clinicians cannot track members’ health or weight goals, and see usage patterns. Resolution • Ingest transactional data and device logs into landing zone and create analytics workspace. • Enable clinicians to generate aggregated data for tracking member adherence and build dashboards using native tools. Digital Services Dashboard – “Interchange”
  15. 15. 15© Cloudera, Inc. All rights reserved. Landing Zone use cases cont… Problem • Sequential and fragmented processes having limited ability to enrich data sources to increase accuracy. • Lack of clinical and analytical views increases lead time to analysis and inconsistent results. Resolution • Ingest data from fragmented system into the Landing Zone. • Created program-wide clinical and analytical views with refresh speed to 7 hours from 18 hours. Common Clinical and Analytical Views Problem • Current Medicare reporting solution does not maintain history and requires significant effort to recreate prior reports and perform trend analysis. • Externally hosted CIMP systems are cost-prohibitive and difficult to scale. Resolution • Replicate data from 30+ source systems into Landing Zone providing access to data internally. • Rebuild reports with improved performance that runs within reasonable time at scale. • Proved versatility of platform to handle data at scale and created equivalent reports. Consumer Information Management Platform – CIMP 2.0
  16. 16. 16© Cloudera, Inc. All rights reserved. Architectural Wrap-Up What does all this mean?
  17. 17. 17© Cloudera, Inc. All rights reserved. Kaiser Permanente is a work in progress with impressive early results, and insights for moving forward • Be the single source of all Kaiser’s data as well as external data leveraged by Kaiser applications, processes, and for Kaiser decision making. • “Learn and adapt” model provides common capabilities across rich data set, with increased agility in provisioning new data sets. • Enabling data profiling / tagging, semantic search, descriptive, predictive and prescriptive analytics to drive advanced business insights and decisions.
  18. 18. 18© Cloudera, Inc. All rights reserved. The Back Room Landing Zone has become a Vibrant Marketplace • Replaces the quiet ETL back room • Challenging (exciting) new service role for IT • Open for business • Data scientists  A/B testing  experimentation  prototyping • Simultaneous ETL pipelines  aggregates, high-performance Parquet files, uploads to EDW • Simultaneous SQL and non-SQL clients • Immediate access • Don’t wait for physical transformation  schema-on-read • Purpose built for extreme I/O performance
  19. 19. 19© Cloudera, Inc. All rights reserved. Thank you Ralph Kimball, ralphcollector@gmail.com Manish Vipani, manish.x.vipani@kp.org