SlideShare uma empresa Scribd logo
1 de 32
Hadoop and the Data Warehouse
    Patrick Angeles




1
About Me

    •   Director of Field Engineering at Cloudera
         •   Architect on several dozen Hadoop-based data solutions
             for Cloudera customers
    •   Started with Hadoop in 2008
         •   First Hadoop system processed set-top box log data
    •   Past life
         •   Java EE / Database Architect
         •   Web Data Mining
         •   Cryptography / Public Key Infrastructure



2
What is a Data Warehouse?




3
— The Oracle



4
Database Architecture 1.0




       Products
                                Inventory
       Customers       DB
                                Sales
       Orders




5
Database Architecture 1.0

     •   Dead simple
     •   Tables in 3rd normal form
     •   Reports are SQL queries that join through entity
         relationships and aggregate

                  SELECT   c.gender, p.product_name,
                           sum(o.qty), sum(o.price)
                  FROM     order o, customer c, product p
                  WHERE    o.customer_id = c.id
                   AND     o.product_id = p.id
                   AND     o.day = ’2013-03-21’
                  GROUP BY c.gender, p.product_name ;


6
Database Architecture 1.0

     •   Report queries can become expensive, redundant
     •   Build a layer of abstraction!
     •   Materialize the data to something closer to query
         form.
     •   Create reporting tables
          •   Decide on the reports columns
          •   What query criteria can be parameterized
          •   Periodicity of report generation
          •   Denormalize and aggregate

7
Database Architecture 1.1




                               Inventory
               Customers
                                      Sales
                      Orders
           Products




8
Two Database Workloads

           Transactional     Analytic
              Record facts   Reveal patterns

          Write-optimized    Read-optimized

      Random reads/writes    Sequential reads

       Normalized schema     Denormalized schema



9
Analytical Database (2.0)




              Customers          Inventory

                     Orders             Sales
          Products




10
Analytical Database Architecture

      •   Column oriented storage
           •   Reduces I/O on multi-dimensional tables
           •   Improved compression
           •   Skip columns or row ranges
      •   Massively Parallel Processing
           •   Query planner breaks up a task to be executed on
               multiple hosts
      •   Shared-nothing Architecture
           •   Cluster nodes have independent storage and memory
      •   Slow writes, fast reads

11
Analytical Database




                    TX     Analytical
                    DB        DB




12
Data Transformation




                   TX      Analytical
                   DB         DB




13
Three Ways to Transform Data

      •   Transform Extract Load
           •   Query from transactional tables into target schema
      •   Extract Load Transform
           •   Load data into analytical database, transform and write
               to target schema
           •   No need for additional hardware
      •   Extract Transform Load
           •   Read data from transactional database into a grid
               system, transform, then write to analytical database
           •   Least load on tx and analytical systems

14
Business Intelligence Tools




             TX          Analytical
                                      BI
             DB             DB




15
Business Intelligence Tools

      • Can provide canned reports, dashboards, or
        interactive visualizations
      • Typically leverage common standards (SQL,
        JDBC/ODBC) to access data
      • Requires low-latency (sub second or minute,
        depending on query) response times from database




16
Observations

      • Separate transactional from analytical workloads
      • Use appropriate database implementation
        according to the workload
          •   ‘Traditional’ row-major store for transactional
          •   MPP column-store for analytic
      • Consider a BI tool so you’re not stuck writing
        reports for analysts who don’t know SQL
      • Consider an ETL tool so you’re not stuck writing
        transformations for analysts who don’t know SQL


17
Welcome to the Enterprise




18
Basic Data Warehouse Architecture




             TX                   BI
                        DW
             DB




19
Data Marts


                       Sales




           TX          Mktg    BI
                  DW
           DB




                       Prch




20
Multiple Data Sources

          TX
          DB                  Sales




         Files           DW   Mktg    BI




         other                Prch




21
Operational Data Store

       TX
       DB                          Sales




      Files                        Mktg    BI
                ODS           DW




      other                        Prch




22
Where’s Hadoop?




23
No Hadoop

      TX
      DB                    Sales




      Files                 Mktg    BI
                 ODS   DW




     other                  Prch




24
Adjacent System

       TX
       DB                   Sales




      Files                 Mktg    BI
                       DW



                ODS
      other                 Prch




25
ETL Engine

       TX
       DB              Sales




      Files            Mktg    BI
                  DW




      other            Prch




26
Tiered Data Warehouse

             TX
             DB              Sales




            Files            Mktg    BI




            other            Prch




27
Analytical Query Engine

               TX
               DB




              Files            BI




              other




28
Simple Database Architecture




        Products
                                    Inventory
        Customers       DB          Sales
        Orders




29
The future?




        Products
                    Inventory
        Customers
                    Sales
        Orders




30
http://www.hbasecon.com/
            San Francisco
            June 13, 2013




31
32

Mais conteúdo relacionado

Mais procurados

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Mais procurados (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 

Destaque

Destaque (20)

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
 
Cloudera Sessions - Optimize Your Data Warehouse
Cloudera Sessions - Optimize Your Data WarehouseCloudera Sessions - Optimize Your Data Warehouse
Cloudera Sessions - Optimize Your Data Warehouse
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
 
Kafka ppt
Kafka pptKafka ppt
Kafka ppt
 

Semelhante a Hadoop and Enterprise Data Warehouse

Semelhante a Hadoop and Enterprise Data Warehouse (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apachecon
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Hadoop and Enterprise Data Warehouse

  • 1. Hadoop and the Data Warehouse Patrick Angeles 1
  • 2. About Me • Director of Field Engineering at Cloudera • Architect on several dozen Hadoop-based data solutions for Cloudera customers • Started with Hadoop in 2008 • First Hadoop system processed set-top box log data • Past life • Java EE / Database Architect • Web Data Mining • Cryptography / Public Key Infrastructure 2
  • 3. What is a Data Warehouse? 3
  • 5. Database Architecture 1.0 Products Inventory Customers DB Sales Orders 5
  • 6. Database Architecture 1.0 • Dead simple • Tables in 3rd normal form • Reports are SQL queries that join through entity relationships and aggregate SELECT c.gender, p.product_name, sum(o.qty), sum(o.price) FROM order o, customer c, product p WHERE o.customer_id = c.id AND o.product_id = p.id AND o.day = ’2013-03-21’ GROUP BY c.gender, p.product_name ; 6
  • 7. Database Architecture 1.0 • Report queries can become expensive, redundant • Build a layer of abstraction! • Materialize the data to something closer to query form. • Create reporting tables • Decide on the reports columns • What query criteria can be parameterized • Periodicity of report generation • Denormalize and aggregate 7
  • 8. Database Architecture 1.1 Inventory Customers Sales Orders Products 8
  • 9. Two Database Workloads Transactional Analytic Record facts Reveal patterns Write-optimized Read-optimized Random reads/writes Sequential reads Normalized schema Denormalized schema 9
  • 10. Analytical Database (2.0) Customers Inventory Orders Sales Products 10
  • 11. Analytical Database Architecture • Column oriented storage • Reduces I/O on multi-dimensional tables • Improved compression • Skip columns or row ranges • Massively Parallel Processing • Query planner breaks up a task to be executed on multiple hosts • Shared-nothing Architecture • Cluster nodes have independent storage and memory • Slow writes, fast reads 11
  • 12. Analytical Database TX Analytical DB DB 12
  • 13. Data Transformation TX Analytical DB DB 13
  • 14. Three Ways to Transform Data • Transform Extract Load • Query from transactional tables into target schema • Extract Load Transform • Load data into analytical database, transform and write to target schema • No need for additional hardware • Extract Transform Load • Read data from transactional database into a grid system, transform, then write to analytical database • Least load on tx and analytical systems 14
  • 15. Business Intelligence Tools TX Analytical BI DB DB 15
  • 16. Business Intelligence Tools • Can provide canned reports, dashboards, or interactive visualizations • Typically leverage common standards (SQL, JDBC/ODBC) to access data • Requires low-latency (sub second or minute, depending on query) response times from database 16
  • 17. Observations • Separate transactional from analytical workloads • Use appropriate database implementation according to the workload • ‘Traditional’ row-major store for transactional • MPP column-store for analytic • Consider a BI tool so you’re not stuck writing reports for analysts who don’t know SQL • Consider an ETL tool so you’re not stuck writing transformations for analysts who don’t know SQL 17
  • 18. Welcome to the Enterprise 18
  • 19. Basic Data Warehouse Architecture TX BI DW DB 19
  • 20. Data Marts Sales TX Mktg BI DW DB Prch 20
  • 21. Multiple Data Sources TX DB Sales Files DW Mktg BI other Prch 21
  • 22. Operational Data Store TX DB Sales Files Mktg BI ODS DW other Prch 22
  • 24. No Hadoop TX DB Sales Files Mktg BI ODS DW other Prch 24
  • 25. Adjacent System TX DB Sales Files Mktg BI DW ODS other Prch 25
  • 26. ETL Engine TX DB Sales Files Mktg BI DW other Prch 26
  • 27. Tiered Data Warehouse TX DB Sales Files Mktg BI other Prch 27
  • 28. Analytical Query Engine TX DB Files BI other 28
  • 29. Simple Database Architecture Products Inventory Customers DB Sales Orders 29
  • 30. The future? Products Inventory Customers Sales Orders 30
  • 31. http://www.hbasecon.com/ San Francisco June 13, 2013 31
  • 32. 32

Notas do Editor

  1. Architected scores of Hadoop-based data solutions
  2. Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  3. Turns out separating the transactional vs reporting database brings other benefits
  4. I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  5. 2 other major components that haven’t been mentioned
  6. I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  7. Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  8. Two things this allows you to do- Use different underlying architectures for each database
  9. Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  10. Two things this allows you to do- Use different underlying architectures for each database
  11. Data marts designed for specific department needs.Kimball ?
  12. Two things this allows you to do- Use different underlying architectures for each database
  13. Ralph Kimball – The Data Warehousing ToolkitBill Inmon – Building the Data Warehouse
  14. Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  15. Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  16. Store long term dataTransform and load to data marts
  17. Store long term dataBI tools can readily query data in Hadoop using Impala
  18. Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  19. Support for insert/update semantics?HBase with typed columns