Hadoop and Enterprise Data Warehouse

2. About Me • Director of Field Engineering at Cloudera • Architect on several dozen Hadoop-based data solutions for Cloudera customers • Started with Hadoop in 2008 • First Hadoop system processed set-top box log data • Past life • Java EE / Database Architect • Web Data Mining • Cryptography / Public Key Infrastructure 2

3. What is a Data Warehouse? 3

4. — The Oracle 4

5. Database Architecture 1.0 Products Inventory Customers DB Sales Orders 5

6. Database Architecture 1.0 • Dead simple • Tables in 3rd normal form • Reports are SQL queries that join through entity relationships and aggregate SELECT c.gender, p.product_name, sum(o.qty), sum(o.price) FROM order o, customer c, product p WHERE o.customer_id = c.id AND o.product_id = p.id AND o.day = ’2013-03-21’ GROUP BY c.gender, p.product_name ; 6

7. Database Architecture 1.0 • Report queries can become expensive, redundant • Build a layer of abstraction! • Materialize the data to something closer to query form. • Create reporting tables • Decide on the reports columns • What query criteria can be parameterized • Periodicity of report generation • Denormalize and aggregate 7

8. Database Architecture 1.1 Inventory Customers Sales Orders Products 8

9. Two Database Workloads Transactional Analytic Record facts Reveal patterns Write-optimized Read-optimized Random reads/writes Sequential reads Normalized schema Denormalized schema 9

10. Analytical Database (2.0) Customers Inventory Orders Sales Products 10

11. Analytical Database Architecture • Column oriented storage • Reduces I/O on multi-dimensional tables • Improved compression • Skip columns or row ranges • Massively Parallel Processing • Query planner breaks up a task to be executed on multiple hosts • Shared-nothing Architecture • Cluster nodes have independent storage and memory • Slow writes, fast reads 11

12. Analytical Database TX Analytical DB DB 12

13. Data Transformation TX Analytical DB DB 13

14. Three Ways to Transform Data • Transform Extract Load • Query from transactional tables into target schema • Extract Load Transform • Load data into analytical database, transform and write to target schema • No need for additional hardware • Extract Transform Load • Read data from transactional database into a grid system, transform, then write to analytical database • Least load on tx and analytical systems 14

15. Business Intelligence Tools TX Analytical BI DB DB 15

16. Business Intelligence Tools • Can provide canned reports, dashboards, or interactive visualizations • Typically leverage common standards (SQL, JDBC/ODBC) to access data • Requires low-latency (sub second or minute, depending on query) response times from database 16

17. Observations • Separate transactional from analytical workloads • Use appropriate database implementation according to the workload • ‘Traditional’ row-major store for transactional • MPP column-store for analytic • Consider a BI tool so you’re not stuck writing reports for analysts who don’t know SQL • Consider an ETL tool so you’re not stuck writing transformations for analysts who don’t know SQL 17

18. Welcome to the Enterprise 18

19. Basic Data Warehouse Architecture TX BI DW DB 19

20. Data Marts Sales TX Mktg BI DW DB Prch 20

21. Multiple Data Sources TX DB Sales Files DW Mktg BI other Prch 21

22. Operational Data Store TX DB Sales Files Mktg BI ODS DW other Prch 22

23. Where’s Hadoop? 23

24. No Hadoop TX DB Sales Files Mktg BI ODS DW other Prch 24

25. Adjacent System TX DB Sales Files Mktg BI DW ODS other Prch 25

26. ETL Engine TX DB Sales Files Mktg BI DW other Prch 26

27. Tiered Data Warehouse TX DB Sales Files Mktg BI other Prch 27

28. Analytical Query Engine TX DB Files BI other 28

29. Simple Database Architecture Products Inventory Customers DB Sales Orders 29

30. The future? Products Inventory Customers Sales Orders 30

31. http://www.hbasecon.com/ San Francisco June 13, 2013 31

32. 32

Notas do Editor

Architected scores of Hadoop-based data solutions
Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
Turns out separating the transactional vs reporting database brings other benefits
I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
2 other major components that haven’t been mentioned
I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
Two things this allows you to do- Use different underlying architectures for each database
Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
Two things this allows you to do- Use different underlying architectures for each database
Data marts designed for specific department needs.Kimball ?
Two things this allows you to do- Use different underlying architectures for each database
Ralph Kimball – The Data Warehousing ToolkitBill Inmon – Building the Data Warehouse
Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
Store long term dataTransform and load to data marts
Store long term dataBI tools can readily query data in Hadoop using Impala
Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
Support for insert/update semantics?HBase with typed columns

Hadoop and Enterprise Data Warehouse

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Hadoop and Enterprise Data Warehouse

Semelhante a Hadoop and Enterprise Data Warehouse (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Hadoop and Enterprise Data Warehouse

Notas do Editor