2. About Me
• Director of Field Engineering at Cloudera
• Architect on several dozen Hadoop-based data solutions
for Cloudera customers
• Started with Hadoop in 2008
• First Hadoop system processed set-top box log data
• Past life
• Java EE / Database Architect
• Web Data Mining
• Cryptography / Public Key Infrastructure
2
6. Database Architecture 1.0
• Dead simple
• Tables in 3rd normal form
• Reports are SQL queries that join through entity
relationships and aggregate
SELECT c.gender, p.product_name,
sum(o.qty), sum(o.price)
FROM order o, customer c, product p
WHERE o.customer_id = c.id
AND o.product_id = p.id
AND o.day = ’2013-03-21’
GROUP BY c.gender, p.product_name ;
6
7. Database Architecture 1.0
• Report queries can become expensive, redundant
• Build a layer of abstraction!
• Materialize the data to something closer to query
form.
• Create reporting tables
• Decide on the reports columns
• What query criteria can be parameterized
• Periodicity of report generation
• Denormalize and aggregate
7
14. Three Ways to Transform Data
• Transform Extract Load
• Query from transactional tables into target schema
• Extract Load Transform
• Load data into analytical database, transform and write
to target schema
• No need for additional hardware
• Extract Transform Load
• Read data from transactional database into a grid
system, transform, then write to analytical database
• Least load on tx and analytical systems
14
16. Business Intelligence Tools
• Can provide canned reports, dashboards, or
interactive visualizations
• Typically leverage common standards (SQL,
JDBC/ODBC) to access data
• Requires low-latency (sub second or minute,
depending on query) response times from database
16
17. Observations
• Separate transactional from analytical workloads
• Use appropriate database implementation
according to the workload
• ‘Traditional’ row-major store for transactional
• MPP column-store for analytic
• Consider a BI tool so you’re not stuck writing
reports for analysts who don’t know SQL
• Consider an ETL tool so you’re not stuck writing
transformations for analysts who don’t know SQL
17
Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
Turns out separating the transactional vs reporting database brings other benefits
I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
2 other major components that haven’t been mentioned
I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
Two things this allows you to do- Use different underlying architectures for each database
Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
Two things this allows you to do- Use different underlying architectures for each database
Data marts designed for specific department needs.Kimball ?
Two things this allows you to do- Use different underlying architectures for each database
Ralph Kimball – The Data Warehousing ToolkitBill Inmon – Building the Data Warehouse
Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
Store long term dataTransform and load to data marts
Store long term dataBI tools can readily query data in Hadoop using Impala
Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
Support for insert/update semantics?HBase with typed columns