Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Hadoop Architectures and the Problems They Solve
1. Mark Rittman, Oracle ACE Director
NEW WORLD HADOOP ARCHITECTURES (& WHAT
PROBLEMS THEY REALLY SOLVE) FOR DBAS
UKOUG DATABASE SIG MEETING
London, February 2017
2. •Oracle ACE Director, Independent Analyst
•Past ODTUG Exec Board Member + Oracle Scene Editor
•Author of two books on Oracle BI
•Co-founder & CTO of Rittman Mead
•15+ Years in Oracle BI, DW, ETL + now Big Data
•Host of the Drill to Detail Podcast (www.drilltodetail.com)
•Based in Brighton & work in London, UK
About The Presenter
2
4. “Hi Mark, In things I have seen and read quite o6en people
start with a high-level overview of a product (e.g. Hadoop,
Ka@a), then describe the technical concepts (using all the
appropriate terminology) …”
“but I am usually le6 missing something. I think it's around
the area of what problems these technologies are solving
and how they are doing it? Without that context I'm finding
it all very academic”
“Many people say tradiKonal systems will sKll be
needed. Are these new technologies solving completely
different problems to those handled by tradi=onal IT?
Is there an overlap?”
5. •Started back in 1996 on a bank Oracle DW project
•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts
•Data warehouses provided a unified view of the business
•Single place to store key data and metrics
•Joined-up view of the business
•Aggregates and conformed dimensions
•ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information
•Tabular data access using SQL-generating tools
•Drill paths, hierarchies, facts, attributes
•Fast access to pre-computed aggregates
•Packaged BI for fast-start ERP analytics
20 Years in Old-school BI & Data Warehousing
5
12. •Google needed to store and query their vast amount of server log files
•And wanted to do so using cheap, commodity hardware
•Google File System and MapReduce designed together for this use
Google File System and MapReduce
12
13. •GFS optimised for particular task at hand -
computing PageRank for sites
•Streaming reads for PageRank calcs, block writes for
crawler whole-site dumps
•Master node only holds metadata
•Stops client/master I/O being bottleneck, also acts as
traffic controller for clients
•Simple design, optimised for specific Google Need
•MapReduce focused on simple computations on
abstraction framework
•Select & filter (MAP) and reduce (aggregate) functions,
easily to distribute on cluster
•MapReduce abstracted cluster compute, HDFS
abstracted cluster storage
•Projects that inspired Apache Hadoop + HDFS
Google File System + MapReduce Key Innovations
13
14. How Traditional RDBMS Data Warehousing Scaled-Up
14
Shared-Everything Architectures (i.e.
Oracle RAC, Exadata)
Shared-Nothing Architectures
(e.g. Teradata, Netezza)
17. •Enterprise High-End RDBMSs such as Oracle can scale
•Clustering for single-instance DBs can scale to >PB
•Exadata scales further by offloading queries to storage
•Sharded databases (e.g. Netezza) can scale further
•But cost (and complexity) become limiting factors
•Typically $1m/node is not uncommon
Cost and Complexity around Scaling DW Clusters
17
18. •A way of storing (non-relational) data cheaply and easily expandable
•Gave us a way of scaling beyond TB-size without paying $$$
•First use-cases were offline storage, active archive of data
Hadoop’s Original Appeal to Data Warehouse Owners
18
(c) 2013
19. Hadoop Ecosystem Expanded Beyond MapReduce
19
•Core Hadoop, MapReduce and HDFS
•HBase and other NoSQL Databases
•Apache Hive and SQL-on-Hadoop
•Storm, Spark and Stream Processing
•Apache YARN and Hadoop 2.0
20. •Solution to the problem of storing semi-structured data at-scale
•Built on Google File System
•Scale for capacity e.g., webtable
•100,000,000,000 pages,
•10 versions per page,
•20 KB / version = 20 PB of data
•Scale for throughput
•Hundreds of millions of users
•Tens of thousands to millions of queries/sec
•At low-latency with high-reliability
Google BigTable, HBase and NoSQL Databases
20
21. •Optimised for a particular task - fast
lookups of ts-versioned web data
•Data stored in multidimensional map keyed
on row, column + timestamp
•Master + data tablets stored on GFS cluster
nodes
•Simple key/value lookup with client doing
interpretation
•Innovation - focus on single job with
different needs to OLTP
•Formed inspiration for Apache HBase
How BigTable Scaled Beyond Traditional RDBMSs
21
22. •Original developed at Facebook, now foundational within Hadoop
•SQL-like language that compiles to MapReduce, Spark, HBase
•Solved the problem of enabling non-programmers to access big data
•And made Hadoop data transformation and aggregation code more productive
•JDBC and ODBC drivers for tool integration
Hive - Hadoop Discovers Set-Based Processing
22
23. •Hive is extensible to help with accessing and integrating new data sets
•SerDes : Serializer-Deserializers that interpret semi-structured sources
•UDFs + Hive Streaming : User-defined functions and streaming input
•File Formats : make use of compressed and/or optimised file storage
•Storage Handlers : use storage other than HDFS (e.g. MongoDB)
Apache Hive as SQL Access Engine For Everything
23
24. •Hadoop as low-cost ETL pre-processing engine - “ETL-offload”
•NoSQL database for landing real-time data at high speed/low latency
•Incoming data then aggregated and stored in RBDMS DW
Common Hadoop/NoSQL Use-Case (c) 2014
24
MartsData Warehouse
Σ Σ
Business
Intelligence
• Online
• Scalable
• Flexible
• Cost
Effective
Hadoop
29. •Driven by pace of business, and user demands for more agility and control
•Traditional IT-governed data loading not always appropriate
•Not all data needed to be modelled right-away
•Not all data suited storing in tabular form
•New ways of analyzing data beyond SQL
•Graph analysis
•Machine learning
Data Warehousing and ETL Needed Some Agility
29
30. Problem #2 That Hadoop / NoSQL Solved :
Making Data Warehousing Agile
31. •Storing data in format it arrived in, and then applying schema at query time
•Suits data that may be analysed in different ways by different tools
•In addition, some datatypes may have schema embedded in file format
•Key benefit - fast arriving data of unknown value can get to users earlier
•Made possible by tools such as Apache Hive + SerDes,
Apache Drill and self-describing file formats, HDFS storage
Advent of Schema-on-Read, and Data Lakes
31
32. •Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
•Flexible data storage platform with cheap storage, flexible schema support + compute
•Solves the problem of how to store new types of data + choose best time/way to process it
•Hadoop/NoSQL increasingly used for all store/transform/query tasks
Meet the New Data Warehouse : The “Data Lake”
32
Data Transfer Data Access
Data Factory
Data Reservoir
Business
Intelligence Tools
Hadoop Platform
File Based
Integration
Stream
Based
Integration
Data streams
Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data sets and
samples
Models and
programs
Marketing /
Sales Applications
Models
Machine
Learning
Segments
Operational Data
Transactions
Customer
Master ata
Unstructured Data
Voice + Chat
Transcripts
ETL Based
Integration
Raw
Customer Data
Data stored in
the original
format (usually
files) such as
SS7, ASN.1,
JSON etc.
Mapped
Customer Data
Data sets
produced by
mapping and
transforming
raw data
33. Hadoop 2.0 and YARN
(“Yet Another Resource Negotiator”)
Key Innovation : Separating how data is stored,
from how it is processed
34.
35. •Hadoop started by being synonymous with MapReduce, and Java coding
•But YARN (Yet another Resource Negotiator) broke this dependency
•Hadoop now just handles resource management
•Multiple different query engines can run against data in-place
•General-purpose (e.g. MapReduce)
•Graph processing
•Machine Learning
•Real-Time Processing
Hadoop 2.0 - Enabling Multiple Query Engines
35
38. •New generation of big data platform services from Google, Amazon, Oracle
•Combines three key innovations from earlier technologies:
•Organising of data into tables and columns (from RDBMS DWs)
•Massively-scalable and distributed storage and query (from Big Data)
•Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
38
42. •On-premise Hadoop, even with simple resilient clustering, will hit limits
•Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc
•Scale limits are encountered way beyond those for DWs…
•… but future is elastically-scaled, query and compute-as-a-service
What Problem Did Analytics-as-a-Service Solve?
42
Oracle Big Data Cloud Compute Edition
Free $300 developer credit at:
https://cloud.oracle.com/en_US/tryit
43. •And things come full-circle … analytics
typically requires tabular data
•Google BigQuery based-on DremelX
massively-parallel query engine
•But stores data columnar and provides SQL
interface
•Solves the problem of providing DW-like
functionality at scale, as-a-service
•This is the future … ;-)
BigQuery : Big Data Meets Data Warehousing
43
44. Mark Rittman, Oracle ACE Director
NEW WORLD HADOOP ARCHITECTURES (& WHAT
PROBLEMS THEY REALLY SOLVE) FOR DBAS
UKOUG DATABASE SIG MEETING
London, February 2017