6. InfiniDB Background – InfiniDB for Hadoop
InfiniDB is a non-map/reduce engine
Reads and writes natively to HDFS
Pig/Hive
HBase
Map Reduce
InfiniDB
for
Hadoop
Hadoop Distributed File System
6
7. InfiniDB Background - InfiniDB for Hadoop
Is InfiniDB a Database?
“InfiniDB turns SQL developers
…not a General Purpose DBMS.
into Big Data developers. We
deployed it quickly and easily
Is InfiniDB NoSQL?
for our online sales analytics.
… only in the sense that we discarded
Something we couldn’t do
traditional DBMS architectures.
with Hadoop, Mongo, or
Teradata”
Is InfiniDB an SQL for Hadoop technology?
… Yes, but not general purpose SQL.
InfiniDB is highly optimized for analytic
workloads/queries.
7
8. InfiniDB Foundation - Parallelism
• User Module – Processes SQL Requests
• Performance Module – Executes the Queries
Single Server
MPP
or
Local disk / EBS
GlusterFS / HDFS
8
9. InfiniDB Foundation - Parallelism
•Purpose-built C++ engine
•Parallelism is at the thread level
•Example: 12 PM Servers with 8 cores each
yields 96 parallel processing engines.
•SQL is translated into thousands or tens of
thousands of discrete jobs or “primitives”.
•The UM sends primitives to the processing
engines.
9
10. InfiniDB Foundation - Parallelism
•User Module – Processes SQL Requests
•Performance Module – Executes the Queries
Single Server
MPP
• Primitives are issued to
thread queue within PM
• Fixed thread count at PM
Local disk / EBS
GlusterFS / HDFS
10
11. Fully Parallel SQL + Full SQL Syntax
DoW
Reduce
SQL Operations are translated into thousands of jobs via custom
Distribution of Work:
• Parallel/Distributed Data Access
• Parallel/Distributed Joins (Inner, Outer)
• Parallel/Distributed Sub-queries (From, Where, Select)
• Parallel/Distributed Group By, Distinct, and Aggregation
• Extensible with Parallel/Distributed User Defined Functions
Results are returned to User Module in Reduce Phase
11
12. InfiniDB Data Partitioning
2-Dimensional Partitioning Model
•Vertical Partitioning by Column
o Not Column-Family (no relation to HBase)
o Only do I/O for columns requested
•Horizontal Partitioning by range of rows
o Meta-data stored within in-memory structure
12
13. InfiniDB Data Partitioning
•Partition elimination can occur based on:
o Columns not included in SQL.
o Based on filter expressed within query.
o Based on filter expressed on a join table:
Table1 filter can drive Table2 I/O elimination
o Intersection between filters:
Filter1 and Filter2 does I/O on intersection
13
15. Additional I/O Efficiency
Techniques to Avoid Unnecessary I/O
Vertical Partitioning: read only the columns required
Horizontal Partition: focus on the rows required
Just-in-time materialization
Techniques for Efficient I/O
Columnar compression reduces I/O from disk
Global data buffer cache can reduce disk I/O (in-memory)
Avoidance of Random I/O
15
18. (My)SQL for Hadoop
Leverage existing tools
that connect to
MySQL
Expose Structured
Data to the Business
Familiar User Privilege
Administration
MicroStrategy
JasperSoft
Pentaho
MySQL ease of use + Hadoop Scale + Columnar
Performance
18
19. Syntax Support
Broad MySQL
SQL syntax
-
+
Analytic/windowing
functions included
with InfiniDB 4
No indexing needed.
Partitioning is automatic.
InfiniDB Supported Syntax
19
20. When to Use InfiniDB for Hadoop
Query Size (Vision/Scope) defines workloads:
1
100 10,000
1,000,000
100,000,000 10,000,000,000
Query Size/Vision/Scope
OLTP/NoSQL Workloads
ROLAP/Analytic/Reporting Workloads
General purpose DBMS missed the target
( dated database technology generally not optimal )
20
21. What is your typical query?
1
100 10,000
1,000,000
100,000,000 10,000,000,000
Query Vision/Scope
OLTP/NoSQL Workloads
Analytic Workloads
• There is no “average” query.
• The challenges are at the extremes:
o The challenge of high concurrency levels with small queries.
o The challenge of latency for very large queries.
• Most use cases imply multiple data technologies.
21
22. Columnar Appropriate Workloads
1
100 10,000
1,000,000
100,000,000 10,000,000,000
Query Vision/Scope
OLTP/NoSQL Workloads
Pure Columnar about
10x worse I/O for
single record lookups
22
ROLAP/Analytic/Reporting Workloads
Pure Columnar about
10x better I/O for large
data access patterns
23. Columnar Appropriate Workloads
Data Dimensions and InfiniDB for Hadoop
Unstructured Data
Schema on read
Schema on write
Small Queries
Large Queries
Transform (ETL)
Targeted Extract
Pre-defined queries
23
Structured
Ad-hoc queries
24. InfiniDB Query Performance – Percona
Star Schema Benchmark (SSB)
Q5 Series
5 table Joins
Q1 Series
2 table Joins
Q2 Series
3 table Joins
Q3 Series
4 table Joins
24
25. 1000 Genomes Data Set – 289 Billion Rows
Fast load Rate
Millions rows/sec
Billions rows/hour
Scalable load rate
1000 Genomes data set on AWS
26. 1000 Genomes Data Set – ~ 24 trillion base
nucleotide values
Scaling: 4 –> 8 –> 16 Performance Modules
Fast Analytics
Millions of rows/second
Scalable Analytics
Seconds
per core
Automatic parallelism
Performance Modules (PMs) Active
Figure 2 - TATA Binding Protein
Source: http://en.wikipedia.org/wiki/TATA_binding_protein
27. Impala-InfiniDB Benchmark (Piwik Data Set)
InfiniDB
Figure 1 - Piwik Standard Query Performance
InfiniDB
Figure 2 - Piwik Ad-Hoc Query Performance
Piwik is an Open Source alternative to Google Analytics
Queries 1-6 offered are Piwik production queries
Queries 7-9 are additional ad-hoc queries covering all data
Amazon 5-node cluster
28. Columnar Appropriate Workloads
Data Dimensions and InfiniDB for Hadoop
Structured
Schema on read
InfiniDB
Schema on write
Small Queries
Large Queries
Transform (ETL)
Targeted Extract
Figure 2 - Piwik Ad-Hoc Query Performance
Ad-hoc queries
28