December 2013 HUG: InfiniDB for Hadoop

Bay Area Hadoop Users Group
Turning the Tables with InfiniDB for
Hadoop
December 18, 2013

Agenda
 InfiniDB Background
 InfiniDB Technical Foundations
 Parallelism
 Partitioning Model
 Additional I/O Efficiencies

 (My)SQL for Hadoop
 When to use Columnar/InfiniDB for Hadoop
 InfiniDB Benchmarks

Copyright © 2013 Calpont. All Rights Reserved.

InfiniDB Background
Platforms

Versions

 InfiniDB

 InfiniDB Launched Feb 2010

 InfiniDB for the Cloud

 InfiniDB 4 – latest release
available October 2013

 InfiniDB for Hadoop

 Added InfiniDB for Hadoop

 Source code at
https://github.com/infinidb

 GPL v2
 No restrictions on syntax,
scale, or performance


InfiniDB Background - Customer Base


InfiniDB Background
Platforms
 InfiniDB

Local Disk, GlusterFS, Windows*

 http://www.calpont.com/products/tryinfinidb

 InfiniDB for Hadoop

CDH or HDP

 http://www.calpont.com/products/tryinfinidb

 InfiniDB for the Cloud

Any availability zone


InfiniDB Background – InfiniDB for Hadoop
 InfiniDB is a non-map/reduce engine
 Reads and writes natively to HDFS

Pig/Hive

HBase

Map Reduce

InfiniDB
for
Hadoop

Hadoop Distributed File System

6

InfiniDB Background - InfiniDB for Hadoop
Is InfiniDB a Database?
“InfiniDB turns SQL developers

…not a General Purpose DBMS.

into Big Data developers. We
deployed it quickly and easily

Is InfiniDB NoSQL?

for our online sales analytics.

… only in the sense that we discarded

Something we couldn’t do

traditional DBMS architectures.

with Hadoop, Mongo, or
Teradata”

Is InfiniDB an SQL for Hadoop technology?
… Yes, but not general purpose SQL.

InfiniDB is highly optimized for analytic
workloads/queries.

7

InfiniDB Foundation - Parallelism
• User Module – Processes SQL Requests
• Performance Module – Executes the Queries
Single Server

MPP

or

Local disk / EBS
GlusterFS / HDFS
8

•Purpose-built C++ engine
•Parallelism is at the thread level
•Example: 12 PM Servers with 8 cores each
yields 96 parallel processing engines.
•SQL is translated into thousands or tens of
thousands of discrete jobs or “primitives”.
•The UM sends primitives to the processing
engines.
9

•User Module – Processes SQL Requests
•Performance Module – Executes the Queries
Single Server

MPP

• Primitives are issued to
thread queue within PM
• Fixed thread count at PM
Local disk / EBS
GlusterFS / HDFS
10

Fully Parallel SQL + Full SQL Syntax

DoW

Reduce 

SQL Operations are translated into thousands of jobs via custom
Distribution of Work:
• Parallel/Distributed Data Access
• Parallel/Distributed Joins (Inner, Outer)
• Parallel/Distributed Sub-queries (From, Where, Select)
• Parallel/Distributed Group By, Distinct, and Aggregation
• Extensible with Parallel/Distributed User Defined Functions
Results are returned to User Module in Reduce Phase
11

InfiniDB Data Partitioning
2-Dimensional Partitioning Model
•Vertical Partitioning by Column
o Not Column-Family (no relation to HBase)
o Only do I/O for columns requested

•Horizontal Partitioning by range of rows
o Meta-data stored within in-memory structure

12

InfiniDB Data Partitioning
•Partition elimination can occur based on:
o Columns not included in SQL.
o Based on filter expressed within query.
o Based on filter expressed on a join table:

Table1 filter can drive Table2 I/O elimination
o Intersection between filters:
Filter1 and Filter2 does I/O on intersection
13

Column Restriction and Projection
|-------- Column # Seventeen -----------|

Extent # 27

Filter 3

Filter 2

Filter 1

|-------------- Column # Six ---------------|

|-------------- Column # Four ---------------|

Projection

Extent # 5

Projection

• Automatic Vertical Partitioning + Horizontal Partitioning
• Just-In-Time Materialization
14

Additional I/O Efficiency
Techniques to Avoid Unnecessary I/O
 Vertical Partitioning: read only the columns required

 Horizontal Partition: focus on the rows required
 Just-in-time materialization

Techniques for Efficient I/O
 Columnar compression reduces I/O from disk
 Global data buffer cache can reduce disk I/O (in-memory)

 Avoidance of Random I/O

15

InfiniDB Design Principles
®

Scalable

Fast

16

Simple

(My)SQL for Hadoop - Engine=InfiniDB
InfiniDB uses standard “Engine=InfiniDB” syntax:

CREATE TABLE `game_warehouse`.`dim_title` (
`id` INT,
`name` VARCHAR(45),
`publisher` VARCHAR(45),
`release_date` DATE,
`language` INT,
`platform_name` VARCHAR(45),
`version` VARCHAR(45)
) ENGINE=InfiniDB;

17

(My)SQL for Hadoop
Leverage existing tools
that connect to
MySQL

Expose Structured
Data to the Business

Familiar User Privilege
Administration

MicroStrategy
JasperSoft
Pentaho

MySQL ease of use + Hadoop Scale + Columnar
Performance
18

Syntax Support

Broad MySQL
SQL syntax

-

+

Analytic/windowing
functions included
with InfiniDB 4

No indexing needed.
Partitioning is automatic.

InfiniDB Supported Syntax
19

When to Use InfiniDB for Hadoop

Query Size (Vision/Scope) defines workloads:
1

100 10,000

1,000,000

100,000,000 10,000,000,000

Query Size/Vision/Scope

OLTP/NoSQL Workloads

ROLAP/Analytic/Reporting Workloads

General purpose DBMS missed the target
( dated database technology generally not optimal )
20

What is your typical query?
1

100 10,000

1,000,000

100,000,000 10,000,000,000

Query Vision/Scope


Analytic Workloads

• There is no “average” query.
• The challenges are at the extremes:
o The challenge of high concurrency levels with small queries.
o The challenge of latency for very large queries.

• Most use cases imply multiple data technologies.
21

Columnar Appropriate Workloads
1

100 10,000

1,000,000

100,000,000 10,000,000,000

Query Vision/Scope


Pure Columnar about
10x worse I/O for
single record lookups
22

ROLAP/Analytic/Reporting Workloads

Pure Columnar about
10x better I/O for large
data access patterns

Data Dimensions and InfiniDB for Hadoop
Unstructured Data
Schema on read

Schema on write

Small Queries

Large Queries

Transform (ETL)

Targeted Extract

Pre-defined queries
23

Structured

Ad-hoc queries

InfiniDB Query Performance – Percona
Star Schema Benchmark (SSB)
Q5 Series
5 table Joins

Q1 Series
2 table Joins

Q2 Series
3 table Joins

Q3 Series
4 table Joins

24

1000 Genomes Data Set – 289 Billion Rows
 Fast load Rate
 Millions rows/sec
 Billions rows/hour

 Scalable load rate

1000 Genomes data set on AWS

1000 Genomes Data Set – ~ 24 trillion base
nucleotide values
Scaling: 4 –> 8 –> 16 Performance Modules

 Fast Analytics
 Millions of rows/second

 Scalable Analytics

Seconds

per core

 Automatic parallelism
Performance Modules (PMs) Active

Figure 2 - TATA Binding Protein
Source: http://en.wikipedia.org/wiki/TATA_binding_protein

Impala-InfiniDB Benchmark (Piwik Data Set)

InfiniDB

Figure 1 - Piwik Standard Query Performance

InfiniDB

Figure 2 - Piwik Ad-Hoc Query Performance

Piwik is an Open Source alternative to Google Analytics
Queries 1-6 offered are Piwik production queries
Queries 7-9 are additional ad-hoc queries covering all data
Amazon 5-node cluster

Data Dimensions and InfiniDB for Hadoop
Structured
Schema on read

InfiniDB

Schema on write

Small Queries

Large Queries

Transform (ETL)

Targeted Extract

Figure 2 - Piwik Ad-Hoc Query Performance

Ad-hoc queries
28

Download Today
InfiniDB and InfiniDB for Hadoop:
www.calpont.com
InfiniDB for the Cloud:
InfiniDB AMI in any AWS Availability Zone/Region

Services Inquiries:
sales@calpont.com
Twitter:
@InfiniDB

@jtommaney

© 2013 Calpont Corporation. Calpont, the Calpont logo, InfiniDB, and the InfiniDB logo are trademarks of Calpont Corporation. AWS is a trademark of Amazon.com,
Inc., and Apache Hadoop is a trademark of the Apache Software Foundation. Other product names and logos may be trademarks of their respective owners.

29

December 2013 HUG: InfiniDB for Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a December 2013 HUG: InfiniDB for Hadoop

Semelhante a December 2013 HUG: InfiniDB for Hadoop (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Último

Último (20)

December 2013 HUG: InfiniDB for Hadoop