Hadoop Summit - Hausenblas 20 March

Understanding the value and
architecture of Apache Drill
Michael Hausenblas, Chief Data Engineer EMEA, MapR
Hadoop Summit, Amsterdam, 2013-03-20
1

Kudos to http://cmx.io/
2
2

Workloads
• Batch processing (MapReduce)
• Light-weight OLTP (HBase, Cassandra, etc.)
• Stream processing (Storm, S4)
• Search (Solr, Elasticsearch)
• Interactive, ad-hoc query and analysis (?)

3

Interactive Query at Scale

Impala

low-latency
4

Use Case
• Jane, a marketing analyst
• Determine target segments
• Data from different sources

5

Today’s Solutions
• RDBMS-focused
– ETL data from MongoDB and Hadoop
– Query data using SQL

• MapReduce-focused
– ETL from RDBMS and MongoDB
– Use Hive, etc.

6

Requirements
• Support for different data sources
• Support for different query interfaces
• Low-latency/real-time
• Ad-hoc queries
• Scalable and fast
• Reliable

7

Google’s Dremel

http://research.google.com/pubs/pub36632.html

8

Apache Drill Overview
• Inspired by Google’s Dremel
• Standard SQL 2003 support
• Other QL possible
• Plug-able data sources
• Support for nested data
• Schema is optional
• Community driven, open, 100’s involved

9

Apache Drill Overview

10

High-level Architecture

11

• Each node: Drillbit - maximize data locality
• Co-ordination, query planning, execution, etc, are distributed
• By default Drillbits hold all roles
• Any node can act as endpoint for a query

Drillbit Drillbit Drillbit Drillbit

Storage Storage Storage Storage
Process Process Process Process

node node node node

12

• Zookeeper for ephemeral cluster membership info
• Distributed cache (Hazelcast) for metadata, locality
information, etc.
Zookeeper

Distributed Cache Distributed Cache Distributed Cache Distributed Cache


node node node node

13

• Originating Drillbit acts as foreman, manages query execution,
scheduling, locality information, etc.
• Streaming data communication avoiding SerDe
Zookeeper

Distributed Cache Distributed Cache Distributed Cache Distributed Cache


node node node node

14

Principled Query Execution

Source Logical Physical
Query Parser Plan Optimizer Plan Execution

SQL 2003 parser API query: [
{
topology scanner API
DrQL @id: "log",
op: "sequence",
MongoQL do: [
{
DSL op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
15

Drillbit Modules
RPC Endpoint

SQL
Scheduler

Storage Engine Interface
DFS Engine

Physical Plan
Logical Plan

HiveQL
Optimizer Foreman

Pig HBase Engine

Operators
Mongo

Parser

Distributed Cache

16

Key Features
• Full SQL 2003
• Nested data
• Optional schema
• Extensibility points

17

Full SQL – ANSI SQL 2003
• SQL-like is often not enough
• Integration with existing tools
– Datameer, Tableau, Excel, SAP Crystal Reports
– Use standard ODBC/JDBC driver

18

Nested Data
• Nested data becoming prevalent
– JSON/BSON, XML, ProtoBuf, Avro
– Some data sources support it natively
(MongoDB, etc.)
• Flattening nested data is error-prone
• Extension to ANSI SQL 2003

19

Optional Schema
• Many data sources don’t have rigid schemas
– Schema changes rapidly
– Different schema per record (e.g. HBase)
• Supports queries against unknown schema
• User can define schema or via discovery

20

Extensibility Points
• Source query – parser API
• Custom operators, UDF – logical plan
• Optimizer
• Data sources and formats – scanner API

Source Logical Physical
Query Parser Plan Optimizer Plan Execution

21

… and Hadoop?
• HDFS can be a data source

• Complementary use cases …

• … use Apache Drill
– Find record with specified condition
– Aggregation under dynamic conditions

• … use MapReduce
– Data mining with multiple iterations
– ETL
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
22
22

Example
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
{
"batter”:
"sales" : 700.0,
[
"typeCount" : 1,
{ "id": "1001", "type": "Regular" },
"quantity" : 700,
{ "id": "1002", "type": "Chocolate" },
"ppu" : 1.0
…
}
{
"sales" : 109.71,
data source: donuts.json "typeCount" : 2,
"quantity" : 159,
query:[ { "ppu" : 0.69
op:"sequence", }
do:[ {
{ "sales" : 184.25,
op: "scan", "typeCount" : 2,
ref: "donuts", "quantity" : 335,
source: "local-logs", "ppu" : 0.55
selection: {data: "activity"} }
},
{ result: out.json
op: "filter",
expr: "donuts.ppu < 2.00"
},
…

logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

23

Status
• Heavy development by multiple organizations

• Available
– Logical plan (ADSP)
– Reference interpreter
– Basic SQL parser
– Basic demo

24

Status
March/April

• Larger SQL syntax
• Physical plan
• In-memory compressed data interfaces
• Distributed execution focused on large cluster
high performance sort, aggregation and join
• Storage engine implementations (HBase, etc.)

25

Contributing
• Dremel-inspired columnar format: Twitter’s Parquet and
Hive’s ORC file

• Integration with Hive metastore (?)

• DRILL-13 Storage Engine: Define Java Interface

• DRILL-15 Build HBase storage engine implementation

26

Contributing
• DRILL-48 RPC interface for query submission and physical plan
execution

• DRILL-53 Setup cluster configuration and membership mgmt
system
– ZK for coordination
– Helix for partition and resource assignment (?)

• Further schedule
– Alpha Q2
– Beta Q3
27

Kudos to …
• Julian Hyde, Pentaho
• Timothy Chen, Microsoft
• Chris Merrick, RJMetrics
• David Alves, UT Austin
• Sree Vaadi, SSS/NGData
• Jacques Nadeau, MapR
• Ted Dunning, MapR

28

Engage!
• Follow @ApacheDrill on Twitter

• Sign up at mailing lists (user|dev)
http://incubator.apache.org/drill/mailing-lists.html

• Learn where and how to contribute
https://cwiki.apache.org/confluence/display/DRILL/Contributing

• Keep an eye on http://drill-user.org/

29

Hadoop Summit - Hausenblas 20 March

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hadoop Summit - Hausenblas 20 March

Semelhante a Hadoop Summit - Hausenblas 20 March (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Hadoop Summit - Hausenblas 20 March

Notas do Editor