3. Agenda
⢠Apache Drill overview
⢠Key features
⢠Status and progress
⢠Discuss potential use cases and cooperation
3
4. Big Data Workloads
⢠ETL
⢠Data mining
⢠Blob store
⢠Lightweight OLTP on large datasets
⢠Index and model generation
⢠Web crawling
⢠Stream processing
⢠Clustering, anomaly detection and classification
⢠Interactive analysis
4
5. Example Problem
⢠Jane works as an Transaction
analyst at an e- information
commerce company
⢠How does she figure
User
out good targeting profiles
segments for the next
marketing campaign?
⢠She has some ideas Access
and lots of data logs
5
6. Solving the Problem with Traditional Systems
⢠Use an RDBMS
â ETL the data from MongoDB and Hadoop into the RDBMS
⢠MongoDB data must be flattened, schematized, filtered and aggregated
⢠Hadoop data must be filtered and aggregated
â Query the data using any SQL-based tool
⢠Use MapReduce
â ETL the data from Oracle and MongoDB into Hadoop
â Work with the MapReduce team to generate the desired analyses
⢠Use Hive
â ETL the data from Oracle and MongoDB into Hadoop
⢠MongoDB data must be flattened and schematized
â But HiveQL is limited, queries take too long and BI tool support is
limited
⢠Challenges: data movement, loss of nesting structure, latency
6
7. WWGD
Distributed Interactive Batch
NoSQL
File System analysis processing
GFS BigTable Dremel MapReduce
Hadoop
HDFS HBase ???
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
8
8. Google Dremel
⢠Interactive analysis of large-scale datasets
â Trillion records at interactive speeds
â Complementary to MapReduce
â Used by thousands of Google employees
â Paper published at VLDB 2010
⢠Model
â Nested data model with schema
⢠Most data at Google is stored/transferred in Protocol Buffers
⢠Normalization (to relational) is prohibitive
â SQL-like query language with nested data support
⢠Implementation
â Column-based storage and processing
â In-situ data access (GFS and Bigtable)
â Tree architecture as in Web search (and databases)
9
9. Innovations
⢠MapReduce
â Highly parallel algorithms running on commodity systems can deliver real
value at reasonable cost
â Scalable IO and compute trumps efficiency with today's commodity hardware
â With many datasets, schemas and indexes are limiting
â Flexibility is more important than efficiency
â An easy, scalable, fault tolerant execution framework is key for large clusters
⢠Dremel
â Columnar storage provides significant performance benefits at scale
â Columnar storage with nesting preserves structure and can be very efficient
â Avoiding final record assembly as long as possible improves efficiency
â Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. E.g., no need to start JVMs, just push compact
queries to running agents.
10
10. Apache Drill Overview
⢠Inspired by Google Dremel/BigQuery ⌠more ambitious
⢠Interactive analysis of Big Data using standard SQL
⢠Fast
â Low latency queries Interactive queries
â Columnar execution Apache Drill Data analyst
Reporting
â Complement native interfaces and MapReduce/Hive/Pig
100 ms-20 min
⢠Open
â Community driven open source project
â Under Apache Software Foundation
⢠Modern
Data mining
â Standard ANSI SQL:2003 (select/into) MapReduce Modeling
Hive
â Nested/hierarchical data support Pig
Large ETL
20 min-20 hr
â Schema is optional
â Supports RDBMS, Hadoop and NoSQL
â Extensible
11
11. How Does It Work?
⢠Drillbits run on each node, designed to
maximize data locality
⢠Processing is done outside MapReduce
SELECT * FROM
paradigm (but possibly within YARN) oracle.transactions,
⢠Queries can be fed to any Drillbit mongo.users,
hdfs.events
⢠Coordination, query planning, optimization, LIMIT 1
scheduling, and execution are distributed
12
12. Key Features
⢠Full SQL (ANSI SQL:2003)
⢠Nested data
⢠Schema is optional
⢠Flexible and extensible architecture
13
13. Full SQL (ANSI SQL:2003)
⢠Drill supports standard ANSI SQL:2003
â Correlated subqueries, analytic functions, âŚ
â SQL-like is not enough
⢠Use any SQL-based tool with Apache Drill
â Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, âŚ
â Standard ODBC and JDBC drivers
Client
Tableau
Drillbit
MicroStrategy
Drill%
ODBC% SQL%Query% Query%
Driver Drillbits
Driver Parser Planner Drill%
Worker
Excel
Drill%Worker
SAP%
Crystal%
Reports
14
14. Nested Data
JSON
⢠Nested data is becoming prevalent {
"name": "Homer",
â JSON, BSON, XML, Protocol Buffers, Avro, etc. "gender": "Male",
"followers": 100
â The data source may or may not be aware children: [
{name: "Bart"},
⢠MongoDB supports nested data natively {name: "Lisaâ}
]
⢠A single HBase value could be a JSON document }
(compound nested type)
â Google Dremelâs innovation was efficient
Avro
columnar storage and querying of nested data enum Gender {
⢠Flattening nested data is error-prone and }
MALE, FEMALE
very difficult record User {
⢠Apache Drill supports nested data string name;
Gender gender;
â Extensions to ANSI SQL:2003 }
long followers;
15
15. Schema is Optional
⢠Many data sources do not have rigid schemas
â Schemas change rapidly
â Each record may have a different schema
⢠Sparse and wide rows in HBase and Cassandra, MongoDB
⢠Apache Drill supports querying against unknown schemas
â Query any HBase, Cassandra or MongoDB table
⢠User can define the schema or let the system discover it automatically
â System of record may already have schema information
⢠Why manage it in a separate system?
â No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>âŚ" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "<html>âŚ" anchor:en.wikipedia.org = "Fox News"
⌠⌠âŚ
16
16. Flexible and Extensible Architecture
⢠Apache Drill is designed for extensibility
⢠Well-documented APIs and interfaces
⢠Data sources and file formats
â Implement a custom scanner to support a new data source or file format
⢠Query languages
â SQL:2003 is the primary language
â Implement a custom Parser to support a Domain Specific Language
⢠Optimizers
â Drill will have a cost-based optimizer
â Clear surrounding APIs support easy optimizer exploration
⢠Operators
â Custom operators can be implemented
⢠Special operators for Mahout (k-means) being designed
â Operator push-down to data source (RDBMS)
17
17. Architecture
⢠Only the execution engine knows the physical attributes of the cluster
â # nodes, hardware, file locations, âŚ
⢠Public interfaces enable extensibility
â Developers can build parsers for new query languages
â Developers can provide an execution plan directly
⢠Each level of the plan has a human readable representation
â Facilitates debugging and unit testing
18
18. Status: In Progress
⢠Heavy active development by multiple organizations
⢠Available
â Logical plan syntax and interpreter
â Reference interpreter
⢠In progress
â SQL interpreter
â Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
⢠Significant community momentum
â Over 200 people on the Drill mailing list
â Over 200 members of the Bay Area Drill User Group
â Drill meetups across the US and Europe
â OpenDremel team joined Apache Drill
⢠Anticipated schedule:
â Prototype: Q1
â Alpha: Q2
â Beta: Q3
19
19. Why Apache Drill Will Be Successful
Resources Community Architecture
⢠Contributors have strong ⢠Development done in the ⢠Full SQL
backgrounds from open ⢠New data support
companies like Oracle, ⢠Active contributors from ⢠Extensible APIs
IBM Netezza, Informatica, multiple companies ⢠Full Columnar Execution
Clustrix and Pentaho ⢠Rapidly growing ⢠Beyond Hadoop
20
20. MapRâs Innovations for Hadoop
⢠NFS direct access
⢠Makes Hadoop file system look like any file system
⢠Simplifies access to data in a Hadoop cluster
⢠Enables non-Hadoop programs access to the data â you
know, the existing important applications you already
have!
⢠Transparent compression
⢠Saves space and thus $$$
⢠Web, command line, and REST based management tools
⢠Reduces the burden on your admin teams
21
21. MapRâs Innovations for Hadoop
⢠Eliminates single points of failure
⢠Self healing with automated stateful failover
⢠Protects your data
⢠Snapshots for point-in-time data protection and recovery
⢠Mirroring for business continuity includes wide area
replication support
⢠More scalable
⢠Central Name Node eliminated
⢠Hundreds of billions of files/cluster â over a billion/node
⢠File creation rates of over 1000/sec/node
22
22. MapRâs Innovations for Hadoop
⢠Speeds jobs by up to 4X
⢠50% - 400% faster than other Hadoop distributions
depending on benchmark and hardware
⢠Google and MapR demonstrated Terasort world record
⢠http://www.mapr.com/mapr-google
⢠How did we do it?
⢠Lots of C/C++ to avoid Java overhead
⢠Raw disk IO
⢠Application level NIC bonding
⢠Numerous other optimizations in key components
23
23. MapR in the Cloud
ď§ Available as a service with Google Compute Engine
⢠Available as a service with Amazon Elastic MapReduce (EMR)
â http://aws.amazon.com/elasticmapreduce/mapr
24
24. Three Editions
⢠All Hadoop API compatible, majority open
source components, full Hadoop stack
⢠M3
â Faster, easier to use, better integration
⢠M5
â Improving reliability and dependability
⢠M7
â Hbase APIs on a more performant, scalable, and
dependable platform (MapR Data Platform)
25
25. Questions?
⢠What problems can Drill solve for you?
⢠Where does it fit in the organization?
⢠Which data sources and BI tools are important
to you?
26
29. How Does Impala Fit In?
Impala Strengths Questions
⢠Beta currently available ⢠Open Source âLiteâ
⢠Easy install and setup on top of ⢠Doesnât support RDBMS or other
Cloudera NoSQLs (beyond Hadoop/HBase)
⢠Faster than Hive on some queries ⢠Early row materialization increases
⢠SQL-like query language footprint and reduces performance
⢠Limited file format support
⢠Query results must fit in memory!
⢠Rigid schema is required
⢠No support for nested data
⢠Compound APIs restrict optimizer
progression
⢠SQL-like (not SQL)
Many important features are âcoming soonâ. Architectural foundation is constrained. No
community development.
30
30. Why Not Leverage MapReduce?
⢠Scheduling Model
â Coarse resource model reduces hardware utilization
â Acquisition of resources typically takes 100âs of millis to seconds
⢠Barriers
â Map completion required before shuffle/reduce
commencement
â All maps must complete before reduce can start
â In chained jobs, one job must finish entirely before the next one
can start
⢠Persistence and Recoverability
â Data is persisted to disk between each barrier
â Serialization and deserialization are required between execution
phase
31
Notas do Editor
With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void â real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers â nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support âbaby joinsâ â joins to the small, loadable in memory, data structures.