Mais conteúdo relacionado Semelhante a Self-Service Data Exploration with Apache Drill (20) Mais de MapR Technologies (20) Self-Service Data Exploration with Apache Drill1. ®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Tomer Shiran, VP Product Management
2. ®
© 2014 MapR Technologies 2
Empowering “as it happens”
businesses by speeding up the
data-to-action cycle
®
3. ®
© 2014 MapR Technologies 3
Top-Ranked NoSQL
Top-Ranked Hadoop
Distribution
Top-Ranked SQL-on-Hadoop
Solution
®
4. ®
© 2014 MapR Technologies 4
Data is doubling in
size every two years
5. ®
© 2014 MapR Technologies 5
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980 2000 20101990 2020
Unstructured data will account
for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
6. ®
© 2014 MapR Technologies 6
1980 2000 20101990 2020
Fixed schema
DBA controls structure
Dynamic schema (schema-free)
Application controls structure
NON-RELATIONAL DATASTORESRELATIONAL DATABASES
MBs-TBs TBs-PBsVolume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
7. ®
© 2014 MapR Technologies 7
SQL in a Non-Relational World
• SQL
• BI (Tableau, MicroStrategy, etc.)
• Low latency
• Scalability
• Create and maintain schemas on:
– HDFS (Parquet, JSON, etc.)
– HBase
– …
• Transform or copy data
2 DON’T WANT WANT
8. ®
© 2014 MapR Technologies 8
• Data exploration for Hadoop and NoSQL
• Low latency at scale
• Point-and-query vs. schema-first
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
Agility
• Explore big data in its
native format without IT
intervention
Flexibility
• Analyze semi-structured/
nested data coming from
NoSQL applications and
JSON sources
Plug-and-Play
• Leverage existing SQL
skillsets, BI tools and
Apache Hive
deployments
9. ®
© 2014 MapR Technologies 9
Drill is the Top-Ranked SQL-on-Hadoop
10. ®
© 2014 MapR Technologies 10
Evolution Towards Self-Service Data Exploration
Schema
Management and
Transformation
Data Visualization
IT-dependent
IT-dependent
IT-dependent
Self-service
IT-dependent
Self-service
Self-service
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
“Gen 1” SQL-on-
Hadoop
Self-Service
Data Exploration
Zero-day analytics
11. ®
© 2014 MapR Technologies 11
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured
data
• Schemas in Hive
Metastore
Relational data
• Highly-structured
data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration
(known and unknown questions)
12. ®
© 2014 MapR Technologies 12
Leverage Existing SQL Tools and Skills
Leverage SQL-compatible tools (BI, query
builders, etc.) via Drill’s standard ODBC,
JDBC and ANSI SQL support
Enable business analysts, technical
analysts and data scientists to explore and
analyze large volumes of real-time data
13. ®
© 2014 MapR Technologies 13
A storage plugin instance
- DFS
- Hbase/MapRDB
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
- HBase namespace
A table
- pathnames
- HBase table
- Hive table
Combine Data from Multiple Sources on the Fly
SELECT
*
FROM
dfs.demo.`yelp/business.json`
!
14. ®
© 2014 MapR Technologies 14© 2014 MapR Technologies
®
How Drill Achieves Data Agility
15. ®
© 2014 MapR Technologies 15
Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet/Avro
CSV
TSV
Schema-lessFixed schema
Complex
Flat
Flexibility
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
16. ®
© 2014 MapR Technologies 16
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
18. ®
© 2014 MapR Technologies 18
Q&A
@mapr maprtech
tshiran@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
19. ®
© 2014 MapR Technologies 19© 2014 MapR Technologies
®
Under the Hood
20. ®
© 2014 MapR Technologies 20
High Level Architecture
Cluster of commodity servers
– Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information
– Drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a
particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability
Data processing unit is columnar record batches
– Enables schema flexibility with negligible performance impact
21. ®
© 2014 MapR Technologies 21
Drill Maximizes Data Locality
Data Source Best Practice
HDFS or MapR-FS drillbit on each DataNode
HBase or MapR-DB drillbit on each RegionServer
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
ZooKeeper
ZooKeeper
ZooKeeper
…
22. ®
© 2014 MapR Technologies 22
Core Modules within drillbit
SQL Parser
Hive
HBase
StoragePlugins
MongoDB
DFS
PhysicalPlan
ExecutionLogicalPlan Optimizer
23. ®
© 2014 MapR Technologies 23
SELECT * Query Execution
drillbit
ZooKeeper
Client
(JDBC, ODBC,
REST)
1. Find drillbits
(once per session)
3. Create logical and physical execution plans
4. Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper
ZooKeeper
drillbit
drillbit
2. Submit query to
drillbit
5. Return results
to client
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4