More Related Content Similar to Strata feb2013 (20) Strata feb20132. Big Data = Terabytes, Petabytes, …
Image Credit: Gizmodo
© Hortonworks 2013
Page 2
3. But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
2012 on calculations Twitter is doing via UDFs in Pig.
This equation uses stochastic gradient descent to do
machine learning with their data:
w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)
© Hortonworks 2013
Page 3
4. And New Tools
• Apache Hadoop brings with it a large selection of tools
and paradigms
– Apache HBase, Apache Cassandra – Distributed, high volume
reads and rights of individual data records
– Apache Hive - SQL
– Apache Pig, Cascading – Data flow programming for ETL, data
modeling, and exploration
– Apache Giraph – Graph processing
– MapReduce – Batch processing
– Storm, S4 – Stream processing
– Plus lots of commercial offerings
© Hortonworks 2013
Page 4
5. Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
SAS).
Data
Mart
Statistical
Analysis
Data
Warehouse
Cube/M OLTP
OLAP
© Hortonworks 2013
Page 5
6. Cloud: Many Tools One Platform
• Users no longer want to be concerned with what platform their data is in – just
apply the tool to it
• SQL no longer the only or primary data access tool
Statistical
Data Analysis
Mart
Data
Warehouse
Cube/M OLT
OLAP P
© Hortonworks 2013
Page 6
7. Upside - Pick the Right Tool for the Job
© Hortonworks 2013
Page 7
8. Downside – Tools Don’t Play Well Together
• Hard for users to share data between tools
– Different storage formats
– Different data models
– Different user defined function interfaces
© Hortonworks 2013
Page 8
9. Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality
Hive
Pig Parser
Parser Metadata
Optimizer Optimizer
Physical Physical
Planner Planner
Executor Executor
© Hortonworks 2013
Page 9
10. Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality
Hive
Pig Parser
Parser Metadata
Optimizer Optimizer
Physical Physical
Overlap
Planner Planner
Executor Executor
© Hortonworks 2013
Page 10
11. Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense
© Hortonworks 2013
Page 11
12. Hadoop = Distributed Data Operating
System
Service Hadoop Component
Table Management Hive
Access To Metadata HCatalog
User authentication Knox
Resource management YARN
Notification HCatalog
REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez
Exists Pieces exist in this component New Project
© Hortonworks 2013
Page 12
13. Hadoop = Distributed Data Operating
System
Service Hadoop Component
Table Management Hive
Access To Metadata HCatalog
User authentication Knox
Resource management YARN
Notification HCatalog
REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez
Exists Pieces exist in this component New Project
© Hortonworks 2013
Page 13
14. HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access
© Hortonworks 2013
Page 14
15. HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access
Hive
Metastore
© Hortonworks 2013
Page 15
16. HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access
Hive Pig
HCat
Loader
Metastore MapReduce
HCatInput
Format
© Hortonworks 2013
Page 16
17. HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access
Hive Pig
External
Systems HCat
Loader
REST
WebHCat
Metastore MapReduce
HCatInput
Format
© Hortonworks 2013
Page 17
18. Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN
© Hortonworks 2013
Page 18
19. Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 3
Pig/Hive - MR
© Hortonworks 2013
Page 19
20. Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Single Job
Job 3
Pig/Hive - MR Pig/Hive - Tez
© Hortonworks 2013
Page 20
21. FastQuery: Beyond Batch with YARN
Tez Generalizes Map-Reduce Always-On Tez Service
Simplified execution plans process Low latency processing for
data more efficiently all Hadoop data processing
© Hortonworks 2013
Page 21
23. Today’s Access Options
• Direct Access
– Access Services via REST (WebHDFS, WebHCat)
– Need knowledge of and access to whole cluster
– Security handled by each component in the cluster
– Kerberos details exposed to users
User {REST} Hadoop Cluster
• Gateway / Portal Nodes
– Dedicated nodes behind firewall
– User SSH to node to access Hadoop services
SSH
GW
User Hadoop Cluster
Node
© Hortonworks 2013
Page 23
24. Knox Design Goals
• Operators can firewall cluster without end user access to
“gateway node”
• Users see one cluster end-point that aggregates
capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
easier
• Enable integration enterprise and cloud identity
management environments
© Hortonworks 2013
Page 24
25. Perimeter Verification & Authentication
Verification
- Verify identity token Authentication Hadoop Cluster
- SAML, propagation of identity
Authentication
User Store
- Establish identity at Gateway to
Authenticate with LDAP + AD KDC, AD, DN DN
LDAP
Web DN DN
HDFS
NN
{REST} Knox
Client Gateway
JT
Web
Hive
ID Provider HCat
KDC, AD,
LDAP HCat
Verification
© Hortonworks 2013
Page 25
Editor's Notes This is how we tend to think of Big data Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything