Strata feb2013

Coordinating the Many
Tools of Big Data
Strata 2013

Alan F. Gates
@alanfgates

Page 1

Big Data = Terabytes, Petabytes, …

Image Credit: Gizmodo
© Hortonworks 2013
Page 2

But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
2012 on calculations Twitter is doing via UDFs in Pig.
This equation uses stochastic gradient descent to do
machine learning with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

© Hortonworks 2013
Page 3

And New Tools
• Apache Hadoop brings with it a large selection of tools
and paradigms
– Apache HBase, Apache Cassandra – Distributed, high volume
reads and rights of individual data records
– Apache Hive - SQL
– Apache Pig, Cascading – Data flow programming for ETL, data
modeling, and exploration
– Apache Giraph – Graph processing
– MapReduce – Batch processing
– Storm, S4 – Stream processing
– Plus lots of commercial offerings

© Hortonworks 2013
Page 4

Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
SAS).

Data
Mart
Statistical
Analysis
Data
Warehouse

Cube/M OLTP
OLAP

© Hortonworks 2013
Page 5

Cloud: Many Tools One Platform
• Users no longer want to be concerned with what platform their data is in – just
apply the tool to it
• SQL no longer the only or primary data access tool

Statistical
Data Analysis
Mart
Data
Warehouse

Cube/M OLT
OLAP P

© Hortonworks 2013
Page 6

Upside - Pick the Right Tool for the Job

© Hortonworks 2013
Page 7

Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools
– Different storage formats
– Different data models
– Different user defined function interfaces

© Hortonworks 2013
Page 8

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Planner Planner

Executor Executor

© Hortonworks 2013
Page 9

Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
redundant functionality

Hive

Pig Parser

Parser Metadata

Optimizer Optimizer
Physical Physical
Overlap
Planner Planner

Executor Executor

© Hortonworks 2013
Page 10

Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense

© Hortonworks 2013
Page 11

Hadoop = Distributed Data Operating
System
Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez

Exists Pieces exist in this component New Project

© Hortonworks 2013
Page 12

Hadoop = Distributed Data Operating
System
Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase,
Oozie
Relational data processing Tez

Exists Pieces exist in this component New Project

© Hortonworks 2013
Page 13

HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
Hadoop
• Presents tools with a table paradigm that abstracts away
storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

© Hortonworks 2013
Page 14

Hadoop
storage details

Hive

Metastore

© Hortonworks 2013
Page 15

Hadoop
storage details

Hive Pig
HCat
Loader

Metastore MapReduce
HCatInput
Format

© Hortonworks 2013
Page 16

Hadoop
storage details

Hive Pig
External
Systems HCat
Loader
REST
WebHCat
Metastore MapReduce
HCatInput
Format

© Hortonworks 2013
Page 17

Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN

© Hortonworks 2013
Page 18

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2

I/O Synchronization
Barrier

I/O Synchronization
Barrier

Job 3

Pig/Hive - MR
© Hortonworks 2013
Page 19

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2

I/O Synchronization
Barrier

I/O Synchronization
Barrier

Single Job

Job 3

Pig/Hive - MR Pig/Hive - Tez
© Hortonworks 2013
Page 20

FastQuery: Beyond Batch with YARN

Tez Generalizes Map-Reduce Always-On Tez Service
Simplified execution plans process Low latency processing for
data more efficiently all Hadoop data processing

© Hortonworks 2013
Page 21

Today’s Access Options
• Direct Access
– Access Services via REST (WebHDFS, WebHCat)
– Need knowledge of and access to whole cluster
– Security handled by each component in the cluster
– Kerberos details exposed to users

User {REST} Hadoop Cluster

• Gateway / Portal Nodes
– Dedicated nodes behind firewall
– User SSH to node to access Hadoop services

SSH
GW
User Hadoop Cluster
Node

© Hortonworks 2013
Page 23

Knox Design Goals
• Operators can firewall cluster without end user access to
“gateway node”
• Users see one cluster end-point that aggregates
capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
easier
• Enable integration enterprise and cloud identity
management environments

© Hortonworks 2013
Page 24

Perimeter Verification & Authentication
Verification
- Verify identity token Authentication Hadoop Cluster
- SAML, propagation of identity
Authentication
User Store
- Establish identity at Gateway to
Authenticate with LDAP + AD KDC, AD, DN DN
LDAP
Web DN DN
HDFS
NN
{REST} Knox
Client Gateway

JT
Web
Hive
ID Provider HCat
KDC, AD,
LDAP HCat

Verification
© Hortonworks 2013
Page 25

Strata feb2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Strata feb2013

Similar to Strata feb2013 (20)

Recently uploaded

Recently uploaded (20)

Strata feb2013

Editor's Notes