Hadoop crash course workshop at Hadoop Summit

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Crash Course
Winter 2015
Version 1.0
Hortonworks. We do Hadoop.
Rafael Coss
rafael@hortonworks.com
@racoss

Hadoop Crash Course
 Why Hadoop?
 Hadoop Ecosystem & Distribution
 Store Data (HDFS)
 Process Data in Hadoop 1 (MapReduce)
 Process Data in Hadoop 2 (Yarn + MapReduce/Tez)
 Access Data
 Lab

What disrupted the data center?
?
Data?

Traditional World Of Applications And Data Silos
Constrains data to specific apps
No insight across ALL data
Built for structured data
Does not scale (cost and tech)
ERP CRM SCM WEB

New Data Paradigm Opens Up New Opportunity
2.8 zettabytes
in 2012
44 zettabytes
in 2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP, CRM, SCM
Web & social
Geolocation
Internet of Things
Server logs
Files, emails
Transform every industry via
full fidelity of data and analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability to
Consume Data
Enterprise
Blind Spot

Hadoop YARN-based Architecture Unlocks Opportunity
Consolidates all data sets
Delivers real-time insights
Integrates with data center
Scalable and affordable
T U R N A L L O F Y O U R D ATA I N T O VA L U E
| Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation

Two Paths in a Customer’s Journey to a Data LakeSCALE
SCOPE
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
The journey begins
with either:
1. Cost Optimization (Data
Architecture Optimization)
2. Advanced Analytic
Applications
Leaders are Data Driven
Advanced Analytic
Apps
Cost
Optimization

Common Drivers of Hadoop Adoption
Data Architecture
Optimization
Keep 100% of Data
at up to 1/100 the Cost
and Enrich DW Analytics
Single
View
Customer
Product
Supply Chain
Predictive
Analytics
Behavioral Insight
Preventive Maintenance
Resource Optimization
Data
Discovery
Explore Datasets
Uncover New Findings
Operationalize Insights
Industry Hadoop Adoption Journey

Hadoop Ecosystem
runs on
ETL
RDBMS Import/Export
Distributed Storage & Processing Framework
Secure NoSQL DB
SQL on HBase
NoSQL DB
Workflow Management
SQL
Streaming Data Ingestion
Cluster System Operations
Secure Gateway
Distributed Registry
ETL
Search & Indexing
Even Faster Data Processing
Data Management
Machine Learning

Hadoop Architecture
Data Access Engines
Distributed Reliable Storage
Distributed Compute Framework
Resource Mgt, Data Locality
Data Operating System
Batch Interactive Streaming
Governance Security
Apps

Hadoop Key Services
Hortonworks Data Platform
Multi-tenant data platform built on a centralized
architecture of shared enterprise services
YARN: data operating system
Governance Security
Operations
Resource management
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time
Storage
Key Services
Resource and workload management
Scalable tiered storage
Consistent operations
Comprehensive security
Trusted data governance

Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack
Vertical
Integration with
YARN and HDFS
Ensure engines can
run reliably and
respectfully in a YARN
based cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS

`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node

Joys of Real Hardware (Jeff Dean)
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc

Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

The DataNodes
“I’m still here! This is my
latest heartbeat.”
“I’m here too! And here is
my latest heartbeat.”
123
“Hey DataNode1,
Replicate block 123 to
DataNode 3.”
NameNode
DataNode 1 DataNode 3 DataNode 4
123 123
DataNode 1

Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from
GB to PB. But, batch oriented – Speed is not it’s
strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports
original investments in Hadoop by customers and
partner ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map
Phase
Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Why rewrite everything immediately?
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
Interactive Real-TimeBatch

What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map
Process
Map
Process
Map
Process
Map
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map
Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle & Sort
Aggregation
Data
Data
Data
Data
Data
Data
Data
Data

WordCount in MapReduce
HDFS
constitution.txt The mappers read the file’s
blocks from HDFS line-by-line
1
We the people, in order to form a...
The lines of text are split into
words and output to the
reducers
2
The shuffle/sort phase
combines pairs with the same
key
3
The reducers add up the “1’s”
and output the word and its
count
4
<We, 1>
<the,1>
<people,1>
<in,1>
<order, 1>
<to,1>
<form,1>
<a,1>
<We, (1,1,1,1)>
<the, (1,1,1,1,1,1,1,...)>
<people,(1,1,1,1,1)>
<form, (1)><We,4>
<the,265>
<people,5>
<form,1>HDFS

1st Gen Hadoop: Cost Effective Batch at Scale
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
Silos created for distinct
use casesSingle App
BATCH
HDFS
Single App
ONLINE

Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
 Manages new data paradigm
 Handles data at scale
 Cost effective
 Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

What does iOS 6 and Windows 3.1 have in common?

Hadoop Beyond Batch with YARN
HDFS
MapReduce
Pig
(data flow)
Hive
(SQL)
Others
API,
Engine, and
System
Hadoop 1
MapReduce as the Base
HDFS
(redundant, reliable storage)
YARN
(Data Operating System: resource management, etc.)
Tez
(modern execution engine)
Data Flow
Pig
SQL
Hive
Java Apps
Cascading
Batch
MapReduce
Hadoop 2
Apache Yarn as a Base
System
Engine
API’s
Single Use Sysztem
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new…

Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and
interactive for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times
of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/S
calding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scriptin
g
Pig
SQL
Hive
Tez Tez
Applications
Tez

Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS

HDP delivers a Centralized Architecture
YARN
Other Pure Play Vendors
A siloed “with” YARN architecture
Disjoint, Siloed Clusters
• Inefficient use of resources, single tenant, duplicate storage & processing
• Multiple implementations of governance, security and operations
• New applications require new clusters
Hortonworks Data Platform
A centralized architecture built on YARN
Cluster1
Application
Security
Storage
YARN
Governance
Operations
Batch
Storage
Governance Security
Operations
Resource Management
Existing
Applications
New
Analytics
Partner
Applications
(ie. SAS)
Cluster2
Application
Security
Storage
Governance
Operations
ClusterN
Application
Security
Storage
Governance
Operations
…
Interactive
Dedicated
Resource mgt
Real-time
Dedicated
Resource mgt
Single cluster, multiple applications
• Efficient storage, processing
• Centralized Security, Operations, Governance
• Run a variety of applications simultaneously
Data Access: Batch, Interactive & Real-time

{Processing + Storage}
=
{MapReduce/YARN + HDFS}
=
{Core Hadoop}

Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW

What is Data Access?
Data Access defines ALL the channels
through which data can be accessed,
analyzed, cleansed and consumed within
Hadoop. Each channel can be categorized
into THREE core patterns; Batch, Interactive
and Real-time.
Multiple engines provide
optimized access to your mission
critical data.

Access patterns enabled by YARN
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS

Apache Projects Enable Access Patterns
• Various Open Source
projects have incubated
in order to meet these
access pattern needs
• Today, they can all run
on a single cluster on a
Single set of data
because of YARN!
• ALL powered by a
BROAD Open
Community
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Batch
MapReduce
Pig
Hive
Interactive
Solr
Spark
Hive
Kafka
Real-Time
HBase
Accumulo
Storm

Scripting Data Flow & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transform data and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG

Pig Latin
• Pig executes in a unique fashion:
o During execution, each statement is processed by the Pig
interpreter
o If a statement is valid, it gets added to a logical plan built by the
interpreter
o The steps in the logical plan do not actually execute until a
DUMP or STORE command is used

Why use Pig?
• Maybe we want to join two datasets, from different sources, on a
common value, and want to filter, and sort, and get top 5 sites

Apache Hive: THE defacto standard for SQL in Hadoop
• What?
• Treat your data in Hadoop as tables
• Provides a standard SQL 92 interface to data in Hadoop
• Why?
• Shipped in every distribution… you already have it (although some do not
ship complete versions) Quickly find value in raw data files
• Proven at petabyte scale for both batch and interactive queries
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…

Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce and executed on
Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez
Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce or Tez Job
Data DataData
Hadoop 3
Data-local processing

Using Tez for Hive Queries
Set the following property in either hive-site.xml or in
your script:
set hive.execution.engine=tez;

SQL Compliance
Evolution of SQL Compliance in Hive
SQL Datatypes SQL Semantics
INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT
FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING
BOOLEAN JOIN on explicit join key
ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins
STRING Sub-queries in the FROM clause
BINARY ROLLUP and CUBE
TIMESTAMP UNION
DECIMAL Standard aggregations (sum, avg, etc.)
DATE Custom Java UDFs
VARCHAR Windowing functions (OVER, RANK, etc.)
CHAR Advanced UDFs (ngram, XPath, URL)
Interval Types Sub-queries for IN/NOT IN, HAVING
JOINs in WHERE Clause
INSERT/UPDATE/DELETE
Legend
Hive 10 or earlier
Roadmap
Hive 11
Hive 12
Hive 13

Overview of Stinger
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
100X+ Faster Time to
Insight
+ +
Deeper Analytical Capabilities
Performance Optimizations
Query Planner
Intelligent Cost-Based Optimizer

System
Engine
API
YARN : Data Operating System
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
Batch
MapReduce
Real-Time
Slider
Direct
Java
.NET
Scripting
Pig
SQL
Hive
Cascading
Java
Scala
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISV
Other
ISV
Applications
Others
Spark
Other ISV
HDP 2.2 HDP 2.2
HDP 2.2 HDP 2.2
HDP 2.2TezTezTez Tez
YARN: Resource Manager for Hadoop 2.0
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such
as interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
Data Processing Engines Run Natively IN Hadoop

Hive & Pig
Hive & Pig work well together
and many customers use both
Hive is a good choice:
• if you are familiar with SQL
• when you want to query data
• when you need an answer to
specific questions
Pig is a good choice:
• For ETL (Extract, Transform, Load)
• for preparing data for analysis
• when you have a long series of
steps to perform

Pig and Hive Sample Scenario
Hadoop Distributed
File System
Structured
Data
Raw
Data
1. Put the data into HDFS
in its raw format
Answers to
questions = $$
2. Use Pig to explore and
transform
3. Data analysts use Hive to
query the data
4. Data scientists use MapReduce,
R, and Mahout to mine the data
Hidden gems = $$

Big Data ETL Life Cycle
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich with
metadata
3. Stream real-time data
8. Explore & validate data
4. Mask sensitive
data
2. Replicate changed data &
schemas
Visualization
& Analytics
11. Subscribe to datasets
Data Mart
1. Load or archive batch
data
Data Access &
Query
5. Access customer “golden
record
MDM
10. Correlate real-time events
with historical patterns & trends
6. Transform & refine
data
7. Move results to
EDW

HDP: Any Data, Any Application, Anywhere
Any Application
• Deep integration with ecosystem
partners to extend existing
investments and skills
• Broadest set of applications through
the stable of YARN-Ready applications
Any Data
Deploy applications fueled by clickstream, sensor,
social, mobile, geo-location, server log, and other new
paradigm datasets with existing legacy datasets.
Anywhere
Implement HDP naturally across the complete
range of deployment options
Clickstream Web
& Social
Geolocation Internet of
Things
Server
Logs
Files, emailsERP CRM SCM
hybrid
commodity appliance cloud
Over 70 Hortonworks Certified YARN Apps

What next? -> developer.hortonworks.com

Thank you!
rafael@hortonworks.com
@racoss

IoT Data Discovery Lab
• A trucking company has over 100 trucks.
• The geolocation data collected from the trucks contains events generated
while the truck drivers are driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.3 TP
o Lab Doc
o URL
o Load Data
o Query Data
o Process Data

Move Data Into Hadoop
Geolocation.csv
trucks.csv
Geolocation_stage Geolocation
Trucks_stage Trucks
csv
csv ORC
ORC
SQL
SQL
move
LOAD

Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG
Risk Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT

Calculate Risk

Cautionary Statement Regarding Forward-Looking Statements
This presentation contains forward-looking statements involving risks and uncertainties. Such
forward-looking statements in this presentation generally relate to future events, our ability to
increase the number of support subscription customers, the growth in usage of the Hadoop
framework, our ability to innovate and develop the various open source projects that will enhance
the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general
business outlook. In some cases, you can identify forward-looking statements because they
contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,”
“target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or
similar terms or expressions that concern our expectations, strategy, plans or intentions. You
should not rely upon forward-looking statements as predictions of future events. We have based
the forward-looking statements contained in this presentation primarily on our current expectations
and projections about future events and trends that we believe may affect our business, financial
condition and prospects. We cannot assure you that the results, events and circumstances
reflected in the forward-looking statements will be achieved or occur, and actual results, events, or
circumstances could differ materially from those described in the forward-looking statements.
The forward-looking statements made in this prospectus relate only to events as of the date on
which the statements are made and we undertake no obligation to update any of the information in
this presentation.
Trademarks
Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other
names used herein may be trademarks of their respective owners.

A Definition of Open Enterprise Hadoop
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
(Cluster Resource Management)
HDFS
OPERATIONS
Batch Interactive Real-Time

EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
Fragile workflows make supporting the analytical
models you want expensive and time-consuming.

Options for Data Input
MapReduce
WebHDFS
hadoop fs -put
Vendor Connectors
Hadoop
nfs gateway
Hue Explorer

Risk Factors Viewed in a Graph

Risk Factors Viewed on a Map

Hadoop crash course workshop at Hadoop Summit

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a Hadoop crash course workshop at Hadoop Summit

Semelhante a Hadoop crash course workshop at Hadoop Summit (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Hadoop crash course workshop at Hadoop Summit

Notas do Editor