Hadoop Crash Course Hadoop Summit SJ

Apache Hadoop Crash Course
Rafael Coss
Data Evangelist
@racoss
#FutureOfData

2 © Hortonworks Inc. 2011 –2016. All Rights Reserved
Agenda
Future of Data
Traditional Data Architectures
What’s Apache Hadoop?
Data Access with Hadoop
Lab Intro

Customers are building Modern Data Applications to transform their industries –
renovating their IT architectures and innovating with their Data in Motion
or Data at Rest to power actionable intelligence.
Social
Mapping
Payment
Tracking
Factory
Yields
Defect
Detection
Call Analysis
Machine
Data
Product
Design
M & A
Due
Diligence
Next Product
Recs
Cyber
Security
Risk
Modeling
Ad
Placement
Proactive
Repair
Disaster
Mitigation
Investment
Planning
Inventory
Predictions
Customer
Support
Sentiment
Analysis
Supply Chain
Ad
Placement
Basket
Analysis
Segments
Cross-
Sell
Customer
Retention
Vendor
Scorecards
Optimize
Inventories
OPEX
Reduction
Mainframe
Offloads
Historical
Records
Data
as a Service
Public
Data
Capture
Fraud
Prevention
Device Data
Ingest
Rapid
Reporting
Digital
Protection

INTERNET
OF
ANYTHING
The Future of Data is about
actionable intelligence derived from a
constantly connected society with easy
secure access to rich data sets coming
from the Internet of Anything

Tire Pressure
Server log Mobile
Sensor
Location
Precipitation
Social
Click-stream
Data Powers Highway Safety

New Data Paradigm Opens Up New Opportunity
2.8 zettabytes
in 2012
44 zettabytes
in 2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP, CRM, SCM
Web & social
Geolocation
Internet of Things
Server logs
Files, emails
Transform every industry via
full fidelity of data and analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability to
Consume Data
Enterprise
Blind Spot

What disrupted the data center?
?
Data?

Modern Data Applications
Polygot Persistence
SQL
NoSQL
NewSQL
Search
Graph
At-Rest In-Motion
Analytics
Data Variety
Integration
Data Lake Federation
Optimization
Storage, Compute
Distributed Computing
Commodity Hardware
Cloud
Hybrid Distributed Computing

The Future of Data
Actionable Intelligence
D A T A I N M O T I O N
STORAGE
STORAGE
GROUP 2GROUP 1
GROUP 4GROUP 3
D A T A A T R E S T
INTERNET
OF
ANYTHING
Connected Data Platforms
are powering Actionable Intelligence
Any and all data
from sensors,
machines,
geolocation, clicks,
files, social.
Secure point-to-point and
bi-directional data flows
Collect and curate all data.

Traditional Data Architectures

Systems of Intelligence
Systems of
Engagements
Systems of
Interactions
Data Systems
13
Systems of
Record
Systems of Insight
Events
In
Gray
Analytics
In
Green
OperatorsDevelopers

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents
Ã Too expensive and slow as data growth keeps accelerating
Ã Too slow to get the data prepared for analytics
Ã Analytics is only leveraging a limited data set
Ã Cold data becomes archived and is no longer usable for analytics
Ã Data ingest is rigid and slow for new IoAT data types
Ã Limited real time insights
Traditional Data Architecture Challenges with Big Data

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents

Next Generation Analytics
Iterative & Exploratory
Data is the structure
IT Team
Delivers Data
On Flexible
Platform
Business
Users
Explore and
Ask Any Question
Analyze ALL Available Information
Whole population analytics connects the dots
Traditional Analytics
Structured & Repeatable
Structure built to store data
Business
Users
Determine
Questions
IT Team
Builds System
To Answer
Known Questions
17
Available Information
Analyzed
Information
Capacity constrained down sampling of available
information
Carefully cleanse all information before any analysis
Analyzed
Information
Analyze information as is & cleanse as needed
Analyzed
Information
Modern Data Applications

Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
18
?
Analyzed
Information
Question
DataAnswer
Hypothesis
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Data
Correlation
All Information
Exploration
Actionable Insight
Analyze after landing… Analyze in motion…
Modern Data Applications Has Two Themes

Hadoop Architecture
Data Access Engines
Distributed Reliable Storage
Distributed Compute Framework
Resource Management, Data LocalityData Operating System
Batch Interactive Real-time
Governance
&
Integration
Security
Applications
Deploy Anywhere

Hadoop Data Platform Architecture
Store and process all of your Corporate Data Assets
YARN: Data Operating System
DATA MANAGEMENT
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITY
Access your data simultaneously in multiple ways
(batch, interactive, real-time)
DATA ACCESS
Load data and
manage according
to policy
GOVERNANCE &
INTEGRATION
ENTERPRISE MGMT & SECURITY
Empower existing operations and
security tools to manage Hadoop
PRESENTATION & APPLICATION
Enable both existing and new application to
provide value to the organization
Provide deployment choice across on-premise, appliance, virtualized, cloud
DEPLOYMENT OPTIONS
Deploy and
effectively
manage the
platform
OPERATIONS

runs on
ETL
RDBMS Import/Export
Distributed Storage & Processing Framework
Secure NoSQL DB
SQL on HBase
NoSQL DB
Workflow Management
SQL
Streaming Data Ingestion
Cluster System Operations
Secure Gateway
Distributed Registry
ETL
Search & Indexing
Even Faster Data Processing
Data Management
Machine Learning
Hadoop Ecosystem

Open Enterprise Hadoop Capabilities
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System

HORTONWORKS DATA PLATFORM
DATA MGMT
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
2.2.0
2.4.0
2.6.0
Ongoing Innovation in Apache
HDFS
YARN
MapReduce
Hadoop Core
What is Apache Hadoop?
Yahoo!
2006
Hortonworks
Oct 2011
Yahoo! start focus on multiple Hadoop apps & clusters
Contributes Hadoop to Apache
2008
HDP 1.0
Oct 2012
Apache Hadoop v2 YARN
Google publishes GFS & MapReduce papers
2004-2005
HDP 2.4
March 2016
2.7.1
HDP 2.2
Dec 2014
HDP 2.3
July 2015
2.7.1

`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node

HDFS: Scalable, Reliable and Secure Storage Platform
The Storage Platform for Hadoop 2.0
Scalable
Horizontally grow as data volumes grow, adding
one or multiple nodes at a time
Reliable
Highly available (HA) and fault tolerant to
protect against data loss and corruption
Cost Effective
Leverage Commodity Hardware
Cross workload access
Secure
Strong access controls, integrated with
authentication mechanisms
Granular data access controls to datasets across
users and groups
Protects data over the wire and at rest
HDFS
C A B C B B A C
B A B A C A
Standards Based
Data Interfaces
NFS
Source /
Destination
REST
RPC
Source /
Destination
Source /
Destination
Ingest and store any data in any format
Flexible read access enables a variety of work loads

Heterogeneous Storage
Before
• DataNodeis a single storage
• Storage is uniform -Only storage type Disk
• Storage types hidden from the file system
New Architecture
• DataNodeis a collection of storages
• Support different types of storages
– Disk, SSDs, Memory
All disks as a single storage
S3
Swift
SAN
Filers
Collection of tiered storages

Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomlyacross the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from GB to
PB. But, batch oriented – Speed is not it’s strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports original
investments in Hadoop by customers and partner
ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map Phase Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Proven at Enterprise Scale
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
Interactive Real-TimeBatch

What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map Process
Map Process
Map Process
Map Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle &
Sort Aggregation
Data
Data
Data
Data
Data
Data
Data
Data

1st Gen Hadoop: Cost Effective Batch at Scale
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
Silos created for distinct
use casesSingle App
BATCH
HDFS
Single App
ONLINE

Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for managing large
volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by large
web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü Manages new data paradigm
ü Handles data at scale
ü Cost effective
ü Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

YARN extends Hadoop into data center leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort,Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing“YARN Ready” solutions
YARN : Data Operating System
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider

What does iOS 6 and Windows 3.1 have in common?

Hadoop Beyond Batch with YARN
Single Use Sysztem
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new…
HADOOP 1
MapReduce
(cluster resource management
& data processing)
Data Flow
Pig
SQL
Hive
Others
API,
Engine,
and
System
YARN
(Data Operating System: resource management, etc.)
Data Flow
Pig
SQL
Hive
Other
ISV
Apache Yarn as a Base
System
Engine
API’s
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(redundant, reliable storage)
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
Batch
MapReduce
Tez Tez
MapReduce as the Base
HADOOP 2

Architecture Enabled by YARN
A single set of data across the entire cluster with multiple access methods
using “zones” for processing
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° ° ° ° ° ° ° n
SQL
Hive
Interactive SQL Query
for Analytics
Pig
Script-based ETL
Algorithm executed in batch to rework
data used by Hive and HBase consumers
• Maximize compute
resources to lower TCO
• No standalone,
silo’d clusters
• Simple management
& operations
…all enabled by YARN
Stream Processing
Storm
Identify & act on
real-time events
NoSQL
Hbase
Accumulo
Low-latency access serving up
a web front end

Hadoop Workload Evolution
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new… Multi Use Platform
Data & Beyond
HADOOP 1
YARN
HADOOP 2
1 ° ° ° °
° ° ° ° N
HDFS
1 ° ° °
° ° ° N
HDFS
MapReduce
HADOOP.Next
YARN ‘
1 ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
DATA ACCESS APPS
Docker
MySQLMR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
MR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
Docker
Tomcat
Docker
Other

Gartner: What is Hadoop?
Ã Common Apache Projects
– ALL = 7 (6)
– Except for 1 = 3 (5)
– Except for 2 = 4 (4)
² About 14 Common Projects
Ã Uncommon Projects
– Only 1 = 9 (1)
– Only 2 = 7 (2)
– Only 3 = 6 (3)
² About 22 Uncommon Projects
http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/
ODPi
ODPi
ODPi
ODPi
ODPi ODPi ODPi

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HORTONWORKS DATA PLATFORM
Hadoop
&YARN
Flume
Oozie
HDP 2.3 is Apache Hadoop; not “based on” Hadoop
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
Oct 2015 4.2.0
0.96.1
0.98.0 0.9.1
0.8.1
1.4.1 1.1.2
2.7.1 1.4.6 1.3.0 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.80.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.01.2.10.16.0
HDP 2.5*
2H2016
4.2.01.6.2 1.1.2
2.7.1 1.4.6 1.1.0 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.4
Mar 2016 4.2.01.6.0 1.1.2
Zeppelin
Ongoing Innovation in Apache
0.6.0
* HDP 2.5 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.

Next Generation Data Vendors Investment for the Enterprise
Vertical
Integration with
YARN and HDFS
Ensure engines can run
reliably and respectfully
in a YARN based
cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack

What do distributions do?
Ã Define a stack of components
• Rich and latest set of Apache Projects (open source & open community) without lock in
Ã Vertical and Horizontal integration of components
• Vertical: Best Speed and Scale
• Horizontal: Open Enterprise Ready
Ã Provision and Upgrade stack
• Robust, Easy and Anywhere
Ã Accelerate time to value (easy of use)
• New Face of Hadoop with Uis from Ambari, Ambari Views, Ranger, Falcon, Atlas
Ã Partner Ecosystem
• Rich and Deep
Ã Support
• Industry’s best, SmartSenseand influence community

How Do You Operate a Hadoop Cluster?
Apache™ Ambari is a platform
to provision, manage and
monitor Hadoop clusters

Ambari Core Features and Extensibility
Install & Configure
Operate, Manage &
Administer
Develop
Optimize & Tune
Developer
Data Architect
Ambari provides core services for operations, development and
extensions points for both
Extensibility Features
Stacks, Blueprints & REST APIs
Core Features
Install Wizard & Web
Web, Operator Views,
Metrics & Alerts
User Views
User Views
Views Framework & REST APIs
Views Framework
Views Framework
How?
Cluster Admin

New user interface enables fast &
easy SQL definition and execution.

New User Views for DevOps
Capacity Scheduler View
Browse and manage YARN queues
Tez View
View information related to Tez jobs that
are executing on the cluster

New User Views for Development
Pig View
Author and execute Pig Scripts.
Hive View
Author, execute and debug Hive
queries.
Files View
Browse HDFS file system.

Apache Zeppelin
• Web-based notebook for data engineers, data
analysts and data scientists
• Brings interactive data ingestion, data
exploration, visualization, sharing and
collaboration features to Hadoop and
Spark
• Modern data science studio
• Scala with Spark
• Python with Spark
• SparkSQL
• Apache Hive, and more.

Access patterns enabled by YARN
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly find value in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
SensorMobile
Weblog
Operational
/ MPP
SQL Queries

Hive and the Stinger Initiative
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
+ +
Query Planner
Intelligent Cost-Based Optimizer
Performance Optimizations
100x+ faster time to insight
Deeper analytical capabilities

Stinger.next and Sub-Second SQL
Emergenceof LLAP brings Sub-Second SQL response times within reach with Hive.
BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED
DELIVERY
SQL
UPDATES
ENGINES
STINGER
DELIVERED
PROGRESS
DELIVERED
FINAL
VERSION
HDP 2.1
VERSION
0.13
VERSION
HDP 2.3
VERSION
1.2.1
SQL:2003+ SQL:2011 SUBSET
READ-ONLY SQL INSERT/UPDATE/DELETE
MR, TEZ MR, TEZ
FUTURE
STINGER NEXT
COMPLETE ACID SUPPORT INCLUDING MERGE
COMPREHENSIVE SQL:2011 BASED ANALYTICS
MR, TEZ, LLAP
DELIVERED IN DEVELOPMENT
Tiered Data Storage
Stinger.next Phase 3
YARN: Containerized
Applications

Data Types SQL Features File Formats Latest Additions…
Numeric Core SQL Features Columnar Scalable Cross Product
FLOAT/DOUBLE Date, Time and Arithmetical Functions ORCFile Primary Key / Foreign Key
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Non-Equijoin
INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries
Text
Tech Preview:
Proc. Extensions (PL/SQL)
BOOLEAN Correlated + Uncorrelated Subqueries CSV Future
String UNION ALL Logfile ACID MERGE
CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Multi Subquery
STRING Common Table Expressions Avro Comparison to sub-select
BINARY UNION DISTINCT JSON INTERSECT and EXCEPT
Date, Time Advanced Analytics XML
DATE OLAP and Windowing Functions Custom Formats
TIMESTAMP CUBE and Grouping Sets Other Features
Interval Types Nested Data Analytics XPath Analytics
Complex Types Nested Data Traversal
ARRAY Lateral Views
MAP ACID Transactions
STRUCT INSERT / UPDATE / DELETE
UNION
Apache Hive: Journey to SQL:2011 Analytics
Legend
Existing
Future
New with Hive 2.0

Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In-development
Legend
Apache Hive: Modern Architecture

Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and interactive
for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/
Scalding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scripting
Pig
SQL
Hive
Tez Tez
Applications
Tez

Apache Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded writes to
HDFS

Scripting Data Pipeline & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transformdata and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG

Pig Latin
• Pig executes in a unique fashion:
oDuring execution, each statement is processed by the Pig interpreter
oIf a statement is valid, it gets added to a logical plan built by the
interpreter
oThe steps in the logical plan do not actually execute until a DUMP or
STORE command is used

Why use Pig?
• Maybe we want to join two datasets, from different sources, on a
common value, and want to filter, and sort, and get top 5 sites

ResourceManagement
Storage
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Apache Spark enthusiasm
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Spark Core Engine

Apache Spark & Apache Hadoop Perfect Together
General Purpose Data Access Engine
for fast, large-scale data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-LanguageAPIs
for Java, Scala, Python and R
Built-in Libraries
Enable data workers to rapidly iterate over data for:
ETL, Machine Learning, SQL and Stream processing
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Apache Projects Enable Access Patterns
Various open source projects have
incubated in order to meet these access
pattern needs
Today, they can all run on a single cluster
on a single set of data because of YARN
All powered by a broad open community
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive
Solr
Spark
Hive
Pig
Real-Time
HBase
Accumulo
Storm
Batch
MapReduce
Applications
Kafka

Connected Data Platforms

Connected Data Platforms Enable Architectural Transformations
Data in
Motion
(Cloud)
Data in
Motion
(on-premises)
Data at
Rest
(on-premises)
Edge
Data
Data in
Motion
Edge
Analytics
Data at
Rest
(Cloud)
Edge
Data
Data at
Rest
(on-premises)
Closed
Loop
Analytics
Machine
Learning
Deep
Historical
Analysis

Must-have Considerations for Technology
Continuous Data
Life Cycle
Real-time
insights from
origin to rest
Enterprise
Ready
Management
Security
Governance
Deployment
Flexibility
On Premise
Cloud
Hybrid
Open
Innovation
Architecture
Community
Ecosystem

HDP 2.4 Sandbox
Ã Provides Free preconfigured
HDP
– Runs in a Virtual Machine or
Azure
Hortonworks.com/sandbox
Ã Easy to Use
– Operations
• Ambari
– Dev and DevOps
• Ambari User Views
– Web Notebook
• Zeppelin
Ã Works with 60+ Free tutorial
Hortonworks.com/tutorials

Data Discovery Lab
• Elefante Wine Company has a fleet of over 100 trucks.
• The geolocation data collected from the trucks contains events generated while the truck drivers are
driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.4
o Lab Doc
o URL: http://goo.gl/14OAat
o Load Data
o Query Data
o Process Data

Elefante Wine Current Challenges
The Company
Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine
in a highly-regulated industry with stringent transportation requirements.
The Situation
Recently a number of driver violations led to fines and increased insurance rates
The Challenges
• Rising Operational Costs
• Driver Safety
• Risk Management
• Logistics Optimization

© HortonworksInc. 2012
Professional Services
Elefante Wine Company has a large fleet of trucks in USA
A truck generates millions of events for a
given route; an event could be:
§ 'Normal' events: starting / stopping of the
vehicle
§ ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time to calculate risk
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers

Elefante Wine Risk and Driver Safety Challenges
Trucks outfitted with new sensors generating large
volumes of new data:
• Location
• Speed
• Driver Violations
Need to be integrate real-time & historical data
Increase safety and reduce liabilities
Anticipate driver violations BEFORE they
happen and take precautionary actions
Find predictive correlations in driver behavior over
large volumes of real-time data
Difficult to deliver timely insights to the right
people and systems to take action
Data Discovery
Uncover new
findings
Predictive Analytics
Identify your next best
action
Better Understanding
of the Past
Better Prediction
of the Future

What’s our goal?
Ã Solution:
– Collect additional data via sensors in trucks to better understand Risk Factors
Ã How:
– Quickly store new sensor data in a common repository
– Prepare the data for analysis
– Explore the data
– Calculate Risk
– Generate a report

Move Data Into Hadoop
Geolocation.csv
trucks.csv
Geolocation_stage Geolocation
Trucks_stage Trucks
csv
csv ORC
ORC
SQL
SQL
move
LOAD

Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG or Spark
Risk Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT

developer.hortonworks.com

Hortonworks Nourishes the Community
H O R TO NW O R KS
C O M M UNI TY C O NNE C T I ON
H O R TO N W OR KS
PA R T N ERWO RKS
https://community.hortonworks.com

Thank you!
rafael@hortonworks.com
@racoss

Hadoop Crash Course Hadoop Summit SJ

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Hadoop Crash Course Hadoop Summit SJ

Semelhante a Hadoop Crash Course Hadoop Summit SJ (20)

Último

Último (20)

Hadoop Crash Course Hadoop Summit SJ