SlideShare uma empresa Scribd logo
1 de 57
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Crash Course
Winter 2015
Version 1.0
Hortonworks. We do Hadoop.
Rafael Coss
rafael@hortonworks.com
@racoss
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Crash Course
 Why Hadoop?
 Hadoop Ecosystem & Distribution
 Store Data (HDFS)
 Process Data in Hadoop 1 (MapReduce)
 Process Data in Hadoop 2 (Yarn + MapReduce/Tez)
 Access Data
 Lab
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What disrupted the data center?
?
Data?
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Traditional World Of Applications And Data Silos
Constrains data to specific apps
No insight across ALL data
Built for structured data
Does not scale (cost and tech)
ERP CRM SCM WEB
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
New Data Paradigm Opens Up New Opportunity
2.8 zettabytes
in 2012
44 zettabytes
in 2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP, CRM, SCM
Web & social
Geolocation
Internet of Things
Server logs
Files, emails
Transform every industry via
full fidelity of data and analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability to
Consume Data
Enterprise
Blind Spot
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop YARN-based Architecture Unlocks Opportunity
Consolidates all data sets
Delivers real-time insights
Integrates with data center
Scalable and affordable
T U R N A L L O F Y O U R D ATA I N T O VA L U E
| Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Two Paths in a Customer’s Journey to a Data LakeSCALE
SCOPE
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
The journey begins
with either:
1. Cost Optimization (Data
Architecture Optimization)
2. Advanced Analytic
Applications
Leaders are Data Driven
Advanced Analytic
Apps
Cost
Optimization
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Common Drivers of Hadoop Adoption
Data Architecture
Optimization
Keep 100% of Data
at up to 1/100 the Cost
and Enrich DW Analytics
Single
View
Customer
Product
Supply Chain
Predictive
Analytics
Behavioral Insight
Preventive Maintenance
Resource Optimization
Data
Discovery
Explore Datasets
Uncover New Findings
Operationalize Insights
Industry Hadoop Adoption Journey
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Ecosystem
runs on
ETL
RDBMS Import/Export
Distributed Storage & Processing Framework
Secure NoSQL DB
SQL on HBase
NoSQL DB
Workflow Management
SQL
Streaming Data Ingestion
Cluster System Operations
Secure Gateway
Distributed Registry
ETL
Search & Indexing
Even Faster Data Processing
Data Management
Machine Learning
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Architecture
Data Access Engines
Distributed Reliable Storage
Distributed Compute Framework
Resource Mgt, Data Locality
Data Operating System
Batch Interactive Streaming
Governance Security
Apps
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Key Services
Hortonworks Data Platform
Multi-tenant data platform built on a centralized
architecture of shared enterprise services
YARN: data operating system
Governance Security
Operations
Resource management
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time
Storage
Key Services
Resource and workload management
Scalable tiered storage
Consistent operations
Comprehensive security
Trusted data governance
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack
Vertical
Integration with
YARN and HDFS
Ensure engines can
run reliably and
respectfully in a YARN
based cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Joys of Real Hardware (Jeff Dean)
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The DataNodes
“I’m still here! This is my
latest heartbeat.”
“I’m here too! And here is
my latest heartbeat.”
123
“Hey DataNode1,
Replicate block 123 to
DataNode 3.”
NameNode
DataNode 1 DataNode 3 DataNode 4
123 123
DataNode 1
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from
GB to PB. But, batch oriented – Speed is not it’s
strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports
original investments in Hadoop by customers and
partner ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map
Phase
Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Why rewrite everything immediately?
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
YARN: Data Operating System
Interactive Real-TimeBatch
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map
Process
Map
Process
Map
Process
Map
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map
Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle & Sort
Aggregation
Data
Data
Data
Data
Data
Data
Data
Data
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
WordCount in MapReduce
HDFS
constitution.txt The mappers read the file’s
blocks from HDFS line-by-line
1
We the people, in order to form a...
The lines of text are split into
words and output to the
reducers
2
The shuffle/sort phase
combines pairs with the same
key
3
The reducers add up the “1’s”
and output the word and its
count
4
<We, 1>
<the,1>
<people,1>
<in,1>
<order, 1>
<to,1>
<form,1>
<a,1>
<We, (1,1,1,1)>
<the, (1,1,1,1,1,1,1,...)>
<people,(1,1,1,1,1)>
<form, (1)><We,4>
<the,265>
<people,5>
<form,1>HDFS
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
1st Gen Hadoop: Cost Effective Batch at Scale
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
Silos created for distinct
use casesSingle App
BATCH
HDFS
Single App
ONLINE
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
 Manages new data paradigm
 Handles data at scale
 Cost effective
 Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What does iOS 6 and Windows 3.1 have in common?
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Beyond Batch with YARN
HDFS
MapReduce
Pig
(data flow)
Hive
(SQL)
Others
API,
Engine, and
System
Hadoop 1
MapReduce as the Base
HDFS
(redundant, reliable storage)
YARN
(Data Operating System: resource management, etc.)
Tez
(modern execution engine)
Data Flow
Pig
SQL
Hive
Java Apps
Cascading
Batch
MapReduce
Hadoop 2
Apache Yarn as a Base
System
Engine
API’s
Single Use Sysztem
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new…
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and
interactive for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times
of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/S
calding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scriptin
g
Pig
SQL
Hive
Tez Tez
Applications
Tez
YARN: Data Operating System
Interactive Real-TimeBatch
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP delivers a Centralized Architecture
YARN
Other Pure Play Vendors
A siloed “with” YARN architecture
Disjoint, Siloed Clusters
• Inefficient use of resources, single tenant, duplicate storage & processing
• Multiple implementations of governance, security and operations
• New applications require new clusters
Hortonworks Data Platform
A centralized architecture built on YARN
Cluster1
Application
Security
Storage
YARN
Governance
Operations
Batch
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Existing
Applications
New
Analytics
Partner
Applications
(ie. SAS)
Cluster2
Application
Security
Storage
Governance
Operations
ClusterN
Application
Security
Storage
Governance
Operations
…
Interactive
Dedicated
Resource mgt
Real-time
Dedicated
Resource mgt
Single cluster, multiple applications
• Efficient storage, processing
• Centralized Security, Operations, Governance
• Run a variety of applications simultaneously
Data Access: Batch, Interactive & Real-time
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
{Processing + Storage}
=
{MapReduce/YARN + HDFS}
=
{Core Hadoop}
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Data Access?
Data Access defines ALL the channels
through which data can be accessed,
analyzed, cleansed and consumed within
Hadoop. Each channel can be categorized
into THREE core patterns; Batch, Interactive
and Real-time.
Multiple engines provide
optimized access to your mission
critical data.
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Access patterns enabled by YARN
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Interactive Real-TimeBatch
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Projects Enable Access Patterns
• Various Open Source
projects have incubated
in order to meet these
access pattern needs
• Today, they can all run
on a single cluster on a
Single set of data
because of YARN!
• ALL powered by a
BROAD Open
Community
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Batch
MapReduce
Pig
Hive
Interactive
Solr
Spark
Hive
Kafka
Real-Time
HBase
Accumulo
Storm
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Scripting Data Flow & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transform data and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG
YARN: Data Operating System
Interactive Real-TimeBatch
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pig Latin
• Pig executes in a unique fashion:
o During execution, each statement is processed by the Pig
interpreter
o If a statement is valid, it gets added to a logical plan built by the
interpreter
o The steps in the logical plan do not actually execute until a
DUMP or STORE command is used
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why use Pig?
• Maybe we want to join two datasets, from different sources, on a
common value, and want to filter, and sort, and get top 5 sites
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: THE defacto standard for SQL in Hadoop
• What?
• Treat your data in Hadoop as tables
• Provides a standard SQL 92 interface to data in Hadoop
• Why?
• Shipped in every distribution… you already have it (although some do not
ship complete versions) Quickly find value in raw data files
• Proven at petabyte scale for both batch and interactive queries
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce and executed on
Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez
Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce or Tez Job
Data DataData
Hadoop 3
Data-local processing
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using Tez for Hive Queries
Set the following property in either hive-site.xml or in
your script:
set hive.execution.engine=tez;
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL Compliance
Evolution of SQL Compliance in Hive
SQL Datatypes SQL Semantics
INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT
FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING
BOOLEAN JOIN on explicit join key
ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins
STRING Sub-queries in the FROM clause
BINARY ROLLUP and CUBE
TIMESTAMP UNION
DECIMAL Standard aggregations (sum, avg, etc.)
DATE Custom Java UDFs
VARCHAR Windowing functions (OVER, RANK, etc.)
CHAR Advanced UDFs (ngram, XPath, URL)
Interval Types Sub-queries for IN/NOT IN, HAVING
JOINs in WHERE Clause
INSERT/UPDATE/DELETE
Legend
Hive 10 or earlier
Roadmap
Hive 11
Hive 12
Hive 13
YARN: Data Operating System
Interactive Real-TimeBatch
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview of Stinger
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
100X+ Faster Time to
Insight
+ +
Deeper Analytical Capabilities
Performance Optimizations
Query Planner
Intelligent Cost-Based Optimizer
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
System
Engine
API
YARN : Data Operating System
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Batch
MapReduce
Real-Time
Slider
Direct
Java
.NET
Scripting
Pig
SQL
Hive
Cascading
Java
Scala
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISV
Other
ISV
Applications
Others
Spark
Other ISV
HDP 2.2 HDP 2.2
HDP 2.2 HDP 2.2
HDP 2.2TezTezTez Tez
YARN: Resource Manager for Hadoop 2.0
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such
as interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
Data Processing Engines Run Natively IN Hadoop
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive & Pig
Hive & Pig work well together
and many customers use both
Hive is a good choice:
• if you are familiar with SQL
• when you want to query data
• when you need an answer to
specific questions
Pig is a good choice:
• For ETL (Extract, Transform, Load)
• for preparing data for analysis
• when you have a long series of
steps to perform
YARN: Data Operating System
Interactive Real-TimeBatch
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pig and Hive Sample Scenario
Hadoop Distributed
File System
Structured
Data
Raw
Data
1. Put the data into HDFS
in its raw format
Answers to
questions = $$
2. Use Pig to explore and
transform
3. Data analysts use Hive to
query the data
4. Data scientists use MapReduce,
R, and Mahout to mine the data
Hidden gems = $$
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data ETL Life Cycle
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich with
metadata
3. Stream real-time data
8. Explore & validate data
4. Mask sensitive
data
2. Replicate changed data &
schemas
Visualization
& Analytics
11. Subscribe to datasets
Data Mart
1. Load or archive batch
data
Data Access &
Query
5. Access customer “golden
record
MDM
10. Correlate real-time events
with historical patterns & trends
6. Transform & refine
data
7. Move results to
EDW
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP: Any Data, Any Application, Anywhere
Any Application
• Deep integration with ecosystem
partners to extend existing
investments and skills
• Broadest set of applications through
the stable of YARN-Ready applications
Any Data
Deploy applications fueled by clickstream, sensor,
social, mobile, geo-location, server log, and other new
paradigm datasets with existing legacy datasets.
Anywhere
Implement HDP naturally across the complete
range of deployment options
Clickstream Web
& Social
Geolocation Internet of
Things
Server
Logs
Files, emailsERP CRM SCM
hybrid
commodity appliance cloud
Over 70 Hortonworks Certified YARN Apps
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What next? -> developer.hortonworks.com
Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you!
rafael@hortonworks.com
@racoss
Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT Data Discovery Lab
• A trucking company has over 100 trucks.
• The geolocation data collected from the trucks contains events generated
while the truck drivers are driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.3 TP
o Lab Doc
o URL
o Load Data
o Query Data
o Process Data
Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Move Data Into Hadoop
Geolocation.csv
trucks.csv
Geolocation_stage Geolocation
Trucks_stage Trucks
csv
csv ORC
ORC
SQL
SQL
move
LOAD
Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG
Risk Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT
Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Calculate Risk
Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cautionary Statement Regarding Forward-Looking Statements
This presentation contains forward-looking statements involving risks and uncertainties. Such
forward-looking statements in this presentation generally relate to future events, our ability to
increase the number of support subscription customers, the growth in usage of the Hadoop
framework, our ability to innovate and develop the various open source projects that will enhance
the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general
business outlook. In some cases, you can identify forward-looking statements because they
contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,”
“target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or
similar terms or expressions that concern our expectations, strategy, plans or intentions. You
should not rely upon forward-looking statements as predictions of future events. We have based
the forward-looking statements contained in this presentation primarily on our current expectations
and projections about future events and trends that we believe may affect our business, financial
condition and prospects. We cannot assure you that the results, events and circumstances
reflected in the forward-looking statements will be achieved or occur, and actual results, events, or
circumstances could differ materially from those described in the forward-looking statements.
The forward-looking statements made in this prospectus relate only to events as of the date on
which the statements are made and we undertake no obligation to update any of the information in
this presentation.
Trademarks
Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other
names used herein may be trademarks of their respective owners.
Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
A Definition of Open Enterprise Hadoop
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
OPERATIONS
Batch Interactive Real-Time
Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data ETL Life Cycle
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich with
metadata
3. Stream real-time data
8. Explore & validate data
4. Mask sensitive
data
2. Replicate changed data &
schemas
Visualization
& Analytics
11. Subscribe to datasets
Data Mart
1. Load or archive batch
data
Data Access &
Query
5. Access customer “golden
record
MDM
10. Correlate real-time events
with historical patterns & trends
6. Transform & refine
data
7. Move results to
EDW
Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
Fragile workflows make supporting the analytical
models you want expensive and time-consuming.
Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Options for Data Input
MapReduce
WebHDFS
hadoop fs -put
Vendor Connectors
Hadoop
nfs gateway
Hue Explorer
Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Risk Factors Viewed in a Graph
Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Risk Factors Viewed on a Map

Mais conteúdo relacionado

Mais procurados

Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 

Mais procurados (19)

Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 

Destaque

Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2DataWorks Summit
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache GiraphDataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataDataWorks Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 

Destaque (20)

Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 

Semelhante a Hadoop crash course workshop at Hadoop Summit

Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Rommel Garcia
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in HadoopRommel Garcia
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 

Semelhante a Hadoop crash course workshop at Hadoop Summit (20)

Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Hadoop crash course workshop at Hadoop Summit

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course Winter 2015 Version 1.0 Hortonworks. We do Hadoop. Rafael Coss rafael@hortonworks.com @racoss
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course  Why Hadoop?  Hadoop Ecosystem & Distribution  Store Data (HDFS)  Process Data in Hadoop 1 (MapReduce)  Process Data in Hadoop 2 (Yarn + MapReduce/Tez)  Access Data  Lab
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What disrupted the data center? ? Data?
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Traditional World Of Applications And Data Silos Constrains data to specific apps No insight across ALL data Built for structured data Does not scale (cost and tech) ERP CRM SCM WEB
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved New Data Paradigm Opens Up New Opportunity 2.8 zettabytes in 2012 44 zettabytes in 2020 N E W 1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research Clickstream ERP, CRM, SCM Web & social Geolocation Internet of Things Server logs Files, emails Transform every industry via full fidelity of data and analytics Opportunity T R A D I T I O N A L LAGGARDS LEADERS Ability to Consume Data Enterprise Blind Spot
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop YARN-based Architecture Unlocks Opportunity Consolidates all data sets Delivers real-time insights Integrates with data center Scalable and affordable T U R N A L L O F Y O U R D ATA I N T O VA L U E | Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Two Paths in a Customer’s Journey to a Data LakeSCALE SCOPE Goal: • Centralized Architecture • Data-driven Business DATA LAKE Journey to the Data Lake with Hadoop Systems of Insight The journey begins with either: 1. Cost Optimization (Data Architecture Optimization) 2. Advanced Analytic Applications Leaders are Data Driven Advanced Analytic Apps Cost Optimization
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Common Drivers of Hadoop Adoption Data Architecture Optimization Keep 100% of Data at up to 1/100 the Cost and Enrich DW Analytics Single View Customer Product Supply Chain Predictive Analytics Behavioral Insight Preventive Maintenance Resource Optimization Data Discovery Explore Datasets Uncover New Findings Operationalize Insights Industry Hadoop Adoption Journey
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Ecosystem runs on ETL RDBMS Import/Export Distributed Storage & Processing Framework Secure NoSQL DB SQL on HBase NoSQL DB Workflow Management SQL Streaming Data Ingestion Cluster System Operations Secure Gateway Distributed Registry ETL Search & Indexing Even Faster Data Processing Data Management Machine Learning
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Architecture Data Access Engines Distributed Reliable Storage Distributed Compute Framework Resource Mgt, Data Locality Data Operating System Batch Interactive Streaming Governance Security Apps
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Key Services Hortonworks Data Platform Multi-tenant data platform built on a centralized architecture of shared enterprise services YARN: data operating system Governance Security Operations Resource management Existing applications New analytics Partner applications Data access: batch, interactive, real-time Storage Key Services Resource and workload management Scalable tiered storage Consistent operations Comprehensive security Trusted data governance
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services Ensure consistent enterprise services are applied across the Hadoop stack Vertical Integration with YARN and HDFS Ensure engines can run reliably and respectfully in a YARN based cluster Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez OPERATIONS
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ` + /directory/structure/in/memory.txt Resource management + schedulingDisk, CPU, Memory Core NameNode HDFS ResourceManager YARN Hadoop daemon User application NN RM DataNode HDFS NodeManager YARN Worker Node
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Joys of Real Hardware (Jeff Dean) Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Distributed File System (HDFS) Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The DataNodes “I’m still here! This is my latest heartbeat.” “I’m here too! And here is my latest heartbeat.” 123 “Hey DataNode1, Replicate block 123 to DataNode 3.” NameNode DataNode 1 DataNode 3 DataNode 4 123 123 DataNode 1
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Batch Processing in Hadoop MapReduce Batch Access to Data Original data access mechanism for Hadoop • Framework Made for developing distributed applications to process vast amounts of data in-parallel on large clusters • Proven Reliable interface to Hadoop which works from GB to PB. But, batch oriented – Speed is not it’s strong point. • Ecosystem Ported to Hadoop 2 to run on YARN. Supports original investments in Hadoop by customers and partner ecosystem. DataNode1 Mapper Data is shuffled across the network & sorted Map Phase Shuffle/Sort Reduce Phase MapReduce Job Lifecycle Saying that MapReduce is dead is preposterous - Would limits us to only new workloads - ALL Hadoop clusters use map reduce - Why rewrite everything immediately? DataNode2 Mapper DataNode3 Mapper DataNode1 Reducer DataNode2 Reducer DataNode3 Reducer YARN: Data Operating System Interactive Real-TimeBatch
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is MapReduce? Break a large problem into sub-solutions Map • Iterate over a large # of records • Extract something of interest from each record Shuffle • Sort Intermediate results Reduce • Aggregate, summarize, filter or transform intermediate results • Generate final output Map Process Map Process Map Process Map Process Data Data Data Data Data Data Data Data Data Data Data Data Data Map Process Reduce Process Reduce Process Data Read & ETL Shuffle & Sort Aggregation Data Data Data Data Data Data Data Data
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved WordCount in MapReduce HDFS constitution.txt The mappers read the file’s blocks from HDFS line-by-line 1 We the people, in order to form a... The lines of text are split into words and output to the reducers 2 The shuffle/sort phase combines pairs with the same key 3 The reducers add up the “1’s” and output the word and its count 4 <We, 1> <the,1> <people,1> <in,1> <order, 1> <to,1> <form,1> <a,1> <We, (1,1,1,1)> <the, (1,1,1,1,1,1,1,...)> <people,(1,1,1,1,1)> <form, (1)><We,4> <the,265> <people,5> <form,1>HDFS
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 1st Gen Hadoop: Cost Effective Batch at Scale HADOOP 1.0 Built for Web-Scale Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS Silos created for distinct use casesSingle App BATCH HDFS Single App ONLINE
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages  Manages new data paradigm  Handles data at scale  Cost effective  Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What does iOS 6 and Windows 3.1 have in common?
  • 23. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Beyond Batch with YARN HDFS MapReduce Pig (data flow) Hive (SQL) Others API, Engine, and System Hadoop 1 MapReduce as the Base HDFS (redundant, reliable storage) YARN (Data Operating System: resource management, etc.) Tez (modern execution engine) Data Flow Pig SQL Hive Java Apps Cascading Batch MapReduce Hadoop 2 Apache Yarn as a Base System Engine API’s Single Use Sysztem Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … A shift from the old to the new…
  • 24. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Tez is a critical innovation of the Stinger Initiative. • Along with YARN, Tez not only improves Hive, but improves all things batch and interactive for Hadoop; Pig, Cascading… • More Efficient Processing than MapReduce • Reduce operations and complexity of back end processing • Allows for Map Reduce Reduce which saves hard disk operations • Implements a “service” which is always on, decreasing start times of jobs • Allows Caching of Data in Memory YARN Dev Cascading/S calding Why is Tez Important? °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Scriptin g Pig SQL Hive Tez Tez Applications Tez YARN: Data Operating System Interactive Real-TimeBatch
  • 25. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez Hive – MapReduce Hive – Tez SELECT a.state, COUNT(*), AVG(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 26. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a Centralized Architecture YARN Other Pure Play Vendors A siloed “with” YARN architecture Disjoint, Siloed Clusters • Inefficient use of resources, single tenant, duplicate storage & processing • Multiple implementations of governance, security and operations • New applications require new clusters Hortonworks Data Platform A centralized architecture built on YARN Cluster1 Application Security Storage YARN Governance Operations Batch Storage YARN: Data Operating System Governance Security Operations Resource Management Existing Applications New Analytics Partner Applications (ie. SAS) Cluster2 Application Security Storage Governance Operations ClusterN Application Security Storage Governance Operations … Interactive Dedicated Resource mgt Real-time Dedicated Resource mgt Single cluster, multiple applications • Efficient storage, processing • Centralized Security, Operations, Governance • Run a variety of applications simultaneously Data Access: Batch, Interactive & Real-time
  • 27. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved {Processing + Storage} = {MapReduce/YARN + HDFS} = {Core Hadoop}
  • 28. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Modern Data Architecture emerges to unify data & processing Modern Data Architecture • Enable applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  • 29. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Data Access? Data Access defines ALL the channels through which data can be accessed, analyzed, cleansed and consumed within Hadoop. Each channel can be categorized into THREE core patterns; Batch, Interactive and Real-time. Multiple engines provide optimized access to your mission critical data.
  • 30. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Access patterns enabled by YARN Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time. YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Interactive Real-TimeBatch
  • 31. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Projects Enable Access Patterns • Various Open Source projects have incubated in order to meet these access pattern needs • Today, they can all run on a single cluster on a Single set of data because of YARN! • ALL powered by a BROAD Open Community YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Batch MapReduce Pig Hive Interactive Solr Spark Hive Kafka Real-Time HBase Accumulo Storm
  • 32. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Scripting Data Flow & ETL Apache Pig • Data flow engine and scripting language (Pig Latin) • Allows you to transform data and datasets Advantages over MapReduce • Reduces time to write jobs • Community support • Piggybank has a significant number of UDF’s to help adoption • There are a large number of existing shops using PIG YARN: Data Operating System Interactive Real-TimeBatch
  • 33. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig Latin • Pig executes in a unique fashion: o During execution, each statement is processed by the Pig interpreter o If a statement is valid, it gets added to a logical plan built by the interpreter o The steps in the logical plan do not actually execute until a DUMP or STORE command is used
  • 34. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why use Pig? • Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5 sites
  • 35. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: THE defacto standard for SQL in Hadoop • What? • Treat your data in Hadoop as tables • Provides a standard SQL 92 interface to data in Hadoop • Why? • Shipped in every distribution… you already have it (although some do not ship complete versions) Quickly find value in raw data files • Proven at petabyte scale for both batch and interactive queries • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…
  • 36. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Architecture User issues SQL query Hive parses and plans query Query converted to MapReduce and executed on Hadoop 2 3 Web UI JDBC / ODBC CLI Hive SQL 1 1 HiveServer2 Hive MR/Tez Compiler Optimizer Executor 2 Hive MetaStore (MySQL, Postgresql, Oracle) MapReduce or Tez Job Data DataData Hadoop 3 Data-local processing
  • 37. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Tez for Hive Queries Set the following property in either hive-site.xml or in your script: set hive.execution.engine=tez;
  • 38. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL Compliance Evolution of SQL Compliance in Hive SQL Datatypes SQL Semantics INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING BOOLEAN JOIN on explicit join key ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins STRING Sub-queries in the FROM clause BINARY ROLLUP and CUBE TIMESTAMP UNION DECIMAL Standard aggregations (sum, avg, etc.) DATE Custom Java UDFs VARCHAR Windowing functions (OVER, RANK, etc.) CHAR Advanced UDFs (ngram, XPath, URL) Interval Types Sub-queries for IN/NOT IN, HAVING JOINs in WHERE Clause INSERT/UPDATE/DELETE Legend Hive 10 or earlier Roadmap Hive 11 Hive 12 Hive 13 YARN: Data Operating System Interactive Real-TimeBatch
  • 39. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview of Stinger Base Optimizations Generate simplified DAGs In-memory Hash Joins Vector Query Engine Optimized for modern processor architectures Tez Express tasks more simply Eliminate disk writes Pre-warmed Containers ORCFile Column Store High Compression Predicate / Filter Pushdowns YARN Next-gen Hadoop data processing framework 100X+ Faster Time to Insight + + Deeper Analytical Capabilities Performance Optimizations Query Planner Intelligent Cost-Based Optimizer
  • 40. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved System Engine API YARN : Data Operating System °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch MapReduce Real-Time Slider Direct Java .NET Scripting Pig SQL Hive Cascading Java Scala NoSQL HBase Accumulo Stream Storm Other ISV Other ISV Applications Others Spark Other ISV HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2TezTezTez Tez YARN: Resource Manager for Hadoop 2.0 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop
  • 41. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive & Pig Hive & Pig work well together and many customers use both Hive is a good choice: • if you are familiar with SQL • when you want to query data • when you need an answer to specific questions Pig is a good choice: • For ETL (Extract, Transform, Load) • for preparing data for analysis • when you have a long series of steps to perform YARN: Data Operating System Interactive Real-TimeBatch
  • 42. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig and Hive Sample Scenario Hadoop Distributed File System Structured Data Raw Data 1. Put the data into HDFS in its raw format Answers to questions = $$ 2. Use Pig to explore and transform 3. Data analysts use Hive to query the data 4. Data scientists use MapReduce, R, and Mahout to mine the data Hidden gems = $$
  • 43. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  • 44. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP: Any Data, Any Application, Anywhere Any Application • Deep integration with ecosystem partners to extend existing investments and skills • Broadest set of applications through the stable of YARN-Ready applications Any Data Deploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets. Anywhere Implement HDP naturally across the complete range of deployment options Clickstream Web & Social Geolocation Internet of Things Server Logs Files, emailsERP CRM SCM hybrid commodity appliance cloud Over 70 Hortonworks Certified YARN Apps
  • 45. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What next? -> developer.hortonworks.com
  • 46. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you! rafael@hortonworks.com @racoss
  • 47. Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT Data Discovery Lab • A trucking company has over 100 trucks. • The geolocation data collected from the trucks contains events generated while the truck drivers are driving. • The company’s goal with Hadoop is to Mitigate Risk: o Understand correlations between miles driven and events o Compute the risk factor for each driver based on mileage & events o Lab Env o Sandbox 2.3 TP o Lab Doc o URL o Load Data o Query Data o Process Data
  • 48. Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Move Data Into Hadoop Geolocation.csv trucks.csv Geolocation_stage Geolocation Trucks_stage Trucks csv csv ORC ORC SQL SQL move LOAD
  • 49. Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Geolocation Trucks ORC ORC SQL SQL PIG Risk Calculation Truck_mileage ORC Avg_mileage ORC DriverMileage ORC RiskFactor ORC Events ORC Trucking Risk Analysis – Hadoop ELT
  • 50. Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Calculate Risk
  • 51. Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cautionary Statement Regarding Forward-Looking Statements This presentation contains forward-looking statements involving risks and uncertainties. Such forward-looking statements in this presentation generally relate to future events, our ability to increase the number of support subscription customers, the growth in usage of the Hadoop framework, our ability to innovate and develop the various open source projects that will enhance the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general business outlook. In some cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,” “target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or similar terms or expressions that concern our expectations, strategy, plans or intentions. You should not rely upon forward-looking statements as predictions of future events. We have based the forward-looking statements contained in this presentation primarily on our current expectations and projections about future events and trends that we believe may affect our business, financial condition and prospects. We cannot assure you that the results, events and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements. The forward-looking statements made in this prospectus relate only to events as of the date on which the statements are made and we undertake no obligation to update any of the information in this presentation. Trademarks Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other names used herein may be trademarks of their respective owners.
  • 52. Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved A Definition of Open Enterprise Hadoop Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° BATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) OPERATIONS Batch Interactive Real-Time
  • 53. Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  • 54. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL Fragile workflows make supporting the analytical models you want expensive and time-consuming.
  • 55. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Options for Data Input MapReduce WebHDFS hadoop fs -put Vendor Connectors Hadoop nfs gateway Hue Explorer
  • 56. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed in a Graph
  • 57. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed on a Map

Notas do Editor

  1. Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  2. The majority of enterprise data has traditionally come from large scale ERP, CRM, and other applications.  Each application has become siloed without the ability to gain insights across ALL the data. Now the enterprise must rationalize existing data silos but also gain value from the explosion of data that is being generated from the new paradigm sources. The challenge is the existing data management platforms have become both architecturally and financially impractical. Architecturally - these systems were not designed to store or process vast quantities of data Financially – the licensing structures with the traditional approach are no longer feasible These challenges and the rate at which data is being produced require a completely new approach to managing data. If we fast-forward another 3 to 5 years, more than 50% of the data under management within the enterprise will be from these new data paradigm sources. We have come to an inflection point on how the enterprise can manage their data. [NEXT SLIDE]
  3. What has created this inflection point is the growth and value from the new paradigm data. New data paradigm sources have put tremendous pressure on existing platforms but have also created tremendous opportunities. Exponential Growth. 85% year over year growth. Varied Nature. The incoming data can have little or no structure, or structure that changes too frequently for reliable schema creation at time of ingest. Value at High Volumes. The incoming data can have little or no value as individual, or small groups of, records. But at high volumes and longer historical perspectives can be inspected for patterns and used for advanced analytic applications. This New Data Paradigm opens up the Opportunity for both an architectural and business transformation that applies to virtually every industry.   [NEXT SLIDE]
  4. In today’s data-rich world, overlooked insight translates into missed opportunity.   The opportunities afforded by the age of Big Data have given rise to a new ultra-competitive breed of business that consumes the full spectrum of its data, transforming immense volumes and varieties of data into currency.   Our customers are investing in next-generation “systems of insight,” with advanced analytic apps providing a single, holistic view of customers and processes, and delivering predictive analytics around business performance and discovery through machine learning.   Underpinning these capabilities is a YARN-based architecture that delivers huge new processing power, scale, and efficiency especially when it’s properly integrated with existing operational and data warehousing systems.   HDP usage typically begins by creating new analytic applications fueled by the data that was not previously being captured.   As more and more applications are created, more opportunity is unlocked across ALL data sets, from the new types of data from sensors/machines, server logs, clickstreams, and other traditional sources like ERP and CRM.   Ultimately, HDP’s YARN-based architecture acts as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with.   [NEXT SLIDE]
  5. Ultimately, most organizations that adopt Hadoop, aspire to create a data lake where multiple applications use a shared set of resources, for both storage and processing all with a consistent level of service.   The value in the data lake ultimately results in delivery of “systems of insight” where advanced algorithms and applications that access multiple data sets allow organizations to derive brand new value from data that was once unable to be investigated or simply to complex to combine and analyze. Hadoop doesn’t just create a Data Lake—it opens the platform for analysts to view multiple data sources in multiple dimensions and reduce time to insight. This journey from apps to lake is only possible with HDP and its YARN based architecture.
  6. http://hortonworks.com/solutions/data-architecture-optimization/ http://hortonworks.com/solutions/advanced-analytic-apps/#single-view-customer http://hortonworks.com/solutions/advanced-analytic-apps/#predictive-analytics http://hortonworks.com/solutions/advanced-analytic-apps/#data-discovery BAWAG Bank, KPN, Daimler, ING, British Ga
  7. Since starting the company, one of our core missions was to make Hadoop an enterprise viable data platform. With HDP and its YARN-based architecture, the market now has a multi-tenant data platform built on a centralized architecture that provides the shared enterprise services of Resource Management, Operations, Security, Governance in a consistent manner for all Data Access patterns, for batch, interactive, or real-time applications.   These enterprise readiness capabilities help enable HDP to be used everywhere.   While it’s clear that HDP is ready for the enterprise, that doesn’t mean that we stop our work on enterprise readiness.   In fact, it’s just the opposite. There are more security, governance and operational advancements taking place in the Hadoop ecosystem now than ever before. And we continue to advance all of the services with the community.   [NEXT SLIDE]
  8. From Jeff Dean http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
  9. Outlines stays the same      Map and Reduce change to fit the problem
  10. Enter Hadoop. Faced with this challenge the team at yahoo conceived and created apache hadoop to address the challenge. They then were convinced that contribution of this platform into an open community would speed innovation. They open sourced the technology and did so within the governance of the Apache Software Foundation. (ASF) This introduced two distinct significant advantages. Not only could they manage new data types at scale but the now had a commercially feasible approach. However, there will still significant challenges. The first generation of Hadoop was: - designed and optimized for Batch only workloads, - it required dedicated clusters for each application, and, - it didn’t integrate easily with many of the existing technologies present in the data center. Also, like any emerging technology, Hadoop was required to meet a certain level of readiness required by the enterprise. After running Hadoop at scale at yahoo, the team spun out to form Hortonworks with the intent to address these challenges and make Hadoop enterprise ready.
  11. Access, Execution, Resource Mgt
  12. Since HDP provides a centralized architecture that is built on YARN with common services for security, operations, and governance, it enables the enterprise to run a wide range of applications simultaneously with well managed service levels. More applications and more data can run in the same shared cluster which simplifies the security, operations, and governance. Since the other pure play vendors have NOT built their products from the ground-up on a centralized YARN architecture, their platform architectures are disjoint. Without a consistent set of services applied to all applications and workloads, users are forced to silo their clusters in order to achieve predictable performance and service levels – which is more complex and costly. And since the critical services for security, operations, and governance are implemented as bolt-ons, the deployment architecture is further complicated.
  13. In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo! This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type. Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies. This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
  14. Pig-Latin, a language intended to sit between the two Provides standard relational transforms (join, sort, etc.) Schemas are optional, used when available, can be defined at runtime User Defined Functions are first class citizens An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs
  15. Pig executes in a unique fashion: some commands build on previous commands, while certain commands trigger a MapReduce job.
  16. Interactive queries at scale Originally created by a team at Facebook
  17. HDP 2.x ships with HiveServer2, a Thrift-based implementation that allows multiple concurrent connections and also supports Kerberos authentication.
  18. Note that this property is set to mr by default.
  19. The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it. The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the resource manager for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future. For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  20. You have talked about the components of Hadoop, now this slide talks about the various roles of Hadoop professionals.
  21. HDP is versatile to handle any data for any application and anywhere ANY DATA Hadoop was initially designed to store and process vast quantities of data and is still the optimal platformj to do so. With YARN and the introduction of all types of access methids from batch to interactive and real time, access to process and analyze this data has become even easier. ANY APPLICATION YARN also opens up Hadoop so that it can extend the value of linear scale storage and processing to existing applications. This also allows you to reuse your existing skillsets and resources, but with hadop as a foundation. To date, Hortonworks has certified over 70 ISVs to be YARN ready and the list is growing. ANYWHERE As a key part of the modern data architecture, Hadoop needs to be available across a wide range of deployment choices, and we enable the widest choice in the industry. In 2011, we established our partnership with Microsoft based on a shared vision of a hybrid world where Hadoop can run on-premises on Windows Server or Linux, within turnkey appliances, and in the cloud as a fully managed service or simply running within virtual machines on infrastructure-as-a-service clouds. Our work with Microsoft brought Hadoop to the Windows Server ecosystem and we’re the only vendor serving that market opportunity today. While most of our customers are deploying on-premises Hadoop clusters, we are uniquely positioned to support a hybrid architecture as enterprises embrace cloud for specific use cases.
  22. This is a great use case, but only spend 3-4 minutes on it. Run Hive Queries to Refine the Trucks data to get the average mileage Compute the risk factor for each driver (milage
  23. truck_mileage, avg_mileage
  24. Power Pivot again – this time demonstrating which driver’s had the most incidents.
  25. Power Pivot map again – this time showing the areas where the incidents occurred.