SlideShare uma empresa Scribd logo
1 de 90
1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved.
Intro to Hadoop: Hype or Reality –
you decide
kcrocker@gopivotal.com
Pivotal Meet-up
Kevin Crocker, Consulting Instructor, Pivotal Academy
March 19, 2014
2© Copyright 2014 Pivotal. All rights reserved.
Why is this Meet-up necessary
•  What is the future of enterprise data architecture?
–  The explosion of data
–  Volume, Variety, Velocity
–  Overruns traditional data stores
–  What is the business value of collecting all this data?
3© Copyright 2014 Pivotal. All rights reserved.
Volume
•  At a recent data conference, one participant told
the audience that they collected 7 PB of data a day
– and generated another 7 PB of data analytics
•  That’s 63 racks! A day! X 2
•  What do we even call that amount of data?
–  Data Warehouse(s), Data Store(s)
–  New Term: Data Lake
4© Copyright 2014 Pivotal. All rights reserved.
Variety
•  At the same data conference, another presenter
participated in a study using wearable medical
technology to monitor health
–  Collected 1 million readings a day = 12 readings a
second
–  when was the last time you had YOUR blood pressure
checked?
•  Toronto – so many sensors they can track millions
of cell phones over 400 square miles – 24x7
5© Copyright 2014 Pivotal. All rights reserved.
Velocity
•  Ingesting this amount of data is difficult
•  Analyzing this amount of data in traditional ways is
also difficult
–  A client recently told me that it used to take 3 weeks for
them to analyze the data from their sensors, now they
do it in 3 hours
6© Copyright 2014 Pivotal. All rights reserved.
Business Value
•  Wall Street Journal – those businesses in Toronto
pay to get summary reports of all that data and
then gear their marketing campaigns to drive new
revenue
7© Copyright 2014 Pivotal. All rights reserved.
The Data Lake Dream, Forbes, 01/14/2014
•  In an article published in Forbes, the author
mentions the term Data Lake and the technology
that addresses the problem of big data => Hadoop
•  Four levels of Hadoop Maturity
–  Life Before Hadoop -> Hadoop is Introduced ->
Growing the Data Lake -> Data Lake and Application
Cloud
8© Copyright 2014 Pivotal. All rights reserved.
So – Let’s talk about Hadoop
•  Hadoop Overview
–  Core Elements: HDFS and MapReduce
–  Ecosystem
•  HDFS Architecture
•  Hadoop MapReduce
•  Hadoop Ecosystem
•  MapReduce Primer
•  Buckle up!
9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
10© Copyright 2014 Pivotal. All rights reserved.
Hadoop Core
•  Based on two Google papers in 2003/4 – Google File System and MapReduce
•  Spawned off Nutch open-source web-search because of the need to store the
data
•  Open-source Apache project out of Yahoo! in January 2006
•  Distributed fault-tolerant data storage (distribution and replication of resources)
and distributed batch processing (not for random reads/writes, or updates)
•  Provides linear scalability on commodity hardware
•  Adopted by many:
–  Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter,
Yahoo!, and many, many more http://wiki.apache.org/hadoop/PoweredBy
•  Hadoop uses data redundancy rather than backup strategies
11© Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
•  Consists of:
–  Key sub-projects
•  Hadoop Common: Common utilities/tools for all Hadoop components/sub-projects
•  HDFS: A reliable, high-bandwidth, distributed file system
•  Map/Reduce: A programming framework to process large datasets
•  YARN
–  Other key Apache projects in Hadoop ecosystem
•  Avro: A data serialization system
•  Hbase/Cassandra: A scalable, distributed no-sql databases, supports structured data storage
for large tables.
•  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
•  Pig: A high-level data-flow language and execution framework for parallel computation.
•  ZooKeeper: A high-performance coordination service for distributed application
•  Latest version of Hadoop;
–  Stable and widely used latest version – V1 => 1.2.1, V2 => 2.2.0
12© Copyright 2014 Pivotal. All rights reserved.
Why?
•  Bottom line:
–  Flexible
–  Scalable
–  Inexpensive
13© Copyright 2014 Pivotal. All rights reserved.
Overview
•  Great at
–  Reliable storage for multi-petabyte data sets
–  Batch queries and analytics
–  Complex hierarchical data structures with changing
schemas, unstructured and structured data
•  Not so great at
–  Changes to files (can’t do it…) – not OLTP
–  Low-latency responses
–  Analyst usability
•  This is less of a concern now due to higher-level languages
14© Copyright 2014 Pivotal. All rights reserved.
Data Structure
•  Bytes! And more Bytes! (Peta)
•  No more ETL necessary???
•  Store data now, process later
•  Structure (schema) on read
–  Built-in support for common data types and formats
–  Extendable
–  Flexible
15© Copyright 2014 Pivotal. All rights reserved.
Versioning
•  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x
–  Two main MR packages:
•  org.apache.hadoop.mapred (deprecated)
•  org.apache.hadoop.mapreduce (new hotness)
•  Version 2.2.0, GA Oct 2013
–  NameNode HA
–  YARN – Next Gen MapReduce
–  HDFS Federation, Snapshots
16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
17© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
18© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture (Master/Worker)
•  HDFS Master: “Namenode”
–  Manages the filesystem namespace
–  Controls read/write access to files
–  Serves open/close/rename file requests from client
–  Manages block replication (rack-aware block placement, auto re-replication)
–  Checkpoints namespace and journals namespace changes for reliability
•  HDFS Workers: “Datanodes”
–  Serve read/write requests from clients
–  Perform replication tasks upon instruction by Namenode
–  Periodically validate the data checksum
•  HDFS Client
–  Interface available in Java, C, and command line.
–  Client computes and validates checksum stored by Datanode for data integrity check (if block
is corrupt, then other replica is accessed)
Hadoop Distributed File System
Data Model:
•  Data is organized into files and directories
•  Files are divided into uniformly-sized blocks and
distributed across cluster nodes
•  Blocks are replicated to handle hardware failure
•  Filesystem keeps checksums of data for corruption
detection and recovery
•  Read requests are always served from closest replica
•  Not strictly POSIX-compliant
20© Copyright 2014 Pivotal. All rights reserved.
Hadoop Distributed File System
•  Distributed, Fault-Tolerant & Scalable (petabyte) File System:
•  Designed to run on commodity hardware
Hardware failure is a norm (RAID-1 - Block level replication)
•  High throughput for Streaming/Sequential data access
As opposed to low latency for random I/O
•  Tuned for smaller number of large size data files
•  Simple Coherency model (Write once, read multiple times)
Append data to a file is supported in 0.19
•  Support for scalable data processing
Exposes metadata as # of block replicas and their locations etc., for scheduling
computations closer to data
•  Portability across heterogeneous HW & SW platforms
File system written in Java
•  High Availability and Namespace federation support (2.0.x-alpha)
21© Copyright 2014 Pivotal. All rights reserved.
HDFS Overview
•  Hierarchical UNIX-like file system for data storage
–  sort of (files, folders, permissions, users, groups) … but it is
a virtual file system
•  Splitting of large files into blocks
•  Distribution and replication of blocks to nodes
•  Two key services
–  Master NameNode
–  Many DataNodes
•  Checkpoint Node (Secondary NameNode)
22© Copyright 2014 Pivotal. All rights reserved.
NameNode
•  Single master service for HDFS
•  Single point of failure (HDFS 1.x; not 2.x)
•  Stores file to block to location mappings in the
namespace
•  All transactions are logged to disk
•  NameNode startup reads namespace image and
logs
23© Copyright 2014 Pivotal. All rights reserved.
Checkpoint Node (Secondary NN)
•  Performs checkpoints of the NameNode’s
namespace and logs
•  Not a hot backup!
1.  Loads up namespace
2.  Reads log transactions to modify namespace
3.  Saves namespace as a checkpoint
24© Copyright 2014 Pivotal. All rights reserved.
DataNode
•  Stores blocks on local disk
•  Sends frequent heartbeats to NameNode
•  Sends block reports to NameNode (all the block
IDs it has, checksums, etc)
•  Clients connect to DataNode for I/O
25© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNode
1
Client
2
A1
3
A2 A3 A4
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially
Writes blocks to DataNode
26© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
DataNodes replicate data
blocks, orchestrated
by the NameNode
27© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Reads
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
1
2
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from DataNode
28© Copyright 2014 Pivotal. All rights reserved.
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
Client connects to another
node serving that block
How HDFS Works - Failure
29© Copyright 2014 Pivotal. All rights reserved.
Block Replication
•  Default of three replica’s
•  Rack-aware system
–  One block on same rack
–  One block on same rack,
different host
–  One block on another rack
•  Automatic re-copy by
NameNode, as needed
Rack 1
DN
DN
DN
…
Rack 2
DN
DN
DN
…
30© Copyright 2014 Pivotal. All rights reserved.
HDFS 2.0 Features
•  NameNode High-Availability (HA)
–  Two redundant NameNodes in active/passive
configuration
–  Manual or automated failover
•  NameNode Federation
–  Multiple independent NameNodes using the same
collection of DataNodes
31© Copyright 2014 Pivotal. All rights reserved. 31© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce
•  Programming model processing list of key/value pairs
•  Map function: processes input key/value pairs and produces set of
intermediate key/value pairs.
•  Reduce function: merges all intermediate values associated with the same
intermediate key and produces output key/value pairs.
Map-Reduce Programming Model
Input
(k1, v1)
Output
K2, List(V3)
Intermediate
Output
List (K2, V2)
Reduce
Sort or Group by K2
(K2, List(V2))
Map
Application Writer Specifies:
• Map and Reduce classes
• Input data on HDFS
• Input/Output format classes (optional)
Workflow:
•  Input phase generates a number of logical FileSplits from input files
• One Map task is created per logical file split
•  Each Map task loads Map class and executes map function to transform
input kv-pairs into a new set of kv-pairs
•  Record reader class supplied part of InputFormat reads a input record
as k-v pair
•  Map output keys are stored on local disk in sorted partitions, one per
task
•  One invocation of map function per k-v pair from an associated input
split
•  Each Reduce task fetches map output (from its associated partition) as
soon as map task finishes its processing
•  Map outputs are merged
•  One invocation of reduce function per distinct key and its associated
list of values
•  Output k-v pairs are stored on HDFS, one file per reduce task
•  Framework handles task scheduling and recovery.
Km+1…N
Output
Part-0
Output
Part-1
Input
Split 0
Input HDFS File
K1..m K1..mK1..m Km+1…N Km+1…N
Sorted Partitions
Map 0 Map 1 Map 2
Sorted Partitions Sorted Partitions
Reduce 0 Reduce 1
Shuffle
Input
Split 2
Input
Split 1
Merge & Sort Merge & Sort
Parallel Execution Model for Map-Reduce
Km+1…N
34© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce 1.x
•  Moves the code to the data
•  JobTracker
–  Master service to monitor jobs
•  TaskTracker
–  Multiple services to run tasks in parallel
–  Same physical machine as a DataNode
•  A job contains many tasks (One data block equals one task )
•  A task contains one or more task attempts (success = good,
failed task attempts are given to another Task Tracker for
processing: 4 single failed task attempts = one failed job)
35© Copyright 2014 Pivotal. All rights reserved.
JobTracker
•  Monitors job and task progress
•  Issues task attempts to TaskTrackers
•  Re-tries failed task attempts
•  Four failed attempts = one failed job
•  Schedules jobs in FIFO order
–  Fair Scheduler
•  Single point of failure for MapReduce
36© Copyright 2014 Pivotal. All rights reserved.
TaskTrackers
•  Runs on same node as DataNode service
•  Sends heartbeats and task reports to JobTracker
•  Configurable number of map and reduce slots
•  Runs map and reduce task attempts
–  Separate JVM!
37© Copyright 2014 Pivotal. All rights reserved.
Exploiting Data Locality
•  JobTracker will schedule task on a TaskTracker
that is local to the block
–  3 options! Because 3 replica’s
•  If TaskTracker is busy, selects TaskTracker on
same rack
–  Many options!
•  If still busy, chooses an available TaskTracker at
random – Rare!
38© Copyright 2014 Pivotal. All rights reserved.
YARN (aka MapReduce 2)
•  Abstract framework for distributed application development
•  Split functionality of JobTracker into two components
–  ResourceManager
–  ApplicationMaster
•  TaskTracker becomes NodeManager
–  Containers instead of map and reduce slots
•  Configurable amount of memory per NodeManager
39© Copyright 2014 Pivotal. All rights reserved.
How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTracker
1
Client
4
2
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
3
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output is written to
DataNodes w/replication
JobTracker reports metrics back to client
40© Copyright 2014 Pivotal. All rights reserved.
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
How MapReduce Works - Failure
JobTracker assigns task to different node
41© Copyright 2014 Pivotal. All rights reserved.
MapReduce 2.x on YARN
•  MapReduce API has not changed
–  Rebuild required to upgrade from 1.x to 2.x
•  MapReduce History Server to store… history
42© Copyright 2014 Pivotal. All rights reserved.
YARN – Architecture
•  Client
•  Submit Job/applications
•  Resource Manager
•  Schedule resources
•  AppMaster
•  Manage/monitor lifecycle
of the M/R Job
•  Node Manager
•  Manage/monitor task
lifecycle
•  Container
•  Task JVM
•  No distinction between
map and reduce tasks
43© Copyright 2014 Pivotal. All rights reserved.
YARN – Map/Reduce
44© Copyright 2014 Pivotal. All rights reserved. 44© Copyright 2014 Pivotal. All rights reserved.
Hadoop Ecosystem
45© Copyright 2014 Pivotal. All rights reserved.
Hadoop Ecosystem
•  Core Technologies
–  Hadoop Distributed File System
–  Hadoop MapReduce
•  Many other tools…
–  Which I will be describing… now
46© Copyright 2014 Pivotal. All rights reserved.
Moving Data
•  Sqoop
–  Moving data between RDBMS and HDFS
–  Say, migrating MySQL tables to HDFS
•  Flume
–  Streams event data from sources to sinks
–  Say, weblogs from multiple servers into HDFS
47© Copyright 2014 Pivotal. All rights reserved.
Flume Architecture
48© Copyright 2014 Pivotal. All rights reserved.
Higher Level APIs
•  Pig
–  Data-flow language – aptly named PigLatin -- to
generate one or more MapReduce jobs against data
stored locally or in HDFS
•  Hive
–  Data warehousing solution, allowing users to write
SQL-like queries to generate a series of MapReduce
jobs against data stored in HDFS
49© Copyright 2014 Pivotal. All rights reserved.
Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';
50© Copyright 2014 Pivotal. All rights reserved.
Key/Value Stores
•  HBase
•  Accumulo
•  Implementations of Google’s Big Table for HDFS
•  Provides random, real-time access to big data
•  Supports updates and deletes of key/value pairs
51© Copyright 2014 Pivotal. All rights reserved.
HBase Architecture
MasterZooKeeper
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
Client
HDFS
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
52© Copyright 2014 Pivotal. All rights reserved.
Data Structure
•  Avro
–  Data serialization system designed for the Hadoop
ecosystem
–  Expressed as JSON
•  Parquet
–  Compressed, efficient columnar storage for Hadoop
and other systems
53© Copyright 2014 Pivotal. All rights reserved.
Scalable Machine Learning
•  Mahout
–  Library for scalable machine learning written in Java
–  Very robust examples!
–  Classification, Clustering, Pattern Mining, Collaborative
Filtering, and much more
54© Copyright 2014 Pivotal. All rights reserved.
Workflow Management
•  Oozie
–  Scheduling system for Hadoop Jobs
–  Support for:
•  Java MapReduce
•  Streaming MapReduce
•  Pig, Hive, Sqoop, Distcp
•  Any ol’ Java or shell script program
55© Copyright 2014 Pivotal. All rights reserved.
Real-time Stream Processing
•  Storm
–  Open-source project
which runs a streaming
of data, called a spout,
to a series of execution
agents called bolts
–  Scalable and fault-
tolerant, with guaranteed
processing of data
–  Benchmarks of over a
million tuples processed
per second per node
56© Copyright 2014 Pivotal. All rights reserved.
Distributed Application Coordination
•  ZooKeeper
–  An effort to develop and
maintain an open-source
server which enables
highly reliable distributed
coordination
–  Designed to be simple,
replicated, ordered, and
fast
–  Provides configuration
management, distributed
synchronization, and group
services for applications
57© Copyright 2014 Pivotal. All rights reserved.
ZooKeeper Architecture
58© Copyright 2014 Pivotal. All rights reserved.
Hadoop Streaming
•  Can define Mapper and Reduce using Unix text filters
–  Typically use grep, sed, python, or perl scripts
•  Format for input and output is: key t value n
•  Allows for easy debugging and experimentation
•  Slower than Java programs
•  bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh -
reducer streamingReducer.sh
–  Mapper: /bin/sed -e 's| |n|g' | /bin/grep .
–  Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "t" $1}'
59© Copyright 2014 Pivotal. All rights reserved.
Hadoop Streaming Architecture
JobTracker (Master)
TaskTracker
(Slave)
Map Task
TaskTracker
(Slave)
Mapper
Executable
I/P
HDFS
File
STDOUT
STDIN
Reduce Task
Reducer
Executable
STDOUT
STDIN
O/P
HDFS
File
K t V
http://hadoop.apache.org/docs/stable/streaming.html
60© Copyright 2014 Pivotal. All rights reserved.
SQL on Hadoop
•  Apache Drill
•  Cloudera Impala
•  Hive Stinger
•  Pivotal HAWQ
•  MPP execution of SQL queries against HDFS data
61© Copyright 2014 Pivotal. All rights reserved.
That’s a lot of projects
•  I am likely missing several (Sorry, guys!)
•  Each cropped up to solve a limitation of Hadoop
Core
•  Know your ecosystem
•  Pick the right tool for the right job
62© Copyright 2014 Pivotal. All rights reserved.
Sample Architecture
HDFS
Flume
Agent
Flume
Agent
Flume
Agent
MapReduce Pig HBase Storm
Website
Oozie
Webserver
Sales
Call Center SQL
SQL
63© Copyright 2014 Pivotal. All rights reserved. 63© Copyright 2014 Pivotal. All rights reserved.
MapReduce Primer
64© Copyright 2014 Pivotal. All rights reserved.
MapReduce Paradigm
•  Data processing system with two key phases
•  Map
–  Perform a map function on input key/value pairs to
generate intermediate key/value pairs
•  Reduce
–  Perform a reduce function on intermediate key/value
groups to generate output key/value pairs
•  Groups created by sorting map output
Reduce Task 0 Reduce Task 1
Map Task 0 Map Task 1 Map Task 2
(0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun")
("hadoop", 1)
("is", 1)
("fun", 1)
("I", 1)
("love", 1)
("hadoop", 1)
("Pig", 1)
("is", 1)
("more", 1)
("fun", 1)
("hadoop", {1,1})
("is", {1,1})
("fun", {1,1})
("love", {1})
("I", {1})
("Pig", {1})
("more", {1})
("hadoop", 2)
("fun", 2)
("love", 1)
("I", 1)
("is", 2)
("Pig", 1)
("more", 1)
SHUFFLE AND SORT
Map Input
Map Output
Reducer Input Groups
Reducer Output
66© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce Components
•  Map Phase
–  Input Format
–  Record Reader
–  Mapper
–  Combiner
–  Partitioner
•  Reduce Phase
–  Shuffle
–  Sort
–  Reducer
–  Output Format
–  Record Writer
67© Copyright 2014 Pivotal. All rights reserved.
Writable Interfaces
public interface Writable {"
"
void write(DataOutput out);"
void readFields(DataInput in);"
}"
"
public interface WritableComparable<T> extends Writable, Comparable<T> {"
}"
•  BooleanWritable
•  BytesWritable
•  ByteWritable
•  DoubleWritable
•  FloatWritable
•  IntWritable
•  LongWritable
•  NullWritable
•  Text
68© Copyright 2014 Pivotal. All rights reserved.
InputFormat
"
"
public abstract class InputFormat<K, V> {"
"
public abstract List<InputSplit> getSplits(JobContext context);"
"
public abstract RecordReader<K, V>"
"createRecordReader(InputSplit split, TaskAttemptContext context);"
}"
69© Copyright 2014 Pivotal. All rights reserved.
RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {"
"
public abstract void initialize(InputSplit split, TaskAttemptContext context);"
"
public abstract boolean nextKeyValue();"
"
public abstract KEYIN getCurrentKey();"
"
public abstract VALUEIN getCurrentValue();"
"
public abstract float getProgress();"
"
public abstract void close();"
}"
70© Copyright 2014 Pivotal. All rights reserved.
Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protected void setup(Context context) { /* NOTHING */ }"
protected void cleanup(Context context) { /* NOTHING */ }"
"
protected void map(KEYIN key, VALUEIN value, Context context) {"
context.write((KEYOUT) key, (VALUEOUT) value);"
}"
"
public void run(Context context) {"
setup(context);"
while (context.nextKeyValue())"
map(context.getCurrentKey(), context.getCurrentValue(), context);"
cleanup(context);"
}"
}"
71© Copyright 2014 Pivotal. All rights reserved.
Partitioner
"
"
public abstract class Partitioner<KEY, VALUE> {"
"
public abstract int getPartition(KEY key, VALUE value, int numPartitions);"
"
}"
"
•  Default HashPartitioner uses key’s hashCode() % numPartitions
72© Copyright 2014 Pivotal. All rights reserved.
Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protected void setup(Context context) { /* NOTHING */ }"
protected void cleanup(Context context) { /* NOTHING */ }"
"
protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {"
for (VALUEIN value : values)"
context.write((KEYOUT) key, (VALUEOUT) value);"
}"
"
public void run(Context context) {"
setup(context);"
while (context.nextKey())"
reduce(context.getCurrentKey(), context.getValues(), context);"
cleanup(context);"
}"
}"
73© Copyright 2014 Pivotal. All rights reserved.
OutputFormat
"
"
public abstract class OutputFormat<K, V> {"
"
public abstract RecordWriter<K, V>"
" " "getRecordWriter(TaskAttemptContext context);"
"
public abstract void checkOutputSpecs(JobContext context);"
"
public abstract OutputCommitter"
" " "getOutputCommitter(TaskAttemptContext context);"
}"
74© Copyright 2014 Pivotal. All rights reserved.
RecordWriter
"
"
public abstract class RecordWriter<K, V> {"
"
public abstract void write(K key, V value);"
"
public abstract void close(TaskAttemptContext context);"
}"
75© Copyright 2014 Pivotal. All rights reserved.
Some M/R Concepts / knobs
•  Configuration
–  {hdfs,yarn,mapred}-default.xml -- default config (contain both services & client config)
–  {hdfs,yarn,mapred}-site.xml -- Service config used for cluster specific over-rides,
–  {hdfs,yarn,mapred}-client.xml -- Client specific config
•  Input/Output Formats
–  TextFileInputFormat, KeyValueTextFileInputFormat, NLineInputFormat, SequenceFileInputFormat
–  Pluggable input/output formats provide ability for Jobs to read/write data in different formats
–  Major function
•  getSplits
•  RecordReader
•  Schedulers
–  Pluggable resource scheduler used by Resource Manager
–  Default, Capacity Scheduler & Fair scheduler
•  Combiner
–  Combine individual map output before sending to reducer
–  Lowers intermediate data
•  Partitioner
–  Pluggable class to partition the map output among number of reducers
76© Copyright 2014 Pivotal. All rights reserved.
Some M/R knobs
•  Compression
–  Enable compression of Map/Reduce output
–  Gzip, lzo, bz2 codecs available with framework
•  Counters
–  Ability to keep track of various job statistics e.g. num bytes read, written
–  Available for each task and also aggregated per job.
–  Job can write its own custom counters
•  Speculative Executions
–  Provides task recovery against hardware issues
•  Distributed cache
–  Ability to make job specific data available to each
•  Tool – M/R application helper classes, Support ability for job to accept generic options, e.g.
–  -conf <configuration file> specify an application configuration file
–  -D <property=value> use value for given property
–  -fs <local|namenode:port> specify a namenode
–  -jt <local|jobtracker:port> specify a job tracker
–  -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
–  -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
–  -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
77© Copyright 2014 Pivotal. All rights reserved. 77© Copyright 2014 Pivotal. All rights reserved.
Word Count Example
78© Copyright 2014 Pivotal. All rights reserved.
Problem
•  Count the number of
times each word is used
in a body of text
•  Uses TextInputFormat
and TextOutputFormat
map(byte_offset, line)
foreach word in line
emit(word, 1)
reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)
79© Copyright 2014 Pivotal. All rights reserved.
Word Count Example
80© Copyright 2014 Pivotal. All rights reserved.
Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ "
private final static IntWritable ONE = new IntWritable(1);"
private Text word = new Text();"
"
public void map(LongWritable key, Text value, Context context) {"
String line = value.toString();"
StringTokenizer tokenizer = new StringTokenizer(line);"
"
while (tokenizer.hasMoreTokens()) {"
word.set(tokenizer.nextToken());"
context.write(word, ONE);"
}"
}"
}"
81© Copyright 2014 Pivotal. All rights reserved.
Shuffle and Sort
P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3
2
3
P0 P1 P2 P3
Reducer 0 Reducer 1 Reducer 2 Reducer 3
Mapper 0 Mapper 1 Mapper 2 Mapper 3
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,
sorting by key
82© Copyright 2014 Pivotal. All rights reserved.
Reducer Code
public class IntSumReducer"
"extends Reducer<Text, LongWritable, Text, IntWritable> {"
private IntWritable outvalue = new IntWritable();"
private int sum = 0;"
"
public void reduce(Text key, Iterable<IntWritable> values, Context context) {"
sum = 0;"
for (IntWritable val : values) {"
sum += val.get();"
}"
outvalue.set(sum);"
context.write(key, outvalue);"
}"
}"
83© Copyright 2014 Pivotal. All rights reserved.
So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems you'll
ever have ever
84© Copyright 2014 Pivotal. All rights reserved.
So what’s so hard about it?
•  MapReduce is a limitation
•  Entirely different way of thinking
•  Simple processing operations such as joins are not so
easy when expressed in MapReduce
•  Proper implementation is not so easy
•  Lots of configuration and implementation details for
optimal performance
–  Number of reduce tasks, data skew, JVM size, garbage
collection
85© Copyright 2014 Pivotal. All rights reserved.
So what does this mean for you?
•  Hadoop is written primarily in Java
•  Components are extendable and configurable
•  Custom I/O through Input and Output Formats
–  Parse custom data formats
–  Read and write using external systems
•  Higher-level tools enable rapid development of big
data analysis
86© Copyright 2014 Pivotal. All rights reserved.
Resources, Wrap-up, etc.
•  http://hadoop.apache.org
•  Very supportive community
•  Plenty of resources available to learn more
–  Blogs
–  Email lists
–  Books
–  Shameless Plug -- MapReduce Design Patterns
87© Copyright 2014 Pivotal. All rights reserved.
Getting Started
•  Pivotal HD Single-Node VM and Community
Edition
–  http://gopivotal.com/pivotal-products/data/pivotal-hd
•  For the brave and bold -- Roll-your-own!
–  http://hadoop.apache.org/docs/current
88© Copyright 2014 Pivotal. All rights reserved.
Acknowledgements
•  Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout,
Oozie, Pig, Sqoop, YARN, and ZooKeeper are
trademarks of the Apache Software Foundation
•  Cloudera Impala is a trademark of Cloudera
•  Parquet is copyright Twitter, Cloudera, and other
contributors
•  Storm is licensed under the Eclipse Public License
89© Copyright 2014 Pivotal. All rights reserved.
•  Talk to us on Twitter: @mewzherder (Tamao, not
me)
•  Sign up for more Hadoop
–  http://bit.ly/POSH0018
•  Pivotal Education
–  http://www.gopivotal.com/training
Learn More. Stay Connected.
90© Copyright 2014 Pivotal. All rights reserved.
Questions ??

Mais conteúdo relacionado

Mais procurados

Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 

Mais procurados (20)

Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Hadoop
Hadoop Hadoop
Hadoop
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 

Semelhante a Apache hadoop: POSH Meetup Palo Alto, CA April 2014

hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptManiMaran230751
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 

Semelhante a Apache hadoop: POSH Meetup Palo Alto, CA April 2014 (20)

Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.pptHADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Último

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 

Último (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 

Apache hadoop: POSH Meetup Palo Alto, CA April 2014

  • 1. 1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved. Intro to Hadoop: Hype or Reality – you decide kcrocker@gopivotal.com Pivotal Meet-up Kevin Crocker, Consulting Instructor, Pivotal Academy March 19, 2014
  • 2. 2© Copyright 2014 Pivotal. All rights reserved. Why is this Meet-up necessary •  What is the future of enterprise data architecture? –  The explosion of data –  Volume, Variety, Velocity –  Overruns traditional data stores –  What is the business value of collecting all this data?
  • 3. 3© Copyright 2014 Pivotal. All rights reserved. Volume •  At a recent data conference, one participant told the audience that they collected 7 PB of data a day – and generated another 7 PB of data analytics •  That’s 63 racks! A day! X 2 •  What do we even call that amount of data? –  Data Warehouse(s), Data Store(s) –  New Term: Data Lake
  • 4. 4© Copyright 2014 Pivotal. All rights reserved. Variety •  At the same data conference, another presenter participated in a study using wearable medical technology to monitor health –  Collected 1 million readings a day = 12 readings a second –  when was the last time you had YOUR blood pressure checked? •  Toronto – so many sensors they can track millions of cell phones over 400 square miles – 24x7
  • 5. 5© Copyright 2014 Pivotal. All rights reserved. Velocity •  Ingesting this amount of data is difficult •  Analyzing this amount of data in traditional ways is also difficult –  A client recently told me that it used to take 3 weeks for them to analyze the data from their sensors, now they do it in 3 hours
  • 6. 6© Copyright 2014 Pivotal. All rights reserved. Business Value •  Wall Street Journal – those businesses in Toronto pay to get summary reports of all that data and then gear their marketing campaigns to drive new revenue
  • 7. 7© Copyright 2014 Pivotal. All rights reserved. The Data Lake Dream, Forbes, 01/14/2014 •  In an article published in Forbes, the author mentions the term Data Lake and the technology that addresses the problem of big data => Hadoop •  Four levels of Hadoop Maturity –  Life Before Hadoop -> Hadoop is Introduced -> Growing the Data Lake -> Data Lake and Application Cloud
  • 8. 8© Copyright 2014 Pivotal. All rights reserved. So – Let’s talk about Hadoop •  Hadoop Overview –  Core Elements: HDFS and MapReduce –  Ecosystem •  HDFS Architecture •  Hadoop MapReduce •  Hadoop Ecosystem •  MapReduce Primer •  Buckle up!
  • 9. 9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview
  • 10. 10© Copyright 2014 Pivotal. All rights reserved. Hadoop Core •  Based on two Google papers in 2003/4 – Google File System and MapReduce •  Spawned off Nutch open-source web-search because of the need to store the data •  Open-source Apache project out of Yahoo! in January 2006 •  Distributed fault-tolerant data storage (distribution and replication of resources) and distributed batch processing (not for random reads/writes, or updates) •  Provides linear scalability on commodity hardware •  Adopted by many: –  Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more http://wiki.apache.org/hadoop/PoweredBy •  Hadoop uses data redundancy rather than backup strategies
  • 11. 11© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview •  Consists of: –  Key sub-projects •  Hadoop Common: Common utilities/tools for all Hadoop components/sub-projects •  HDFS: A reliable, high-bandwidth, distributed file system •  Map/Reduce: A programming framework to process large datasets •  YARN –  Other key Apache projects in Hadoop ecosystem •  Avro: A data serialization system •  Hbase/Cassandra: A scalable, distributed no-sql databases, supports structured data storage for large tables. •  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. •  Pig: A high-level data-flow language and execution framework for parallel computation. •  ZooKeeper: A high-performance coordination service for distributed application •  Latest version of Hadoop; –  Stable and widely used latest version – V1 => 1.2.1, V2 => 2.2.0
  • 12. 12© Copyright 2014 Pivotal. All rights reserved. Why? •  Bottom line: –  Flexible –  Scalable –  Inexpensive
  • 13. 13© Copyright 2014 Pivotal. All rights reserved. Overview •  Great at –  Reliable storage for multi-petabyte data sets –  Batch queries and analytics –  Complex hierarchical data structures with changing schemas, unstructured and structured data •  Not so great at –  Changes to files (can’t do it…) – not OLTP –  Low-latency responses –  Analyst usability •  This is less of a concern now due to higher-level languages
  • 14. 14© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Bytes! And more Bytes! (Peta) •  No more ETL necessary??? •  Store data now, process later •  Structure (schema) on read –  Built-in support for common data types and formats –  Extendable –  Flexible
  • 15. 15© Copyright 2014 Pivotal. All rights reserved. Versioning •  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x –  Two main MR packages: •  org.apache.hadoop.mapred (deprecated) •  org.apache.hadoop.mapreduce (new hotness) •  Version 2.2.0, GA Oct 2013 –  NameNode HA –  YARN – Next Gen MapReduce –  HDFS Federation, Snapshots
  • 16. 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  • 17. 17© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  • 18. 18© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture (Master/Worker) •  HDFS Master: “Namenode” –  Manages the filesystem namespace –  Controls read/write access to files –  Serves open/close/rename file requests from client –  Manages block replication (rack-aware block placement, auto re-replication) –  Checkpoints namespace and journals namespace changes for reliability •  HDFS Workers: “Datanodes” –  Serve read/write requests from clients –  Perform replication tasks upon instruction by Namenode –  Periodically validate the data checksum •  HDFS Client –  Interface available in Java, C, and command line. –  Client computes and validates checksum stored by Datanode for data integrity check (if block is corrupt, then other replica is accessed)
  • 19. Hadoop Distributed File System Data Model: •  Data is organized into files and directories •  Files are divided into uniformly-sized blocks and distributed across cluster nodes •  Blocks are replicated to handle hardware failure •  Filesystem keeps checksums of data for corruption detection and recovery •  Read requests are always served from closest replica •  Not strictly POSIX-compliant
  • 20. 20© Copyright 2014 Pivotal. All rights reserved. Hadoop Distributed File System •  Distributed, Fault-Tolerant & Scalable (petabyte) File System: •  Designed to run on commodity hardware Hardware failure is a norm (RAID-1 - Block level replication) •  High throughput for Streaming/Sequential data access As opposed to low latency for random I/O •  Tuned for smaller number of large size data files •  Simple Coherency model (Write once, read multiple times) Append data to a file is supported in 0.19 •  Support for scalable data processing Exposes metadata as # of block replicas and their locations etc., for scheduling computations closer to data •  Portability across heterogeneous HW & SW platforms File system written in Java •  High Availability and Namespace federation support (2.0.x-alpha)
  • 21. 21© Copyright 2014 Pivotal. All rights reserved. HDFS Overview •  Hierarchical UNIX-like file system for data storage –  sort of (files, folders, permissions, users, groups) … but it is a virtual file system •  Splitting of large files into blocks •  Distribution and replication of blocks to nodes •  Two key services –  Master NameNode –  Many DataNodes •  Checkpoint Node (Secondary NameNode)
  • 22. 22© Copyright 2014 Pivotal. All rights reserved. NameNode •  Single master service for HDFS •  Single point of failure (HDFS 1.x; not 2.x) •  Stores file to block to location mappings in the namespace •  All transactions are logged to disk •  NameNode startup reads namespace image and logs
  • 23. 23© Copyright 2014 Pivotal. All rights reserved. Checkpoint Node (Secondary NN) •  Performs checkpoints of the NameNode’s namespace and logs •  Not a hot backup! 1.  Loads up namespace 2.  Reads log transactions to modify namespace 3.  Saves namespace as a checkpoint
  • 24. 24© Copyright 2014 Pivotal. All rights reserved. DataNode •  Stores blocks on local disk •  Sends frequent heartbeats to NameNode •  Sends block reports to NameNode (all the block IDs it has, checksums, etc) •  Clients connect to DataNode for I/O
  • 25. 25© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNode 1 Client 2 A1 3 A2 A3 A4 Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially Writes blocks to DataNode
  • 26. 26© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 DataNodes replicate data blocks, orchestrated by the NameNode
  • 27. 27© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Reads DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 1 2 3 Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode
  • 28. 28© Copyright 2014 Pivotal. All rights reserved. DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 Client connects to another node serving that block How HDFS Works - Failure
  • 29. 29© Copyright 2014 Pivotal. All rights reserved. Block Replication •  Default of three replica’s •  Rack-aware system –  One block on same rack –  One block on same rack, different host –  One block on another rack •  Automatic re-copy by NameNode, as needed Rack 1 DN DN DN … Rack 2 DN DN DN …
  • 30. 30© Copyright 2014 Pivotal. All rights reserved. HDFS 2.0 Features •  NameNode High-Availability (HA) –  Two redundant NameNodes in active/passive configuration –  Manual or automated failover •  NameNode Federation –  Multiple independent NameNodes using the same collection of DataNodes
  • 31. 31© Copyright 2014 Pivotal. All rights reserved. 31© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce
  • 32. •  Programming model processing list of key/value pairs •  Map function: processes input key/value pairs and produces set of intermediate key/value pairs. •  Reduce function: merges all intermediate values associated with the same intermediate key and produces output key/value pairs. Map-Reduce Programming Model Input (k1, v1) Output K2, List(V3) Intermediate Output List (K2, V2) Reduce Sort or Group by K2 (K2, List(V2)) Map
  • 33. Application Writer Specifies: • Map and Reduce classes • Input data on HDFS • Input/Output format classes (optional) Workflow: •  Input phase generates a number of logical FileSplits from input files • One Map task is created per logical file split •  Each Map task loads Map class and executes map function to transform input kv-pairs into a new set of kv-pairs •  Record reader class supplied part of InputFormat reads a input record as k-v pair •  Map output keys are stored on local disk in sorted partitions, one per task •  One invocation of map function per k-v pair from an associated input split •  Each Reduce task fetches map output (from its associated partition) as soon as map task finishes its processing •  Map outputs are merged •  One invocation of reduce function per distinct key and its associated list of values •  Output k-v pairs are stored on HDFS, one file per reduce task •  Framework handles task scheduling and recovery. Km+1…N Output Part-0 Output Part-1 Input Split 0 Input HDFS File K1..m K1..mK1..m Km+1…N Km+1…N Sorted Partitions Map 0 Map 1 Map 2 Sorted Partitions Sorted Partitions Reduce 0 Reduce 1 Shuffle Input Split 2 Input Split 1 Merge & Sort Merge & Sort Parallel Execution Model for Map-Reduce Km+1…N
  • 34. 34© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce 1.x •  Moves the code to the data •  JobTracker –  Master service to monitor jobs •  TaskTracker –  Multiple services to run tasks in parallel –  Same physical machine as a DataNode •  A job contains many tasks (One data block equals one task ) •  A task contains one or more task attempts (success = good, failed task attempts are given to another Task Tracker for processing: 4 single failed task attempts = one failed job)
  • 35. 35© Copyright 2014 Pivotal. All rights reserved. JobTracker •  Monitors job and task progress •  Issues task attempts to TaskTrackers •  Re-tries failed task attempts •  Four failed attempts = one failed job •  Schedules jobs in FIFO order –  Fair Scheduler •  Single point of failure for MapReduce
  • 36. 36© Copyright 2014 Pivotal. All rights reserved. TaskTrackers •  Runs on same node as DataNode service •  Sends heartbeats and task reports to JobTracker •  Configurable number of map and reduce slots •  Runs map and reduce task attempts –  Separate JVM!
  • 37. 37© Copyright 2014 Pivotal. All rights reserved. Exploiting Data Locality •  JobTracker will schedule task on a TaskTracker that is local to the block –  3 options! Because 3 replica’s •  If TaskTracker is busy, selects TaskTracker on same rack –  Many options! •  If still busy, chooses an available TaskTracker at random – Rare!
  • 38. 38© Copyright 2014 Pivotal. All rights reserved. YARN (aka MapReduce 2) •  Abstract framework for distributed application development •  Split functionality of JobTracker into two components –  ResourceManager –  ApplicationMaster •  TaskTracker becomes NodeManager –  Containers instead of map and reduce slots •  Configurable amount of memory per NodeManager
  • 39. 39© Copyright 2014 Pivotal. All rights reserved. How MapReduce Works DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTracker 1 Client 4 2 B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 3 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics back to client
  • 40. 40© Copyright 2014 Pivotal. All rights reserved. DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTrackerClient B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D How MapReduce Works - Failure JobTracker assigns task to different node
  • 41. 41© Copyright 2014 Pivotal. All rights reserved. MapReduce 2.x on YARN •  MapReduce API has not changed –  Rebuild required to upgrade from 1.x to 2.x •  MapReduce History Server to store… history
  • 42. 42© Copyright 2014 Pivotal. All rights reserved. YARN – Architecture •  Client •  Submit Job/applications •  Resource Manager •  Schedule resources •  AppMaster •  Manage/monitor lifecycle of the M/R Job •  Node Manager •  Manage/monitor task lifecycle •  Container •  Task JVM •  No distinction between map and reduce tasks
  • 43. 43© Copyright 2014 Pivotal. All rights reserved. YARN – Map/Reduce
  • 44. 44© Copyright 2014 Pivotal. All rights reserved. 44© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem
  • 45. 45© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem •  Core Technologies –  Hadoop Distributed File System –  Hadoop MapReduce •  Many other tools… –  Which I will be describing… now
  • 46. 46© Copyright 2014 Pivotal. All rights reserved. Moving Data •  Sqoop –  Moving data between RDBMS and HDFS –  Say, migrating MySQL tables to HDFS •  Flume –  Streams event data from sources to sinks –  Say, weblogs from multiple servers into HDFS
  • 47. 47© Copyright 2014 Pivotal. All rights reserved. Flume Architecture
  • 48. 48© Copyright 2014 Pivotal. All rights reserved. Higher Level APIs •  Pig –  Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS •  Hive –  Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS
  • 49. 49© Copyright 2014 Pivotal. All rights reserved. Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';
  • 50. 50© Copyright 2014 Pivotal. All rights reserved. Key/Value Stores •  HBase •  Accumulo •  Implementations of Google’s Big Table for HDFS •  Provides random, real-time access to big data •  Supports updates and deletes of key/value pairs
  • 51. 51© Copyright 2014 Pivotal. All rights reserved. HBase Architecture MasterZooKeeper RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile Client HDFS RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile
  • 52. 52© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Avro –  Data serialization system designed for the Hadoop ecosystem –  Expressed as JSON •  Parquet –  Compressed, efficient columnar storage for Hadoop and other systems
  • 53. 53© Copyright 2014 Pivotal. All rights reserved. Scalable Machine Learning •  Mahout –  Library for scalable machine learning written in Java –  Very robust examples! –  Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more
  • 54. 54© Copyright 2014 Pivotal. All rights reserved. Workflow Management •  Oozie –  Scheduling system for Hadoop Jobs –  Support for: •  Java MapReduce •  Streaming MapReduce •  Pig, Hive, Sqoop, Distcp •  Any ol’ Java or shell script program
  • 55. 55© Copyright 2014 Pivotal. All rights reserved. Real-time Stream Processing •  Storm –  Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts –  Scalable and fault- tolerant, with guaranteed processing of data –  Benchmarks of over a million tuples processed per second per node
  • 56. 56© Copyright 2014 Pivotal. All rights reserved. Distributed Application Coordination •  ZooKeeper –  An effort to develop and maintain an open-source server which enables highly reliable distributed coordination –  Designed to be simple, replicated, ordered, and fast –  Provides configuration management, distributed synchronization, and group services for applications
  • 57. 57© Copyright 2014 Pivotal. All rights reserved. ZooKeeper Architecture
  • 58. 58© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming •  Can define Mapper and Reduce using Unix text filters –  Typically use grep, sed, python, or perl scripts •  Format for input and output is: key t value n •  Allows for easy debugging and experimentation •  Slower than Java programs •  bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh - reducer streamingReducer.sh –  Mapper: /bin/sed -e 's| |n|g' | /bin/grep . –  Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "t" $1}'
  • 59. 59© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming Architecture JobTracker (Master) TaskTracker (Slave) Map Task TaskTracker (Slave) Mapper Executable I/P HDFS File STDOUT STDIN Reduce Task Reducer Executable STDOUT STDIN O/P HDFS File K t V http://hadoop.apache.org/docs/stable/streaming.html
  • 60. 60© Copyright 2014 Pivotal. All rights reserved. SQL on Hadoop •  Apache Drill •  Cloudera Impala •  Hive Stinger •  Pivotal HAWQ •  MPP execution of SQL queries against HDFS data
  • 61. 61© Copyright 2014 Pivotal. All rights reserved. That’s a lot of projects •  I am likely missing several (Sorry, guys!) •  Each cropped up to solve a limitation of Hadoop Core •  Know your ecosystem •  Pick the right tool for the right job
  • 62. 62© Copyright 2014 Pivotal. All rights reserved. Sample Architecture HDFS Flume Agent Flume Agent Flume Agent MapReduce Pig HBase Storm Website Oozie Webserver Sales Call Center SQL SQL
  • 63. 63© Copyright 2014 Pivotal. All rights reserved. 63© Copyright 2014 Pivotal. All rights reserved. MapReduce Primer
  • 64. 64© Copyright 2014 Pivotal. All rights reserved. MapReduce Paradigm •  Data processing system with two key phases •  Map –  Perform a map function on input key/value pairs to generate intermediate key/value pairs •  Reduce –  Perform a reduce function on intermediate key/value groups to generate output key/value pairs •  Groups created by sorting map output
  • 65. Reduce Task 0 Reduce Task 1 Map Task 0 Map Task 1 Map Task 2 (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") ("hadoop", 1) ("is", 1) ("fun", 1) ("I", 1) ("love", 1) ("hadoop", 1) ("Pig", 1) ("is", 1) ("more", 1) ("fun", 1) ("hadoop", {1,1}) ("is", {1,1}) ("fun", {1,1}) ("love", {1}) ("I", {1}) ("Pig", {1}) ("more", {1}) ("hadoop", 2) ("fun", 2) ("love", 1) ("I", 1) ("is", 2) ("Pig", 1) ("more", 1) SHUFFLE AND SORT Map Input Map Output Reducer Input Groups Reducer Output
  • 66. 66© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce Components •  Map Phase –  Input Format –  Record Reader –  Mapper –  Combiner –  Partitioner •  Reduce Phase –  Shuffle –  Sort –  Reducer –  Output Format –  Record Writer
  • 67. 67© Copyright 2014 Pivotal. All rights reserved. Writable Interfaces public interface Writable {" " void write(DataOutput out);" void readFields(DataInput in);" }" " public interface WritableComparable<T> extends Writable, Comparable<T> {" }" •  BooleanWritable •  BytesWritable •  ByteWritable •  DoubleWritable •  FloatWritable •  IntWritable •  LongWritable •  NullWritable •  Text
  • 68. 68© Copyright 2014 Pivotal. All rights reserved. InputFormat " " public abstract class InputFormat<K, V> {" " public abstract List<InputSplit> getSplits(JobContext context);" " public abstract RecordReader<K, V>" "createRecordReader(InputSplit split, TaskAttemptContext context);" }"
  • 69. 69© Copyright 2014 Pivotal. All rights reserved. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {" " public abstract void initialize(InputSplit split, TaskAttemptContext context);" " public abstract boolean nextKeyValue();" " public abstract KEYIN getCurrentKey();" " public abstract VALUEIN getCurrentValue();" " public abstract float getProgress();" " public abstract void close();" }"
  • 70. 70© Copyright 2014 Pivotal. All rights reserved. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void map(KEYIN key, VALUEIN value, Context context) {" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKeyValue())" map(context.getCurrentKey(), context.getCurrentValue(), context);" cleanup(context);" }" }"
  • 71. 71© Copyright 2014 Pivotal. All rights reserved. Partitioner " " public abstract class Partitioner<KEY, VALUE> {" " public abstract int getPartition(KEY key, VALUE value, int numPartitions);" " }" " •  Default HashPartitioner uses key’s hashCode() % numPartitions
  • 72. 72© Copyright 2014 Pivotal. All rights reserved. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {" for (VALUEIN value : values)" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKey())" reduce(context.getCurrentKey(), context.getValues(), context);" cleanup(context);" }" }"
  • 73. 73© Copyright 2014 Pivotal. All rights reserved. OutputFormat " " public abstract class OutputFormat<K, V> {" " public abstract RecordWriter<K, V>" " " "getRecordWriter(TaskAttemptContext context);" " public abstract void checkOutputSpecs(JobContext context);" " public abstract OutputCommitter" " " "getOutputCommitter(TaskAttemptContext context);" }"
  • 74. 74© Copyright 2014 Pivotal. All rights reserved. RecordWriter " " public abstract class RecordWriter<K, V> {" " public abstract void write(K key, V value);" " public abstract void close(TaskAttemptContext context);" }"
  • 75. 75© Copyright 2014 Pivotal. All rights reserved. Some M/R Concepts / knobs •  Configuration –  {hdfs,yarn,mapred}-default.xml -- default config (contain both services & client config) –  {hdfs,yarn,mapred}-site.xml -- Service config used for cluster specific over-rides, –  {hdfs,yarn,mapred}-client.xml -- Client specific config •  Input/Output Formats –  TextFileInputFormat, KeyValueTextFileInputFormat, NLineInputFormat, SequenceFileInputFormat –  Pluggable input/output formats provide ability for Jobs to read/write data in different formats –  Major function •  getSplits •  RecordReader •  Schedulers –  Pluggable resource scheduler used by Resource Manager –  Default, Capacity Scheduler & Fair scheduler •  Combiner –  Combine individual map output before sending to reducer –  Lowers intermediate data •  Partitioner –  Pluggable class to partition the map output among number of reducers
  • 76. 76© Copyright 2014 Pivotal. All rights reserved. Some M/R knobs •  Compression –  Enable compression of Map/Reduce output –  Gzip, lzo, bz2 codecs available with framework •  Counters –  Ability to keep track of various job statistics e.g. num bytes read, written –  Available for each task and also aggregated per job. –  Job can write its own custom counters •  Speculative Executions –  Provides task recovery against hardware issues •  Distributed cache –  Ability to make job specific data available to each •  Tool – M/R application helper classes, Support ability for job to accept generic options, e.g. –  -conf <configuration file> specify an application configuration file –  -D <property=value> use value for given property –  -fs <local|namenode:port> specify a namenode –  -jt <local|jobtracker:port> specify a job tracker –  -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster –  -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. –  -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  • 77. 77© Copyright 2014 Pivotal. All rights reserved. 77© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  • 78. 78© Copyright 2014 Pivotal. All rights reserved. Problem •  Count the number of times each word is used in a body of text •  Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
  • 79. 79© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  • 80. 80© Copyright 2014 Pivotal. All rights reserved. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ " private final static IntWritable ONE = new IntWritable(1);" private Text word = new Text();" " public void map(LongWritable key, Text value, Context context) {" String line = value.toString();" StringTokenizer tokenizer = new StringTokenizer(line);" " while (tokenizer.hasMoreTokens()) {" word.set(tokenizer.nextToken());" context.write(word, ONE);" }" }" }"
  • 81. 81© Copyright 2014 Pivotal. All rights reserved. Shuffle and Sort P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 2 3 P0 P1 P2 P3 Reducer 0 Reducer 1 Reducer 2 Reducer 3 Mapper 0 Mapper 1 Mapper 2 Mapper 3 Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key
  • 82. 82© Copyright 2014 Pivotal. All rights reserved. Reducer Code public class IntSumReducer" "extends Reducer<Text, LongWritable, Text, IntWritable> {" private IntWritable outvalue = new IntWritable();" private int sum = 0;" " public void reduce(Text key, Iterable<IntWritable> values, Context context) {" sum = 0;" for (IntWritable val : values) {" sum += val.get();" }" outvalue.set(sum);" context.write(key, outvalue);" }" }"
  • 83. 83© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? MapReduce that’s a tiny box All the problems you'll ever have ever
  • 84. 84© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? •  MapReduce is a limitation •  Entirely different way of thinking •  Simple processing operations such as joins are not so easy when expressed in MapReduce •  Proper implementation is not so easy •  Lots of configuration and implementation details for optimal performance –  Number of reduce tasks, data skew, JVM size, garbage collection
  • 85. 85© Copyright 2014 Pivotal. All rights reserved. So what does this mean for you? •  Hadoop is written primarily in Java •  Components are extendable and configurable •  Custom I/O through Input and Output Formats –  Parse custom data formats –  Read and write using external systems •  Higher-level tools enable rapid development of big data analysis
  • 86. 86© Copyright 2014 Pivotal. All rights reserved. Resources, Wrap-up, etc. •  http://hadoop.apache.org •  Very supportive community •  Plenty of resources available to learn more –  Blogs –  Email lists –  Books –  Shameless Plug -- MapReduce Design Patterns
  • 87. 87© Copyright 2014 Pivotal. All rights reserved. Getting Started •  Pivotal HD Single-Node VM and Community Edition –  http://gopivotal.com/pivotal-products/data/pivotal-hd •  For the brave and bold -- Roll-your-own! –  http://hadoop.apache.org/docs/current
  • 88. 88© Copyright 2014 Pivotal. All rights reserved. Acknowledgements •  Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation •  Cloudera Impala is a trademark of Cloudera •  Parquet is copyright Twitter, Cloudera, and other contributors •  Storm is licensed under the Eclipse Public License
  • 89. 89© Copyright 2014 Pivotal. All rights reserved. •  Talk to us on Twitter: @mewzherder (Tamao, not me) •  Sign up for more Hadoop –  http://bit.ly/POSH0018 •  Pivotal Education –  http://www.gopivotal.com/training Learn More. Stay Connected.
  • 90. 90© Copyright 2014 Pivotal. All rights reserved. Questions ??