2. Who am I
• Joe Alex
– Software Architect / Data Scientist
Loves to code in Java, Scala
– Areas of Interest: Big Data, Data Analytics,
Machine Learning, Hadoop, Cassandra
– Currently working as Team Lead for Managed
Security Services Portal at Verizon
3. New kind of data
• Social - messages, posts, blogs, photos, videos,
maps, graphs, friends
• Machine – sensors, firewalls, routers, logs,
metrics, health monitoring, cell phones, credit
card transactions
4. New kind of data
• Volume – massive, TB PB
– Convert 350 billion annual meter readings to better predict power
consumption
– Turn 12 terabytes of Tweets created each day into improved product
sentiment analysis
• Types – structured, semi/un-structured
– Text, audio, video, click streams, logs, machine
– Monitor 100’s of live video feeds from surveillance cameras to target points of
interest
• Velocity (time sensitive)
– ideally processed as it is streaming (realtime, near-realtime, batch)
– Scrutinize 5 million trade events created each day to identify potential fraud
– Analyze 500 million daily call detail records in real-time to predict customer
churn faster
5. What is Big Data about
• We are drowning is a sea of data, sometimes
we throw away a lot.
• Still we cant make much sense of it
• We consider data as a cost
• But Data is an opportunity
• This is what Big Data is about
– New Insights
– New Business
6. Big Data Domains
• Digital marketing
• Data discovery – patterns, trends
• Fraud detection
• Machine generated Data Analytics
– Remote device insight, sensing, location based
intel
• Social
• Data retention
7. Big Data Architecture
• Traditional
– High Availability, RDBMS, Structured data
• Big Data
– High scalabilty/availability/flexibility,
Compute/Storage on same nodes,
Structured/Semi/Un-Structured data
8. How to tackle Big Data
• Layered Architecture
– Speed Layer
– Batch layer
9. Apache Hadoop
• Open source project under Apache Software Foundation
• Based on papers published by Google
– MapReduce:
http://research.google.com/archive/mapreduce.html
– GFS:
http://research.google.com/archive/gfs.html
10. Reliability
• "Failure is the defining difference between
distributed and local programming“
• - Ken Arnold, CORBA Designer
11. Why Hadoop
• Data processed by Google every month: 400Pb… in
2007
• Average job size: 180Gb
• Time 180Gb of data would take to read sequentially off
a single disk drive: 45 minutes
• Solution: parallel reads
• – 1 HDD = 75Mb/sec
• – 1,000 HDDs = 75Gb/sec (Far more acceptable)
• Data Access Speed is the Bottleneck
• We can process data very quickly, but we can only
read/write it very slowly
12. Core Components
• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
• There are many other projects based around core
Hadoop
– Often referred to as the “Hadoop Ecosystem”
– Pig, Hive, HBase, Flume, Oozie, Sqoop etc
• A set of machines running HDFS and MapReduce is
known as a Hadoop Cluster
• Individual machines are known as nodes. A cluster can
have as few as one node, as many as several thousands
• More nodes = better performance
13. System Requirements
• System should support partial failure
– Failure of one part of the system should result in a
graceful decline in performance. Not a full halt
• System should support data recoverability
– If components fail, their workload should be picked up
by still functioning units
• System should support individual recoverability
– Nodes that fail and restart should be able to rejoin the
group activity without a full group restart
14. System Requirements (cont’d)
• System should be consistent
– – Concurrent operations or partial internal failures
should not cause the results of the job to change
• System should be scalable
– – Adding increased load to a system should not
cause outright failure. Instead, should result in a
graceful decline
• Increasing resources should support a
proportional increase in load capacity
15. Hadoop’s radical approach
• Hadoop provides a radical approach to these issues:
– Nodes talk to each other as little as possible, Probably
never.
– This is known as a “shared nothing” architecture
– Programmer should not explicitly be allowed to write code
which communicates between nodes
• Data is spread throughout machines in the cluster
– Data distribution happens when data is loaded on to the
cluster
• Instead of bringing data to the processors, Hadoop
brings the processing to the data
16. Hadoop’s radical approach
• Batch Oriented
• Data Locality (code is shipped around)
• Heavy Parallelization
• Process Management
• Append-only files
• Express your computation in Map Reduce, get
parallelism and and scalability for free
17. Core Hadoop Daemons
• Each node in a Hadoop installation runs one or
more daemons executing MapReduce code or
HDFS commands. Each daemon’s responsibilities
in the cluster are:
– NameNode: manages HDFS and communicates with
every DataNode daemon in the cluster
– JobTracker: dispatches jobs and assigns splits to
mappers or reducers as each stage completes
– TaskTracker: executes tasks sent by the JobTracker and
reports status
– DataNode: Manages HDFS content in the node and
updates status to the NameNode
18. Config files
• hadoop-env.sh — environmental configuration,
JVM configuration etc
• core-site.xml — site wide configuration
• hdfs-site.xml — HDFS block size, Name and Data
node directories
• mapred-site.xml — total MapReduce tasks,
JobTracker address
• masters, slaves files — NameNode, JobTracker,
DataNodes, and TaskTrackers addresses, as
appropriate
19. HDFS: Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Provides redundant storage of massive
amounts of data
– Using cheap, unreliable computers
• At load time, data is distributed across all
nodes
– Provides for efficient MapReduce processing
20. HDFS Assumptions
• High component failure rates
– Inexpensive components fail all the time
• “Modest” number of HUGE files
– Just a few million
– Each file likely to be 100Mb or larger
– Multi-Gigabyte files typical
• Large streaming reads
– Not random access
• High sustained throughput should be favored
over low latency
21. HDFS Features
• Operates ‘on top of’ an existing filesystem
• Files are stored as ‘blocks’
– Much larger than for most filesystems
– Default is 64Mb
• Provides reliability through replication
– Each block is replicated across three or more DataNodes
• Single NameNode stores metadata and co-ordinates access
– Provides simple, centralized management
• No data caching
– Would provide little benefit due to large datasets, streaming reads
• Familiar interface, but customize the API
– Simplify the problem and focus on distributed applications
23. MapReduce
• MapReduce is a method for distributing a task
across multiple nodes in the Hadoop cluster
• Consists of two phases: Map, and then Reduce
– Between the two is a stage known as the shuffle and
sort
• Each Map task operates on a discrete portion of
the overall dataset. Typically one HDFS block of
data
• After all Maps are complete, the MapReduce
system distributes the intermediate data to nodes
which perform the Reduce phase
24. Features of MapReduce
• Automatic parallelization and distribution
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
– – MapReduce programs are usually written in Java
– – Can be written in any scripting language using Hadoop
Streaming
– – All of Hadoop is written in Java
• MapReduce abstracts all the “housekeeping” away
from the developer
– – Developer can concentrate simply on writing the Map
and Reduce functions
25. MapReduce example
• Map
• // assume input is a set of text files k is a line
offset v is the line for that offset
• let map(k, v) =
• for each word in v:
• emit(word, 1)
• Reduce
• // k is a word vals is a list of 1s
• let reduce(k, vals) =
• emit(k, vals.length())
30. Streaming API
• Many organizations have developers skilled in
languages other than Java
– Perl, Ruby, Python, Etc
• The Streaming API allows developers to use any
language they wish to write Mappers and Reducers
– As long as the language can read from standard input and
write to standard output
• Advantages of the Streaming API:
– No need for non-Java coders to learn Java
– Fast development time
– Ability to use existing code libraries
31. Job Driver
• Job Driver
JobConf conf = new JobConf(WordCount.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
• Driver is submitted to the Hadoop cluster for processing, along with
the rest of the code in a .jar file.
32. Mapper
• The basic Java code implementation for the mapper has the
form:
public class WordMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> collector, Reporter
reporter) throws IOException {
/* implementation here
*/ }
}
• The implementation itself uses standard Java text
manipulation tools; you can use regular expressions,
scanners, whatever is necessary.
33. Reducer
• Reducer
public class SumReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable>
values,
OutputCollector<Text, IntWritable> collector, Reporter
reporter)
throws IOException {
/* implementation
*/}
}
• The reducer iterates over keys and values generated in the
previous step and sums up the occurrence of the word
34. Input/Output Formats
• Input Formats
– KeyValueTextInputFormat — Each line represents
a key and value delimited by a separator
– TextInputFormat — The key is the byte offset, the
value is the text itself for each line
– Sequence Input Format — Raw format serialized
key/value pairs
• Output Formats
– Specify final output
35. Hadoop Eco System : Hive
• Hive
– SQL-based data warehousing app
– Data analysts are very familiar with SQL than Java etc
– Hive allows users to query data using HiveQL, a language
very similar to standard SQL
– Hive turns HiveQL queries into standard MapReduce jobs
– Automatically runs the jobs, and displays the results to the
user
– Note that Hive is not an RDBMS
• Results take many seconds, minutes, or even hours to be produced
• Not possible to modify the data using HiveQL
– Features for analyzing very large data sets
36. Hadoop Eco System: Pig
• Pig
– Data-flow oriented language
– Pig can be used as an alternative to writing
MapReduce jobs in Java (or some other language)
– Provides a scripting language known as Pig Latin
– Abstracts MapReduce details away from the user
– Made up of a set of operations that are applied to the
input data to produce output
– Fairly easy to write complex asks such as joins of
multiple datasets
– Under the covers, Pig Latin scripts are converted to
MapReduce jobs
37. Hadoop Eco System: HBase
• HBase
– Distributed, sparse, column-oriented datastore
• Distributed: designed to use multiple machines to store and
serve data
• Sparse: each row may or may not have values for all columns
• Column-oriented: Data is stored grouped by column, rather
than by row. Columns are grouped into ‘column families’,
which define what columns are physically stored together
– Leverages HDFS
– Modeled after Google’s BigTable datastore
38. Hadoop Eco System: Others
• Flume
– Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced.
– Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
• Sqoop
– Sqoop is “the SQL-to-Hadoop database import tool”
– Designed to import data from RDBMS into Hadoop
– Can also send data the other way, from Hadoop to an RDBMS
– Uses JDBC to connect to the RDBMS
• Oozie
– Dataflow
40. Next Gen
• Storm – distributed realtime computation.
Makes it easy to reliably process unbounded
streams of data, doing for realtime processing
what Hadoop did for batch processing
• Spark – Spark is an open source cluster
computing system that aims to make data
analytics fast.
• Impala – real-time processing