2. Agenda
What is BigData and Hadoop?
Hadoop Architecture
HDFS
MapReduce
Installing Hadoop
Develop & Run a MapReduce Program
Hadoop Ecosystems
5. Unstructured Data
Machine Generated
Satellite images
Scientific data
Photographs and video
Radar or sonar data
Human Generated
Word, PDF, Text
Social media data (Facebook, Twitter, LinkedIn)
Mobile data (text messages)
website contents (blogs, Instagram)
7. Key Terms
Commodity Hardware – PCs which can be used to form clusters.
Node – Commodity servers interconnected through network device.
NameNode = Master Node, DataNode = Slave Node
Cluster – interconnection of different nodes/systems in a network.
10. BigData
Traditional approaches not fit for data analysis due to inflation.
Handling Large volume of data (zettabytes & petabytes) which are structured or
unstructured.
Datasets that grow so large that it is difficult to capture, store, manage, share, analyze
and visualize with the typical database software tools.
Generated by different sources around us like Systems, sensors and mobile devices.
2.5 quintillion bytes of data created everyday.
80-90% of the data in the world today has been created in the last two years alone.
11. Flood of Data
More than 3 billion internet users in the world today.
The New York Stock Exchange generates about 4-5 TB of data per day.
7TB of data are processed by Twitter every day.
10TB of data are processed by Facebook every day and growing at 7 PB per month.
Interestingly 80% of these data are unstructured.
With this massive quantity of data, businesses need fast, reliable, deeper data insight.
Therefore, BigData solutions based on Hadoop and other analytics software are
becoming more and more relevant.
12. Dimensions of BigData
Volume – Big data comes in one size: large. Enterprises are awash with data, easily
amassing terabytes and even petabytes of information.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the
enterprise in order to maximize its value to the business.
Variety – Big data extends beyond structured data, including unstructured data of all
varieties: text, audio, video, click streams, log files and more.
13. BigData Benefits
Analysis of market and derive new strategy to improve business in different geo locations.
To know the response for their campaigns, promotions, and other advertising mediums.
Use medical history of patients, hospitals to provide better and quick service.
Re-develop your products.
Perform Risk Analysis.
Create new revenue streams.
Reduces maintenance cost.
Faster, better decision making.
New products & services.
16. Hadoop
Google File System (2003).
Developed by Doug Cutting from Yahoo.
Hadoop 0.1.0 was released in April 2006.
Open source project of the Apache Software Foundation.
A Framework written in Java.
Distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
Naming the Hadoop.
17. Hardware & Software
Hardware (commodity hardware)
Software
OS
RedHat Enterprise Linux (RHEL)
CentOS
Ubuntu
Java
Oracle JDK 1.6 (v 1.6.31)
Medium High
CPU 8 physical cores 12 physical cores
Memory 16 GB 48 GB
Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB
Network 1 GB Ethernet 10 GB Ethernet or Infiniband
18. When Hadoop?
When you must process lots of unstructured data.
When your processing can easily be made parallel.
When running batch jobs is acceptable.
When you have access to lots of cheap hardware.
22. Hadoop Configurations
Standalone Mode
All Hadoop services run into a single JVM and on a single machine.
Pseudo-Distributed Mode
Individual Hadoop services run in an individual JVM, but on a single machine.
Fully Distributed Mode
Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single
cluster.
24. How does Hadoop work?
Stage 1
User submit the Job to process with location of the input and output files in HDFS & Jar file
of MapReduce Program.
Job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop Job Client submits the Job and Configuration to JobTracker.
JobTracker will initiate the process to TaskTracker which in slave nodes.
JobTracker will schedule the tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
TaskTracker executes the Job as per MapReduce implementation.
Input will be processed and output will be stored into HDFS.
27. Hadoop Distributed File System (HDFS)
Java-based file system to store large volume of data.
Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
Supporting close to a billion files and blocks.
Access
Java API
Python/C for Non-Java Applications
Web GUI through HTTP
FS Shell - shell-like commands that directly interact with HDFS
28. HDFS Features
HDFS can handle large data sets.
Since HDFS deals with large scale data, it supports a multitude of machines.
HDFS provides a write-once-read-many access model.
HDFS is built using the Java language making it portable across various platforms.
Fault Tolerance and availability are high.
30. File Storage in HDFS
Split into multiple blocks/chunks and stored into different machines.
Blocks – 64MB size (default), 128MB (recommended).
Replication – fault tolerance and availability, it is configurable and it can be modified.
No storage space wasted. E.g. 420MB file stored as
31. NameNode
One Per Hadoop Cluster and Act as Master Server.
Commodity hardware that contains the Linux operating system.
Namenode software – runs on commodity hardware.
Responsible for
Manages the file system namespace.
Regulates client’s access to files.
executes file system operations such as renaming, closing, and opening files and directories.
32. Secondary NameNode
NameNode contains meta-data of job & data details in RAM.
S-NameNode contacts NameNode in a periodic time and copy of metadata information out
of NameNode.
When NameNode crashes, the meta-data copied from S-NameNode.
33. DataNode
Many per Hadoop Cluster.
Uses inexpensive commodity hardware.
Contains actual data.
Performs read/write operations on file based on request.
Performs block creation, deletion, and replication according to the instructions of the
NameNode.
34. HDFS Command Line Interface
View existing files
Copy files from local (copyFromLocal / put)
Copy files to local (copyToLocal / get)
Reset replication
37. MapReduce
Heart of Hadoop.
Programming model/Algorithm for data processing.
Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
MapReduce programs are inherently parallel.
Master-Slave Model.
Mapper
Performs filtering and sorting.
Reducer
Performs a summary operation.
39. Job Tracker
One per Hadoop Cluster.
Controls overall execution of MapReduce Program.
Manages the Task Tracker running on Data Node.
Tracking of available & utilized resources.
Tracks the running jobs and provides fault tolerance.
Heartbeat from TaskTracker for every few minutes.
40. Task Tracker
Many per Hadoop Cluster.
Executes and manages the individual tasks assigned by Job Tracker.
Periodic status to the JobTracker about the execution of the Job.
Handles the data motion between map() and reduce().
Notifies JobTracker if any task failed.
46. Mapper
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
47. Reducerimport java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
48. Main Programimport org.apache.hadoop.*;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} }
49. Input Data
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
Hello Hadoop Goodbye Hadoop
50. Run
Create Jar WordCout.jar
Run Command
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output
Output
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html