Hadoop and BigData - July 2016

Ranjith Sekar
Ranjith SekarSoftware Engineer
Hadoop and BigData
Ranjith Sekar
July 2016
Agenda
 What is BigData and Hadoop?
 Hadoop Architecture
 HDFS
 MapReduce
 Installing Hadoop
 Develop & Run a MapReduce Program
 Hadoop Ecosystems
Introduction
Data
 Structured
 Relational DB,
 Library Catalogues (date, author, place, subject, etc.,)
 Semi Structured
 CSV, XML, JSON, NoSQL database
 Unstructured
Unstructured Data
 Machine Generated
 Satellite images
 Scientific data
 Photographs and video
 Radar or sonar data
 Human Generated
 Word, PDF, Text
 Social media data (Facebook, Twitter, LinkedIn)
 Mobile data (text messages)
 website contents (blogs, Instagram)
Storage
Key Terms
 Commodity Hardware – PCs which can be used to form clusters.
 Node – Commodity servers interconnected through network device.
 NameNode = Master Node, DataNode = Slave Node
 Cluster – interconnection of different nodes/systems in a network.
BigData
Hadoop and BigData - July 2016
BigData
 Traditional approaches not fit for data analysis due to inflation.
 Handling Large volume of data (zettabytes & petabytes) which are structured or
unstructured.
 Datasets that grow so large that it is difficult to capture, store, manage, share, analyze
and visualize with the typical database software tools.
 Generated by different sources around us like Systems, sensors and mobile devices.
 2.5 quintillion bytes of data created everyday.
 80-90% of the data in the world today has been created in the last two years alone.
Flood of Data
 More than 3 billion internet users in the world today.
 The New York Stock Exchange generates about 4-5 TB of data per day.
 7TB of data are processed by Twitter every day.
 10TB of data are processed by Facebook every day and growing at 7 PB per month.
 Interestingly 80% of these data are unstructured.
 With this massive quantity of data, businesses need fast, reliable, deeper data insight.
 Therefore, BigData solutions based on Hadoop and other analytics software are
becoming more and more relevant.
Dimensions of BigData
Volume – Big data comes in one size: large. Enterprises are awash with data, easily
amassing terabytes and even petabytes of information.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the
enterprise in order to maximize its value to the business.
Variety – Big data extends beyond structured data, including unstructured data of all
varieties: text, audio, video, click streams, log files and more.
BigData Benefits
 Analysis of market and derive new strategy to improve business in different geo locations.
 To know the response for their campaigns, promotions, and other advertising mediums.
 Use medical history of patients, hospitals to provide better and quick service.
 Re-develop your products.
 Perform Risk Analysis.
 Create new revenue streams.
 Reduces maintenance cost.
 Faster, better decision making.
 New products & services.
Hadoop
Hadoop and BigData - July 2016
Hadoop
 Google File System (2003).
 Developed by Doug Cutting from Yahoo.
 Hadoop 0.1.0 was released in April 2006.
 Open source project of the Apache Software Foundation.
 A Framework written in Java.
 Distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
 Naming the Hadoop.
Hardware & Software
 Hardware (commodity hardware)
 Software
 OS
 RedHat Enterprise Linux (RHEL)
 CentOS
 Ubuntu
 Java
 Oracle JDK 1.6 (v 1.6.31)
Medium High
CPU 8 physical cores 12 physical cores
Memory 16 GB 48 GB
Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB
Network 1 GB Ethernet 10 GB Ethernet or Infiniband
When Hadoop?
 When you must process lots of unstructured data.
 When your processing can easily be made parallel.
 When running batch jobs is acceptable.
 When you have access to lots of cheap hardware.
Hadoop Distributions
http://www.cloudera.com/downloads/
http://hortonworks.com/downloads/
https://www.mapr.com/products/hadoop-download
http://pivotal.io/big-data/pivotal-hdb
http://www.ibm.com/developerworks/downloads/im/biginsightsquick/
Hadoop Architecture
Hadoop Core Components
Hadoop Configurations
 Standalone Mode
 All Hadoop services run into a single JVM and on a single machine.
 Pseudo-Distributed Mode
 Individual Hadoop services run in an individual JVM, but on a single machine.
 Fully Distributed Mode
 Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single
cluster.
Hadoop Core Services
 NameNode
 Secondary NameNode
 DataNode
 ResourceManager
 ApplicationMaster
 NodeManager
How does Hadoop work?
 Stage 1
 User submit the Job to process with location of the input and output files in HDFS & Jar file
of MapReduce Program.
 Job configuration by setting different parameters specific to the job.
 Stage 2
 The Hadoop Job Client submits the Job and Configuration to JobTracker.
 JobTracker will initiate the process to TaskTracker which in slave nodes.
 JobTracker will schedule the tasks and monitoring them, providing status and diagnostic
information to the job-client.
 Stage 3
 TaskTracker executes the Job as per MapReduce implementation.
 Input will be processed and output will be stored into HDFS.
Hadoop Cluster
HDFS
Hadoop Distributed File System (HDFS)
 Java-based file system to store large volume of data.
 Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
 Supporting close to a billion files and blocks.
 Access
 Java API
 Python/C for Non-Java Applications
 Web GUI through HTTP
 FS Shell - shell-like commands that directly interact with HDFS
HDFS Features
 HDFS can handle large data sets.
 Since HDFS deals with large scale data, it supports a multitude of machines.
 HDFS provides a write-once-read-many access model.
 HDFS is built using the Java language making it portable across various platforms.
 Fault Tolerance and availability are high.
HDFS Architecture
File Storage in HDFS
 Split into multiple blocks/chunks and stored into different machines.
 Blocks – 64MB size (default), 128MB (recommended).
 Replication – fault tolerance and availability, it is configurable and it can be modified.
 No storage space wasted. E.g. 420MB file stored as
NameNode
 One Per Hadoop Cluster and Act as Master Server.
 Commodity hardware that contains the Linux operating system.
 Namenode software – runs on commodity hardware.
 Responsible for
 Manages the file system namespace.
 Regulates client’s access to files.
 executes file system operations such as renaming, closing, and opening files and directories.
Secondary NameNode
 NameNode contains meta-data of job & data details in RAM.
 S-NameNode contacts NameNode in a periodic time and copy of metadata information out
of NameNode.
 When NameNode crashes, the meta-data copied from S-NameNode.
DataNode
 Many per Hadoop Cluster.
 Uses inexpensive commodity hardware.
 Contains actual data.
 Performs read/write operations on file based on request.
 Performs block creation, deletion, and replication according to the instructions of the
NameNode.
HDFS Command Line Interface
 View existing files
 Copy files from local (copyFromLocal / put)
 Copy files to local (copyToLocal / get)
 Reset replication
HDFS Operation Principle
MapReduce
MapReduce
 Heart of Hadoop.
 Programming model/Algorithm for data processing.
 Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
 MapReduce programs are inherently parallel.
 Master-Slave Model.
 Mapper
 Performs filtering and sorting.
 Reducer
 Performs a summary operation.
MapReduce Architecture
Job Tracker
 One per Hadoop Cluster.
 Controls overall execution of MapReduce Program.
 Manages the Task Tracker running on Data Node.
 Tracking of available & utilized resources.
 Tracks the running jobs and provides fault tolerance.
 Heartbeat from TaskTracker for every few minutes.
Task Tracker
 Many per Hadoop Cluster.
 Executes and manages the individual tasks assigned by Job Tracker.
 Periodic status to the JobTracker about the execution of the Job.
 Handles the data motion between map() and reduce().
 Notifies JobTracker if any task failed.
MapReduce Engine
Hadoop Installation
Installing Hadoop
 Prerequisites
 Installation
 Download : http://hadoop.apache.org/releases.html
 > tar xzf hadoop-x.y.z.tar.gz
 > export JAVA_HOME=/user/software/java6/
 > export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
 > export PATH=$PATH:$HADOOP_INSTALL/bin
 > Hadoop version
Hadoop 0.20.0
Pseudo-Distributed Mode Configuration
core-site.xml hdfs-site.xml mapred-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
 Formatting HDFS
 > hadoop namenode -format
 Start HDFS & MapReduce
 > start-dfs.sh
 > start-mapred.sh
 Stop HDFS & MapReduce
 > stop-dfs.sh
 > stop-mapred.sh
Develop &
Run a MapReduce Program
Mapper
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
Reducerimport java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
Main Programimport org.apache.hadoop.*;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} }
Input Data
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
Hello Hadoop Goodbye Hadoop
Run
 Create Jar WordCout.jar
 Run Command
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output
 Output
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
 Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html
Hadoop Ecosystem
Hadoop Ecosystem
 HDFS & MapReduce
 Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.
 Pig – Scripting Language for MapReduce Program.
 Mahout - Scalable, commercial-friendly machine learning for building intelligent application.
 Hive – Metastore to view HDFS data.
 Hbase - open source, non-relational, distributed database.
 Sqoop – CLI application for transferring data between relational databases and Hadoop.
 ZooKeeper - distributed configuration service, synchronization service, and naming registry for large
distributed systems.
 Oozie – define and manage the workflow.
Queries ?
 http://www.slideshare.net/java2ranjith
 java2ranjith@gmail.com
1 de 53

Recomendados

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData... por
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
4K visualizações51 slides
Introduction to BIg Data and Hadoop por
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
870 visualizações42 slides
Introduction to Bigdata and HADOOP por
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
1K visualizações23 slides
BigData Analytics with Hadoop and BIRT por
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
693 visualizações21 slides
Big Data Analytics with Hadoop, MongoDB and SQL Server por
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
6.9K visualizações36 slides
Big Data Concepts por
Big Data ConceptsBig Data Concepts
Big Data ConceptsAhmed Salman
1.3K visualizações42 slides

Mais conteúdo relacionado

Mais procurados

Hadoop Tutorial For Beginners por
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd
1.3K visualizações26 slides
Intro to Big Data Hadoop por
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
4.5K visualizações26 slides
Big data technologies and Hadoop infrastructure por
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
7.4K visualizações39 slides
Hadoop: An Industry Perspective por
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
32.4K visualizações25 slides
Introduction to Apache Hadoop Eco-System por
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
1.4K visualizações30 slides
Whatisbigdataandwhylearnhadoop por
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
5.5K visualizações20 slides

Mais procurados(20)

Intro to Big Data Hadoop por Apache Apex
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex4.5K visualizações
Big data technologies and Hadoop infrastructure por Roman Nikitchenko
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko7.4K visualizações
Hadoop: An Industry Perspective por Cloudera, Inc.
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.32.4K visualizações
Introduction to Apache Hadoop Eco-System por Md. Hasan Basri (Angel)
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)1.4K visualizações
Whatisbigdataandwhylearnhadoop por Edureka!
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!5.5K visualizações
Hadoop and Big Data por Harshdeep Kaur
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur1.1K visualizações
Big data with Hadoop - Introduction por Tomy Rhymond
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond2.1K visualizações
Big Data Course - BigData HUB por Ahmed Salman
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
Ahmed Salman487 visualizações
Big Data - A brief introduction por Frans van Noort
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort501 visualizações
Big data concepts por Serkan Özal
Big data conceptsBig data concepts
Big data concepts
Serkan Özal14.5K visualizações
Bio bigdata por Mk Kim
Bio bigdata Bio bigdata
Bio bigdata
Mk Kim3.8K visualizações
Big Data simplified por Praveen Hanchinal
Big Data simplifiedBig Data simplified
Big Data simplified
Praveen Hanchinal1.6K visualizações
What is hadoop por Asis Mohanty
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty1.1K visualizações
Introduction of Big data, NoSQL & Hadoop por Savvycom Savvycom
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom1.2K visualizações
Hadoop and big data por Yukti Kaura
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura1K visualizações
Introduction to Big Data por Haluan Irsad
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad919 visualizações
Big Data: An Overview por C. Scyphers
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers14.7K visualizações
Big Data Real Time Applications por DataWorks Summit
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit11.5K visualizações
Big Data Final Presentation por 17aroumougamh
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
17aroumougamh1.9K visualizações

Destaque

Verso i bigdata giudiziari? (Nexa Torino, luglio 2016) por
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)Simone Aliprandi
1.8K visualizações38 slides
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북 por
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북BOAZ Bigdata
1.6K visualizações56 slides
BigData - Hadoop -by 侯圣文@secooler por
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler Shengwen HOU(侯圣文)
735 visualizações46 slides
ITEC - Qua trinh phat trien he thong BigData por
ITEC - Qua trinh phat trien he thong BigDataITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigDataIT Expert Club
3K visualizações42 slides
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager... por
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...FactoVia
1.5K visualizações43 slides
Oxalide MorningTech #1 - BigData por
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataLudovic Piot
715 visualizações35 slides

Destaque(20)

Verso i bigdata giudiziari? (Nexa Torino, luglio 2016) por Simone Aliprandi
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Simone Aliprandi1.8K visualizações
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북 por BOAZ Bigdata
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
BOAZ Bigdata1.6K visualizações
BigData - Hadoop -by 侯圣文@secooler por Shengwen HOU(侯圣文)
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler
Shengwen HOU(侯圣文)735 visualizações
ITEC - Qua trinh phat trien he thong BigData por IT Expert Club
ITEC - Qua trinh phat trien he thong BigDataITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigData
IT Expert Club3K visualizações
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager... por FactoVia
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
FactoVia1.5K visualizações
Oxalide MorningTech #1 - BigData por Ludovic Piot
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
Ludovic Piot715 visualizações
Integración Bigdata: punto de entrada al IoT - LibreCon 2016 por LibreCon
Integración Bigdata: punto de entrada al IoT - LibreCon 2016Integración Bigdata: punto de entrada al IoT - LibreCon 2016
Integración Bigdata: punto de entrada al IoT - LibreCon 2016
LibreCon352 visualizações
DNA - Einstein - Data science ja bigdata por Rolf Koski
DNA - Einstein - Data science ja bigdataDNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdata
Rolf Koski328 visualizações
Introduction to Big Data por Mohammed Guller
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller682 visualizações
Storage Component Technologies in the Age of Big Data and Cloud Computing - S... por xuyunhao
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
xuyunhao1.7K visualizações
Big data introduction, Hadoop in details por Mahmoud Yassin
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin1.7K visualizações
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat... por Usama Fayyad
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad2.9K visualizações
Big data&hadoop por Ram Idavalapati
Big data&hadoopBig data&hadoop
Big data&hadoop
Ram Idavalapati288 visualizações
Introduction to Big Data por Karan Desai
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai1K visualizações
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie... por kcitp
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp913 visualizações
Big Data: tools and techniques for working with large data sets por Boston Consulting Group
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group8.2K visualizações
Chapter 14 replication por AbDul ThaYyal
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
AbDul ThaYyal12.3K visualizações
SQL, NoSQL, BigData in Data Architecture por Venu Anuganti
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti32.6K visualizações
Bigdata analytics and our IoT gateway por Freek van Gool
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
Freek van Gool544 visualizações
A data analyst view of Bigdata por Venkata Reddy Konasani
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
Venkata Reddy Konasani12.6K visualizações

Similar a Hadoop and BigData - July 2016

Hadoop in action por
Hadoop in actionHadoop in action
Hadoop in actionMahmoud Yassin
619 visualizações26 slides
Hadoop info por
Hadoop infoHadoop info
Hadoop infoNikita Sure
281 visualizações25 slides
Big data por
Big dataBig data
Big dataAbilash Mavila
174 visualizações42 slides
Big Data and Hadoop por
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
2.6K visualizações42 slides
OPERATING SYSTEM .pptx por
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
6 visualizações32 slides
Hadoop a Natural Choice for Data Intensive Log Processing por
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
3.8K visualizações21 slides

Similar a Hadoop and BigData - July 2016(20)

Hadoop in action por Mahmoud Yassin
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin619 visualizações
Hadoop info por Nikita Sure
Hadoop infoHadoop info
Hadoop info
Nikita Sure281 visualizações
Big data por Abilash Mavila
Big dataBig data
Big data
Abilash Mavila174 visualizações
Big Data and Hadoop por Flavio Vit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit2.6K visualizações
OPERATING SYSTEM .pptx por AltafKhadim
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim6 visualizações
Hadoop a Natural Choice for Data Intensive Log Processing por Hitendra Kumar
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar3.8K visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic018169 visualizações
Hadoop basics por Laxmi Rauth
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth65 visualizações
Hadoop seminar por KrishnenduKrishh
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh481 visualizações
Big data ppt por Shweta Sahu
Big data pptBig data ppt
Big data ppt
Shweta Sahu10.7K visualizações
Big Data and Hadoop Basics por Sonal Tiwari
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari126 visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb107 visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb147 visualizações
Hadoop por Mayuri Gupta
HadoopHadoop
Hadoop
Mayuri Gupta1.3K visualizações
Overview of Big data, Hadoop and Microsoft BI - version1 por Thanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen2.4K visualizações
Overview of big data & hadoop version 1 - Tony Nguyen por Thanh Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen1.2K visualizações
Seminar ppt por RajatTripathi34
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi3449 visualizações
Big Data and Hadoop por Mr. Ankit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 visualizações

Último

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
263 visualizações23 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... por
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
119 visualizações17 slides
Cencora Executive Symposium por
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
159 visualizações14 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T por
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
152 visualizações34 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue por
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
138 visualizações15 slides
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue por
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueShapeBlue
222 visualizações7 slides

Último(20)

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 visualizações
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... por ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 visualizações
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T por ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue152 visualizações
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue por ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 visualizações
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue por ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 visualizações
Kyo - Functional Scala 2023.pdf por Flavio W. Brasil
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
Flavio W. Brasil457 visualizações
Generative AI: Shifting the AI Landscape por Deakin University
Generative AI: Shifting the AI LandscapeGenerative AI: Shifting the AI Landscape
Generative AI: Shifting the AI Landscape
Deakin University53 visualizações
"Surviving highload with Node.js", Andrii Shumada por Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 visualizações
Future of AR - Facebook Presentation por Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 visualizações
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... por ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue132 visualizações
Ransomware is Knocking your Door_Final.pdf por Security Bootcamp
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdf
Security Bootcamp96 visualizações
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... por ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue161 visualizações
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... por ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 visualizações
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... por James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson160 visualizações
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... por ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue139 visualizações
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates por ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue252 visualizações
Business Analyst Series 2023 - Week 4 Session 7 por DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 visualizações
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue por ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue147 visualizações
Digital Personal Data Protection (DPDP) Practical Approach For CISOs por Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash158 visualizações

Hadoop and BigData - July 2016

  • 1. Hadoop and BigData Ranjith Sekar July 2016
  • 2. Agenda  What is BigData and Hadoop?  Hadoop Architecture  HDFS  MapReduce  Installing Hadoop  Develop & Run a MapReduce Program  Hadoop Ecosystems
  • 4. Data  Structured  Relational DB,  Library Catalogues (date, author, place, subject, etc.,)  Semi Structured  CSV, XML, JSON, NoSQL database  Unstructured
  • 5. Unstructured Data  Machine Generated  Satellite images  Scientific data  Photographs and video  Radar or sonar data  Human Generated  Word, PDF, Text  Social media data (Facebook, Twitter, LinkedIn)  Mobile data (text messages)  website contents (blogs, Instagram)
  • 7. Key Terms  Commodity Hardware – PCs which can be used to form clusters.  Node – Commodity servers interconnected through network device.  NameNode = Master Node, DataNode = Slave Node  Cluster – interconnection of different nodes/systems in a network.
  • 10. BigData  Traditional approaches not fit for data analysis due to inflation.  Handling Large volume of data (zettabytes & petabytes) which are structured or unstructured.  Datasets that grow so large that it is difficult to capture, store, manage, share, analyze and visualize with the typical database software tools.  Generated by different sources around us like Systems, sensors and mobile devices.  2.5 quintillion bytes of data created everyday.  80-90% of the data in the world today has been created in the last two years alone.
  • 11. Flood of Data  More than 3 billion internet users in the world today.  The New York Stock Exchange generates about 4-5 TB of data per day.  7TB of data are processed by Twitter every day.  10TB of data are processed by Facebook every day and growing at 7 PB per month.  Interestingly 80% of these data are unstructured.  With this massive quantity of data, businesses need fast, reliable, deeper data insight.  Therefore, BigData solutions based on Hadoop and other analytics software are becoming more and more relevant.
  • 12. Dimensions of BigData Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
  • 13. BigData Benefits  Analysis of market and derive new strategy to improve business in different geo locations.  To know the response for their campaigns, promotions, and other advertising mediums.  Use medical history of patients, hospitals to provide better and quick service.  Re-develop your products.  Perform Risk Analysis.  Create new revenue streams.  Reduces maintenance cost.  Faster, better decision making.  New products & services.
  • 16. Hadoop  Google File System (2003).  Developed by Doug Cutting from Yahoo.  Hadoop 0.1.0 was released in April 2006.  Open source project of the Apache Software Foundation.  A Framework written in Java.  Distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.  Naming the Hadoop.
  • 17. Hardware & Software  Hardware (commodity hardware)  Software  OS  RedHat Enterprise Linux (RHEL)  CentOS  Ubuntu  Java  Oracle JDK 1.6 (v 1.6.31) Medium High CPU 8 physical cores 12 physical cores Memory 16 GB 48 GB Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB Network 1 GB Ethernet 10 GB Ethernet or Infiniband
  • 18. When Hadoop?  When you must process lots of unstructured data.  When your processing can easily be made parallel.  When running batch jobs is acceptable.  When you have access to lots of cheap hardware.
  • 22. Hadoop Configurations  Standalone Mode  All Hadoop services run into a single JVM and on a single machine.  Pseudo-Distributed Mode  Individual Hadoop services run in an individual JVM, but on a single machine.  Fully Distributed Mode  Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single cluster.
  • 23. Hadoop Core Services  NameNode  Secondary NameNode  DataNode  ResourceManager  ApplicationMaster  NodeManager
  • 24. How does Hadoop work?  Stage 1  User submit the Job to process with location of the input and output files in HDFS & Jar file of MapReduce Program.  Job configuration by setting different parameters specific to the job.  Stage 2  The Hadoop Job Client submits the Job and Configuration to JobTracker.  JobTracker will initiate the process to TaskTracker which in slave nodes.  JobTracker will schedule the tasks and monitoring them, providing status and diagnostic information to the job-client.  Stage 3  TaskTracker executes the Job as per MapReduce implementation.  Input will be processed and output will be stored into HDFS.
  • 26. HDFS
  • 27. Hadoop Distributed File System (HDFS)  Java-based file system to store large volume of data.  Scalability of up to 200 PB of storage and a single cluster of 4500 servers.  Supporting close to a billion files and blocks.  Access  Java API  Python/C for Non-Java Applications  Web GUI through HTTP  FS Shell - shell-like commands that directly interact with HDFS
  • 28. HDFS Features  HDFS can handle large data sets.  Since HDFS deals with large scale data, it supports a multitude of machines.  HDFS provides a write-once-read-many access model.  HDFS is built using the Java language making it portable across various platforms.  Fault Tolerance and availability are high.
  • 30. File Storage in HDFS  Split into multiple blocks/chunks and stored into different machines.  Blocks – 64MB size (default), 128MB (recommended).  Replication – fault tolerance and availability, it is configurable and it can be modified.  No storage space wasted. E.g. 420MB file stored as
  • 31. NameNode  One Per Hadoop Cluster and Act as Master Server.  Commodity hardware that contains the Linux operating system.  Namenode software – runs on commodity hardware.  Responsible for  Manages the file system namespace.  Regulates client’s access to files.  executes file system operations such as renaming, closing, and opening files and directories.
  • 32. Secondary NameNode  NameNode contains meta-data of job & data details in RAM.  S-NameNode contacts NameNode in a periodic time and copy of metadata information out of NameNode.  When NameNode crashes, the meta-data copied from S-NameNode.
  • 33. DataNode  Many per Hadoop Cluster.  Uses inexpensive commodity hardware.  Contains actual data.  Performs read/write operations on file based on request.  Performs block creation, deletion, and replication according to the instructions of the NameNode.
  • 34. HDFS Command Line Interface  View existing files  Copy files from local (copyFromLocal / put)  Copy files to local (copyToLocal / get)  Reset replication
  • 37. MapReduce  Heart of Hadoop.  Programming model/Algorithm for data processing.  Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).  MapReduce programs are inherently parallel.  Master-Slave Model.  Mapper  Performs filtering and sorting.  Reducer  Performs a summary operation.
  • 39. Job Tracker  One per Hadoop Cluster.  Controls overall execution of MapReduce Program.  Manages the Task Tracker running on Data Node.  Tracking of available & utilized resources.  Tracks the running jobs and provides fault tolerance.  Heartbeat from TaskTracker for every few minutes.
  • 40. Task Tracker  Many per Hadoop Cluster.  Executes and manages the individual tasks assigned by Job Tracker.  Periodic status to the JobTracker about the execution of the Job.  Handles the data motion between map() and reduce().  Notifies JobTracker if any task failed.
  • 43. Installing Hadoop  Prerequisites  Installation  Download : http://hadoop.apache.org/releases.html  > tar xzf hadoop-x.y.z.tar.gz  > export JAVA_HOME=/user/software/java6/  > export HADOOP_INSTALL=/home/tom/hadoop-x.y.z  > export PATH=$PATH:$HADOOP_INSTALL/bin  > Hadoop version Hadoop 0.20.0
  • 44. Pseudo-Distributed Mode Configuration core-site.xml hdfs-site.xml mapred-site.xml <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property> </configuration> <?xml version="1.0"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> <?xml version="1.0"?> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>  Formatting HDFS  > hadoop namenode -format  Start HDFS & MapReduce  > start-dfs.sh  > start-mapred.sh  Stop HDFS & MapReduce  > stop-dfs.sh  > stop-mapred.sh
  • 45. Develop & Run a MapReduce Program
  • 46. Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }}}
  • 47. Reducerimport java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } context.write(key, new IntWritable(sum)); } }
  • 48. Main Programimport org.apache.hadoop.*; public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job = new Job(conf, "wordcount"); job.setJarByClass(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 49. Input Data $ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/input/file01 /user/ranjith/mapreduce/input/file02 $ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01 Hello World Bye World $ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02 Hello Hadoop Goodbye Hadoop
  • 50. Run  Create Jar WordCout.jar  Run Command > hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output  Output $ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2  Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html
  • 52. Hadoop Ecosystem  HDFS & MapReduce  Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.  Pig – Scripting Language for MapReduce Program.  Mahout - Scalable, commercial-friendly machine learning for building intelligent application.  Hive – Metastore to view HDFS data.  Hbase - open source, non-relational, distributed database.  Sqoop – CLI application for transferring data between relational databases and Hadoop.  ZooKeeper - distributed configuration service, synchronization service, and naming registry for large distributed systems.  Oozie – define and manage the workflow.