SlideShare uma empresa Scribd logo
1 de 30
Hadoop Eco System
Presented by : Sreenu Musham
27th March, 2015
Sreenu Musham
Data Warehouse Architect
sreenu.musham@yahoo.com
Hadoop Introduction
Agenda
BigData and Its Challenges(Recap)
Hadoop and Its Evolution
Terminology used
HDFS
MapReduce
Hadoop Eco System
Hadoop Distributors
Feel of Hadoop (how it looks?)
Big Data and Challenges
 1024 KB = 1MB
 1024 MB = 1GB
 1024 GB = 1TB
 1024 TB = 1Petabyte
 So on … Exabytes, Zettabytes, Yottabytes, Brontobytes,
Geopbytes
Big Data and Challenges
Issues on Disk I/O, Network &
Processing in time
Storage? Costly in enterprise machine
Vertical solution is not always correct
Reliability
handling unstructured data?
Schema less data
Disk Vs Transfer Rate
126 sec to read
whole disk
58 min to read
whole disk
4 hrs to read
whole disk
What is Hadoop?
 Apache Open source Software Framework for reliable,
scalable, distributed computing of massive amount of data
 A framework where the job is divided among the nodes and
process them in parallel
 Hides underlying system details and complexities from
user
 Developed in JAVA
 A set of machines running on HDFS and MapReduce is
known as Hadoop cluster
 Core Components
 HDFS MapReduce
Hadoop is not for all type of work
• Process transactions
• Low-Latency data Access
• Lot of Small Files
• Intensive calculations with little data
 Hadoop initiated new kinds of analysis
• Not jus old thinking on bigger data
• Iterate over whole data sets, not only sample sets
• Use multiple data sources (beyond structured
data)
• Work with schema-less data
Warehouse Themes
Story Behind hadoop
Google’s Victory in 2000
Searching in 1990’s
Story Behind Hadoop
2003 2004 2005
Created by
Doug Cutting and Michael Cafarella
(Yahoo)
2006
Yahoo donated the project
to
Paper
on GFS
paper
on MapReduce
2008 2009
Name
by
Doug
Terabyte Sort
2010
Launches
HIVE
Runs 4000 node
Hadoop cluster
Node-Rack-cluster
HA
Node 1
Node 2
Node N
..
…..
R
A
C
K
1
Node 1
Node 2
Node N
..
…..
R
A
C
K
2
Node 1
Node 2
Node N
..
…..
R
A
C
K
N
.. …
Hadoop Cluster
Computation method
Data
Source
Server
Data
Hadoop cluster
I/O
Processing Time
What is HDFS?
HDFS runs on top of existing file system
Uses blocks to store a file or parts of file
Stores data across multiple nodes
Size of a file can be larger than any single disk in the
network
The default block size is 64MB
The Main Objectives
Storing Very large files
Streaming data access
Commodity hardware
Allow access to data on any node in the cluster
Able to handle hardware failures
What is HDFS?
If a chunk of the file is smaller than HDFS block size
 Only the needed space is used
Example: 300MB
HDFS blocks are replicated to several nodes for reliability
Maintains checksums of data for corruption detection and
recovery
What is MapReduce?
Is a algorithm/programming model to process the data in
the hadoop cluster
Consists of two phases: Map, and then Reduce
Each Map task operates on a discrete portion of the overal
dataset
•Typically one HDFS block of data
After all Maps are complete, the MapReduce system
disstribute the intermediate data to nodes which perform
the Reduce phase
Hadoop framework parallelizes the computation, handles
failures, efficient communication and performance issues.
Sample Data Flow
Take a large problem and divide it into sub-problems
-Break data set down into small chunks
Perform the same function on all sub-tasks
REDUCEMAP
Combine the output from all sub-tasks
It works like a Unix pipeline
cat input | grep <pattern> | sort | uniq -c > output
Input | Map | Shuffle & Sort | Reduce | Output
Hadoop Architecture
Data
Node
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
Scalable
Master
Slaves
Master/Slave and
Shared Nothing architecture
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
HDFS Layer
MapReduce Layer
Sample Data Flow- Word Count
Air
Box
Car
Box
Do
Air
Air,2
Box,2
Car,1
Do,1
(0, Air)
(1, Box)
(2, Car)
(3, Box)
(4, Do)
(5, Air)
(Air, [1, 1])
(Box, [1, 1])
(Car, [1] )
(Do, [1] )
(Air, 2)
(Box, 2)
(Car, 1)
(Do, 1)
(Air, 1)
(Box, 1)
(Car, 1)
(Box, 1)
(Do, 1)
(Air, 1)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
Sample Data Flow
(1949,
111)
(1950,
22)
(1949, [111, 78])
(1950, [0, 22,
−11])
(1949, 111)
(1950, 22)
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
Input Outputmap Shuffle &Sort Reduce
<k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(1, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(2, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(3, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
MapReduce : on splits
How Hadoop runs a MapReduce Job
Reading Data from HDFS
Writing Data to HDFS
Replication of Blocks
Node 1
Node 2
Node N
R
A
C
K
2
Hadoop Cluster
Node 3
Node 1
Node 2
Node N
R
A
C
K
3 Node 3
Name
Node
Node 1
Node 3
R
A
C
K
1 Node 2
HDFS Client
file.txt
File.txt:B1,B2,B3
Name node
Hadoop Eco System
Hadoop Distributions.
Hadoop Versions
Accessing hadoop/HDFS
Hadoop fs –ls <path>
Hadoop fs mkdir testsreenu
Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put
samplefile.txt
Running mapreduce
hadoop jar newjob.jar samplefile in_dir out_dir
Hive
Pig
PIG code
a. Load customer records
Cust=LOAD ‘/input/custs’ using PigStorage(,) AS
(custid:chararray,firstname:chararray,lastname:chararray,age:long,profe
ssion:chararray);
b. Select only 100 records
Amt=LIMIT cust 100;
Dump amt;
c. Group customer records by profession
groupbyprofession = GROUP cust BY profession;
describe groupbyprofession;
Feel of Hadoop
Hadoop Vs Other Systems
29
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands
of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing
infrastructure can run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s
demand
• Example: Amazon EC2
Elastic:grows and shrinks based on user’s
demand
Q & A
Hadoop Eco System
Presented By
Sreenu Musham

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
HDFS
HDFSHDFS
HDFS
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 

Semelhante a Presentation sreenu dwh-services

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

Semelhante a Presentation sreenu dwh-services (20)

Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Presentation sreenu dwh-services

  • 1. Hadoop Eco System Presented by : Sreenu Musham 27th March, 2015 Sreenu Musham Data Warehouse Architect sreenu.musham@yahoo.com Hadoop Introduction
  • 2. Agenda BigData and Its Challenges(Recap) Hadoop and Its Evolution Terminology used HDFS MapReduce Hadoop Eco System Hadoop Distributors Feel of Hadoop (how it looks?)
  • 3. Big Data and Challenges  1024 KB = 1MB  1024 MB = 1GB  1024 GB = 1TB  1024 TB = 1Petabyte  So on … Exabytes, Zettabytes, Yottabytes, Brontobytes, Geopbytes
  • 4. Big Data and Challenges Issues on Disk I/O, Network & Processing in time Storage? Costly in enterprise machine Vertical solution is not always correct Reliability handling unstructured data? Schema less data
  • 5. Disk Vs Transfer Rate 126 sec to read whole disk 58 min to read whole disk 4 hrs to read whole disk
  • 6. What is Hadoop?  Apache Open source Software Framework for reliable, scalable, distributed computing of massive amount of data  A framework where the job is divided among the nodes and process them in parallel  Hides underlying system details and complexities from user  Developed in JAVA  A set of machines running on HDFS and MapReduce is known as Hadoop cluster  Core Components  HDFS MapReduce
  • 7. Hadoop is not for all type of work • Process transactions • Low-Latency data Access • Lot of Small Files • Intensive calculations with little data  Hadoop initiated new kinds of analysis • Not jus old thinking on bigger data • Iterate over whole data sets, not only sample sets • Use multiple data sources (beyond structured data) • Work with schema-less data
  • 9. Story Behind hadoop Google’s Victory in 2000 Searching in 1990’s
  • 10. Story Behind Hadoop 2003 2004 2005 Created by Doug Cutting and Michael Cafarella (Yahoo) 2006 Yahoo donated the project to Paper on GFS paper on MapReduce 2008 2009 Name by Doug Terabyte Sort 2010 Launches HIVE Runs 4000 node Hadoop cluster
  • 11. Node-Rack-cluster HA Node 1 Node 2 Node N .. ….. R A C K 1 Node 1 Node 2 Node N .. ….. R A C K 2 Node 1 Node 2 Node N .. ….. R A C K N .. … Hadoop Cluster
  • 13. What is HDFS? HDFS runs on top of existing file system Uses blocks to store a file or parts of file Stores data across multiple nodes Size of a file can be larger than any single disk in the network The default block size is 64MB The Main Objectives Storing Very large files Streaming data access Commodity hardware Allow access to data on any node in the cluster Able to handle hardware failures
  • 14. What is HDFS? If a chunk of the file is smaller than HDFS block size  Only the needed space is used Example: 300MB HDFS blocks are replicated to several nodes for reliability Maintains checksums of data for corruption detection and recovery
  • 15. What is MapReduce? Is a algorithm/programming model to process the data in the hadoop cluster Consists of two phases: Map, and then Reduce Each Map task operates on a discrete portion of the overal dataset •Typically one HDFS block of data After all Maps are complete, the MapReduce system disstribute the intermediate data to nodes which perform the Reduce phase Hadoop framework parallelizes the computation, handles failures, efficient communication and performance issues.
  • 16. Sample Data Flow Take a large problem and divide it into sub-problems -Break data set down into small chunks Perform the same function on all sub-tasks REDUCEMAP Combine the output from all sub-tasks It works like a Unix pipeline cat input | grep <pattern> | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output
  • 17. Hadoop Architecture Data Node Name Node Data Node Data Node Data Node Data Node Scalable Master Slaves Master/Slave and Shared Nothing architecture Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker HDFS Layer MapReduce Layer
  • 18. Sample Data Flow- Word Count Air Box Car Box Do Air Air,2 Box,2 Car,1 Do,1 (0, Air) (1, Box) (2, Car) (3, Box) (4, Do) (5, Air) (Air, [1, 1]) (Box, [1, 1]) (Car, [1] ) (Do, [1] ) (Air, 2) (Box, 2) (Car, 1) (Do, 1) (Air, 1) (Box, 1) (Car, 1) (Box, 1) (Do, 1) (Air, 1) Input Outputmap Shuffle &Sort Reduce <k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>)
  • 19. Sample Data Flow (1949, 111) (1950, 22) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) Input Outputmap Shuffle &Sort Reduce <k1, v1> list(<k2, v2>) <k2, list( v2)> list(<k3, v3>) To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (1, 0043011990999991950051512004...9999999N9+00221+99999999999...) (2, 0043011990999991950051518004...9999999N9-00111+99999999999...) (3, 0043012650999991949032412004...0500001N9+01111+99999999999...) (4, 0043012650999991949032418004...0500001N9+00781+99999999999...)
  • 20. MapReduce : on splits
  • 21. How Hadoop runs a MapReduce Job
  • 24. Replication of Blocks Node 1 Node 2 Node N R A C K 2 Hadoop Cluster Node 3 Node 1 Node 2 Node N R A C K 3 Node 3 Name Node Node 1 Node 3 R A C K 1 Node 2 HDFS Client file.txt File.txt:B1,B2,B3 Name node
  • 28. Accessing hadoop/HDFS Hadoop fs –ls <path> Hadoop fs mkdir testsreenu Hadoop fs –copyFromLocal samplefile.txt : Hadoop fs –put samplefile.txt Running mapreduce hadoop jar newjob.jar samplefile in_dir out_dir Hive Pig PIG code a. Load customer records Cust=LOAD ‘/input/custs’ using PigStorage(,) AS (custid:chararray,firstname:chararray,lastname:chararray,age:long,profe ssion:chararray); b. Select only 100 records Amt=LIMIT cust 100; Dump amt; c. Group customer records by profession groupbyprofession = GROUP cust BY profession; describe groupbyprofession; Feel of Hadoop
  • 29. Hadoop Vs Other Systems 29 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance • Cloud Computing • A computing model where any computing infrastructure can run on the cloud • Hardware & Software are provided as remote services • Elastic: grows and shrinks based on the user’s demand • Example: Amazon EC2 Elastic:grows and shrinks based on user’s demand
  • 30. Q & A Hadoop Eco System Presented By Sreenu Musham