Big Data & Hadoop

Ankan Banerjee
Ankan BanerjeePhotography Intern em India Indulge
Big Data & Hadoop
Overview
 Big Data
 3 Vs of Big Data
 Hadoop
 HDFS
 Map Reduce
 Big Data Market Size
 Big Data in India
oOrder Details for a store
oAll orders across 100s of stores
oA person’s stock portfolio
oAll stock transactions for Stock Exchange
Its data that is created very fast and is too big to
be processed on a single machine .These data
come from various sources in various formats.
What is BIG DATA ???
How 3 Vs define Big Data ???
 Volume: Large volumes of data
 Velocity: Quickly moving data
 Variety: Structured, Unstructured,
images, etc.
Volume
It is the size of the data which determines the value and potential of the
data under consideration. The name ‘Big Data’ itself contains a term
which is related to size and hence the characteristic.
Variety
Data today comes in all types of formats: Structured, data in traditional
databases. Unstructured text documents, email, stock ticker data and
financial transactions and semi-structured data too.
Velocity
Speed of generation of data or how fast the data is generated and processed to
meet the demands and the challenges which lie ahead in the path of growth and
development.
Why Big Data ?
 The real issue is not that you are acquiring large amounts of data. It's
what you do with the data that counts. The hopeful vision is that
organizations will be able to take data from any source, harness
relevant data and analyse it to find answers that enable
 1) cost reductions
 2) time reductions
 3) new product development and optimized offerings
 4) smarter business decision making
What is Hadoop?
 Hadoop is a distributed file system and data processing engine that is
designed to handle extremely high volumes of data in any structure.
 Hadoop has two components:
 The Hadoop distributed file system (HDFS), which supports data in structured
relational form, in unstructured form, and in any form in between
 The MapReduce programing paradigm for managing applications on multiple
distributed servers
 The focus is on supporting redundancy, distributed architectures, and
parallel processing
 Low cost: The open-source framework is free and uses commodity hardware to
store large quantities of data.
 Computing power: Its distributed computing model can quickly process very large
volumes of data.
 Scalability: You can easily grow your system simply by adding more nodes with
little administration.
 Storage flexibility: Unlike traditional relational databases, you don’t have to pre-
process data before storing it. You can store as much data as you want .
 Inherent data protection: Data and application processing are protected against
hardware failure.
11
The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It’s a
scalable file system that distributes and stores data across
all machines in a Hadoop cluster.
Hadoop Distributed File System
12
HDFS has a master/slave architecture
HDFS cluster consists of :
A single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
A number of DataNodes, which manage storage attached to the nodes
that they run on. Internally, a file is split into one or more blocks and
these blocks are stored in DataNodes.
HDFS Architecture
Files in HDFS
13
HDFS supports a traditional hierarchical file organization. A user or an application can
create directories and store files inside these directories. The NameNode maintains the file
system namespace. Any change to the file system namespace or its properties is recorded
by the NameNode.
The File System Namespace
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file
HDFS Robustness
The primary objective of HDFS is to store data reliably
even in the presence of failures. The common types of
failures are DataNode failures and NameNode failures.
Data Disk Failure and Re-Replication
DataNodes may lose connectivity with the NameNode. The NameNode detects this condition, marks them as dead and
does not forward any new IO requests to them. The NameNode constantly tracks block failures and initiates re-replication
whenever necessary
Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS
instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple
copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and
EditLogs to get updated synchronously.
Mappers and Reducers
Mappers
 These are just small programs that deal with a relatively small amount of data and work in parallel.
 Mapper maps input to a set of intermediate key/value pairs .
 Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.
Reducers
 Reducer reduces a set of intermediate values which share a key to a smaller set of values.
 It gets the key and the list of all values and then it writes the final result
MapReduce
MapReduce
MapReduce applications typically implement the Mapper and Reducer interfaces
to provide the map and reduce methods.
MapReduce divides workloads up into multiple tasks that can be executed in
parallel
Why MapReduce ?
o It won’t work.
o We may run out of memory.
o Data processing may take long time.
The initial approach is to process data serially i.e. from top to bottom.
MapReduce in Action
Worker
Worker
Worker
Worker
Worker
Master(2)
assign
map
(2)
assign
reduce
(3) read (4) local write
(5) remote read
Output
File 0
Output
File 1
(6) write
Split 0
Split 1
Split 2
Input files
Mapper: split, read, emit
intermediate Key-Value pairs
Reducer: repartition, emits
final output
User
Program
Map phase
Intermediate files
(on local disks)
Reduce phase Output files
Market Size
Source: Wikibon Taming Big Data
By 2015 4.5 million IT jobs in Big Data ; 2 million is in US itself
In India
 Gaining attraction
 Huge market opportunities for IT services (82.9% of revenues) and
analytics firms (17.1 % )
 Market size by end of 2015 - $1 billion
 India will require a minimum of 1 lakh data scientists in the next couple
of years in addition to data analysts and data managers to support the
Big Data space.
Big Data & Hadoop
References
 https://hadoop.apache.org
 Cloudera (Introduction to HDFS & MapReduce)
 CBT Nuggets Apache Hadoop
 Hadoop- The Definitive Guide, 4th Edition
 en.wikipedia.org
 www.edureka.co/big-data-and-hadoop
 https://www.udacity.com/
Big Data & Hadoop
1 de 22

Recomendados

Hadoop por
HadoopHadoop
HadoopAnkit Prasad
168 visualizações25 slides
What is difference between dbms and rdbms por
What is difference between dbms and rdbmsWhat is difference between dbms and rdbms
What is difference between dbms and rdbmsAfrasiyab Haider
2.4K visualizações2 slides
hadoop por
hadoophadoop
hadoopswatic018
169 visualizações29 slides
Database Management System por
Database Management SystemDatabase Management System
Database Management SystemAbishek V S
436 visualizações40 slides
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle... por
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Nandhitha B
21 visualizações17 slides
Hadoop paper por
Hadoop paperHadoop paper
Hadoop paperATWIINE Simon Alex
37 visualizações5 slides

Mais conteúdo relacionado

Mais procurados

Database management system por
Database management systemDatabase management system
Database management systemnazmul hoque
106 visualizações9 slides
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll... por
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Nandhitha B
109 visualizações17 slides
assignment3 por
assignment3assignment3
assignment3Kirti J
42 visualizações4 slides
Dbms slides por
Dbms slidesDbms slides
Dbms slidesrahulrathore725
112.1K visualizações15 slides
Introduction to RDBMS por
Introduction to RDBMSIntroduction to RDBMS
Introduction to RDBMSSarmad Ali
5.5K visualizações18 slides
Distributed processing por
Distributed processingDistributed processing
Distributed processingNeil Stein
1.7K visualizações24 slides

Mais procurados(16)

Database management system por nazmul hoque
Database management systemDatabase management system
Database management system
nazmul hoque106 visualizações
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll... por Nandhitha B
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Nandhitha B109 visualizações
assignment3 por Kirti J
assignment3assignment3
assignment3
Kirti J42 visualizações
Dbms slides por rahulrathore725
Dbms slidesDbms slides
Dbms slides
rahulrathore725112.1K visualizações
Introduction to RDBMS por Sarmad Ali
Introduction to RDBMSIntroduction to RDBMS
Introduction to RDBMS
Sarmad Ali5.5K visualizações
Distributed processing por Neil Stein
Distributed processingDistributed processing
Distributed processing
Neil Stein1.7K visualizações
Big Data and Hadoop por MaulikLakhani
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani200 visualizações
Data Archiving and Processing por CRRC-Armenia
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and Processing
CRRC-Armenia845 visualizações
Hadoop mapreduce and yarn frame work- unit5 por RojaT4
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4108 visualizações
Database System Concepts and Architecture por sontumax
Database System Concepts and ArchitectureDatabase System Concepts and Architecture
Database System Concepts and Architecture
sontumax3K visualizações
Design of Hadoop Distributed File System por Dr. C.V. Suresh Babu
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu543 visualizações
Chapter 5: Database Systems, Data Centers, and Business Intelligence por phak_09
Chapter 5: Database Systems, Data Centers, and Business IntelligenceChapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business Intelligence
phak_091.2K visualizações
Ds intro por ramyasanthosh
Ds introDs intro
Ds intro
ramyasanthosh453 visualizações
Implementation of Multi-node Clusters in Column Oriented Database using HDFS por IJEACS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS23 visualizações
Big data technology unit 3 por RojaT4
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT474 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12147 visualizações

Similar a Big Data & Hadoop

Big Data and Hadoop por
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
253 visualizações24 slides
hadoop por
hadoophadoop
hadoopswatic018
96 visualizações29 slides
Introduction to hadoop ecosystem por
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
56 visualizações28 slides
Big data por
Big dataBig data
Big datarevathireddyb
107 visualizações18 slides
Big data por
Big dataBig data
Big datarevathireddyb
147 visualizações18 slides
Managing Big data with Hadoop por
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
787 visualizações24 slides

Similar a Big Data & Hadoop(20)

Big Data and Hadoop por Mr. Ankit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
Introduction to hadoop ecosystem por Rupak Roy
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
Rupak Roy56 visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb107 visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb147 visualizações
Managing Big data with Hadoop por Nalini Mehta
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta787 visualizações
Unit-1 Introduction to Big Data.pptx por AnkitChauhan817826
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan81782620 visualizações
Big data Hadoop presentation por Shivanee garg
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg558 visualizações
Hadoop Technology por Atul Kushwaha
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K visualizações
Distributed Systems Hadoop.pptx por AlAmin638189
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
AlAmin6381896 visualizações
Hadoop by kamran khan por KamranKhan587
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan58759 visualizações
Hadoop por RittikaBaksi
HadoopHadoop
Hadoop
RittikaBaksi27 visualizações
Apache Hadoop Big Data Technology por Jay Nagar
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar495 visualizações
Cppt Hadoop por chunkypandey12
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey1227 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12126 visualizações
Hadoop file system por John Veigas
Hadoop file systemHadoop file system
Hadoop file system
John Veigas76 visualizações
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA por Aishwarya Saseendran
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
Aishwarya Saseendran873 visualizações
Introduction to Apache Hadoop Eco-System por Md. Hasan Basri (Angel)
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)1.4K visualizações
HADOOP por Harinder Kaur
HADOOPHADOOP
HADOOP
Harinder Kaur246 visualizações

Último

Ukraine Infographic_22NOV2023_v2.pdf por
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdfAnastosiyaGurin
1.4K visualizações3 slides
3196 The Case of The East River por
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
12 visualizações4 slides
Binder1.pdf por
Binder1.pdfBinder1.pdf
Binder1.pdfEstherSita2
10 visualizações21 slides
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx por
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 visualizações3 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 visualizações12 slides
TGP 2.docx por
TGP 2.docxTGP 2.docx
TGP 2.docxsandi636490
10 visualizações8 slides

Último(20)

Ukraine Infographic_22NOV2023_v2.pdf por AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K visualizações
3196 The Case of The East River por ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9012 visualizações
Binder1.pdf por EstherSita2
Binder1.pdfBinder1.pdf
Binder1.pdf
EstherSita210 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx por JaysonGarabilesEspej
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
DataScienceConferenc15 visualizações
TGP 2.docx por sandi636490
TGP 2.docxTGP 2.docx
TGP 2.docx
sandi63649010 visualizações
Organic Shopping in Google Analytics 4.pdf por GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials12 visualizações
UNEP FI CRS Climate Risk Results.pptx por pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 visualizações
SUPER STORE SQL PROJECT.pptx por khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 visualizações
Cross-network in Google Analytics 4.pdf por GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 visualizações
Advanced_Recommendation_Systems_Presentation.pptx por neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
neeharikasingh295 visualizações
Chapter 3b- Process Communication (1) (1)(1) (1).pptx por ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf por vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 visualizações
Survey on Factuality in LLM's.pptx por NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 visualizações
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation por DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
DataScienceConferenc111 visualizações
How Leaders See Data? (Level 1) por Narendra Narendra
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra13 visualizações
PROGRAMME.pdf por HiNedHaJar
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar19 visualizações

Big Data & Hadoop

  • 2. Overview  Big Data  3 Vs of Big Data  Hadoop  HDFS  Map Reduce  Big Data Market Size  Big Data in India
  • 3. oOrder Details for a store oAll orders across 100s of stores oA person’s stock portfolio oAll stock transactions for Stock Exchange Its data that is created very fast and is too big to be processed on a single machine .These data come from various sources in various formats. What is BIG DATA ???
  • 4. How 3 Vs define Big Data ???  Volume: Large volumes of data  Velocity: Quickly moving data  Variety: Structured, Unstructured, images, etc.
  • 5. Volume It is the size of the data which determines the value and potential of the data under consideration. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.
  • 6. Variety Data today comes in all types of formats: Structured, data in traditional databases. Unstructured text documents, email, stock ticker data and financial transactions and semi-structured data too.
  • 7. Velocity Speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 8. Why Big Data ?  The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that enable  1) cost reductions  2) time reductions  3) new product development and optimized offerings  4) smarter business decision making
  • 9. What is Hadoop?  Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure.  Hadoop has two components:  The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between  The MapReduce programing paradigm for managing applications on multiple distributed servers  The focus is on supporting redundancy, distributed architectures, and parallel processing
  • 10.  Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.  Computing power: Its distributed computing model can quickly process very large volumes of data.  Scalability: You can easily grow your system simply by adding more nodes with little administration.  Storage flexibility: Unlike traditional relational databases, you don’t have to pre- process data before storing it. You can store as much data as you want .  Inherent data protection: Data and application processing are protected against hardware failure.
  • 11. 11 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster. Hadoop Distributed File System
  • 12. 12 HDFS has a master/slave architecture HDFS cluster consists of : A single NameNode, a master server that manages the file system namespace and regulates access to files by clients. A number of DataNodes, which manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in DataNodes. HDFS Architecture
  • 13. Files in HDFS 13 HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. The File System Namespace Data Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file
  • 14. HDFS Robustness The primary objective of HDFS is to store data reliably even in the presence of failures. The common types of failures are DataNode failures and NameNode failures. Data Disk Failure and Re-Replication DataNodes may lose connectivity with the NameNode. The NameNode detects this condition, marks them as dead and does not forward any new IO requests to them. The NameNode constantly tracks block failures and initiates re-replication whenever necessary Metadata Disk Failure The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.
  • 15. Mappers and Reducers Mappers  These are just small programs that deal with a relatively small amount of data and work in parallel.  Mapper maps input to a set of intermediate key/value pairs .  Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data. Reducers  Reducer reduces a set of intermediate values which share a key to a smaller set of values.  It gets the key and the list of all values and then it writes the final result MapReduce
  • 16. MapReduce MapReduce applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. MapReduce divides workloads up into multiple tasks that can be executed in parallel Why MapReduce ? o It won’t work. o We may run out of memory. o Data processing may take long time. The initial approach is to process data serially i.e. from top to bottom.
  • 17. MapReduce in Action Worker Worker Worker Worker Worker Master(2) assign map (2) assign reduce (3) read (4) local write (5) remote read Output File 0 Output File 1 (6) write Split 0 Split 1 Split 2 Input files Mapper: split, read, emit intermediate Key-Value pairs Reducer: repartition, emits final output User Program Map phase Intermediate files (on local disks) Reduce phase Output files
  • 18. Market Size Source: Wikibon Taming Big Data By 2015 4.5 million IT jobs in Big Data ; 2 million is in US itself
  • 19. In India  Gaining attraction  Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % )  Market size by end of 2015 - $1 billion  India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.
  • 21. References  https://hadoop.apache.org  Cloudera (Introduction to HDFS & MapReduce)  CBT Nuggets Apache Hadoop  Hadoop- The Definitive Guide, 4th Edition  en.wikipedia.org  www.edureka.co/big-data-and-hadoop  https://www.udacity.com/