Big Data and Hadoop

Mr. Ankit
Mr. AnkitSTUDENT em Techno India, Salt lake
Big Data and Hadoop
 What is Big Data
 How 3vs define Big data
 Hadoop and its ecosystem
 HDFS
 Map reduce and Yarn
 Career in Big Data and Hadoop
o Order Details for a store
o All orders across 100s of stores
o A person’s stock portfolio
o All stock transactions for Stock Exchange
 Its data that is created very fast and is too big to
be processed on a single machine .These data
come from various sources in various formats.
What is BIG DATA ???
How 3Vs define Big Data ???
1. Volume
 It is the size of the data which determines the value and
potential of the data under consideration. The name ‘Big
Data’ itself contains a term which is related to size and
hence the characteristic.
2. Variety
 Data today comes in all types of formats. Structured, numeric data in
traditional databases. Unstructured text documents, email, stock ticker data
and financial transactions and semi-structured data too.
3. Velocity
 speed of generation of data or how fast the data is generated
and processed to meet the demands and the challenges which
lie ahead in the path of growth and development.
 SUMMARY
 Veracity ( came much later after 3Vs but entered as next big wave of innovation )
 The quality of the data being captured can vary greatly. Accuracy of
analysis depends on the veracity of the source data.
What is HADOOP ???
“Hadoop” was name of a yellow toy elephant owned by the son of one of its inventors.
Hadoop is an open-source software framework for storing and processing
big data in a distributed fashion on large clusters of commodity hardware.
Essentially, it accomplishes two tasks : : massive data storage and faster
processing.•Open-source software. Open source software differs from commercial software due to the broad
and open network of developers that create and manage the programs.
•Framework. In this case, it means everything you need to develop and run your software applications
is provided – programs, tool sets, connections, etc.
•Distributed. Data is divided and stored across multiple computers, and computations can be run in
parallel across multiple connected machines.
•Massive storage. The Hadoop framework can store huge amounts of data by breaking the data into
blocks and storing it on clusters of lower-cost commodity hardware.
•Faster processing. How? Hadoop processes large amounts of data in parallel across clusters of tightly
connected low-cost computers for quick results.
 Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
 Computing power. Its distributed computing model can quickly process very large volumes
of data.
 Scalability. You can easily grow your system simply by adding more nodes with little administration .
 Storage flexibility. Unlike traditional relational databases, you don’t have to pre-process
data before storing it. You can store as much data as you want .
 Inherent data protection. Data and application processing are protected against hardware failure.
 self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail and automatically stores multiple copies of all data.
 What’s in Hadoop ???
 HDFS – the Java-based distributed file system that can store all kinds of data
without prior organization.
 MapReduce – a software programming model for processing large sets of
data in parallel.
 YARN – a resource management framework for scheduling and handling
resource requests from distributed applications.
 Hadoop Ecosystem
 Basically ,HDFS and MapReduce are the two core components of the Hadoop Ecosystem
and are at the heart of the Hadoop framework.
 But Some of the other Apache Projects which are built around the Hadoop Framework
are part of the Hadoop Ecosystem.
HDFS (Hadoop Distributed File System)
o HDFS enables Hadoop to store huge files. It’s a scalable file system
that distributes and stores data across all machines in a Hadoop cluster.
 Scale-Out Architecture - Add servers to increase capacity
 High Availability - Serve mission-critical workflows and applications
 Fault Tolerance - Automatically and seamlessly recover from failures
 Load Balancing - Place data intelligently for maximum efficiency and utilization
 Tunable Replication - Multiple copies of each file provide data protection and
computational performance
 Namenode and datanode
64 MB
64 MB
22 MB
150MB Text File
 When file(say 150MB Text file) is uploaded on HDFS then each block is
stored as a node in the Hadoop cluster.
 NameNode- It Runs on a master node that tracks and
directs the storage of the cluster. Also we know that
the nodes or blocks which make up the original 150
MB file and that is handled by a separate machine is
the Namenode. Information stored here is called as
metadata.
DN
 DataNode- There is a piece of software running on each of
these nodes of the cluster called Datanode which
runs on slave nodes which make up the majority of the
machines of a cluster. The name node places the data
into these data nodes.
Name
Node
DN
DN
Cluster.
 HOW HDFS WORKS ???
Name
Node
DN
DN
DN
Which of these are a problem if it occurs ?
oNetwork failure Between the nodes
oDisk failure on Datanode
oNot all Datanodes are used
oBlock sizes if differ of Datanodes
oDisk failure of Namenode
 We may lose some data nodes and hence will be losing some amount of data say
64MB out 150MB text file
 We may also have some hardware problem in namenode and may lose it too.
 HOW HDFS WORKS continued….???
o Replication Factor ( RF ) -The number of copies of a file is called the
replication factor of that file. This information is stored by the Namenode.
Solution to problem occurred...(Datanode lost)
 Hadoop replicates each file 3 times as it stores in
HDFS. ( RF = 3 )
 HOW HDFS WORKS continued….???
 NFS (Network File System) - Now , meta data
is stored not only on someone’s hard drive but
also on NFS . It is a method of mounting a
remote disk that way if namenode and
metadata are lost still we have a copy of
metadata elsewhere on the network.
 Even more efficient, now a days , two
Namenodes have been configured.
 Namenode(Active) - works in normal
condition
 Namenode(StandBy) - works if active
Solution to problem occurred…( NAMEnode lost )
• Earlier for a long time when Namenode (and metadata stored inside) was lost then the entire cluster
was inaccessible but now we have 2 techniques by which we can maintain our data .
MapReduce
 MapReduce is a programming model and an associated implementation for processing
and generating large data sets with a parallel, distributed algorithm on a cluster.
Scale-out Architecture - Add servers to increase processing power
Security & Authentication - Works with HDFS security to make sure that only approved users can
operate against the data in the system
Resource Manager - Employs data locality and server resources to determine optimal computing
operations
Optimized Scheduling - Completes jobs according to prioritization
Flexibility - Procedures can be written in virtually any programming language
Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently
and restart automatically
 Why MapReduce ???
 To process data serially i.e. from top to bottom could take some long time
 Historically we may probably use an associative array and Hash Tables but
these may lead us to some serious problem .
 As the hash sizes grow, heap pressure becomes more of an issue
Say we are using 1TB of data ,then what issues may occur ????
o It won’t work.
o We may run out of memory.
o Data processing may take long time.
 how MapReduce works ???
MapReduce divides workloads up into multiple tasks that can be executed in parallel.
Solution to problem
 Mapreduce applications typically implement the Mapper and Reducer interfaces to provide
the map and reduce methods. These form the core of the job.
 Mappers and Reducers
Mappers
 Mappers are the individual tasks that transform input records into intermediate records.
 These are just small programs that deal with a relatively small amount of data and work in parallel.
 The output obtained are called as intermediate records.
 Mapper maps input key/value pairs to a set of intermediate key/value pairs .
 Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.
 Shuffle is the movement of intermediate records from mappers to reducers.
 Sort is the fact that reducers will organize these records in the sorted order.
Reducers
 Reducer reduces a set of intermediate values which share a key to a smaller set of values.
 It works on one set of records at a time. It gets the key and the list of all values and then it writes the final
result
Yarn ( part of mapreduce )
 YARN is the architectural centre of Hadoop that allows multiple data processing engines such as
interactive SQL, real-time streaming, data science and batch processing to handle data stored in
a single platform, unlocking an entirely new approach to analytics.
 Career in Big Data and Hadoop
Big Data and Hadoop
1 de 24

Recomendados

Apache Hadoop Hive por
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop HiveSome corner at the Laboratory
160 visualizações25 slides
Apache storm por
Apache stormApache storm
Apache stormKapil Kumar
503 visualizações12 slides
Big Data and Hadoop Components por
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop ComponentsDezyreAcademy
173 visualizações16 slides
Apache Hive por
Apache HiveApache Hive
Apache Hivetusharsinghal58
142 visualizações24 slides
Apache Hive por
Apache HiveApache Hive
Apache HiveAmit Khandelwal
154 visualizações17 slides
Session 14 - Hive por
Session 14 - HiveSession 14 - Hive
Session 14 - HiveAnandMHadoop
492 visualizações40 slides

Mais conteúdo relacionado

Mais procurados

Hive por
HiveHive
HiveManas Nayak
2.9K visualizações23 slides
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had... por
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
3.9K visualizações95 slides
Apache hive por
Apache hiveApache hive
Apache hiveVaibhav Kadu
1.1K visualizações18 slides
4. hbase overview por
4. hbase overview4. hbase overview
4. hbase overviewAnuja Gunale
147 visualizações48 slides
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T... por
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
2.6K visualizações79 slides
Apache hive introduction por
Apache hive introductionApache hive introduction
Apache hive introductionMahmood Reza Esmaili Zand
1.2K visualizações42 slides

Mais procurados(20)

Hive por Manas Nayak
HiveHive
Hive
Manas Nayak2.9K visualizações
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had... por Simplilearn
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn3.9K visualizações
Apache hive por Vaibhav Kadu
Apache hiveApache hive
Apache hive
Vaibhav Kadu1.1K visualizações
4. hbase overview por Anuja Gunale
4. hbase overview4. hbase overview
4. hbase overview
Anuja Gunale147 visualizações
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T... por Simplilearn
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn2.6K visualizações
Hbase por Sam Davarnia
HbaseHbase
Hbase
Sam Davarnia121 visualizações
Apache HBase™ por Prashant Gupta
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta3.8K visualizações
Hive and HiveQL - Module6 por Rohit Agrawal
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
Rohit Agrawal1.5K visualizações
Hbase por AmitkumarPal21
HbaseHbase
Hbase
AmitkumarPal2159 visualizações
Introduction to Apache Hive(Big Data, Final Seminar) por Takrim Ul Islam Laskar
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar2K visualizações
Unit 5-apache hive por vishal choudhary
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
vishal choudhary191 visualizações
Hive(ppt) por Abhinav Tyagi
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi15.4K visualizações
Introduction To HBase por Anil Gupta
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta87.8K visualizações
Introduction to HBase por Byeongweon Moon
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Byeongweon Moon2.5K visualizações
Apache hive1 por sheetal sharma
Apache hive1Apache hive1
Apache hive1
sheetal sharma331 visualizações
Introduction to HiveQL por kristinferrier
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier3.8K visualizações
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce por Cloudera, Inc.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.41.7K visualizações
03 hive query language (hql) por Subhas Kumar Ghosh
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh2.7K visualizações

Destaque

Cities por
CitiesCities
CitiesKristen Schlotman
131 visualizações14 slides
Evaluation 1 por
Evaluation 1Evaluation 1
Evaluation 1shaanlone
213 visualizações4 slides
Indira nooyi por
Indira nooyiIndira nooyi
Indira nooyiLinfochristy
698 visualizações8 slides
Social network por
Social networkSocial network
Social networkMr. Ankit
1.2K visualizações22 slides
review of structural reforms in financial sector por
review of structural reforms in  financial sectorreview of structural reforms in  financial sector
review of structural reforms in financial sectorsyed hassan
388 visualizações31 slides
The hungry caterpillar por
The hungry caterpillarThe hungry caterpillar
The hungry caterpillarKristen Schlotman
1.1K visualizações13 slides

Destaque(16)

Evaluation 1 por shaanlone
Evaluation 1Evaluation 1
Evaluation 1
shaanlone213 visualizações
Indira nooyi por Linfochristy
Indira nooyiIndira nooyi
Indira nooyi
Linfochristy698 visualizações
Social network por Mr. Ankit
Social networkSocial network
Social network
Mr. Ankit1.2K visualizações
review of structural reforms in financial sector por syed hassan
review of structural reforms in  financial sectorreview of structural reforms in  financial sector
review of structural reforms in financial sector
syed hassan388 visualizações
The hungry caterpillar por Kristen Schlotman
The hungry caterpillarThe hungry caterpillar
The hungry caterpillar
Kristen Schlotman1.1K visualizações
The hungry caterpillar por Kristen Schlotman
The hungry caterpillarThe hungry caterpillar
The hungry caterpillar
Kristen Schlotman481 visualizações
powerpoint.26 por rebwball
powerpoint.26powerpoint.26
powerpoint.26
rebwball661 visualizações
Tugas Kelompok 8C por Lukas Siahaan
Tugas Kelompok 8CTugas Kelompok 8C
Tugas Kelompok 8C
Lukas Siahaan677 visualizações
The hungry caterpillar por Kristen Schlotman
The hungry caterpillarThe hungry caterpillar
The hungry caterpillar
Kristen Schlotman791 visualizações
powerpoint.25 por rebwball
powerpoint.25powerpoint.25
powerpoint.25
rebwball311 visualizações
Kaizen por syed hassan
KaizenKaizen
Kaizen
syed hassan1.8K visualizações
Metro - Cash & Carry Pakistan por syed hassan
Metro - Cash & Carry PakistanMetro - Cash & Carry Pakistan
Metro - Cash & Carry Pakistan
syed hassan9.7K visualizações
Taxation review of Pakistan por syed hassan
Taxation review of PakistanTaxation review of Pakistan
Taxation review of Pakistan
syed hassan6.4K visualizações
Germany- history,culture,society,organizational structure and approach to man... por syed hassan
Germany- history,culture,society,organizational structure and approach to man...Germany- history,culture,society,organizational structure and approach to man...
Germany- history,culture,society,organizational structure and approach to man...
syed hassan4.4K visualizações
Mc donald's project final por syed hassan
Mc donald's     project finalMc donald's     project final
Mc donald's project final
syed hassan39.2K visualizações

Similar a Big Data and Hadoop

hadoop por
hadoophadoop
hadoopswatic018
96 visualizações29 slides
hadoop por
hadoophadoop
hadoopswatic018
169 visualizações29 slides
Big Data & Hadoop por
Big Data & HadoopBig Data & Hadoop
Big Data & HadoopAnkan Banerjee
735 visualizações22 slides
Hadoop Technology por
Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
2.5K visualizações22 slides
Managing Big data with Hadoop por
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
787 visualizações24 slides
Cppt Hadoop por
Cppt HadoopCppt Hadoop
Cppt Hadoopchunkypandey12
27 visualizações31 slides

Similar a Big Data and Hadoop(20)

hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic018169 visualizações
Big Data & Hadoop por Ankan Banerjee
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Ankan Banerjee735 visualizações
Hadoop Technology por Atul Kushwaha
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K visualizações
Managing Big data with Hadoop por Nalini Mehta
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta787 visualizações
Cppt Hadoop por chunkypandey12
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey1227 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12126 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12147 visualizações
Hadoop by kamran khan por KamranKhan587
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan58759 visualizações
Hadoop por Ankit Prasad
HadoopHadoop
Hadoop
Ankit Prasad168 visualizações
Seminar ppt por RajatTripathi34
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi3449 visualizações
Introduction to hadoop ecosystem por Rupak Roy
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
Rupak Roy56 visualizações
Distributed Systems Hadoop.pptx por AlAmin638189
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
AlAmin6381896 visualizações
Hadoop and BigData - July 2016 por Ranjith Sekar
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb107 visualizações
Big data por revathireddyb
Big dataBig data
Big data
revathireddyb147 visualizações
عصر کلان داده، چرا و چگونه؟ por datastack
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack532 visualizações
Hadoop por Mayuri Gupta
HadoopHadoop
Hadoop
Mayuri Gupta1.3K visualizações
Seminar_Report_hadoop por Varun Narang
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang12.3K visualizações

Último

OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx por
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptxOEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptxInge de Waard
169 visualizações29 slides
AUDIENCE - BANDURA.pptx por
AUDIENCE - BANDURA.pptxAUDIENCE - BANDURA.pptx
AUDIENCE - BANDURA.pptxiammrhaywood
77 visualizações44 slides
7 NOVEL DRUG DELIVERY SYSTEM.pptx por
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptxSachin Nitave
59 visualizações35 slides
Use of Probiotics in Aquaculture.pptx por
Use of Probiotics in Aquaculture.pptxUse of Probiotics in Aquaculture.pptx
Use of Probiotics in Aquaculture.pptxAKSHAY MANDAL
95 visualizações15 slides
discussion post.pdf por
discussion post.pdfdiscussion post.pdf
discussion post.pdfjessemercerail
130 visualizações1 slide
Dance KS5 Breakdown por
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 BreakdownWestHatch
69 visualizações2 slides

Último(20)

OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx por Inge de Waard
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptxOEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
OEB 2023 Co-learning To Speed Up AI Implementation in Courses.pptx
Inge de Waard169 visualizações
AUDIENCE - BANDURA.pptx por iammrhaywood
AUDIENCE - BANDURA.pptxAUDIENCE - BANDURA.pptx
AUDIENCE - BANDURA.pptx
iammrhaywood77 visualizações
7 NOVEL DRUG DELIVERY SYSTEM.pptx por Sachin Nitave
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptx
Sachin Nitave59 visualizações
Use of Probiotics in Aquaculture.pptx por AKSHAY MANDAL
Use of Probiotics in Aquaculture.pptxUse of Probiotics in Aquaculture.pptx
Use of Probiotics in Aquaculture.pptx
AKSHAY MANDAL95 visualizações
discussion post.pdf por jessemercerail
discussion post.pdfdiscussion post.pdf
discussion post.pdf
jessemercerail130 visualizações
Dance KS5 Breakdown por WestHatch
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 Breakdown
WestHatch69 visualizações
AI Tools for Business and Startups por Svetlin Nakov
AI Tools for Business and StartupsAI Tools for Business and Startups
AI Tools for Business and Startups
Svetlin Nakov105 visualizações
American Psychological Association 7th Edition.pptx por SamiullahAfridi4
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptx
SamiullahAfridi482 visualizações
Scope of Biochemistry.pptx por shoba shoba
Scope of Biochemistry.pptxScope of Biochemistry.pptx
Scope of Biochemistry.pptx
shoba shoba126 visualizações
Classification of crude drugs.pptx por GayatriPatra14
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptx
GayatriPatra1483 visualizações
Are we onboard yet University of Sussex.pptx por Jisc
Are we onboard yet University of Sussex.pptxAre we onboard yet University of Sussex.pptx
Are we onboard yet University of Sussex.pptx
Jisc93 visualizações
Ch. 7 Political Participation and Elections.pptx por Rommel Regala
Ch. 7 Political Participation and Elections.pptxCh. 7 Political Participation and Elections.pptx
Ch. 7 Political Participation and Elections.pptx
Rommel Regala90 visualizações
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively por PECB
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks EffectivelyISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
ISO/IEC 27001 and ISO/IEC 27005: Managing AI Risks Effectively
PECB 574 visualizações
11.30.23 Poverty and Inequality in America.pptx por mary850239
11.30.23 Poverty and Inequality in America.pptx11.30.23 Poverty and Inequality in America.pptx
11.30.23 Poverty and Inequality in America.pptx
mary850239149 visualizações
Women from Hackney’s History: Stoke Newington by Sue Doe por History of Stoke Newington
Women from Hackney’s History: Stoke Newington by Sue DoeWomen from Hackney’s History: Stoke Newington by Sue Doe
Women from Hackney’s History: Stoke Newington by Sue Doe
History of Stoke Newington148 visualizações
Drama KS5 Breakdown por WestHatch
Drama KS5 BreakdownDrama KS5 Breakdown
Drama KS5 Breakdown
WestHatch73 visualizações
Structure and Functions of Cell.pdf por Nithya Murugan
Structure and Functions of Cell.pdfStructure and Functions of Cell.pdf
Structure and Functions of Cell.pdf
Nithya Murugan455 visualizações
Sociology KS5 por WestHatch
Sociology KS5Sociology KS5
Sociology KS5
WestHatch65 visualizações
ICS3211_lecture 08_2023.pdf por Vanessa Camilleri
ICS3211_lecture 08_2023.pdfICS3211_lecture 08_2023.pdf
ICS3211_lecture 08_2023.pdf
Vanessa Camilleri127 visualizações
Class 10 English lesson plans por TARIQ KHAN
Class 10 English  lesson plansClass 10 English  lesson plans
Class 10 English lesson plans
TARIQ KHAN280 visualizações

Big Data and Hadoop

  • 2.  What is Big Data  How 3vs define Big data  Hadoop and its ecosystem  HDFS  Map reduce and Yarn  Career in Big Data and Hadoop
  • 3. o Order Details for a store o All orders across 100s of stores o A person’s stock portfolio o All stock transactions for Stock Exchange  Its data that is created very fast and is too big to be processed on a single machine .These data come from various sources in various formats. What is BIG DATA ???
  • 4. How 3Vs define Big Data ???
  • 5. 1. Volume  It is the size of the data which determines the value and potential of the data under consideration. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.
  • 6. 2. Variety  Data today comes in all types of formats. Structured, numeric data in traditional databases. Unstructured text documents, email, stock ticker data and financial transactions and semi-structured data too.
  • 7. 3. Velocity  speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 8.  SUMMARY  Veracity ( came much later after 3Vs but entered as next big wave of innovation )  The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data.
  • 9. What is HADOOP ??? “Hadoop” was name of a yellow toy elephant owned by the son of one of its inventors. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks : : massive data storage and faster processing.•Open-source software. Open source software differs from commercial software due to the broad and open network of developers that create and manage the programs. •Framework. In this case, it means everything you need to develop and run your software applications is provided – programs, tool sets, connections, etc. •Distributed. Data is divided and stored across multiple computers, and computations can be run in parallel across multiple connected machines. •Massive storage. The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of lower-cost commodity hardware. •Faster processing. How? Hadoop processes large amounts of data in parallel across clusters of tightly connected low-cost computers for quick results.
  • 10.  Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.  Computing power. Its distributed computing model can quickly process very large volumes of data.  Scalability. You can easily grow your system simply by adding more nodes with little administration .  Storage flexibility. Unlike traditional relational databases, you don’t have to pre-process data before storing it. You can store as much data as you want .  Inherent data protection. Data and application processing are protected against hardware failure.  self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail and automatically stores multiple copies of all data.
  • 11.  What’s in Hadoop ???  HDFS – the Java-based distributed file system that can store all kinds of data without prior organization.  MapReduce – a software programming model for processing large sets of data in parallel.  YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
  • 12.  Hadoop Ecosystem  Basically ,HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework.  But Some of the other Apache Projects which are built around the Hadoop Framework are part of the Hadoop Ecosystem.
  • 13. HDFS (Hadoop Distributed File System) o HDFS enables Hadoop to store huge files. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster.  Scale-Out Architecture - Add servers to increase capacity  High Availability - Serve mission-critical workflows and applications  Fault Tolerance - Automatically and seamlessly recover from failures  Load Balancing - Place data intelligently for maximum efficiency and utilization  Tunable Replication - Multiple copies of each file provide data protection and computational performance
  • 14.  Namenode and datanode 64 MB 64 MB 22 MB 150MB Text File  When file(say 150MB Text file) is uploaded on HDFS then each block is stored as a node in the Hadoop cluster.  NameNode- It Runs on a master node that tracks and directs the storage of the cluster. Also we know that the nodes or blocks which make up the original 150 MB file and that is handled by a separate machine is the Namenode. Information stored here is called as metadata. DN  DataNode- There is a piece of software running on each of these nodes of the cluster called Datanode which runs on slave nodes which make up the majority of the machines of a cluster. The name node places the data into these data nodes. Name Node DN DN Cluster.
  • 15.  HOW HDFS WORKS ??? Name Node DN DN DN Which of these are a problem if it occurs ? oNetwork failure Between the nodes oDisk failure on Datanode oNot all Datanodes are used oBlock sizes if differ of Datanodes oDisk failure of Namenode  We may lose some data nodes and hence will be losing some amount of data say 64MB out 150MB text file  We may also have some hardware problem in namenode and may lose it too.
  • 16.  HOW HDFS WORKS continued….??? o Replication Factor ( RF ) -The number of copies of a file is called the replication factor of that file. This information is stored by the Namenode. Solution to problem occurred...(Datanode lost)  Hadoop replicates each file 3 times as it stores in HDFS. ( RF = 3 )
  • 17.  HOW HDFS WORKS continued….???  NFS (Network File System) - Now , meta data is stored not only on someone’s hard drive but also on NFS . It is a method of mounting a remote disk that way if namenode and metadata are lost still we have a copy of metadata elsewhere on the network.  Even more efficient, now a days , two Namenodes have been configured.  Namenode(Active) - works in normal condition  Namenode(StandBy) - works if active Solution to problem occurred…( NAMEnode lost ) • Earlier for a long time when Namenode (and metadata stored inside) was lost then the entire cluster was inaccessible but now we have 2 techniques by which we can maintain our data .
  • 18. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Scale-out Architecture - Add servers to increase processing power Security & Authentication - Works with HDFS security to make sure that only approved users can operate against the data in the system Resource Manager - Employs data locality and server resources to determine optimal computing operations Optimized Scheduling - Completes jobs according to prioritization Flexibility - Procedures can be written in virtually any programming language Resiliency & High Availability - Multiple job and task trackers ensure that jobs fail independently and restart automatically
  • 19.  Why MapReduce ???  To process data serially i.e. from top to bottom could take some long time  Historically we may probably use an associative array and Hash Tables but these may lead us to some serious problem .  As the hash sizes grow, heap pressure becomes more of an issue Say we are using 1TB of data ,then what issues may occur ???? o It won’t work. o We may run out of memory. o Data processing may take long time.
  • 20.  how MapReduce works ??? MapReduce divides workloads up into multiple tasks that can be executed in parallel. Solution to problem  Mapreduce applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. These form the core of the job.
  • 21.  Mappers and Reducers Mappers  Mappers are the individual tasks that transform input records into intermediate records.  These are just small programs that deal with a relatively small amount of data and work in parallel.  The output obtained are called as intermediate records.  Mapper maps input key/value pairs to a set of intermediate key/value pairs .  Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.  Shuffle is the movement of intermediate records from mappers to reducers.  Sort is the fact that reducers will organize these records in the sorted order. Reducers  Reducer reduces a set of intermediate values which share a key to a smaller set of values.  It works on one set of records at a time. It gets the key and the list of all values and then it writes the final result
  • 22. Yarn ( part of mapreduce )  YARN is the architectural centre of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
  • 23.  Career in Big Data and Hadoop