Hadoop introduction , Why and What is Hadoop ?

sudhakara st
sudhakara stSenior Bigdata Engineer/Hadoop Developer at Cyanogen em Cyanogen
Introduction to Hadoop
Agenda
1. Introduction to Hadoop
 What is Big data and Why Hadoop?
 Big Data Charracteristics and Challeges
Comparison between Hadoop and RDBMS
 Hadoop History and Origin
Hadoop Ecosyetem overiew
 Anatomy of Hadoop Cluster
Hands on Exercise – Installing Couldera Hadoop VM
Big Data
Think at Scale
Data is in TB even in PB
• Facebook has 400 terabytes of stored data and ingest 20 terabytes of
new data per day. Hosts approx. 10 billion photos, 5PB(2011) and is
growing 4TB per day
• NYSE generates 1TB data/day
• The Internet Archive stores around 2PB of data and is growing at a rate
of 20PB per month
 Flood of data is coming from many resources
 Social network profile, activity , logging and tracking
 Public web information
 Data ware house appliances
 Internet Archive store etc.. [Q: How much big ? ]
Big Data-how it is? What it means ?
DATA Value
 Velocity
Data flown continues, time sensitive,
streaming flow
Batch, Real time, Streams, Historic
 Variety
Big Data extends structured, including semi-
structured and unstructured data of all variety:
text, log, xml, audio, vedio, stream, falt files etc.
Structured, Semi structured, Unstructured
Veracity
Quality, consistency, reliability and
provenance of data
Good, bad, undefined, inconsistency,
incomplete.
 Volume
Big Data comes in on large scale. Its
on TB and even PB
Records, Transaction, Tables , Files
Big Data- how it is? What it means ? continue..
Social media and websites
IT – services, Software and Hardware services and support.
Finance: Better and deeper understanding of risk to avoid credit
crisis
Telecommunication: More reliable network where we can
predicate and prevent
Media: More content that is lined up with your personal
preferences
Life science: Better targeted medicine with fewer complications
and side effects
Retail: A personal experience with product and offer that are
just what and you need
Google, yahoo and others need to index the entire internet and
return searched results in milliseconds
Business Drivers and sceneries for large data
Challenges in Big Data Storage and Analysis
Slow to process, can’t scale
Disk seek for every access
Buffered reads, locality  still seeking every disk page
It not Storage Capacity but access speeds which is the bottleneck.
Challenges to both store and analyze datasets
Scaling is expensive
 Hard Drive capacity to process
IDE drive – 75 MB/sec, 10ms seek
SATA drive – 300MB/s, 8.5ms seek
SSD – 800MB/s, 2 ms “seek”
Apart from this analyze, compute, aggregation, processing dealy
etc..
 Unreliable machines: Risk
1 Machine 1 time in 3 years mean time between failures
1000 Machines 1 day mean time between failures
Reliability
Partial failure, graceful decline rather than full halt
Data recoverability, if a node fails, another picks up its workload
Node recoverability, a fixed node can rejoin the group without a full
group restart
Scalability, adding resources adds load capacity
Backup
 Not affordable, expensive(faster, more reliability more cost)
Easy to use and Secure
Process data in parallel
Challenges in Big Data Storage and Analysis continues…
 An Idea: Parallelism
• Transfer speed improves at a greater rate than seek speed.
• Process read/write parallel rather then sequential.
1 drive – 75 MB/sec 16 days for 100TB
1000 drives – 75 GB/sec 22 minutes for 100TB
 A problem: Parallelism is Hard
• Synchronization
• Deadlock
• Limited bandwidth
• Timing issues and co-ordination
• Spilt & Aggregation
 Computer are complicate
• Driver failure
• Data availability
• Co-ordination
Hey !, We have Distribute computing !!!
Process data in parallel ? – not simple 
Yes, We have distributed computing and it also come up with some
challenges 
Resource sharing. Access any data and utilize CPU resource across the
system.
Portability, reliable,
Concurrency: Allow concurrent access, update of shared resource,
availability with high throughput
Scalability: With data, with load
Fault tolerance : By having provisions for redundancy and recovery
Heterogeneity: Different operating system, different hardware
Transparency: Should appear as a whole instead of collection of
computers
Hide details and complexity by accomplishing above challenges from
the user and need a common unified interface to interact with it.
To address most of these challenges(but not all) Hadoop come in.
Common Challenges in Distributed computing
Apache Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model. It is designed
to scale up from single servers to thousands of machines,
each providing computation and storage.
Hadoop is an open-source implementation of Google
MapReduce, GFS(distributed file system).
Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
Hadoop fulfill need of common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache License
Hadoop origins
The Name ‘Hadoop’ ?
Store and process large amounts of data (PetaBytes)
Performance, storage, processing scale linearly
Compute should move to data
Simple core , modular and extensible
Failure is normal, expected
Manageable and Heal self
Design run on commodity hardware-cost effective
Hadoop Design Axioms
For Storage and Distributed computing (MapReduce)
Spilt up the data
Process Data in parallel
Sort and combine to get the answer
Schedule, Process and aggregate independently
Failures are independent, Handle failures.
Handle fault tolerance
Solve : Hadoop achieves complete parallelism
2002-2004 Doug cutting and Mike Cafarella started
working on Nutch
2003-2004: Google publishes GFS and MapReduce paper
2004 : Doug cutting adds DFS and Mapreduce support to
Nutch
Yahoo ! Hires Cutting , bulid team to develop Hadoop
2007: NY time converts 4TB of archive over 100 EC2
cluster of Hadoop.
Web scale deployement at Y!,Facebook,twitter.
May 2009: Yahoo does fastest sort of a TB, 62secs over
1460nodes
Yahoo sort a PB in 16.25hrs over 3658 nodes
Hadoop History
An Elephant can't jump. But can carry heavy load !!!
 A fundamental tenet of relational databases structure defined by a
schema, what about Large data sets are often unstructured or semi-
structured, Hadoop is the best choice. Hadoop MR framework uses
key/value pairs as its basic data unit, which is flexible enough to
work with the less-structured data types.
 Scaling commercial relational databases is expensive and limited.
 High-level declarative language like SQL, Block box Query
engine.You query data by stating the result you want and let the
database engine figure and drive it. you can build complex statistical
models from your data or analytical reporting or reformat your
image data. SQL is not well designed for such tasks. MapReduce
tries to collocate the data with the compute node, so data access is
fast since it is local.
 Coordinating the processes in a large-scale distributed computation
is a challenge. HDFS and MR made easy split, store. process and
aggregate.
Hadoop V/S RDBMS
 To run a bigger database you need to buy a bigger machine. the high-end
machines are not cost effective for many applications. For example, a
machine with four times the power of a standard PC costs a lot more than
putting four such PCs in a cluster. Hadoop is designed to be a scale-out
architecture operating on a cluster of commodity hardware . Adding more
resources means adding more machines to the Hadoop cluster.
Effective cost per user TB: $250/TB
Other solutions(RDBMS) cost in the range of $100 to $100K per user TB
 Hardest aspect is gracefully handling partial failure— when you don’t
know if a remote process has failed or not—and still making progress with
the overall computation. MapReduce spares the programmer from having
to think about failure, since the implementation detects failed map or
reduce tasks and reschedules with suitable replacements . MapReduce is
able to do this since it is a shared-nothing architecture, meaning that tasks
have no dependence on one other.
Hadoop V/S RDBMS Continue…
Hardware failure: as soon as you start using many pieces of hardware, the
chance that one will fail is fairly high. A common way of avoiding data loss is
through replication. Redundant copies of the data are kept by the system so
that in the event of failure, there is another copy available. Node failure and disk
failure efficient handle in Hadoop frame work.
Hadoop V/S RDBMS Continue…
Is Hadoop alternative for RDBMs ?
Hadoop is not replacing the traditional data systems used for building
analytic applications – the RDBMS, EDW and MPP systems – but rather is a
complement.
 Interoperate with existing systems and tools, at the moment Apache
Hadoop is not a substitute for a database
 No Relation, Key Value pairs
 Big Data, unstructured (Text) & semi structured (Seq / Binary Files)
 Structured (Hbase=Google BigTable)
 Works fine together with RDBMs, Hadoop is being used to distill
large quantities of data into something more manageable.
Hadoop V/S RDBMS Continue…
Hadoop designed and built on two independent frame
works. Hadoop = HDFS + Map reduce
HDFS (storage and File system) : HDFS is a reliable
distributed file system that provides high-throughput
access to data
MapReduce (processing) : MapReduce is a
framework for performing high performance distributed
data processing using the divide and aggregate
programming paradigm
Hadoop has a master/slave architecture for both
storage and processing.
Hadoop Architecture
Hadoop Master and Salve Architecture
Periodic check piont
Hadoop storage and file system – Hadoo Distribute File system(HDFS)
The components(daemons) of HDFS are
• NameNode is the master of the system. It is maintains the name system
(directories and files) and manages the blocks which are present on the DataNodes.
• DataNodes are the slaves which are deployed on each machine and provide the
actual stor-age. They are responsible for serving read and write requests for the
clients.
• Secondary NameNode is responsible for performing periodic checkpoints. So, in
the event of NameNode failure, you can restart the NameNode using the checkpoint.
Name Node
Data Nodes Data node ….. Data node
Slave
Secondary Name
Node
Periodic check point
Master
Hadoop Master and Salve Architecture continues…
Parallel and Distributed computation – Map Reduce Paradigm
The components(daemons) of MapReduce are:
 JobTracker is the master of the system which manages the jobs and
resources in the clus-ter (TaskTrackers). The JobTracker tries to schedule each
map as close to the actual data being processed i.e. on the TaskTracker which is
running on the same DataNode as the underlying block.
 TaskTrackers are the slaves which are deployed on each machine. They are
responsible for running the map and reduce tasks as instructed by the
JobTracker
Job Tracker
Task Tacker Task tracker ….. Task Tracker
Master
Slave
Core Hadoop Ecosystem
Hadoop Ecosystem Development
Hadoop Ecosystem now…
The Hadoop ecosystem has grown over the last few years and there is a lot of
jargon in terms of tools, frameworks . Hadoop has become the kernel
Hadoop can be configured in three modes
Standalone: Hadoop all Deamons in run inside a single Java
process. Use local file for storage. Standalone mode helpful for
debug Hadoop applications.
Pseudo-distributed : Each Hadoop daemons runs is different
JVM, as a separate process , but all processes running on a
single machine.
Fully-distributed: Hadoop actual powers parallel processing,
scalability and the independence of task execution, replication
management, workflow management, fault-tolerance, and data
consistency are lies in the fully distributed mode. The Hadoop
fully distributed mode is highly effective centralized data
structure allows multiple machines to contribute processing
power and storage to the cluster.
Hadoop Environment
Distributed frame work for processing and storing data generally on commodity hardware.
Completely open source and Written in Java
Store anything.
Unstructured or semi structured data,
Storage capacity.
Scale linearly, cost in not exponential.
Data locality and process in your way.
Code moves to data
In MR you specify the actual steps in processing the data and drive the out put.
Stream access: Process data in any language.
Failure and fault tolerance:
Detect Failure and Heals itself.
Reliable, data replicated, failed task are rerun , no need maintain backup of data
Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity PC machines
The Hadoop framework transparently for customization to provides applications both
reliability, adaption and data motion
Primarily used for batch processing, not real-time/trasactional user applications.
& Many more……..
Hadoop Features and summary
Hadoop Open Source at Apache
No strategic agenda
 quality is emergent, Continues evolving
Community based and strong
 diverse organizations collaborating voluntarily
 decisions by consensus
 transparent
Allows competing projects
 survival of fittest
A loose federation of projects
 permits evolution
Insures against vendor lock-in
 can't buy Apache
Who Wrote Hadoop? It's the
Community
Contributors and Development
Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
What is Hadoop used for?
• Search
– Yahoo, Amazon, Zvents,
• Log processing
– Facebook, Yahoo, ContextWeb. Joost, Last.fm
• Recommendation Systems
– Facebook
• Data Warehouse
– Facebook, AOL
• Video and Image Analysis
– New York Times, Eyealike
..... Almost in every domain !!!
Who uses Hadoop?
 Amazon/A9
 Facebook
 Google
 IBM
Joost
 Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Twitter
LinkedIn
…No list too big now
References
Hadoop: The Definitive Guide, Third Edition by Tom White.
http://hadoop.apache.org/
http://www.cloudera.com/
Hadoop introduction , Why and What is  Hadoop ?
Zookeeper(Coordination)
Map-Reduce (Job scheduling and Execution)
HBase(Key-Value store)Hive Warehouse
Pig(Data Flow) processing ETL Tools
Avero(Serialization)
Sqoop,Flume
MR-streaming
Application DB Transactional DBRest API
1 de 34

Recomendados

Introduction to HDFS por
Introduction to HDFSIntroduction to HDFS
Introduction to HDFSBhavesh Padharia
2.4K visualizações39 slides
Hadoop Overview & Architecture por
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
64.6K visualizações91 slides
Hadoop por
HadoopHadoop
HadoopNishant Gandhi
19.5K visualizações17 slides
Hadoop File system (HDFS) por
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
8.4K visualizações54 slides
Hadoop hdfs por
Hadoop hdfsHadoop hdfs
Hadoop hdfsSudipta Ghosh
908 visualizações15 slides
Introduction to Hadoop por
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopDr. C.V. Suresh Babu
2.1K visualizações29 slides

Mais conteúdo relacionado

Mais procurados

Hadoop Map Reduce por
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map ReduceVNIT-ACM Student Chapter
11.1K visualizações20 slides
Big Data Analytics with Hadoop por
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
441.9K visualizações64 slides
Map Reduce por
Map ReduceMap Reduce
Map ReducePrashant Gupta
14.9K visualizações69 slides
Introduction to Hadoop and Hadoop component por
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
2.7K visualizações15 slides
Hive por
HiveHive
HiveManas Nayak
2.9K visualizações23 slides
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be... por
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
658 visualizações69 slides

Mais procurados(20)

Big Data Analytics with Hadoop por Philippe Julio
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio441.9K visualizações
Map Reduce por Prashant Gupta
Map ReduceMap Reduce
Map Reduce
Prashant Gupta14.9K visualizações
Introduction to Hadoop and Hadoop component por rebeccatho
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho2.7K visualizações
Hive por Manas Nayak
HiveHive
Hive
Manas Nayak2.9K visualizações
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be... por Simplilearn
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn658 visualizações
Hadoop hive presentation por Arvind Kumar
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K visualizações
Hadoop and Big Data por Harshdeep Kaur
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur1.1K visualizações
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori... por Simplilearn
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Simplilearn5.1K visualizações
Hadoop MapReduce Fundamentals por Lynn Langit
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit133.9K visualizações
Hadoop And Their Ecosystem por sunera pathan
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan451 visualizações
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo... por Simplilearn
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn992 visualizações
Seminar Presentation Hadoop por Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K visualizações
Hadoop technology por tipanagiriharika
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika2.9K visualizações
Big data and Hadoop por Rahul Agarwal
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal82.3K visualizações
Hadoop seminar por KrishnenduKrishh
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh477 visualizações
Unit-3_BDA.ppt por PoojaShah174393
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393539 visualizações
Big Data and Hadoop por Flavio Vit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit2.6K visualizações

Destaque

Practical Problem Solving with Apache Hadoop & Pig por
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
237.2K visualizações199 slides
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop por
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
82.7K visualizações40 slides
Pig, Making Hadoop Easy por
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
84.7K visualizações16 slides
introduction to data processing using Hadoop and Pig por
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
92.5K visualizações32 slides
Hadoop, Pig, and Twitter (NoSQL East 2009) por
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
143.2K visualizações58 slides
HIVE: Data Warehousing & Analytics on Hadoop por
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
111.2K visualizações28 slides

Destaque(20)

Practical Problem Solving with Apache Hadoop & Pig por Milind Bhandarkar
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar237.2K visualizações
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop por royans
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans82.7K visualizações
Pig, Making Hadoop Easy por Nick Dimiduk
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk84.7K visualizações
introduction to data processing using Hadoop and Pig por Ricardo Varela
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela92.5K visualizações
Hadoop, Pig, and Twitter (NoSQL East 2009) por Kevin Weil
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil143.2K visualizações
HIVE: Data Warehousing & Analytics on Hadoop por Zheng Shao
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao111.2K visualizações
Integration of Hive and HBase por Hortonworks
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks99.6K visualizações
Introduction To Map Reduce por rantav
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav106.2K visualizações
Big Data & Hadoop Tutorial por Edureka!
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!90.2K visualizações
Hive Quick Start Tutorial por Carl Steinbach
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach139.8K visualizações
Hadoop Map Reduce 程式設計 por Wei-Yu Chen
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
Wei-Yu Chen40.4K visualizações
Map reduce and Hadoop on windows por Muhammad Shahid
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid2.6K visualizações
Hadoop for beginners free course ppt por Njain85
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
Njain857.3K visualizações
Hadoop Technology por Atul Kushwaha
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha2.5K visualizações
Hadoop - Overview por Jay
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay 8.2K visualizações
Hadoop demo ppt por Phil Young
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young40.6K visualizações
HADOOP TECHNOLOGY ppt por sravya raju
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju38.6K visualizações
Computer network ppt por Santosh Delwar
Computer network pptComputer network ppt
Computer network ppt
Santosh Delwar211.1K visualizações
Basic concepts of computer Networking por Hj Habib
Basic concepts of computer NetworkingBasic concepts of computer Networking
Basic concepts of computer Networking
Hj Habib175.3K visualizações

Similar a Hadoop introduction , Why and What is Hadoop ?

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook por
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
22.4K visualizações24 slides
Big data and hadoop overvew por
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
773 visualizações62 slides
Seminar ppt por
Seminar pptSeminar ppt
Seminar pptRajatTripathi34
49 visualizações20 slides
Bigdata and Hadoop Introduction por
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introductionumapavankumar kethavarapu
408 visualizações21 slides
Apache hadoop basics por
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
1.1K visualizações20 slides
Hadoop and BigData - July 2016 por
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
1.2K visualizações53 slides

Similar a Hadoop introduction , Why and What is Hadoop ?(20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook por Amr Awadallah
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah22.4K visualizações
Big data and hadoop overvew por Kunal Khanna
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna773 visualizações
Seminar ppt por RajatTripathi34
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi3449 visualizações
Apache hadoop basics por saili mane
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane1.1K visualizações
Hadoop and BigData - July 2016 por Ranjith Sekar
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic018169 visualizações
Hadoop por Mayuri Gupta
HadoopHadoop
Hadoop
Mayuri Gupta1.3K visualizações
Introduction to Apache Hadoop por Christopher Pezza
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza5.4K visualizações
Hadoop by kamran khan por KamranKhan587
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan58759 visualizações
Hadoop Seminar Report por Atul Kushwaha
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha3.2K visualizações
Hadoop introduction por Subhas Kumar Ghosh
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh914 visualizações
Big Data and Hadoop por Mr. Ankit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 visualizações
Hadoop training in bangalore por TIB Academy
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
TIB Academy17 visualizações
عصر کلان داده، چرا و چگونه؟ por datastack
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack534 visualizações
Hadoop info por Nikita Sure
Hadoop infoHadoop info
Hadoop info
Nikita Sure281 visualizações
Big Data: An Overview por C. Scyphers
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers14.7K visualizações
Managing Big data with Hadoop por Nalini Mehta
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta787 visualizações

Último

Democratising digital commerce in India-Report por
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
15 visualizações161 slides
Piloting & Scaling Successfully With Microsoft Viva por
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
12 visualizações160 slides
Business Analyst Series 2023 - Week 3 Session 5 por
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
248 visualizações20 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf por
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
19 visualizações29 slides
Future of Indian ConsumerTech por
Future of Indian ConsumerTechFuture of Indian ConsumerTech
Future of Indian ConsumerTechKapil Khandelwal (KK)
21 visualizações68 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf por
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
12 visualizações36 slides

Último(20)

Democratising digital commerce in India-Report por Kapil Khandelwal (KK)
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-Report
Kapil Khandelwal (KK)15 visualizações
Piloting & Scaling Successfully With Microsoft Viva por Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Richard Harbridge12 visualizações
Business Analyst Series 2023 - Week 3 Session 5 por DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10248 visualizações
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf por Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Dr. Jimmy Schwarzkopf19 visualizações
Unit 1_Lecture 2_Physical Design of IoT.pdf por StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 visualizações
Network Source of Truth and Infrastructure as Code revisited por Network Automation Forum
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisited
Network Automation Forum26 visualizações
STPI OctaNE CoE Brochure.pdf por madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 visualizações
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 por IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
IttrainingIttraining52 visualizações
PRODUCT LISTING.pptx por angelicacueva6
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptx
angelicacueva614 visualizações
Voice Logger - Telephony Integration Solution at Aegis por Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 visualizações
Special_edition_innovator_2023.pdf por WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 visualizações
Ransomware is Knocking your Door_Final.pdf por Security Bootcamp
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdf
Security Bootcamp55 visualizações
PRODUCT PRESENTATION.pptx por angelicacueva6
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptx
angelicacueva614 visualizações
Kyo - Functional Scala 2023.pdf por Flavio W. Brasil
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
Flavio W. Brasil368 visualizações
SUPPLIER SOURCING.pptx por angelicacueva6
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptx
angelicacueva615 visualizações
Attacking IoT Devices from a Web Perspective - Linux Day por Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 visualizações
Empathic Computing: Delivering the Potential of the Metaverse por Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst478 visualizações
Serverless computing with Google Cloud (2023-24) por wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 visualizações

Hadoop introduction , Why and What is Hadoop ?

  • 2. Agenda 1. Introduction to Hadoop  What is Big data and Why Hadoop?  Big Data Charracteristics and Challeges Comparison between Hadoop and RDBMS  Hadoop History and Origin Hadoop Ecosyetem overiew  Anatomy of Hadoop Cluster Hands on Exercise – Installing Couldera Hadoop VM
  • 3. Big Data Think at Scale Data is in TB even in PB • Facebook has 400 terabytes of stored data and ingest 20 terabytes of new data per day. Hosts approx. 10 billion photos, 5PB(2011) and is growing 4TB per day • NYSE generates 1TB data/day • The Internet Archive stores around 2PB of data and is growing at a rate of 20PB per month  Flood of data is coming from many resources  Social network profile, activity , logging and tracking  Public web information  Data ware house appliances  Internet Archive store etc.. [Q: How much big ? ]
  • 4. Big Data-how it is? What it means ? DATA Value  Velocity Data flown continues, time sensitive, streaming flow Batch, Real time, Streams, Historic  Variety Big Data extends structured, including semi- structured and unstructured data of all variety: text, log, xml, audio, vedio, stream, falt files etc. Structured, Semi structured, Unstructured Veracity Quality, consistency, reliability and provenance of data Good, bad, undefined, inconsistency, incomplete.  Volume Big Data comes in on large scale. Its on TB and even PB Records, Transaction, Tables , Files
  • 5. Big Data- how it is? What it means ? continue..
  • 6. Social media and websites IT – services, Software and Hardware services and support. Finance: Better and deeper understanding of risk to avoid credit crisis Telecommunication: More reliable network where we can predicate and prevent Media: More content that is lined up with your personal preferences Life science: Better targeted medicine with fewer complications and side effects Retail: A personal experience with product and offer that are just what and you need Google, yahoo and others need to index the entire internet and return searched results in milliseconds Business Drivers and sceneries for large data
  • 7. Challenges in Big Data Storage and Analysis Slow to process, can’t scale Disk seek for every access Buffered reads, locality  still seeking every disk page It not Storage Capacity but access speeds which is the bottleneck. Challenges to both store and analyze datasets Scaling is expensive  Hard Drive capacity to process IDE drive – 75 MB/sec, 10ms seek SATA drive – 300MB/s, 8.5ms seek SSD – 800MB/s, 2 ms “seek” Apart from this analyze, compute, aggregation, processing dealy etc..  Unreliable machines: Risk 1 Machine 1 time in 3 years mean time between failures 1000 Machines 1 day mean time between failures
  • 8. Reliability Partial failure, graceful decline rather than full halt Data recoverability, if a node fails, another picks up its workload Node recoverability, a fixed node can rejoin the group without a full group restart Scalability, adding resources adds load capacity Backup  Not affordable, expensive(faster, more reliability more cost) Easy to use and Secure Process data in parallel Challenges in Big Data Storage and Analysis continues…
  • 9.  An Idea: Parallelism • Transfer speed improves at a greater rate than seek speed. • Process read/write parallel rather then sequential. 1 drive – 75 MB/sec 16 days for 100TB 1000 drives – 75 GB/sec 22 minutes for 100TB  A problem: Parallelism is Hard • Synchronization • Deadlock • Limited bandwidth • Timing issues and co-ordination • Spilt & Aggregation  Computer are complicate • Driver failure • Data availability • Co-ordination Hey !, We have Distribute computing !!! Process data in parallel ? – not simple 
  • 10. Yes, We have distributed computing and it also come up with some challenges  Resource sharing. Access any data and utilize CPU resource across the system. Portability, reliable, Concurrency: Allow concurrent access, update of shared resource, availability with high throughput Scalability: With data, with load Fault tolerance : By having provisions for redundancy and recovery Heterogeneity: Different operating system, different hardware Transparency: Should appear as a whole instead of collection of computers Hide details and complexity by accomplishing above challenges from the user and need a common unified interface to interact with it. To address most of these challenges(but not all) Hadoop come in. Common Challenges in Distributed computing
  • 11. Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage. Hadoop is an open-source implementation of Google MapReduce, GFS(distributed file system). Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop fulfill need of common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License Hadoop origins
  • 13. Store and process large amounts of data (PetaBytes) Performance, storage, processing scale linearly Compute should move to data Simple core , modular and extensible Failure is normal, expected Manageable and Heal self Design run on commodity hardware-cost effective Hadoop Design Axioms
  • 14. For Storage and Distributed computing (MapReduce) Spilt up the data Process Data in parallel Sort and combine to get the answer Schedule, Process and aggregate independently Failures are independent, Handle failures. Handle fault tolerance Solve : Hadoop achieves complete parallelism
  • 15. 2002-2004 Doug cutting and Mike Cafarella started working on Nutch 2003-2004: Google publishes GFS and MapReduce paper 2004 : Doug cutting adds DFS and Mapreduce support to Nutch Yahoo ! Hires Cutting , bulid team to develop Hadoop 2007: NY time converts 4TB of archive over 100 EC2 cluster of Hadoop. Web scale deployement at Y!,Facebook,twitter. May 2009: Yahoo does fastest sort of a TB, 62secs over 1460nodes Yahoo sort a PB in 16.25hrs over 3658 nodes Hadoop History
  • 16. An Elephant can't jump. But can carry heavy load !!!  A fundamental tenet of relational databases structure defined by a schema, what about Large data sets are often unstructured or semi- structured, Hadoop is the best choice. Hadoop MR framework uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types.  Scaling commercial relational databases is expensive and limited.  High-level declarative language like SQL, Block box Query engine.You query data by stating the result you want and let the database engine figure and drive it. you can build complex statistical models from your data or analytical reporting or reformat your image data. SQL is not well designed for such tasks. MapReduce tries to collocate the data with the compute node, so data access is fast since it is local.  Coordinating the processes in a large-scale distributed computation is a challenge. HDFS and MR made easy split, store. process and aggregate. Hadoop V/S RDBMS
  • 17.  To run a bigger database you need to buy a bigger machine. the high-end machines are not cost effective for many applications. For example, a machine with four times the power of a standard PC costs a lot more than putting four such PCs in a cluster. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity hardware . Adding more resources means adding more machines to the Hadoop cluster. Effective cost per user TB: $250/TB Other solutions(RDBMS) cost in the range of $100 to $100K per user TB  Hardest aspect is gracefully handling partial failure— when you don’t know if a remote process has failed or not—and still making progress with the overall computation. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules with suitable replacements . MapReduce is able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one other. Hadoop V/S RDBMS Continue…
  • 18. Hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. Node failure and disk failure efficient handle in Hadoop frame work. Hadoop V/S RDBMS Continue…
  • 19. Is Hadoop alternative for RDBMs ? Hadoop is not replacing the traditional data systems used for building analytic applications – the RDBMS, EDW and MPP systems – but rather is a complement.  Interoperate with existing systems and tools, at the moment Apache Hadoop is not a substitute for a database  No Relation, Key Value pairs  Big Data, unstructured (Text) & semi structured (Seq / Binary Files)  Structured (Hbase=Google BigTable)  Works fine together with RDBMs, Hadoop is being used to distill large quantities of data into something more manageable. Hadoop V/S RDBMS Continue…
  • 20. Hadoop designed and built on two independent frame works. Hadoop = HDFS + Map reduce HDFS (storage and File system) : HDFS is a reliable distributed file system that provides high-throughput access to data MapReduce (processing) : MapReduce is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm Hadoop has a master/slave architecture for both storage and processing. Hadoop Architecture
  • 21. Hadoop Master and Salve Architecture Periodic check piont Hadoop storage and file system – Hadoo Distribute File system(HDFS) The components(daemons) of HDFS are • NameNode is the master of the system. It is maintains the name system (directories and files) and manages the blocks which are present on the DataNodes. • DataNodes are the slaves which are deployed on each machine and provide the actual stor-age. They are responsible for serving read and write requests for the clients. • Secondary NameNode is responsible for performing periodic checkpoints. So, in the event of NameNode failure, you can restart the NameNode using the checkpoint. Name Node Data Nodes Data node ….. Data node Slave Secondary Name Node Periodic check point Master
  • 22. Hadoop Master and Salve Architecture continues… Parallel and Distributed computation – Map Reduce Paradigm The components(daemons) of MapReduce are:  JobTracker is the master of the system which manages the jobs and resources in the clus-ter (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being processed i.e. on the TaskTracker which is running on the same DataNode as the underlying block.  TaskTrackers are the slaves which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker Job Tracker Task Tacker Task tracker ….. Task Tracker Master Slave
  • 25. Hadoop Ecosystem now… The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools, frameworks . Hadoop has become the kernel
  • 26. Hadoop can be configured in three modes Standalone: Hadoop all Deamons in run inside a single Java process. Use local file for storage. Standalone mode helpful for debug Hadoop applications. Pseudo-distributed : Each Hadoop daemons runs is different JVM, as a separate process , but all processes running on a single machine. Fully-distributed: Hadoop actual powers parallel processing, scalability and the independence of task execution, replication management, workflow management, fault-tolerance, and data consistency are lies in the fully distributed mode. The Hadoop fully distributed mode is highly effective centralized data structure allows multiple machines to contribute processing power and storage to the cluster. Hadoop Environment
  • 27. Distributed frame work for processing and storing data generally on commodity hardware. Completely open source and Written in Java Store anything. Unstructured or semi structured data, Storage capacity. Scale linearly, cost in not exponential. Data locality and process in your way. Code moves to data In MR you specify the actual steps in processing the data and drive the out put. Stream access: Process data in any language. Failure and fault tolerance: Detect Failure and Heals itself. Reliable, data replicated, failed task are rerun , no need maintain backup of data Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion Primarily used for batch processing, not real-time/trasactional user applications. & Many more…….. Hadoop Features and summary
  • 28. Hadoop Open Source at Apache No strategic agenda  quality is emergent, Continues evolving Community based and strong  diverse organizations collaborating voluntarily  decisions by consensus  transparent Allows competing projects  survival of fittest A loose federation of projects  permits evolution Insures against vendor lock-in  can't buy Apache Who Wrote Hadoop? It's the Community
  • 29. Contributors and Development Lifetime patches contributed for all Hadoop-related projects: community members by current employer * source : JIRA tickets
  • 30. What is Hadoop used for? • Search – Yahoo, Amazon, Zvents, • Log processing – Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems – Facebook • Data Warehouse – Facebook, AOL • Video and Image Analysis – New York Times, Eyealike ..... Almost in every domain !!!
  • 31. Who uses Hadoop?  Amazon/A9  Facebook  Google  IBM Joost  Last.fm New York Times PowerSet Veoh Yahoo! Twitter LinkedIn …No list too big now
  • 32. References Hadoop: The Definitive Guide, Third Edition by Tom White. http://hadoop.apache.org/ http://www.cloudera.com/
  • 34. Zookeeper(Coordination) Map-Reduce (Job scheduling and Execution) HBase(Key-Value store)Hive Warehouse Pig(Data Flow) processing ETL Tools Avero(Serialization) Sqoop,Flume MR-streaming Application DB Transactional DBRest API

Notas do Editor

  1. Look around at the technology we have today, and it's easy to come to the conclusion thatit's all about data. We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 and 2015 to 5Zetta bytes. This flood of data is coming from many sources. Consider the following:• The New York Stock Exchange generates about one terabyte of new trade data perday.• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of20 terabytes per month.• The Large Hadron Collider near Geneva, Switzerland, will produce about 15petabytes of data per year.
  2. Instead of growing a system onto larger and larger hardware, the scale-out approachspreads the processing onto more and more machines. These traditional approaches to scale-up and scale-out not feasible. The purchase costs are often high,as is the effort to develop and manage the systems.