SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
www.edureka.co/big-data-and-hadoop
Introduction to big data and hadoop
Slide 2 www.edureka.co/big-data-and-hadoop
Slide 3 www.edureka.co/big-data-and-hadoop
At the end of this session , you will understand the:
→ Big Data Learning Paths
→ Big Data Introduction
→ Hadoop and Its Eco-System
→ Hadoop Architecture
→ Next Step on How to Setup Hadoop
Slide 4 www.edureka.co/big-data-and-hadoop
• Java / Python / Ruby
• Hadoop Eco-system
• NoSQL DB
• Spark
• Linux Administration
• Cluster Management
• Cluster Performance
• Virtualization
• Statistics Skills
• Machine Learning
• Hadoop Essentials
• Expertise in R
Developer/Testing
Administration
Data Analyst
Big Data and Hadoop
MapReduce
Design Patterns
Apache
Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics
Using R
Advance Predictive
Modelling in R
Talend for Big Data
Data Visualization
Using Tableau
Slide 5 www.edureka.co/big-data-and-hadoop
→ Lots of Data (Terabytes or Petabytes)
→ Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications
→ The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization Big Data
Slide 6 www.edureka.co/big-data-and-hadoop
→ Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information
Stock market generates about one terabyte of new trade data per day to
perform stock trading analytics to determine trends for optimal trades
Slide 7 www.edureka.co/big-data-and-hadoop
→ By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes (ZB)
→ The world’s information is doubling every two years. By 2020 the world will generate 50 times the amount of
information and 75 times the number of information containers
Slide 8 www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
VOLUME
Web
logs
Images
Videos
Audios
Sensor
Data
VARIET
Y
VELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
Slide 9 www.edureka.co/big-data-and-hadoop
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 10 www.edureka.co/big-data-and-hadoop
Map the following to corresponding data type:
» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)
Slide 11 www.edureka.co/big-data-and-hadoop
Ans. XML files, e-mail body → Semi-structured data
Audio, Video, Image, Files, Archived documents → Unstructured data
Data from Enterprise systems (ERP, CRM etc.) → Structured data
Slide 12 www.edureka.co/big-data-and-hadoop
More on Big Data
• http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
• http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
• http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
• http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
• http://www-01.ibm.com/software/data/bigdata/
Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop
→ Web and e-tailing
» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection
→ Telecommunications
» Customer Churn Prevention
» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop
→ Government
» Fraud Detection and Cyber Security
» Welfare Schemes
» Justice
→ Healthcare and Life Sciences
» Health Information Exchange
» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop
→ Banks and Financial services
» Modeling True Risk
» Threat Analysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
→ Retail
» Point of Sales Transaction Analysis
» Customer Churn Analysis
» Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop
→ Insight into data can provide Business Advantage.
→ Some key early indicators can mean Fortunes to Business.
→ More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Case Study: Sears Holding Corporation
Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Inctrumentation
A meagre
10% of the
~2PB data is
available for
BI
Storage
2. Moving data to compute doesn’
t scale
90% of
the ~2PB
archived
Processing
3. Premature data
death
1. Can’t explore original
high fidelity raw data
Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both
Storage
And
Processing
Entire ~2PB
Data is
available for
processing
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
43 Minutes
Read 1 TB Data
Slide 21Slide 21Slide 21 www.edureka.co/big-data-and-hadoop
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machine
4.3 Minutes43 Minutes
Read 1 TB Data
Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoop
→ Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
→ It is an Open-source Data Management with scale-out storage and distributed processing.
Slide 23 www.edureka.co/big-data-and-hadoop
Hadoop is a framework that allows for the distributed
processing of:
» Small Data Sets
» Large Data Sets
Slide 24 www.edureka.co/big-data-and-hadoop
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
Slide 25Slide 25Slide 25 www.edureka.co/big-data-and-hadoop
Pig Latin
Data Analysis
Hive
DW System
Other YARN
Frameworks
(MPI, GRAPH)
HBaseMapReduce Framework
YARN
Cluster Resource Management
Apache Oozie
(Workflow)
HDFS
(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or
Semi-structured Data Structured Data
Flume
Mahout
Machine Learning
Slide 26Slide 26Slide 26 www.edureka.co/big-data-and-hadoop
DataNode
Node
Manager
DataNode DataNode DataNode
YARN
HDFS
Cluster
Resource
Manager
NameNode
Node
Manager
Node
Manager
Node
Manager
Slide 27 www.edureka.co/big-data-and-hadoop
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
Slide 28 www.edureka.co/big-data-and-hadoop
Master
NameNode
http://master:50070/
ResourceManager
http://master:8088
Slave01
DataNode
NodeManager
Slave02
DataNode
NodeManager
Slave03
DataNode
NodeManager
Slave04
DataNode
NodeManager
Slave05
DataNode
NodeManager
Slide 29 www.edureka.co/big-data-and-hadoop
NodeManager
DataNode
NodeManager
HDFS YARN
NameNode
DataNode
NodeManager DataNode
ResourceManager
DataNode
NodeManager
DataNode
NodeManager
NodeManager
DataNode
NodeManager
DataNode
NodeManager
DataNode
Slide 30 www.edureka.co/big-data-and-hadoop
Namenode NS
Storage
…
NamespaceBlockStorage
Namespace
NN-1 NN-k NN-n
Common Storage
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
DatanodeDatanode
Datanode 1
…
Datanode m
…
Datanode 2
…
Block Management
Slide 31 www.edureka.co/big-data-and-hadoop
How does HDFS Federation help HDFS Scale horizontally?
a. Reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual parts of the file system
namespace.
b. Provides cross-data centre (non-local) support for HDFS, allowing a
cluster administrator to split the Block Storage outside the local cluster.
Slide 32 www.edureka.co/big-data-and-hadoop
Ans. Option (a)
In order to scale the name service horizontally, HDFS federation
uses multiple independent NameNode. The NameNode are
federated, that is, the NameNode are independent and don’t
require coordination with each other.
Slide 33 www.edureka.co/big-data-and-hadoop
You have configured two name nodes to manage /marketing and
/finance respectively. What will happen if you try to put a file in
/accounting directory?
Slide 34 www.edureka.co/big-data-and-hadoop
Ans. Put will fail. None of the namespace will manage the file and you
will get an IOException with a No such file or directory error.
Slide 35 www.edureka.co/big-data-and-hadoop
Node Manager
Container
App
Master
Node Manager
Container
App
Master
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure
Secondary
NameNode
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Slide 36 www.edureka.co/big-data-and-hadoop
Node Manager
Container
App
Master
Node Manager
Container
App
Master
HDFS YARN
Resource
Manager
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and
applies to its own
namespace
Secondary
Name Node
DataNode
Standby
NameNode
Active
NameNode
DataNode Data Node
DataNodeDataNode
NameNode
High
Availability
Next Generation
MapReduce
*Not necessary to
configure
Secondary
NameNode
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Slide 37 www.edureka.co/big-data-and-hadoop
HDFS HA was developed to overcome the following disadvantage in Hadoop
1.0?
a. Single Point of Failure of Name-Node
b. Only one version can be run in classic Map-Reduce
c. Too much burden on Job Tracker
Slide 38 www.edureka.co/big-data-and-hadoop
Ans. Single Point of Failure of NameNode
Slide 39 www.edureka.co/big-data-and-hadoop
…
Slide 40 www.edureka.co/big-data-and-hadoop
Facebook
→ We use Hadoop to store copies of internal log and dimension data sources and use
it as a source for reporting/analytics and machine learning.
→ Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have built
a higher level data warehousing framework using these features called Hive
(see the http://Hadoop.apache.org/hive/). We have also developed a FUSE
implementation over HDFS.
Slide 41 www.edureka.co/big-data-and-hadoop
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
→ No daemons, everything runs in a single JVM.
→ Suitable for running MapReduce programs during development.
→ Has no DFS.
→ Hadoop daemons run on the local machine.
→ Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode
Slide 42 www.edureka.co/big-data-and-hadoop
→ Apache Hadoop and HDFS
→ http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
→ Apache Hadoop HDFS Architecture
→ http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Slide 43 www.edureka.co/big-data-and-hadoop
• Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 44
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoop
Introduction to Big Data & Hadoop

Mais conteúdo relacionado

Mais procurados

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Edureka!
 

Mais procurados (20)

Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 

Destaque

Best Practices for Identity Management Projects
Best Practices for Identity Management ProjectsBest Practices for Identity Management Projects
Best Practices for Identity Management Projects
Hitachi ID Systems, Inc.
 
Kristen cookie case
Kristen cookie caseKristen cookie case
Kristen cookie case
jldolan1
 
Strategic evaluation and control
Strategic evaluation and controlStrategic evaluation and control
Strategic evaluation and control
Radhey Shyam Yadav
 
Sales of goods_act_1967
Sales of goods_act_1967Sales of goods_act_1967
Sales of goods_act_1967
izrena
 
Full report gas absorption
Full report gas  absorptionFull report gas  absorption
Full report gas absorption
Erra Zulkifli
 
Patient recruitment
Patient recruitmentPatient recruitment
Patient recruitment
swati2084
 
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
giefheoie
 
Key Architectural Aspects of a Enterprise Mobility Solution
Key Architectural Aspects of a Enterprise Mobility SolutionKey Architectural Aspects of a Enterprise Mobility Solution
Key Architectural Aspects of a Enterprise Mobility Solution
roshanjk
 

Destaque (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Best Practices for Identity Management Projects
Best Practices for Identity Management ProjectsBest Practices for Identity Management Projects
Best Practices for Identity Management Projects
 
Kristen cookie case
Kristen cookie caseKristen cookie case
Kristen cookie case
 
Strategic evaluation and control
Strategic evaluation and controlStrategic evaluation and control
Strategic evaluation and control
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Do you know what the arrow on a NYC Parking Sign means?
Do you know what the arrow on a NYC Parking Sign means?Do you know what the arrow on a NYC Parking Sign means?
Do you know what the arrow on a NYC Parking Sign means?
 
Sales of goods_act_1967
Sales of goods_act_1967Sales of goods_act_1967
Sales of goods_act_1967
 
What are Software Defined Application Services
What are Software Defined Application ServicesWhat are Software Defined Application Services
What are Software Defined Application Services
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
immunological assays - ELISA and RIA
immunological assays - ELISA and RIAimmunological assays - ELISA and RIA
immunological assays - ELISA and RIA
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Top 10 military interview questions with answers
Top 10 military interview questions with answersTop 10 military interview questions with answers
Top 10 military interview questions with answers
 
Full report gas absorption
Full report gas  absorptionFull report gas  absorption
Full report gas absorption
 
Patient recruitment
Patient recruitmentPatient recruitment
Patient recruitment
 
Intro to Algebra II
Intro to Algebra IIIntro to Algebra II
Intro to Algebra II
 
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
대출확실한곳『LG777』.『XYZ』경찰신용대출 미국비자발급
 
Modern Enterprise integration Strategies
Modern Enterprise integration StrategiesModern Enterprise integration Strategies
Modern Enterprise integration Strategies
 
Key Architectural Aspects of a Enterprise Mobility Solution
Key Architectural Aspects of a Enterprise Mobility SolutionKey Architectural Aspects of a Enterprise Mobility Solution
Key Architectural Aspects of a Enterprise Mobility Solution
 

Semelhante a Introduction to Big Data & Hadoop

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Edureka!
 

Semelhante a Introduction to Big Data & Hadoop (20)

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Learn Hadoop
Learn HadoopLearn Hadoop
Learn Hadoop
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop : The Pile of Big Data
Hadoop : The Pile of Big DataHadoop : The Pile of Big Data
Hadoop : The Pile of Big Data
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 

Mais de Edureka!

Mais de Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Introduction to Big Data & Hadoop

  • 3. Slide 3 www.edureka.co/big-data-and-hadoop At the end of this session , you will understand the: → Big Data Learning Paths → Big Data Introduction → Hadoop and Its Eco-System → Hadoop Architecture → Next Step on How to Setup Hadoop
  • 4. Slide 4 www.edureka.co/big-data-and-hadoop • Java / Python / Ruby • Hadoop Eco-system • NoSQL DB • Spark • Linux Administration • Cluster Management • Cluster Performance • Virtualization • Statistics Skills • Machine Learning • Hadoop Essentials • Expertise in R Developer/Testing Administration Data Analyst Big Data and Hadoop MapReduce Design Patterns Apache Spark & Scala Apache Cassandra Linux Administration Hadoop Administration Data Science Business Analytics Using R Advance Predictive Modelling in R Talend for Big Data Data Visualization Using Tableau
  • 5. Slide 5 www.edureka.co/big-data-and-hadoop → Lots of Data (Terabytes or Petabytes) → Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications → The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization Big Data
  • 6. Slide 6 www.edureka.co/big-data-and-hadoop → Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades
  • 7. Slide 7 www.edureka.co/big-data-and-hadoop → By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes (ZB) → The world’s information is doubling every two years. By 2020 the world will generate 50 times the amount of information and 75 times the number of information containers
  • 8. Slide 8 www.edureka.co/big-data-and-hadoop IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ VOLUME Web logs Images Videos Audios Sensor Data VARIET Y VELOCITY VERACITY Min Max Mean SD 4.3 7.9 5.84 0.83 2.0 4.4 3.05 0.43 0.1 2.5 1.20 0.76
  • 9. Slide 9 www.edureka.co/big-data-and-hadoop Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
  • 10. Slide 10 www.edureka.co/big-data-and-hadoop Map the following to corresponding data type: » XML files, e-mail body » Audio, Video, Images, Archived documents » Data from Enterprise systems (ERP, CRM etc.)
  • 11. Slide 11 www.edureka.co/big-data-and-hadoop Ans. XML files, e-mail body → Semi-structured data Audio, Video, Image, Files, Archived documents → Unstructured data Data from Enterprise systems (ERP, CRM etc.) → Structured data
  • 12. Slide 12 www.edureka.co/big-data-and-hadoop More on Big Data • http://www.edureka.in/blog/the-hype-behind-big-data/ Why Hadoop? • http://www.edureka.in/blog/why-hadoop/ Opportunities in Hadoop • http://www.edureka.in/blog/jobs-in-hadoop/ Big Data • http://en.wikipedia.org/wiki/Big_Data IBM’s definition – Big Data Characteristics • http://www-01.ibm.com/software/data/bigdata/
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop → Web and e-tailing » Recommendation Engines » Ad Targeting » Search Quality » Abuse and Click Fraud Detection → Telecommunications » Customer Churn Prevention » Network Performance Optimization » Calling Data Record (CDR) Analysis » Analysing Network to Predict Failure http://wiki.apache.org/hadoop/PoweredBy
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop → Government » Fraud Detection and Cyber Security » Welfare Schemes » Justice → Healthcare and Life Sciences » Health Information Exchange » Gene Sequencing » Serialization » Healthcare Service Quality Improvements » Drug Safety http://wiki.apache.org/hadoop/PoweredBy
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop → Banks and Financial services » Modeling True Risk » Threat Analysis » Fraud Detection » Trade Surveillance » Credit Scoring and Analysis → Retail » Point of Sales Transaction Analysis » Customer Churn Analysis » Sentiment Analysis http://wiki.apache.org/hadoop/PoweredBy
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop → Insight into data can provide Business Advantage. → Some key early indicators can mean Fortunes to Business. → More Precise Analysis with more data. *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data. Case Study: Sears Holding Corporation
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoop Mostly Append BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid Storage only Grid (Original Raw Data) Collection Inctrumentation A meagre 10% of the ~2PB data is available for BI Storage 2. Moving data to compute doesn’ t scale 90% of the ~2PB archived Processing 3. Premature data death 1. Can’t explore original high fidelity raw data
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop Mostly Append BI Reports + Interactive Apps RDBMS (Aggregated Data) Hadoop : Storage + Compute Grid Collection Instrumentation Both Storage And Processing Entire ~2PB Data is available for processing No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop Read 1 TB Data 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine 43 Minutes Read 1 TB Data
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/big-data-and-hadoop 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machine 4.3 Minutes43 Minutes Read 1 TB Data
  • 22. Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoop → Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. → It is an Open-source Data Management with scale-out storage and distributed processing.
  • 23. Slide 23 www.edureka.co/big-data-and-hadoop Hadoop is a framework that allows for the distributed processing of: » Small Data Sets » Large Data Sets
  • 24. Slide 24 www.edureka.co/big-data-and-hadoop Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
  • 25. Slide 25Slide 25Slide 25 www.edureka.co/big-data-and-hadoop Pig Latin Data Analysis Hive DW System Other YARN Frameworks (MPI, GRAPH) HBaseMapReduce Framework YARN Cluster Resource Management Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Hadoop 2.0 Sqoop Unstructured or Semi-structured Data Structured Data Flume Mahout Machine Learning
  • 26. Slide 26Slide 26Slide 26 www.edureka.co/big-data-and-hadoop DataNode Node Manager DataNode DataNode DataNode YARN HDFS Cluster Resource Manager NameNode Node Manager Node Manager Node Manager
  • 27. Slide 27 www.edureka.co/big-data-and-hadoop RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores. Ethernet: 3 x 10 GB/s OS: 64-bit CentOS RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Active NameNodeSecondary NameNode DataNode DataNode RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply StandBy NameNode
  • 29. Slide 29 www.edureka.co/big-data-and-hadoop NodeManager DataNode NodeManager HDFS YARN NameNode DataNode NodeManager DataNode ResourceManager DataNode NodeManager DataNode NodeManager NodeManager DataNode NodeManager DataNode NodeManager DataNode
  • 30. Slide 30 www.edureka.co/big-data-and-hadoop Namenode NS Storage … NamespaceBlockStorage Namespace NN-1 NN-k NN-n Common Storage BlockStorage Pool 1 Pool k Pool n Block Pools … … Hadoop 1.0 Hadoop 2.0 DatanodeDatanode Datanode 1 … Datanode m … Datanode 2 … Block Management
  • 31. Slide 31 www.edureka.co/big-data-and-hadoop How does HDFS Federation help HDFS Scale horizontally? a. Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace. b. Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
  • 32. Slide 32 www.edureka.co/big-data-and-hadoop Ans. Option (a) In order to scale the name service horizontally, HDFS federation uses multiple independent NameNode. The NameNode are federated, that is, the NameNode are independent and don’t require coordination with each other.
  • 33. Slide 33 www.edureka.co/big-data-and-hadoop You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?
  • 34. Slide 34 www.edureka.co/big-data-and-hadoop Ans. Put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.
  • 35. Slide 35 www.edureka.co/big-data-and-hadoop Node Manager Container App Master Node Manager Container App Master HDFS YARN Resource Manager All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node DataNode Standby NameNode Active NameNode DataNode Data Node DataNodeDataNode NameNode High Availability Next Generation MapReduce *Not necessary to configure Secondary NameNode Client Shared Edit Logs HDFS HIGH AVAILABILITY Node Manager Container App Master Node Manager Container App Master
  • 36. Slide 36 www.edureka.co/big-data-and-hadoop Node Manager Container App Master Node Manager Container App Master HDFS YARN Resource Manager All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node DataNode Standby NameNode Active NameNode DataNode Data Node DataNodeDataNode NameNode High Availability Next Generation MapReduce *Not necessary to configure Secondary NameNode Client Shared Edit Logs HDFS HIGH AVAILABILITY Node Manager Container App Master Node Manager Container App Master
  • 37. Slide 37 www.edureka.co/big-data-and-hadoop HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0? a. Single Point of Failure of Name-Node b. Only one version can be run in classic Map-Reduce c. Too much burden on Job Tracker
  • 38. Slide 38 www.edureka.co/big-data-and-hadoop Ans. Single Point of Failure of NameNode
  • 40. Slide 40 www.edureka.co/big-data-and-hadoop Facebook → We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. → Currently we have 2 major clusters: » A 1100-machine cluster with 8800 cores and about 12 PB raw storage. » A 300-machine cluster with 2400 cores and about 3 PB raw storage. » Each (commodity) node has 8 cores and 12 TB of storage. » We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive (see the http://Hadoop.apache.org/hive/). We have also developed a FUSE implementation over HDFS.
  • 41. Slide 41 www.edureka.co/big-data-and-hadoop Hadoop can run in any of the following three modes: Fully-Distributed Mode Pseudo-Distributed Mode → No daemons, everything runs in a single JVM. → Suitable for running MapReduce programs during development. → Has no DFS. → Hadoop daemons run on the local machine. → Hadoop daemons run on a cluster of machines. Standalone (or Local) Mode
  • 42. Slide 42 www.edureka.co/big-data-and-hadoop → Apache Hadoop and HDFS → http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/ → Apache Hadoop HDFS Architecture → http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
  • 43. Slide 43 www.edureka.co/big-data-and-hadoop • Referring the documents present in the LMS under assignment solve the below problem. How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
  • 44. Slide 44 Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better! Please spare few minutes to take the survey after the webinar. www.edureka.co/big-data-and-hadoop