Big Data and Hadoop Basics

PRESENTATION FOR
BIG DATA & HADOOP
BY SONAL TIWARI
UNDERSTANDING BIG DATA
Big data involves the data produced by different devices and
applications some of the fields that comes under Big Data are:
WHAT IS BIG DATA?
BlackBoxData−Itisacomponentofhelicopter,airplanes,andjets,etc.It
capturesvoicesoftheflightcrew,recordingsofmicrophonesand
earphones,andtheperformanceinformationoftheaircraft.
01
SocialMediaData−SocialmediasuchasFacebookandTwitterhold
informationandtheviewspostedbymillionsofpeopleacrosstheglobe.02
Stock Exchange Data − The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a
share of different companies made by the customers.
03
UNDERSTANDING BIG DATA
Transport Data − Transport data includes model, capacity,
distance and availability of a vehicle.04
SearchEngineData−Searchenginesretrievelotsofdatafromdifferent
databases.05
Power Grid Data − The power grid data holds information
consumed by a particular node with respect to a base
station.
06
UNDERSTANDING BIG DATA
 Big data is a collection of large datasets that cannot be processed
using traditional computing techniques.
 The 4V’s of data that defines the data sets in Big Data are:
o Volume
o Velocity
o Variety
o Veracity
DEFINITION OF BIG DATA?
UNDERSTANDING BIG DATA
4V’S OF BIG DATA
Refers to vast
amount of data
generated every
second
Refers to the
different types
of data such as
messages, audio
and video
recordings,
images
Refers to speed
at which ne data
is generated and
the speed at
which it moves
around.
Refers to
messiness and
trustworthiness
of the data
VOLUME VARIETY VELOCITY VERACITY
UNDERSTANDING BIG DATA
DEFINITION OF BIG DATA?
Big Data
Challenges
Capturing
Data
Curation
Storage
SearchingSharing
Transfer
Analysis
UNDERSTANDING BIG DATA
 The enterprise stores and processes Big data in a
computer/database such as Oracle, IBM, etc.
 The user interacts with the application, which in turn handles the
part of data storage and analysis.
TRADITIONAL APPROACH OF BIG DATA PROCESSING AND LIMITATIONS
Centralised
System
Relational Data
Base
User
 This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data.
LIMITATIONS
UNDERSTANDING BIG DATA
 Google solved the limitations of traditional methods using an
algorithm called MapReduce.
 This algorithm divides the task into small parts and assigns them to
many computers, and collects the results from them which when
integrated, form the result dataset.
LATEST APPROACH: GOOGLE SOLUTION
Commodity
Hardware
Commodity
Hardware
Centralised
System
User
UNDERSTANDING HADOOP & IT’S COMPONENTS
 Using the solution provided by Google, Doug Cutting and his
team developed an Open Source Project called HADOOP.
 Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel with others.
 Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
 Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming models.
 The Hadoop framework application works in an environment
that provides distributed storage and computation across
clusters of computers.
 Hadoop is designed to scale up from single server to thousands
of machines, each offering local computation and storage.
INTRODUCTION TO HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
INTRODUCTION TO HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
 MapReduce : MapReduce is a parallel programming model for writing
distributed applications devised at Google for efficient processing of
large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner. The MapReduce program runs on Hadoop which is an Apache
open-source framework.
 Hadoop Distributed File System: The Hadoop Distributed File System
(HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on commodity hardware.
It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
 Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
 Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules −
 Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and
cluster resource management.
UNDERSTANDING HADOOP & IT’S COMPONENTS
ADVANTAGES OF HADOOP
 Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes
the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
 Hadoop is open source as well as it is compatible on all the
platforms since it is Java based.
COMPONENTS OF HADOOP ECOSYSTEM
 HDFS: Hadoop Distributed File System
 MapReduce: Programming based Data Processing
 YARN: Yet Another Resource Negotiator
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
COMPONENTS OF HADOOP ECOSYSTEM
HDFS
(File System)
HBASE
(Column DB
Storage)
Oozie
(Workflow Monitoring)
Chukwa
(Monitoring)
Hive
(SQL)
Map Reduce
(Cluster Management)
YARN
(Cluster & Resource
Management)
Pig
(Dataflow)
Mahout
(Machine
Learning)
Avro
(RPC)
Sqoop
(RDBMS
Connector)
Flume
(Monitoring)
ZooKeeper
(Management)
Data
Storage
Data
Processing
Data
Access
Data
Management
DATA STORAGE COMPONENT OF HADOOP
HDFS
 Hadoop File System was developed using distributed file system
design.
 It is run on commodity hardware.
 Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple
machines.
 The files are stored in redundant fashion to rescue the system
from possible data losses in case of failure.
 HDFS makes applications available to parallel processing
DATA STORAGE COMPONENT OF HADOOP
HDFS - ARCHITECTURE
DATA STORAGE COMPONENT OF HADOOP
HDFS ARCHITECTURE
HDFS follows the master-slave architecture and it has the following
elements.
 Namenode : The namenode is the commodity hardware that
contains the GNU/Linux operating system and the namenode
software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master
server and it does the following tasks
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming,
closing, and opening files and directories.
DATA STORAGE COMPONENT OF HADOOP
HDFS ARCHITECTURE
 Datanode: The datanode is a commodity hardware having the
GNU/Linux operating system and datanode software. For every
node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems,
as per client request.
 They perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
 Block: The user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or
stored in individual data nodes. These file segments are called as
blocks. Or the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB.
DATA STORAGE COMPONENT OF HADOOP
HBASE
 It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database.
 It provides capabilities of Google’s BigTable, thus able to work on
Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be
processed within a short quick span of time. At such times, HBase
comes handy as it gives a tolerant way of storing limited data
DATA STORAGE COMPONENT OF HADOOP
HBASE- COMPONENTS
 HBase master: It is not part of the actual data storage, but it
manages load balancing activities across all Region Servers.
 It controls the failovers.
 Performs administration activities which provide an interface
for creating, updating and deleting tables.
 Handles DDL operations.
 It maintains and monitors the Hadoop cluster.
 Regional server: It is a worker node. It reads, writes, and deletes
request from Clients. Region server runs on every node of Hadoop
cluster. Its server runs on HDFS data nodes.
DATA PROCESSING COMPONENT OF HADOOP
MAP REDUCE
 By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s logic
and helps to write applications which transform big data sets into
a manageable one.
 MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
 Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
 Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
DATA PROCESSING COMPONENT OF HADOOP
MAP REDUCE- FEATURES
Features
of Map
Reduce
Simplicity
(jobs are easy
to run)
Scalability
(can process
petabytes of
data)
Speed
(parallel
processing
improves
speed)
Fault
Tolerance
(takes care of
failures)
DATA PROCESSING COMPONENT OF HADOOP
YARN
 Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. It
performs scheduling and resource allocation for the Hadoop
System.
 Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
DATA PROCESSING COMPONENT OF HADOOP
YARN- KEY BENEFITS
Key
Benefits
of YARN
Improved
cluster
utilization
Highly
scalable
Beyond Java
Novel
programmin
g models &
services
Agility
DATA ACCESS COMPONENT OF HADOOP
HIVE
 With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. Its query language is called
as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by
Hive thus, making the query processing easier.
 HIVE comes with two components: JDBC Drivers and HIVE
Command Line.
 JDBC, along with ODBC drivers work on establishing the data
storage permissions and connection whereas HIVE Command line
helps in the processing of queries.
DATA ACCESS COMPONENT OF HADOOP
HIVE
DATA ACCESS COMPONENT OF HADOOP
PIG
 Pig was developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and
analyzing huge data sets.
 Pig does the work of executing commands and in the background,
all the activities of MapReduce are taken care of. After the
processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which
runs on Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and
hence is a major segment of the Hadoop Ecosystem.
DATA ACCESS COMPONENT OF HADOOP
PIG
Data Flows
Local Execution in
a Single JVM
Input file stored
in HDFS
Output file stored
in HDFS
Pig Latin
Compiler
Execution of Map
Reduce job
internally
Pig Script
UDF present in
local file system
Execution
Environment
Pig
Pig latin is used to
express data flows
Register In
Produces Map Reduce job
1
2
4
5
3
DATA ACCESS COMPONENT OF HADOOP
MAHOUT
 Mahout, allows Machine Learnability to a system or application.
 Machine Learning, helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of
algorithms.
 It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are
nothing but concepts of Machine learning.
 It allows invoking algorithms as per our need with the help of its
own libraries.
DATA ACCESS COMPONENT OF HADOOP
AVRO
 Apache Avro works as a data serialization system. It helps Hadoop
in data serialization and data exchange.
 Avro enables big data in exchanging programs written in different
languages. It serializes data into files or messages.
 Avro Schema: Schema helps Avaro in serialization and
deserialization process without code generation. Avro needs a
schema for data to read and write.
 Dynamic typing: it means serializing and deserializing data
without generating any code. It replaces the code generation
process with its statistically typed language as an optional
optimization.
DATA ACCESS COMPONENT OF HADOOP
SQOOP
 Sqoop works as a front-end loader of Big data.
 Sqoop is a front-end interface that enables in moving bulk data
from Hadoop to relational databases and into variously structured
data marts.
 Sqoop replaces the function called ‘developing scripts’ to import
and export data. It mainly helps in moving data from an
enterprise database to Hadoop cluster to performing the ETL
process.
 Sqoop fulfills the growing need to transfer data from the
mainframe to HDFS.
 Sqoop helps in achieving improved compression and light-weight
indexing for advanced query performance.
DATA ACCESS COMPONENT OF HADOOP
SQOOP
 It facilitates feature to transfer data parallelly for effective
performance and optimal system utilization.
 Sqoop creates fast data copies from an external source into
Hadoop.
 It acts as a load balancer by mitigating extra storage and
processing loads to other devices.
DATA ACCESS COMPONENT OF HADOOP
SQOOP: AS AN ETL
RDBMS
(MySQL, Oracle etc)
Hadoop File System
(HDFS, HIVE, etc)
Import
Export
Sqoop
DATA MANAGEMENT COMPONENT OF HADOOP
OOZIE
 Apache Ooze is a tool in which all sort of programs can be
pipelined in a required manner to work in Hadoop's distributed
environment.
 Oozie works as a scheduler system to run and manage Hadoop
jobs.
 Oozie allows combining multiple complex jobs to be run in a
sequential order to achieve the desired output.
 It is strongly integrated with Hadoop stack supporting various jobs
like Pig, Hive, Sqoop, and system-specific jobs like Java, and
Shell.
 Oozie is an open source Java web application.
DATA MANAGEMENT COMPONENT OF HADOOP
OOZIE
Oozie consists of two jobs:
 Oozie workflow: It is a collection of actions arranged to perform
the jobs one after another. It is just like a relay race where one
has to start right after one finish, to complete the race.
 Oozie Coordinator: It runs workflow jobs based on the
availability of data and predefined schedules.
DATA MANAGEMENT COMPONENT OF HADOOP
FLUME
 Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of
streaming data such as log files, events (etc...) from various
sources to a centralized data store.
 Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from
various web servers to HDFS.
DATA MANAGEMENT COMPONENT OF HADOOP
ZOOKEEPER
 Apache Zookeeper is an open source project designed to
coordinate multiple services in the Hadoop ecosystem.
 Zookeeper performs task like synchronization, inter-component
based communication, grouping, and maintenance.
 Features of Zookeeper:
 Zookeeper acts fast enough with workloads where reads to
data are more common than writes.
 Zookeeper maintains a record of all transactions.
OTHER IMPORTANT COMPONENTS OF HADOOP
 Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries,
especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
 Spark
 It’s a platform that handles all the process consumptive tasks
like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster
than the prior in terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of the companies interchangeably.
BIG DATA & HADOOP SECURITY
 Knox provides a framework for managing security and supports
security implementations on Hadoop clusters.
 Knox is a REST API gateway developed within the Apache
community to support monitoring, authorization management,
auditing, and policy enforcement on Hadoop clusters.
 Knox provides a single access point for all REST interactions with
clusters.
 Through Knox, system administrators can manage authentication
via LDAP and Active Directory, conduct HTTP header-based
federated identity management, and audit hardware on the
clusters.
 Knox supports enhanced security because it can integrate with
enterprise identity management solutions and is Kerberos
compatible.
HADOOP SECURITY MANAGEMENT TOOL: KNOX
BIG DATA & HADOOP SECURITY
 The Ranger provides a centralized framework that can be used to
manage policies at the resource level, such as files, folders,
databases, and even for specific lines and columns within
databases.
 Ranger helps administrators implement access policies by group,
data type, etc.
 Ranger has different authorization functionality for different
Hadoop components such as YARN, HBase, Hive, etc.
HADOOP SECURITY MANAGEMENT TOOL: RANGER
BIG DATA & HADOOP SECURITY
 In core Hadoop technology the HFDS has directories called
encryption zones. When data is written to Hadoop it is
automatically encrypted (with a user-selected algorithm) and
assigned to an encryption zone.
 Encryption is file specific, not zone specific. That means each file
within the zone is encrypted with its own unique data encryption
key (DEK).
 Clients decrypt data from HFDS uses an encrypted data
encryption key (EDEK), then use the DEK to read and write data.
 Encryption zones and DEK encryption occurs between the file
system and database levels of the architecture.
HADOOP ENCRYPTION
1 de 43

Recomendados

Big data & hadoopBig data & hadoop
Big data & hadoopTejashBansal2
100 visualizações30 slides
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
5.5K visualizações20 slides
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
2.1K visualizações30 slides
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
90.2K visualizações54 slides

Mais conteúdo relacionado

Mais procurados

Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics HadoopMishika Bharadwaj
3.3K visualizações59 slides
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop Edureka!
7.9K visualizações41 slides
What is hadoopWhat is hadoop
What is hadoopAsis Mohanty
1.1K visualizações23 slides

Mais procurados(20)

Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj3.3K visualizações
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)1.4K visualizações
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin1.7K visualizações
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
Edureka!7.9K visualizações
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty1.1K visualizações
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex4.5K visualizações
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!1.5K visualizações
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen801 visualizações
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit2.6K visualizações
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Edureka!2.1K visualizações
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura1K visualizações
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh870 visualizações
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman1.3K visualizações
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
Imviplav381 visualizações
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari3.9K visualizações
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko7.4K visualizações
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja549 visualizações

Similar a Big Data and Hadoop Basics

project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
469 visualizações20 slides
paperpaper
paperAnkeeta Battalwar
88 visualizações6 slides
Hadoop infoHadoop info
Hadoop infoNikita Sure
281 visualizações25 slides
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
1.2K visualizações53 slides
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxAlAmin638189
6 visualizações22 slides

Similar a Big Data and Hadoop Basics(20)

project report on hadoopproject report on hadoop
project report on hadoop
Manoj Jangalva469 visualizações
paperpaper
paper
Ankeeta Battalwar88 visualizações
Hadoop infoHadoop info
Hadoop info
Nikita Sure281 visualizações
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar1.2K visualizações
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
AlAmin6381896 visualizações
Big dataBig data
Big data
Abilash Mavila174 visualizações
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
Simplilearn1.6K visualizações
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin619 visualizações
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
Aditi Yadav89 visualizações
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness38 visualizações
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho2.7K visualizações
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
Rahul Sharma64 visualizações
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K131 visualizações
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta786 visualizações
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu408 visualizações
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh477 visualizações
Hadoop map reduceHadoop map reduce
Hadoop map reduce
VijayMohan Vasu248 visualizações
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness18 visualizações
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim6 visualizações

Último(20)

GA4 - Google Analytics 4 - Session Metrics.pdfGA4 - Google Analytics 4 - Session Metrics.pdf
GA4 - Google Analytics 4 - Session Metrics.pdf
GA4 Tutorials19 visualizações
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann76 visualizações
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela164 visualizações
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar7 visualizações
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 visualizações
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials6 visualizações
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2810 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116186 visualizações
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 visualizações
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4917 visualizações
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika19 visualizações
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE909 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen016 visualizações
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann88 visualizações
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia17 visualizações
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm286 visualizações

Big Data and Hadoop Basics

  • 1. PRESENTATION FOR BIG DATA & HADOOP BY SONAL TIWARI
  • 2. UNDERSTANDING BIG DATA Big data involves the data produced by different devices and applications some of the fields that comes under Big Data are: WHAT IS BIG DATA? BlackBoxData−Itisacomponentofhelicopter,airplanes,andjets,etc.It capturesvoicesoftheflightcrew,recordingsofmicrophonesand earphones,andtheperformanceinformationoftheaircraft. 01 SocialMediaData−SocialmediasuchasFacebookandTwitterhold informationandtheviewspostedbymillionsofpeopleacrosstheglobe.02 Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers. 03
  • 3. UNDERSTANDING BIG DATA Transport Data − Transport data includes model, capacity, distance and availability of a vehicle.04 SearchEngineData−Searchenginesretrievelotsofdatafromdifferent databases.05 Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station. 06
  • 4. UNDERSTANDING BIG DATA  Big data is a collection of large datasets that cannot be processed using traditional computing techniques.  The 4V’s of data that defines the data sets in Big Data are: o Volume o Velocity o Variety o Veracity DEFINITION OF BIG DATA?
  • 5. UNDERSTANDING BIG DATA 4V’S OF BIG DATA Refers to vast amount of data generated every second Refers to the different types of data such as messages, audio and video recordings, images Refers to speed at which ne data is generated and the speed at which it moves around. Refers to messiness and trustworthiness of the data VOLUME VARIETY VELOCITY VERACITY
  • 6. UNDERSTANDING BIG DATA DEFINITION OF BIG DATA? Big Data Challenges Capturing Data Curation Storage SearchingSharing Transfer Analysis
  • 7. UNDERSTANDING BIG DATA  The enterprise stores and processes Big data in a computer/database such as Oracle, IBM, etc.  The user interacts with the application, which in turn handles the part of data storage and analysis. TRADITIONAL APPROACH OF BIG DATA PROCESSING AND LIMITATIONS Centralised System Relational Data Base User  This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. LIMITATIONS
  • 8. UNDERSTANDING BIG DATA  Google solved the limitations of traditional methods using an algorithm called MapReduce.  This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset. LATEST APPROACH: GOOGLE SOLUTION Commodity Hardware Commodity Hardware Centralised System User
  • 9. UNDERSTANDING HADOOP & IT’S COMPONENTS  Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.  Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.  Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.  Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.  The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.  Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. INTRODUCTION TO HADOOP
  • 10. UNDERSTANDING HADOOP & IT’S COMPONENTS INTRODUCTION TO HADOOP
  • 11. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP
  • 12. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP  MapReduce : MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.  Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.  Hadoop has two major layers namely −  Processing/Computation layer (MapReduce), and  Storage layer (Hadoop Distributed File System).
  • 13. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP  Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −  Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.  Hadoop YARN − This is a framework for job scheduling and cluster resource management.
  • 14. UNDERSTANDING HADOOP & IT’S COMPONENTS ADVANTAGES OF HADOOP  Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.  Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.  Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.  Hadoop is open source as well as it is compatible on all the platforms since it is Java based.
  • 15. COMPONENTS OF HADOOP ECOSYSTEM  HDFS: Hadoop Distributed File System  MapReduce: Programming based Data Processing  YARN: Yet Another Resource Negotiator  Spark: In-Memory data processing  PIG, HIVE: Query based processing of data services  HBase: NoSQL Database  Mahout, Spark MLLib: Machine Learning algorithm libraries  Solar, Lucene: Searching and Indexing  Zookeeper: Managing cluster  Oozie: Job Scheduling
  • 16. COMPONENTS OF HADOOP ECOSYSTEM HDFS (File System) HBASE (Column DB Storage) Oozie (Workflow Monitoring) Chukwa (Monitoring) Hive (SQL) Map Reduce (Cluster Management) YARN (Cluster & Resource Management) Pig (Dataflow) Mahout (Machine Learning) Avro (RPC) Sqoop (RDBMS Connector) Flume (Monitoring) ZooKeeper (Management) Data Storage Data Processing Data Access Data Management
  • 17. DATA STORAGE COMPONENT OF HADOOP HDFS  Hadoop File System was developed using distributed file system design.  It is run on commodity hardware.  Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.  HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines.  The files are stored in redundant fashion to rescue the system from possible data losses in case of failure.  HDFS makes applications available to parallel processing
  • 18. DATA STORAGE COMPONENT OF HADOOP HDFS - ARCHITECTURE
  • 19. DATA STORAGE COMPONENT OF HADOOP HDFS ARCHITECTURE HDFS follows the master-slave architecture and it has the following elements.  Namenode : The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 20. DATA STORAGE COMPONENT OF HADOOP HDFS ARCHITECTURE  Datanode: The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They perform operations such as block creation, deletion, and replication according to the instructions of the namenode.  Block: The user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. Or the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB.
  • 21. DATA STORAGE COMPONENT OF HADOOP HBASE  It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database.  It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.  At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives a tolerant way of storing limited data
  • 22. DATA STORAGE COMPONENT OF HADOOP HBASE- COMPONENTS  HBase master: It is not part of the actual data storage, but it manages load balancing activities across all Region Servers.  It controls the failovers.  Performs administration activities which provide an interface for creating, updating and deleting tables.  Handles DDL operations.  It maintains and monitors the Hadoop cluster.  Regional server: It is a worker node. It reads, writes, and deletes request from Clients. Region server runs on every node of Hadoop cluster. Its server runs on HDFS data nodes.
  • 23. DATA PROCESSING COMPONENT OF HADOOP MAP REDUCE  By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one.  MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:  Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key- value pair based result which is later on processed by the Reduce() method.  Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
  • 24. DATA PROCESSING COMPONENT OF HADOOP MAP REDUCE- FEATURES Features of Map Reduce Simplicity (jobs are easy to run) Scalability (can process petabytes of data) Speed (parallel processing improves speed) Fault Tolerance (takes care of failures)
  • 25. DATA PROCESSING COMPONENT OF HADOOP YARN  Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. It performs scheduling and resource allocation for the Hadoop System.  Consists of three major components i.e.  Resource Manager  Nodes Manager  Application Manager  Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
  • 26. DATA PROCESSING COMPONENT OF HADOOP YARN- KEY BENEFITS Key Benefits of YARN Improved cluster utilization Highly scalable Beyond Java Novel programmin g models & services Agility
  • 27. DATA ACCESS COMPONENT OF HADOOP HIVE  With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. Its query language is called as HQL (Hive Query Language).  It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier.  HIVE comes with two components: JDBC Drivers and HIVE Command Line.  JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.
  • 28. DATA ACCESS COMPONENT OF HADOOP HIVE
  • 29. DATA ACCESS COMPONENT OF HADOOP PIG  Pig was developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL.  It is a platform for structuring the data flow, processing and analyzing huge data sets.  Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS.  Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.  Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
  • 30. DATA ACCESS COMPONENT OF HADOOP PIG Data Flows Local Execution in a Single JVM Input file stored in HDFS Output file stored in HDFS Pig Latin Compiler Execution of Map Reduce job internally Pig Script UDF present in local file system Execution Environment Pig Pig latin is used to express data flows Register In Produces Map Reduce job 1 2 4 5 3
  • 31. DATA ACCESS COMPONENT OF HADOOP MAHOUT  Mahout, allows Machine Learnability to a system or application.  Machine Learning, helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms.  It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning.  It allows invoking algorithms as per our need with the help of its own libraries.
  • 32. DATA ACCESS COMPONENT OF HADOOP AVRO  Apache Avro works as a data serialization system. It helps Hadoop in data serialization and data exchange.  Avro enables big data in exchanging programs written in different languages. It serializes data into files or messages.  Avro Schema: Schema helps Avaro in serialization and deserialization process without code generation. Avro needs a schema for data to read and write.  Dynamic typing: it means serializing and deserializing data without generating any code. It replaces the code generation process with its statistically typed language as an optional optimization.
  • 33. DATA ACCESS COMPONENT OF HADOOP SQOOP  Sqoop works as a front-end loader of Big data.  Sqoop is a front-end interface that enables in moving bulk data from Hadoop to relational databases and into variously structured data marts.  Sqoop replaces the function called ‘developing scripts’ to import and export data. It mainly helps in moving data from an enterprise database to Hadoop cluster to performing the ETL process.  Sqoop fulfills the growing need to transfer data from the mainframe to HDFS.  Sqoop helps in achieving improved compression and light-weight indexing for advanced query performance.
  • 34. DATA ACCESS COMPONENT OF HADOOP SQOOP  It facilitates feature to transfer data parallelly for effective performance and optimal system utilization.  Sqoop creates fast data copies from an external source into Hadoop.  It acts as a load balancer by mitigating extra storage and processing loads to other devices.
  • 35. DATA ACCESS COMPONENT OF HADOOP SQOOP: AS AN ETL RDBMS (MySQL, Oracle etc) Hadoop File System (HDFS, HIVE, etc) Import Export Sqoop
  • 36. DATA MANAGEMENT COMPONENT OF HADOOP OOZIE  Apache Ooze is a tool in which all sort of programs can be pipelined in a required manner to work in Hadoop's distributed environment.  Oozie works as a scheduler system to run and manage Hadoop jobs.  Oozie allows combining multiple complex jobs to be run in a sequential order to achieve the desired output.  It is strongly integrated with Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and system-specific jobs like Java, and Shell.  Oozie is an open source Java web application.
  • 37. DATA MANAGEMENT COMPONENT OF HADOOP OOZIE Oozie consists of two jobs:  Oozie workflow: It is a collection of actions arranged to perform the jobs one after another. It is just like a relay race where one has to start right after one finish, to complete the race.  Oozie Coordinator: It runs workflow jobs based on the availability of data and predefined schedules.
  • 38. DATA MANAGEMENT COMPONENT OF HADOOP FLUME  Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.  Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.
  • 39. DATA MANAGEMENT COMPONENT OF HADOOP ZOOKEEPER  Apache Zookeeper is an open source project designed to coordinate multiple services in the Hadoop ecosystem.  Zookeeper performs task like synchronization, inter-component based communication, grouping, and maintenance.  Features of Zookeeper:  Zookeeper acts fast enough with workloads where reads to data are more common than writes.  Zookeeper maintains a record of all transactions.
  • 40. OTHER IMPORTANT COMPONENTS OF HADOOP  Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.  Spark  It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc.  It consumes in memory resources hence, thus being faster than the prior in terms of optimization.  Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably.
  • 41. BIG DATA & HADOOP SECURITY  Knox provides a framework for managing security and supports security implementations on Hadoop clusters.  Knox is a REST API gateway developed within the Apache community to support monitoring, authorization management, auditing, and policy enforcement on Hadoop clusters.  Knox provides a single access point for all REST interactions with clusters.  Through Knox, system administrators can manage authentication via LDAP and Active Directory, conduct HTTP header-based federated identity management, and audit hardware on the clusters.  Knox supports enhanced security because it can integrate with enterprise identity management solutions and is Kerberos compatible. HADOOP SECURITY MANAGEMENT TOOL: KNOX
  • 42. BIG DATA & HADOOP SECURITY  The Ranger provides a centralized framework that can be used to manage policies at the resource level, such as files, folders, databases, and even for specific lines and columns within databases.  Ranger helps administrators implement access policies by group, data type, etc.  Ranger has different authorization functionality for different Hadoop components such as YARN, HBase, Hive, etc. HADOOP SECURITY MANAGEMENT TOOL: RANGER
  • 43. BIG DATA & HADOOP SECURITY  In core Hadoop technology the HFDS has directories called encryption zones. When data is written to Hadoop it is automatically encrypted (with a user-selected algorithm) and assigned to an encryption zone.  Encryption is file specific, not zone specific. That means each file within the zone is encrypted with its own unique data encryption key (DEK).  Clients decrypt data from HFDS uses an encrypted data encryption key (EDEK), then use the DEK to read and write data.  Encryption zones and DEK encryption occurs between the file system and database levels of the architecture. HADOOP ENCRYPTION