SlideShare uma empresa Scribd logo
1 de 41
Haden Pereira
Data Engineer , Applications Work Group @
EMC
5+ Years Experience in the Big Data Space
Quick Survey
How many Programmers/Developers ?
Quick Survey
How many SQL Developers?
Quick Survey
How many Application Developers
(Java,C#,etc)
Quick Survey
How many System Administrators
(Database, Tomcat etc)
Quick Survey
How many of you have heard of Hadoop
Quick Survey
How many of you have hands on Experience
in Hadoop ?
Quick Survey
How many of you have worked with any of
the NoSQL tools.
Cassandra, MongoDB, Elasticsearch
Hadoop
A quick walkthrough
What is Hadoop?
Hadoop is an open source framework for
large-scale data storing & processing.
Why Hadoop?
• Traditional Data processing was done on large systems.
• Every time need for better performance arises , they would replace
the old computer with better ones.
• Scaling up was expensive
• Also scaling was limited to the maximum available resources of a
single system.
How does Hadoop Scale?
• ”Scale Out” , rather than “Scale Up”
• If data set/data processing requirement increases , add in one more
server.
• Eliminates the strategy of growing computing capacity by throwing
more expensive hardware at the problem.
Core Components of Hadoop
Hadoop v1 - HDFS & Map/Reduce
Hadoop v2 - HDFS & YARN
HDFS
Distributed: Scale of data growing at higher pace than single storage
disk capacity growth, hence cluster of disk distributed over network is
necessary.
Scalable: Extends to handle growing data requirement.
Fault-Tolerant: Protects against increased failure probability due to
large number of disks by replication
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
Total Capacity 6 TB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F1 F1
100MB 100MB 100MB
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1 F2 F3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2
Map Reduce
Framework for writing applications that process large amounts of
structured and unstructured data in parallel, across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
Map Reduce
File.txt
300 MB
….. , ….. , ….. , ….. , ….. , ….. , ….. , 654 , INR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 432 , AED
….. , ….. , ….. , ….. , ….. , ….. , ….. , 573 , USD
….. , ….. , ….. , ….. , ….. , ….. , ….. , 948 , EUR
….. , ….. , ….. , ….. , ….. , ….. , ….. , 392 , GBP
CSV file with around 1 million lines
Map Reduce
File.txt
300 MB
1 Hour to process 300 MB File
Map Reduce
File.txt
150 MB
1/2 Hour to process 150 MB File
File.txt
150 MB
1/2 Hour to process 150 MB File
Map Reduce
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75MB File
1/4 Hour to process 75 MB File
File.txt
75 MB
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1
Map Reduce
• Handles tasks incase of server failures
• Distributes tasks evenly
• Tries to run tasks on the same server where the data block resides
YARN
Multi-tenancy - YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch, interactive and real-
time engines that can simultaneously access the same data set.
Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization
over more static Map Reduce rules used in early versions of Hadoop.
Scalability - Data center processing power continues to rapidly expand. YARN’s
Resource Manager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run
YARN without any disruption to existing processes that already work
YARN
Hadoop Ecosystem
Pig (scripting): Platform for analyzing large data sets. It is comprised of a high-
level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing
code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw
data, and iterative processing of data.
Hive (SQL). Provides data warehouse infrastructure, enabling data
summarization, ad- hoc query and analysis of large data sets. The query
language, HiveQL (HQL), is similar to SQL.
HCatalog (SQL). Table and storage management layer that provides users with
Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST
APIs so that external systems can access these tables' metadata.
Hadoop Ecosystem
Ambari : Provides an open operational framework for provisioning, managing
and monitoring Hadoop clusters.
Zookeeper : Provides distributed configuration service, a synchronization service
and a naming registry for distributed systems
Oozie : Enables Hadoop administrators to build complex data transformations out
of multiple component tasks, enabling greater control over complex jobs and also
making it easier to schedule repetitions of those jobs.
Hadoop Ecosystem
Tez leverages the MapReduce paradigm to enable the creation and execution of
more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary
tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up
data processing across both small-scale/low-latency and large-scale/high-
throughput workloads
Spark : fast and general in memory processing engine that uses YARN as a
framework for deployment and can read/write data from HDFS.
Hadoop Ecosystem
Sqoop : Tool designed to transfer data between Hadoop and relational database
servers
HBase (NoSQL). Non-relational database that provides random real-time access
to data in very large tables. HBase provides transactional capabilities to Hadoop,
allowing users to conduct updates, inserts and deletes.
Flume : Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into HDFS
Hadoop Ecosystem
Intro to hadoop

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017Deep dive into the Rds PostgreSQL Universe Austin 2017
Deep dive into the Rds PostgreSQL Universe Austin 2017
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Dat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQLDat305 Deep Dive on Amazon Aurora PostgreSQL
Dat305 Deep Dive on Amazon Aurora PostgreSQL
 
File Context
File ContextFile Context
File Context
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
HPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and WorkflowsHPC Storage and IO Trends and Workflows
HPC Storage and IO Trends and Workflows
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2Upgrading from HDP 2.1 to HDP 2.2
Upgrading from HDP 2.1 to HDP 2.2
 
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibilityre:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL Compatibility
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 

Semelhante a Intro to hadoop

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 

Semelhante a Intro to hadoop (20)

Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Lamp Stack Optimization
Lamp Stack OptimizationLamp Stack Optimization
Lamp Stack Optimization
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Exchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store ChangesExchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store Changes
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Database-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable TablespacesDatabase-Migration and -Upgrade with Transportable Tablespaces
Database-Migration and -Upgrade with Transportable Tablespaces
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Intro to hadoop

  • 1. Haden Pereira Data Engineer , Applications Work Group @ EMC 5+ Years Experience in the Big Data Space
  • 2. Quick Survey How many Programmers/Developers ?
  • 3. Quick Survey How many SQL Developers?
  • 4. Quick Survey How many Application Developers (Java,C#,etc)
  • 5. Quick Survey How many System Administrators (Database, Tomcat etc)
  • 6. Quick Survey How many of you have heard of Hadoop
  • 7. Quick Survey How many of you have hands on Experience in Hadoop ?
  • 8. Quick Survey How many of you have worked with any of the NoSQL tools. Cassandra, MongoDB, Elasticsearch
  • 10. What is Hadoop? Hadoop is an open source framework for large-scale data storing & processing.
  • 11. Why Hadoop? • Traditional Data processing was done on large systems. • Every time need for better performance arises , they would replace the old computer with better ones. • Scaling up was expensive • Also scaling was limited to the maximum available resources of a single system.
  • 12. How does Hadoop Scale? • ”Scale Out” , rather than “Scale Up” • If data set/data processing requirement increases , add in one more server. • Eliminates the strategy of growing computing capacity by throwing more expensive hardware at the problem.
  • 13.
  • 14. Core Components of Hadoop Hadoop v1 - HDFS & Map/Reduce Hadoop v2 - HDFS & YARN
  • 15. HDFS Distributed: Scale of data growing at higher pace than single storage disk capacity growth, hence cluster of disk distributed over network is necessary. Scalable: Extends to handle growing data requirement. Fault-Tolerant: Protects against increased failure probability due to large number of disks by replication
  • 16. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB Total Capacity 6 TB
  • 17. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB
  • 18. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F1 F1 100MB 100MB 100MB
  • 19. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1 F2 F3
  • 20. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1
  • 21. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2
  • 22. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 23. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 24. HDFS Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB File.txt 300 MB F1 F2 F3 100MB 100MB 100MB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2
  • 25. Map Reduce Framework for writing applications that process large amounts of structured and unstructured data in parallel, across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
  • 26. Map Reduce File.txt 300 MB ….. , ….. , ….. , ….. , ….. , ….. , ….. , 654 , INR ….. , ….. , ….. , ….. , ….. , ….. , ….. , 432 , AED ….. , ….. , ….. , ….. , ….. , ….. , ….. , 573 , USD ….. , ….. , ….. , ….. , ….. , ….. , ….. , 948 , EUR ….. , ….. , ….. , ….. , ….. , ….. , ….. , 392 , GBP CSV file with around 1 million lines
  • 27. Map Reduce File.txt 300 MB 1 Hour to process 300 MB File
  • 28. Map Reduce File.txt 150 MB 1/2 Hour to process 150 MB File File.txt 150 MB 1/2 Hour to process 150 MB File
  • 29. Map Reduce File.txt 75 MB 1/4 Hour to process 75 MB File File.txt 75 MB 1/4 Hour to process 75 MB File File.txt 75 MB 1/4 Hour to process 75MB File 1/4 Hour to process 75 MB File File.txt 75 MB
  • 30. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 31. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
  • 32. Map Reduce Server 1 1 TB Server 3 1 TB Server 2 1 TB Server 5 1 TB Server 4 1 TB Server 6 1 TB F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1
  • 33. Map Reduce • Handles tasks incase of server failures • Distributes tasks evenly • Tries to run tasks on the same server where the data block resides
  • 34. YARN Multi-tenancy - YARN allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard for batch, interactive and real- time engines that can simultaneously access the same data set. Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization over more static Map Reduce rules used in early versions of Hadoop. Scalability - Data center processing power continues to rapidly expand. YARN’s Resource Manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data. Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run YARN without any disruption to existing processes that already work
  • 35. YARN
  • 36. Hadoop Ecosystem Pig (scripting): Platform for analyzing large data sets. It is comprised of a high- level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Hive (SQL). Provides data warehouse infrastructure, enabling data summarization, ad- hoc query and analysis of large data sets. The query language, HiveQL (HQL), is similar to SQL. HCatalog (SQL). Table and storage management layer that provides users with Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST APIs so that external systems can access these tables' metadata.
  • 37. Hadoop Ecosystem Ambari : Provides an open operational framework for provisioning, managing and monitoring Hadoop clusters. Zookeeper : Provides distributed configuration service, a synchronization service and a naming registry for distributed systems Oozie : Enables Hadoop administrators to build complex data transformations out of multiple component tasks, enabling greater control over complex jobs and also making it easier to schedule repetitions of those jobs.
  • 38. Hadoop Ecosystem Tez leverages the MapReduce paradigm to enable the creation and execution of more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up data processing across both small-scale/low-latency and large-scale/high- throughput workloads Spark : fast and general in memory processing engine that uses YARN as a framework for deployment and can read/write data from HDFS.
  • 39. Hadoop Ecosystem Sqoop : Tool designed to transfer data between Hadoop and relational database servers HBase (NoSQL). Non-relational database that provides random real-time access to data in very large tables. HBase provides transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. Flume : Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS