SlideShare a Scribd company logo
1 of 29
Apache Hadoop
- Large Scale Data Processing
Sharath Bandaru & Sai Dinesh Koppuravuri
Advanced Topics Presentation
ISYE 582 :Engineering Information Systems
Overview
 Understanding Big Data
 Structured/Unstructured Data
 Limitations Of Existing Data Analytics Structure
 Apache Hadoop
 Hadoop Architecture
 HDFS
 Map Reduce
 Conclusions
 References
Understanding Big Data
Big Data
Is creating
Large And
Growing Files
Measured in:
Petabytes (10^12)
Terabytes (10^15)
Which is largely
unstructured
Structured/Unstructured Data
Why now ?DataGrowth
STRUCTURED DATA – 20%
1980 2013
UNSTRUCTUREDDATA–80%
Source : Cloudera, 2013
Challenges posed by Big Data
Velocity
Volume
Variety
400 million tweets in a day on Twitter
1 million transactions by Wal-Mart every hour
2.5 peta bytes created by Wal-Mart
transactions in an hour
Videos, Photos, Text messages, Images,
Audios, Documents, Emails, etc.,
Limitations Of Existing Data Analytics Architecture
BI Reports + Interactive Apps
RDBMS (aggregated data)
ETL Compute Grid
Storage Only Grid ( original raw data )
Collection
Instrumentation
Moving Data To
Compute Doesn’t Scale
Can’t Explore Original
High Fidelity Raw Data
Archiving=
Premature Data
Death
So What is Apache ?
• A set of tools that supports running of applications on big data.
• Core Hadoop has two main systems:
- HDFS : self-healing high-bandwidth clustered storage.
- Map Reduce : distributed fault-tolerant resource management
and scheduling coupled with a scalable data programming
abstraction.
History
Source : Cloudera, 2013
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before any data
can be loaded.
• An explicit load operation has to take
place which transforms data to DB
internal structure.
• New columns must be added explicitly
before new data for such columns can be
loaded into the database
• Data is simply copied to the file store,
no transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied
during read time to extract the required
columns (late binding).
• New data can start flowing anytime and
will appear retroactively once the SerDe is
updated to parse it.
• Read is Fast
• Standards/Governance
• Load is Fast
• Flexibility/Agility
Pros
Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when:
• Interactive OLAP Analytics (< 1 sec)
• Multistep ACID transactions
• 100 % SQL compliance
Use when:
• Structured or Not (Flexibility)
• Scalability of Storage/Compute
• Complex Data Processing
Traditional Approach
Big Data
Powerful Computer
Processing limit
Enterprise Approach:
Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map
Reduce
HDFS
Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
Job Tracker
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
Job Tracker
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
HDFS: Hadoop Distributed File System
• A given file is broken into blocks (default=64MB), then blocks are replicated across
cluster(default=3).
1
2
3
4
5
HDFS
3
4
5
1
2
5
1
3
4
2
4
5
1
2
3
Optimized for :
• Throughput
• Put/Get/Delete
• Appends
Block Replication for :
• Durability
• Availability
• Throughput
Block Replicas are distributed across servers
and racks
Fault Tolerance for Data
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
HDFS
Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map Reduce
Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Tables are backed up
Map Reduce
Input Data
Map Map Map Map Map
Shuffle
Reduce Reduce
Results
Understanding the concept of Map Reduce
Mother
Sam
An Apple
• Believed “an apple a day keeps a doctor away”
The Story Of Sam
Understanding the concept of Map Reduce
• Sam thought of “drinking” the apple
 He used a to cut the
and a to make juice.
Understanding the concept of Map Reduce
Next day
• Sam applied his invention to all the fruits he could find in
the fruit basket
 (map ‘( )’)
 (reduce ‘( )’) Classical Notion of Map Reduce
in Functional Programming
A list of values mapped into
another list of values, which
gets reduced into a single value
Understanding the concept of Map Reduce
18 Years Later
• Sam got his first job in “Tropicana” for his expertise in
making juices.
 Now, it’s not just one basket
but a whole container of fruits
 Also, they produce a list of
juice types separately
NOT ENOUGH !!
 But, Sam had just ONE
and ONE
Large data and list of
values for output
Wait!
Understanding the concept of Map Reduce
Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-
list> (possibly a list of these, depending
on the grouping/hashing mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
Implemented parallel version of his innovation
Understanding the concept of Map Reduce
• Sam realized,
– To create his favorite mix fruit juice he can use a combiner after the reducers
– If several <key, value-list> fall into the same group (based on the
grouping/hashing algorithm) then use the blender (reducer) separately on
each of them
– The knife (mapper) and blender (reducer) should not contain residue after use
– Side Effect Free
Source: (Map Reduce, 2010).
Conclusions
• The key benefits of Apache Hadoop:
1) Agility/ Flexibility (Quickest Time to Insight)
2) Complex Data Processing (Any Language, Any Problem)
3) Scalability of Storage/Compute (Freedom to Grow)
4) Economical Storage (Keep All Your Data Alive Forever)
• The key systems for Apache Hadoop are:
1) Hadoop Distributed File System : self-healing high-bandwidth
clustered storage.
2) Map Reduce : distributed fault-tolerant resource management
coupled with scalable data processing.
References
• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013,
from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.
• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data
Processing on Large Clusters.
• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from
http://hadoop.apache.org/.
• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy.
retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.
• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data
Operating System. Retrieved April 15, 2013, from
http://www.youtube.com/watch?v=d2xeNpfzsYI

More Related Content

What's hot

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 

What's hot (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hive
HiveHive
Hive
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 

Similar to Apache hadoop

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptxD21CE161GOSWAMIPARTH
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 

Similar to Apache hadoop (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
HADOOP
HADOOPHADOOP
HADOOP
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Apache hadoop

  • 1. Apache Hadoop - Large Scale Data Processing Sharath Bandaru & Sai Dinesh Koppuravuri Advanced Topics Presentation ISYE 582 :Engineering Information Systems
  • 2. Overview  Understanding Big Data  Structured/Unstructured Data  Limitations Of Existing Data Analytics Structure  Apache Hadoop  Hadoop Architecture  HDFS  Map Reduce  Conclusions  References
  • 3. Understanding Big Data Big Data Is creating Large And Growing Files Measured in: Petabytes (10^12) Terabytes (10^15) Which is largely unstructured
  • 5. Why now ?DataGrowth STRUCTURED DATA – 20% 1980 2013 UNSTRUCTUREDDATA–80% Source : Cloudera, 2013
  • 6. Challenges posed by Big Data Velocity Volume Variety 400 million tweets in a day on Twitter 1 million transactions by Wal-Mart every hour 2.5 peta bytes created by Wal-Mart transactions in an hour Videos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,
  • 7. Limitations Of Existing Data Analytics Architecture BI Reports + Interactive Apps RDBMS (aggregated data) ETL Compute Grid Storage Only Grid ( original raw data ) Collection Instrumentation Moving Data To Compute Doesn’t Scale Can’t Explore Original High Fidelity Raw Data Archiving= Premature Data Death
  • 8. So What is Apache ? • A set of tools that supports running of applications on big data. • Core Hadoop has two main systems: - HDFS : self-healing high-bandwidth clustered storage. - Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.
  • 10. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before any data can be loaded. • An explicit load operation has to take place which transforms data to DB internal structure. • New columns must be added explicitly before new data for such columns can be loaded into the database • Data is simply copied to the file store, no transformation is needed. • A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding). • New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. • Read is Fast • Standards/Governance • Load is Fast • Flexibility/Agility Pros
  • 11. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: • Interactive OLAP Analytics (< 1 sec) • Multistep ACID transactions • 100 % SQL compliance Use when: • Structured or Not (Flexibility) • Scalability of Storage/Compute • Complex Data Processing
  • 12. Traditional Approach Big Data Powerful Computer Processing limit Enterprise Approach:
  • 17. HDFS: Hadoop Distributed File System • A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3). 1 2 3 4 5 HDFS 3 4 5 1 2 5 1 3 4 2 4 5 1 2 3 Optimized for : • Throughput • Put/Get/Delete • Appends Block Replication for : • Durability • Availability • Throughput Block Replicas are distributed across servers and racks
  • 18. Fault Tolerance for Data Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves HDFS
  • 19. Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Map Reduce
  • 20. Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Tables are backed up
  • 21. Map Reduce Input Data Map Map Map Map Map Shuffle Reduce Reduce Results
  • 22. Understanding the concept of Map Reduce Mother Sam An Apple • Believed “an apple a day keeps a doctor away” The Story Of Sam
  • 23. Understanding the concept of Map Reduce • Sam thought of “drinking” the apple  He used a to cut the and a to make juice.
  • 24. Understanding the concept of Map Reduce Next day • Sam applied his invention to all the fruits he could find in the fruit basket  (map ‘( )’)  (reduce ‘( )’) Classical Notion of Map Reduce in Functional Programming A list of values mapped into another list of values, which gets reduced into a single value
  • 25. Understanding the concept of Map Reduce 18 Years Later • Sam got his first job in “Tropicana” for his expertise in making juices.  Now, it’s not just one basket but a whole container of fruits  Also, they produce a list of juice types separately NOT ENOUGH !!  But, Sam had just ONE and ONE Large data and list of values for output Wait!
  • 26. Understanding the concept of Map Reduce Brave Sam (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of a map is a list of <key, value> pairs (<a’, > , <o’, > , <p’, > , …) Grouped by key Each input to a reduce is a <key, value- list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <a’, ( …)> Reduced into a list of values Implemented parallel version of his innovation
  • 27. Understanding the concept of Map Reduce • Sam realized, – To create his favorite mix fruit juice he can use a combiner after the reducers – If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them – The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free Source: (Map Reduce, 2010).
  • 28. Conclusions • The key benefits of Apache Hadoop: 1) Agility/ Flexibility (Quickest Time to Insight) 2) Complex Data Processing (Any Language, Any Problem) 3) Scalability of Storage/Compute (Freedom to Grow) 4) Economical Storage (Keep All Your Data Alive Forever) • The key systems for Apache Hadoop are: 1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage. 2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing.
  • 29. References • Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html. • Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters. • The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/. • Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8. • Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI