SlideShare uma empresa Scribd logo
1 de 44
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Introduction, Background to Hadoop and HDFS!
!
!
!
!
Brendan Tierney
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

What is Big Data?
O’Reilly Radar definition:
•  Big data is when the size of the data itself becomes part of the problem
EMC/IDC definition:
•  Big Data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large
volumes of a wide variety of data, by enabling high velocity capture,
discovery and/or analysis
•  McKinsey definition:
•  Big Data refers to datasets whose size is beyond the availability of typical
database software tools to capture, store, manage and analyse
http://www.oreilly.com/data/free/big-data-now-2012.csp!
http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf!
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation!
http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data
Some Companies continue to generate large amounts of data:
•  Facebook ~ 6 billion messages per day
•  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage
•  Satellite Images by Skybox Imaging ~ 1 Terabyte per day
•  These numbers are probably out of date before I finished writing this slide
Important : This is for some companies and not all companies
Part of their data management architecture. It will not replace existing DBs etc
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Basic idea
•  The basic idea behind the phrase Big Data is that everything we do is increasingly
leaving a digital trace (data) which we can use and analyse
•  Big Data therefore refers to our ability to make use of ever increasing volumes of
data
Traditional data storage methods can
be a challenge!

Why ?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2013
2013
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2014
Where is 
Predictive Analytics?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2015
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop
•  Existing tools were not designed to handle such large amounts of data
•  "The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.” 
•  http://hadoop.apache.org
•  – Process Big Data on clusters of commodity hardware
•  – Vibrant open-source community
•  – Many products and tools reside on top of Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Who is using Hadoop in Ireland ?
Big websites

Big telcos

Big Banks

Big Financial

CERN

Big ….
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Access Speeds?
1990:
Typical drive ~1370MB
Transfer speed ~ 4.4MB/s

read drive in 5 mins
 2010:
Typical drive ~1TB
Transfer speed ~ 100MB/s

read drive in 2.5 hrs
Hadoop - 100 drives working
at the same
time can read 1TB of data in 2
minutes
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Scaling issue
$
$
$
$ ?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Scaling issue
•  It is harder and more expensive to scale-up ( “It Depends” needs to be applied)
•  Add additional resources to an existing node (CPU, RAM)
•  Moore’s Law can’t keep up with data growth
•  New units must be purchased if required resources can not be added
•  Also known as scale vertically
•  Scale-Out
•  Add more nodes/machines to an existing distributed application
•  Software Layer is designed for node additions or removal
•  Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
•  Very easy to scale down as well
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Principles
•  Scale-Out rather than Scale-Up
•  Bring code to data rather than data to code
•  Deal with failures – they are common
•  Abstract complexity of distributed and concurrent applications
•  Self managing
•  Auto parallel processing
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data – Example Applications
Not all of these are using Hadoop or require Hadoop!
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Cluster
•  A set of "cheap" commodity hardware
•  Networked together
•  Resides in the same location
•  Set of servers in a set of racks in a data center
•  “Cheap” Commodity Server Hardware
•  No need for super-computers, use commodity unreliable hardware
•  Not desktops
Yes you can build a Hadoop Cluster
using Raspberry Pi’s
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Abstracting Complexity
•  Distributed Computing is HARD WORK
•  Hadoop abstracts many complexities in distributed and concurrent applications
•  Defines small number of components
•  Provides simple and well defined interfaces of interactions between these
components
•  Frees developer from worrying about system level challenges
•  race conditions, data starvation
•  processing pipelines, data partitioning, code distribution, etc.
•  Allows developers to focus on application development and business logic
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop vs RDBMS
•  Always keep the phrase
“It Depends” in mind when
discussing Big Data
•  Hadoop != RDBMS
•  Hadoop will not replace RDBMS
•  Hadoop is part of your data
management architecture
•  and only if it is needed !
RDBMS
 Hadoop
Data size
 Gigabytes
 Petabytes
Access
 Interactive & Batch
 Batch
Updates
Read & write many
times
Write once, read
many times
Integrity
 High
 Low
Scaling
 Non Linear
 Linear
Data representation
 Structured
Unstructured, semi-
structured
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Working together
•  Hadoop and RDBMS frequently complement each other within an architecture
•  For example, a website that
•  has a small number of users
•  produces a large amount of audit logs
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosystem
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosytems
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosytems
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Distributions
•  Large number of independent products (Apache projects) 
•  Can be challenging to get all/some of these to work together
•  We will will be working with Hadoop, installing and using some products
•  Hadoop Distributions aim to resolve version incompatibilities
•  Distribution Vendor will
•  Integration Test a set of Hadoop products
•  Package Hadoop products in various installation formats
•  Linux Packages, tarballs, etc.
•  Distributions may provide additional scripts to execute Hadoop
•  Some vendors may choose to backport features and bug fixes made by Apache
•  Typically vendors will employ Hadoop committers so the bugs they find will make it
into Apache’s repository
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Distributions
•  Cloudera Distribution for Hadoop (CDH)
•  Check out the pre-built VM with most of Cloudera products (Hadoop, etc)
•  http://www.cloudera.com/downloads/quickstart_vms/5-8.html
•  MapR Distribution
•  Check out the MapR Sandbox VM
•  https://www.mapr.com/products/mapr-sandbox-hadoop 
•  Hortonworks Data Platform (HDP)
•  Check out the Hortonworks Sandbox VM
•  http://hortonworks.com/products/sandbox/ 
•  Oracle Big Data Applicance
•  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed
and configured for you to use
•  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
$
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop - “move-code-to-data” approach
•  Data is distributed among the nodes as it is initially stored in the system
•  Data is replicated multiple times on the system for increased reliability & availability
•  Master allocates work to nodes 
•  Computation happens on the nodes where the data is stored - data locality
•  Nodes work in parallel each on their own part of the overall dataset
•  Nodes are independent and self-sufficient - shared-nothing architecture
•  If a node fails, master detects the failure and re-assigns work to other nodes
•  If a failed node restarts, it is automatically added back into the system and
assigned new tasks
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  A distributed file system modelled on the Google File System (GFS)

[http://research.google.com/archive/gfs.html]
•  Data is split into blocks, typically 64MB or 128MB in size, spread across many
nodes
•  Works better on large files >= 1 HDFS block in size
•  Each block is replicated to a number of nodes (typically 3)
•  ensures reliability and availability
•  Files in HDFS are write once - no random writes to files allowed
•  HDFS is optimised for large streaming reads of files - no random access to files
allowed 
•  see HIVE later on for more DBMS-type access to HDFS files....
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS is good for
•  Storing large files
•  Terabytes, Petabytes, etc...
•  Millions rather than billions of files
•  100MB or more per file
•  Streaming data
•  Unstructured data => really mixed structured data
•  Write once and read-many times patterns
•  Schema on Read (RDBMS = schema on write)
•  Huge time saving at data write time
•  BUT !!!
•  Optimized for streaming reads rather than random reads
•  “Cheap” Commodity Hardware
•  No need for super-computers, use less reliable commodity hardware
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS is not so good at
•  Low-latency reads
•  High-throughput rather than low latency for small chunks of data
•  HBase and other DBs can address this issue (?)
•  Large amount of small files
•  Better for millions of large files instead of billions of small files
•  Block size of 128M or 256M
•  For example each file can be 100MB or more
•  Multiple Writers
•  Single writer per file
•  Writes only at the end of file, no-support for arbitrary offset
•  Time needed for replication
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  Two types of nodes in a HDFS cluster
•  NameNode - the master node 
•  DataNodes - slave or worker nodes
•  NameNode manages the file system
•  keeps track of the metadata - which blocks make up a file (using 2 files - namespace
image and the edit log)
•  knows on which DataNodes the blocks are stored
•  DataNodes do the work
•  store the blocks
•  retrieve blocks when requested to (by the client or the NameNode)
•  poll and report back to the NameNode periodically with the list of blocks that they are
storing
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  When a client application wants to read a file...
•  it communicates with the NameNode to determine which blocks make up the file,
and on which DataNodes the block reside
•  it then communicates directly with the DataNodes
•  NameNode is the single point of failure of a Hadoop system
•  backup periodically to remote NFS (setup as part of Hadoop configuration)
•  use Secondary NameNode 
•  not the same as the NameNode
•  periodically merges namespace with edit log and maintains a copy
[from Hadoop in Practice, Alex Holmes]
HDFS
Architecture
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Files and Blocks
•  Files are split into blocks (single unit of storage)
•  Managed by Namenode, stored by Datanode
•  Transparent to user
•  Replicated across machines at load time
•  Same block is stored on multiple machines
•  Good for fault-tolerance and access
•  Can lead to inconsistent reads 
•  Default replication is 3
Have you ever experienced
inconsistent reads?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS File Writes
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS File Reads
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Who is using Hadoop in Ireland ?
•  List of Cloudera customers in Ireland
•  Citi
•  Allianz
•  Deutsche Bank
•  Ulster Bank
•  dun & bradstreet

•  Ryanair
•  BT
•  Vodafone
•  Novartis
•  airbnb
•  Dell
•  Intel
•  Rockwell Automation
•  Revenue
•  Adecco
•  Experian
•  M&S
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 


Discuss

Hadoop is not FREE J
vs 
Hadoop is not FREE L
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Something to think about

Mais conteúdo relacionado

Mais procurados

Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 

Mais procurados (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop
HadoopHadoop
Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 

Destaque

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
Christoph Sinn
 

Destaque (20)

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Random number generators
Random number generatorsRandom number generators
Random number generators
 
Open Canary - novahackers
Open Canary - novahackersOpen Canary - novahackers
Open Canary - novahackers
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
 
Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
 

Semelhante a Overview of Hadoop and HDFS

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 

Semelhante a Overview of Hadoop and HDFS (20)

Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Overview of Hadoop and HDFS

  • 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Introduction, Background to Hadoop and HDFS! ! ! ! ! Brendan Tierney
  • 2. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What is Big Data? O’Reilly Radar definition: •  Big data is when the size of the data itself becomes part of the problem EMC/IDC definition: •  Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis •  McKinsey definition: •  Big Data refers to datasets whose size is beyond the availability of typical database software tools to capture, store, manage and analyse http://www.oreilly.com/data/free/big-data-now-2012.csp! http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf! http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation! http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
  • 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Some Companies continue to generate large amounts of data: •  Facebook ~ 6 billion messages per day •  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage •  Satellite Images by Skybox Imaging ~ 1 Terabyte per day •  These numbers are probably out of date before I finished writing this slide Important : This is for some companies and not all companies Part of their data management architecture. It will not replace existing DBs etc
  • 4. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Basic idea •  The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace (data) which we can use and analyse •  Big Data therefore refers to our ability to make use of ever increasing volumes of data Traditional data storage methods can be a challenge! Why ?
  • 5. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data
  • 6. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2013 2013
  • 7. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2014 Where is Predictive Analytics?
  • 8. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2015
  • 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop •  Existing tools were not designed to handle such large amounts of data •  "The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.” •  http://hadoop.apache.org •  – Process Big Data on clusters of commodity hardware •  – Vibrant open-source community •  – Many products and tools reside on top of Hadoop
  • 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? Big websites Big telcos Big Banks Big Financial CERN Big ….
  • 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Access Speeds? 1990: Typical drive ~1370MB Transfer speed ~ 4.4MB/s read drive in 5 mins 2010: Typical drive ~1TB Transfer speed ~ 100MB/s read drive in 2.5 hrs Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  • 12. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue $ $ $ $ ?
  • 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue •  It is harder and more expensive to scale-up ( “It Depends” needs to be applied) •  Add additional resources to an existing node (CPU, RAM) •  Moore’s Law can’t keep up with data growth •  New units must be purchased if required resources can not be added •  Also known as scale vertically •  Scale-Out •  Add more nodes/machines to an existing distributed application •  Software Layer is designed for node additions or removal •  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system •  Very easy to scale down as well
  • 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Principles •  Scale-Out rather than Scale-Up •  Bring code to data rather than data to code •  Deal with failures – they are common •  Abstract complexity of distributed and concurrent applications •  Self managing •  Auto parallel processing
  • 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data – Example Applications Not all of these are using Hadoop or require Hadoop!
  • 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Cluster •  A set of "cheap" commodity hardware •  Networked together •  Resides in the same location •  Set of servers in a set of racks in a data center •  “Cheap” Commodity Server Hardware •  No need for super-computers, use commodity unreliable hardware •  Not desktops Yes you can build a Hadoop Cluster using Raspberry Pi’s
  • 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Abstracting Complexity •  Distributed Computing is HARD WORK •  Hadoop abstracts many complexities in distributed and concurrent applications •  Defines small number of components •  Provides simple and well defined interfaces of interactions between these components •  Frees developer from worrying about system level challenges •  race conditions, data starvation •  processing pipelines, data partitioning, code distribution, etc. •  Allows developers to focus on application development and business logic
  • 18. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop vs RDBMS •  Always keep the phrase “It Depends” in mind when discussing Big Data •  Hadoop != RDBMS •  Hadoop will not replace RDBMS •  Hadoop is part of your data management architecture •  and only if it is needed !
  • 19. RDBMS Hadoop Data size Gigabytes Petabytes Access Interactive & Batch Batch Updates Read & write many times Write once, read many times Integrity High Low Scaling Non Linear Linear Data representation Structured Unstructured, semi- structured
  • 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com
  • 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Working together •  Hadoop and RDBMS frequently complement each other within an architecture •  For example, a website that •  has a small number of users •  produces a large amount of audit logs
  • 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosystem
  • 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  • 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  • 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Large number of independent products (Apache projects) •  Can be challenging to get all/some of these to work together •  We will will be working with Hadoop, installing and using some products •  Hadoop Distributions aim to resolve version incompatibilities •  Distribution Vendor will •  Integration Test a set of Hadoop products •  Package Hadoop products in various installation formats •  Linux Packages, tarballs, etc. •  Distributions may provide additional scripts to execute Hadoop •  Some vendors may choose to backport features and bug fixes made by Apache •  Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository
  • 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Cloudera Distribution for Hadoop (CDH) •  Check out the pre-built VM with most of Cloudera products (Hadoop, etc) •  http://www.cloudera.com/downloads/quickstart_vms/5-8.html •  MapR Distribution •  Check out the MapR Sandbox VM •  https://www.mapr.com/products/mapr-sandbox-hadoop •  Hortonworks Data Platform (HDP) •  Check out the Hortonworks Sandbox VM •  http://hortonworks.com/products/sandbox/ •  Oracle Big Data Applicance •  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed and configured for you to use •  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle- bigdatalite-2104726.html $
  • 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop - “move-code-to-data” approach •  Data is distributed among the nodes as it is initially stored in the system •  Data is replicated multiple times on the system for increased reliability & availability •  Master allocates work to nodes •  Computation happens on the nodes where the data is stored - data locality •  Nodes work in parallel each on their own part of the overall dataset •  Nodes are independent and self-sufficient - shared-nothing architecture •  If a node fails, master detects the failure and re-assigns work to other nodes •  If a failed node restarts, it is automatically added back into the system and assigned new tasks
  • 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  A distributed file system modelled on the Google File System (GFS)
 [http://research.google.com/archive/gfs.html] •  Data is split into blocks, typically 64MB or 128MB in size, spread across many nodes •  Works better on large files >= 1 HDFS block in size •  Each block is replicated to a number of nodes (typically 3) •  ensures reliability and availability •  Files in HDFS are write once - no random writes to files allowed •  HDFS is optimised for large streaming reads of files - no random access to files allowed •  see HIVE later on for more DBMS-type access to HDFS files....
  • 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is good for •  Storing large files •  Terabytes, Petabytes, etc... •  Millions rather than billions of files •  100MB or more per file •  Streaming data •  Unstructured data => really mixed structured data •  Write once and read-many times patterns •  Schema on Read (RDBMS = schema on write) •  Huge time saving at data write time •  BUT !!! •  Optimized for streaming reads rather than random reads •  “Cheap” Commodity Hardware •  No need for super-computers, use less reliable commodity hardware
  • 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is not so good at •  Low-latency reads •  High-throughput rather than low latency for small chunks of data •  HBase and other DBs can address this issue (?) •  Large amount of small files •  Better for millions of large files instead of billions of small files •  Block size of 128M or 256M •  For example each file can be 100MB or more •  Multiple Writers •  Single writer per file •  Writes only at the end of file, no-support for arbitrary offset •  Time needed for replication
  • 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  Two types of nodes in a HDFS cluster •  NameNode - the master node •  DataNodes - slave or worker nodes •  NameNode manages the file system •  keeps track of the metadata - which blocks make up a file (using 2 files - namespace image and the edit log) •  knows on which DataNodes the blocks are stored •  DataNodes do the work •  store the blocks •  retrieve blocks when requested to (by the client or the NameNode) •  poll and report back to the NameNode periodically with the list of blocks that they are storing
  • 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  When a client application wants to read a file... •  it communicates with the NameNode to determine which blocks make up the file, and on which DataNodes the block reside •  it then communicates directly with the DataNodes •  NameNode is the single point of failure of a Hadoop system •  backup periodically to remote NFS (setup as part of Hadoop configuration) •  use Secondary NameNode •  not the same as the NameNode •  periodically merges namespace with edit log and maintains a copy
  • 38. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  • 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Files and Blocks •  Files are split into blocks (single unit of storage) •  Managed by Namenode, stored by Datanode •  Transparent to user •  Replicated across machines at load time •  Same block is stored on multiple machines •  Good for fault-tolerance and access •  Can lead to inconsistent reads •  Default replication is 3 Have you ever experienced inconsistent reads?
  • 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Writes
  • 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Reads
  • 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? •  List of Cloudera customers in Ireland •  Citi •  Allianz •  Deutsche Bank •  Ulster Bank •  dun & bradstreet •  Ryanair •  BT •  Vodafone •  Novartis •  airbnb •  Dell •  Intel •  Rockwell Automation •  Revenue •  Adecco •  Experian •  M&S
  • 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Discuss Hadoop is not FREE J vs Hadoop is not FREE L
  • 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Something to think about