SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Big
Data
Systems
• Before 2004 “Google have implemented
hundreds of special-purpose computations
that process large amounts of raw data, such
as crawled documents, web request logs, etc.,
to compute various kinds of derived data, such
as inverted indices etc.”
• Nutch search system at 2004 was effectively
limited to 100M web pages
Use Cases
• 2002: Doug Cutting started Nutch: crawler & search
system
• 2003: GoogleFS paper
• 2004: Start of NDFS project (Nutch Distributed FS)
• 2004: Google MapReduce paper
• 2005: MapReduce implementation in Nutch
• 2006: HDFS and MapReduce to Hadoop subproject
• 2008: Yahoo! Production search index by a 10000-core
Hadoop cluster
• 2008: Hadoop – top-level Apache project
Hadoop History
• Need to process Multi Petabyte Datasets
• Need to provide framework for reliable application
execution
• Need to encapsulate nodes failures from application
developer.
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
Hadoop Objectives
• Hadoop Distributed File System (HDFS)
• Hadoop MapReduce
• Hadoop Common
Hadoop
• Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth
Goals of GFS/HDFS
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
HFDS Details
Client reading data from HDFS
Client writing data to HDFS
Compression
• Java API
• Command Line
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir myfile.txt
– hadoop dfsadmin –report
– hadoop dfsadmin -decommission datanodename
• Web Interface
– http://host:port/dfshealth.jsp
HDFS User Interface
HDFS Web UI
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
• Common design pattern in data processing
cat * | grep | sort | uniq -c | cat > file
input | map | shuffle | reduce | output
• Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries
Hadoop MapReduce
Map function
Reduce function
Run this program as a
MapReduce job
Lifecycle of a MapReduce Job
MapReduce in Hadoop (1)
MapReduce in Hadoop (2)
MapReduce in Hadoop (3)
Hadoop WebUI
Hadoop WebUI
• 190+ parameters in
Hadoop
• Set manually or defaults
are used
Hadoop Configuration
Pro:
• Cheap components
• Replication
• Fault tolerance
• Parallel processing
• Free license
• Linear scalability
• Amazon support
Con:
• No realtime
• Difficult to add MR tasks
• File edit is not supported
• High support cost
Summary
• Distributed Grep
• Count of URL Access Frequency
• Reverse Web-Link Graph
• Inverted Index
Examples
• Streaming
• Hive
• Pig
• HBase
Hadoop
API to MapReduce that uses Unix standard streams
as the interface between Hadoop and your program
MAP: map.rb
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
% cat input/ncdc/sample.txt | map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
LOCAL EXECUTION
Hadoop Streaming (1)
REDUCE: reduce.rb
#!/usr/bin/env ruby
last_key, max_val = nil, 0
STDIN.each_line do |line|
key, val = line.split("t")
if last_key && last_key != key
puts "#{last_key}t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}t#{max_val}" if last_key
% cat input/ncdc/sample.txt | map.rb | sort | reduce.rb
1949 111
1950 22
LOCAL EXECUTION
Hadoop Streaming (2)
HADOOP EXECUTION
% hadoop jar 
$HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar 
-input input/ncdc/sample.txt 
-output output 
-mapper map.rb 
-reducer reduce.rb
Hadoop Streaming (3)
 Intuitive
 Make the unstructured data looks like tables regardless how
it really lay out
 SQL based query can be directly against these tables
 Generate specify execution plan for this query
 What’s Hive
 A data warehousing system to store structured data on
Hadoop file system
 Provide an easy query these data by execution Hadoop
MapReduce plans
Hive: overview
HDFS
Map Reduce
Hive: architecture
hive> SHOW TABLES;
hive> CREATE TABLE shakespeare (freq
INT, word STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘t’
STORED AS TEXTFILE;
hive> DESCRIBE shakespeare;
loading data…
hive> SELECT * FROM shakespeare LIMIT 10;
hive> SELECT * FROM shakespeare
WHERE freq > 100 SORT BY freq ASC
LIMIT 10;
Hive: shell
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records
BY temperature != 9999
AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
GENERATE group, MAX(filtered_records.temperature);
DUMP max_temp;
Pig
Initial public launch
Move from local workstation to shared, remote hosted
MySQL instance with a well-defined schema.
Service becomes more popular; too many reads hitting the
database
Add memcached to cache common queries. Reads are
now no longer strictly ACID; cached data must expire.
Service continues to grow in popularity; too many writes
hitting the database
Scale MySQL vertically by buying a beefed up server
with 16 cores, 128 GB of RAM,
and banks of 15 k RPM hard drives. Costly.
RDBMS scaling story (1)
New features increases query complexity; now we have
too many joins
Denormalize your data to reduce joins.
Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.
Some queries are still too slow
Periodically prematerialize the most complex
queries, try to stop joining in most cases.
Reads are OK, but writes are getting slower and slower
Drop secondary indexes and triggers (no indexes?).
RDBMS scaling story (1)
NoSQL
• Tables have one primary index, the row key
• No join operators
• Data is unstructured and untyped
• No accessed or manipulated via SQL
– Programmatic access via Java, REST, or Thrift APIs
• There are three types of lookups:
– Fast lookup using row key and optional timestamp
– Full table scan
– Range scan from region start to end
Hbase: differences from RDBMS
• Automatic partitioning
• Scale linearly and automatically with new
nodes
• Commodity hardware
• Fault tolerance: Apache Zookeeper
• Batch processing: Apache Hadoop
Hbase: benefits over RDBMS
 Tables are sorted by Row
 Table schema only define it’s column families .
 Each family consists of any number of columns
 Each column consists of any number of versions
 Columns only exist when inserted, NULLs are free.
 Columns within a family are sorted and stored together
 Everything except table names are byte[]
 (Row, Family: Column, Timestamp)  Value
Row key
Column Family
valueTimeStamp
Hbase: data model
• Master
– Responsible for monitoring region servers
– Load balancing for regions
– Redirect client to correct region servers
• regionserver slaves
– Serving requests (Write/Read/Scan) of Client
– Send HeartBeat to Master
Hbase: members
$ hbase shell
> create 'test', 'data'
0 row(s) in 4.3066 seconds
> list
test
1 row(s) in 0.1485 seconds
> put 'test', 'row1', 'data:1', 'value1'
0 row(s) in 0.0454 seconds
> put 'test', 'row2', 'data:2', 'value2'
0 row(s) in 0.0035 seconds
> scan 'test'
ROW COLUMN+CELL
row1 column=data:1, timestamp=1240148026198, value=value1
row2 column=data:2, timestamp=1240148040035, value=value2
2 row(s) in 0.0825 seconds
Hbase: shell
Hbase: Web UI
• Amazon
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
Who uses Hadoop?
Books

Mais conteĂşdo relacionado

Mais procurados

SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSofian Hadiwijaya
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overviewSiva Pandeti
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 

Mais procurados (20)

SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 

Destaque

MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLAlex Tomic
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015Lukas Vlcek
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearchLukas Vlcek
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"George Stathis
 
OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform@CULT Srl
 
Social Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner
 
Oxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-stepsMatteo Moci
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Simone Onofri
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integrationTommaso Teofili
 
Elastic search
Elastic searchElastic search
Elastic searchRahul Agarwal
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search WalkthroughSuhel Meman
 
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...Jozias Rolim
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014Vivek Sachdeva
 
Using Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchUsing Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchSumy PHP User Grpoup
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data OutOpenThink Labs
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 

Destaque (20)

MoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQLMoSQL: An Elastic Storage Engine for MySQL
MoSQL: An Elastic Storage Engine for MySQL
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearch
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
 
OseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platformOseeGenius - Semantic search engine and discovery platform
OseeGenius - Semantic search engine and discovery platform
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Social Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 minSocial Miner: Webinar people marketing em 30 min
Social Miner: Webinar people marketing em 30 min
 
Oxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic SearchOxalide Academy : Workshop #3 Elastic Search
Oxalide Academy : Workshop #3 Elastic Search
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
Amministratori Di Sistema: Adeguamento al Garante Privacy - Log Management e ...
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Elastic search
Elastic searchElastic search
Elastic search
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
[Case machine learning- iColabora]Text Mining - classificando textos com Elas...
 
Elastic search adaptto2014
Elastic search adaptto2014Elastic search adaptto2014
Elastic search adaptto2014
 
Using Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text SearchUsing Elastic Search Outside Full-Text Search
Using Elastic Search Outside Full-Text Search
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 

Semelhante a Apache Hadoop 1.1

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop PrimerSteve Staso
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_pptjerrin joseph
 
מיכאל
מיכאלמיכאל
מיכאלsqlserver.co.il
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Hadoop
HadoopHadoop
Hadoopavnishagr
 

Semelhante a Apache Hadoop 1.1 (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
מיכאל
מיכאלמיכאל
מיכאל
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 

Mais de Sperasoft

особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4Sperasoft
 
концепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted Worldконцепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted WorldSperasoft
 
Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Sperasoft
 
Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Sperasoft
 
Gameplay Tags
Gameplay TagsGameplay Tags
Gameplay TagsSperasoft
 
Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Sperasoft
 
Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Sperasoft
 
The theory of relational databases
The theory of relational databasesThe theory of relational databases
The theory of relational databasesSperasoft
 
Automated layout testing using Galen Framework
Automated layout testing using Galen FrameworkAutomated layout testing using Galen Framework
Automated layout testing using Galen FrameworkSperasoft
 
Sperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft
 
Sperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft
 
Effective Мeetings
Effective МeetingsEffective Мeetings
Effective МeetingsSperasoft
 
Unreal Engine 4 Introduction
Unreal Engine 4 IntroductionUnreal Engine 4 Introduction
Unreal Engine 4 IntroductionSperasoft
 
JIRA Development
JIRA DevelopmentJIRA Development
JIRA DevelopmentSperasoft
 
MOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSMOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSSperasoft
 
Quick Intro Into Kanban
Quick Intro Into KanbanQuick Intro Into Kanban
Quick Intro Into KanbanSperasoft
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 ReviewSperasoft
 
Console Development in 15 minutes
Console Development in 15 minutesConsole Development in 15 minutes
Console Development in 15 minutesSperasoft
 
Database Indexes
Database IndexesDatabase Indexes
Database IndexesSperasoft
 

Mais de Sperasoft (20)

особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4
 
концепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted Worldконцепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted World
 
Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4
 
Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек
 
Gameplay Tags
Gameplay TagsGameplay Tags
Gameplay Tags
 
Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Data Driven Gameplay in UE4
Data Driven Gameplay in UE4
 
Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks
 
The theory of relational databases
The theory of relational databasesThe theory of relational databases
The theory of relational databases
 
Automated layout testing using Galen Framework
Automated layout testing using Galen FrameworkAutomated layout testing using Galen Framework
Automated layout testing using Galen Framework
 
Sperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft talks: Android Security Threats
Sperasoft talks: Android Security Threats
 
Sperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on Android
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Effective Мeetings
Effective МeetingsEffective Мeetings
Effective Мeetings
 
Unreal Engine 4 Introduction
Unreal Engine 4 IntroductionUnreal Engine 4 Introduction
Unreal Engine 4 Introduction
 
JIRA Development
JIRA DevelopmentJIRA Development
JIRA Development
 
MOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSMOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JS
 
Quick Intro Into Kanban
Quick Intro Into KanbanQuick Intro Into Kanban
Quick Intro Into Kanban
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 Review
 
Console Development in 15 minutes
Console Development in 15 minutesConsole Development in 15 minutes
Console Development in 15 minutes
 
Database Indexes
Database IndexesDatabase Indexes
Database Indexes
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Christopher Logan Kennedy
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Apache Hadoop 1.1

  • 2. • Before 2004 “Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices etc.” • Nutch search system at 2004 was effectively limited to 100M web pages Use Cases
  • 3. • 2002: Doug Cutting started Nutch: crawler & search system • 2003: GoogleFS paper • 2004: Start of NDFS project (Nutch Distributed FS) • 2004: Google MapReduce paper • 2005: MapReduce implementation in Nutch • 2006: HDFS and MapReduce to Hadoop subproject • 2008: Yahoo! Production search index by a 10000-core Hadoop cluster • 2008: Hadoop – top-level Apache project Hadoop History
  • 4. • Need to process Multi Petabyte Datasets • Need to provide framework for reliable application execution • Need to encapsulate nodes failures from application developer. – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Need common infrastructure – Efficient, reliable, Open Source Apache License Hadoop Objectives
  • 5. • Hadoop Distributed File System (HDFS) • Hadoop MapReduce • Hadoop Common Hadoop
  • 6. • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth Goals of GFS/HDFS
  • 7. • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode HFDS Details
  • 11. • Java API • Command Line – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt – hadoop dfsadmin –report – hadoop dfsadmin -decommission datanodename • Web Interface – http://host:port/dfshealth.jsp HDFS User Interface
  • 13. • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | uniq -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries Hadoop MapReduce
  • 14. Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job
  • 20. • 190+ parameters in Hadoop • Set manually or defaults are used Hadoop Configuration
  • 21. Pro: • Cheap components • Replication • Fault tolerance • Parallel processing • Free license • Linear scalability • Amazon support Con: • No realtime • Difficult to add MR tasks • File edit is not supported • High support cost Summary
  • 22. • Distributed Grep • Count of URL Access Frequency • Reverse Web-Link Graph • Inverted Index Examples
  • 23. • Streaming • Hive • Pig • HBase Hadoop
  • 24. API to MapReduce that uses Unix standard streams as the interface between Hadoop and your program MAP: map.rb #!/usr/bin/env ruby STDIN.each_line do |line| val = line year, temp, q = val[15,4], val[87,5], val[92,1] puts "#{year}t#{temp}" if (temp != "+9999" && q =~ /[01459]/) end % cat input/ncdc/sample.txt | map.rb 1950 +0000 1950 +0022 1950 -0011 1949 +0111 1949 +0078 LOCAL EXECUTION Hadoop Streaming (1)
  • 25. REDUCE: reduce.rb #!/usr/bin/env ruby last_key, max_val = nil, 0 STDIN.each_line do |line| key, val = line.split("t") if last_key && last_key != key puts "#{last_key}t#{max_val}" last_key, max_val = key, val.to_i else last_key, max_val = key, [max_val, val.to_i].max end end puts "#{last_key}t#{max_val}" if last_key % cat input/ncdc/sample.txt | map.rb | sort | reduce.rb 1949 111 1950 22 LOCAL EXECUTION Hadoop Streaming (2)
  • 26. HADOOP EXECUTION % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/ncdc/sample.txt -output output -mapper map.rb -reducer reduce.rb Hadoop Streaming (3)
  • 27.  Intuitive  Make the unstructured data looks like tables regardless how it really lay out  SQL based query can be directly against these tables  Generate specify execution plan for this query  What’s Hive  A data warehousing system to store structured data on Hadoop file system  Provide an easy query these data by execution Hadoop MapReduce plans Hive: overview
  • 29. hive> SHOW TABLES; hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ STORED AS TEXTFILE; hive> DESCRIBE shakespeare; loading data… hive> SELECT * FROM shakespeare LIMIT 10; hive> SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10; Hive: shell
  • 30. -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; Pig
  • 31. Initial public launch Move from local workstation to shared, remote hosted MySQL instance with a well-defined schema. Service becomes more popular; too many reads hitting the database Add memcached to cache common queries. Reads are now no longer strictly ACID; cached data must expire. Service continues to grow in popularity; too many writes hitting the database Scale MySQL vertically by buying a beefed up server with 16 cores, 128 GB of RAM, and banks of 15 k RPM hard drives. Costly. RDBMS scaling story (1)
  • 32. New features increases query complexity; now we have too many joins Denormalize your data to reduce joins. Rising popularity swamps the server; things are too slow Stop doing any server-side computations. Some queries are still too slow Periodically prematerialize the most complex queries, try to stop joining in most cases. Reads are OK, but writes are getting slower and slower Drop secondary indexes and triggers (no indexes?). RDBMS scaling story (1)
  • 33. NoSQL
  • 34. • Tables have one primary index, the row key • No join operators • Data is unstructured and untyped • No accessed or manipulated via SQL – Programmatic access via Java, REST, or Thrift APIs • There are three types of lookups: – Fast lookup using row key and optional timestamp – Full table scan – Range scan from region start to end Hbase: differences from RDBMS
  • 35. • Automatic partitioning • Scale linearly and automatically with new nodes • Commodity hardware • Fault tolerance: Apache Zookeeper • Batch processing: Apache Hadoop Hbase: benefits over RDBMS
  • 36.  Tables are sorted by Row  Table schema only define it’s column families .  Each family consists of any number of columns  Each column consists of any number of versions  Columns only exist when inserted, NULLs are free.  Columns within a family are sorted and stored together  Everything except table names are byte[]  (Row, Family: Column, Timestamp)  Value Row key Column Family valueTimeStamp Hbase: data model
  • 37. • Master – Responsible for monitoring region servers – Load balancing for regions – Redirect client to correct region servers • regionserver slaves – Serving requests (Write/Read/Scan) of Client – Send HeartBeat to Master Hbase: members
  • 38. $ hbase shell > create 'test', 'data' 0 row(s) in 4.3066 seconds > list test 1 row(s) in 0.1485 seconds > put 'test', 'row1', 'data:1', 'value1' 0 row(s) in 0.0454 seconds > put 'test', 'row2', 'data:2', 'value2' 0 row(s) in 0.0035 seconds > scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp=1240148026198, value=value1 row2 column=data:2, timestamp=1240148040035, value=value2 2 row(s) in 0.0825 seconds Hbase: shell
  • 40. • Amazon • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh • Yahoo! Who uses Hadoop?
  • 41. Books