SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Apache Hadoop 0.22
and Other Versions
Konstantin V Shvachko
Principal Hadoop Architect, eBay
IBM Karmasphere Twitter
February – March, 2012
eBay Inc. confidential
Apache Hadoop Ecosystem
• Hadoop Core
– Common – communication and user facing APIs
– HDFS – distributed file system
– MapReduce – distributed computation framework
• Pig – dataflow language
• Hive – data warehouse, SQL
• Zookeeper – distributed coordination service
• HBase – columnar store
• Oozie – complex job workflow
• eBay Specific
– Cascading
– Lzo compression
2
eBay Inc. confidential
Hadoop Versioning
• Straight line from 0.1 to 0.20
• Fanned out starting from 0.20.2
• Multiple distributions in 2010 based on 0.20
– Apache, Y, CDH, FB
– More today
• Focus on Apache Releases
– Release 0.20.2 2010-02-16
– Release 0.21.0 2010-08-13
– Release 0.20.203.0 2011-05-11 Security Stable
– Release 0.20.204.0 2011-09-05 Improvements
– Release 0.20.205.0 2011-10-17 HBase support
• Genealogy of elephants
3
eBay Inc. confidential4
eBay Inc. confidential
Major Branches
• Hadoop 1.0.0 (security branch) 2011-12-27
– Rename of 0.20.205
– Beta
• Hadoop 0.22.0 2011-12-10
– Continuation of 0.21.0
– Beta
• Hadoop 0.23.0 2011-11-11
– Fedaration – static partitioning of HDFS namespace
– Yarn – new implementation of MapReduce
– Scalability
– Alpha
• 2011 – record number of major releases!
• No unifying release, containing all the good features
5
eBay Inc. confidential
Hadoop 0.22 Branch
• Branched 2010-11-17
• Released 2011-12-10
• Many events in-between
• RM role – started in August 2011
• Stabilization
–Hadoop Platform team, eBay
–Many contributors from the community
6
eBay Inc. confidential
Features HDFS - 0.22
• New implementation of file append
• HBase support with hflush and hsync
• Symbolic links
• BackupNode and CheckpointNode
• DataNodes tolerate single disk failure. Disk-fail-in-place
• File concatenation
• SLive test
• Sticky bit
• Offline Image Viewer
7
eBay Inc. confidential
Features MapReduce - 0.22
• Hierarchical job queues
• Job limits per queue / pool
• Dynamically stop / start job queues
• Andvances in new MapReduce API
– Input/Output formats, ChainMapper / ChainReducer
• TaskTracker blacklisting
• DistributedCache sharing
8
eBay Inc. confidential
Features not Supported in Hadoop 0.22.0
Compared to Hadoop 1.0
• Security
– LinuxTaskController removed MAPREDUCE-2767
• Optimizations (operability) of the MapReduce framework
introduced in the Hadoop 0.20.security line of releases
– Limits on per-job JobConf, Counters, StatusReport, Split-Sizes
– User / queue limits on tasks / jobs in the CapacityScheduler
• Disk-fail-in-place – MapReduce part
• JMX-based metrics v2
• Jetty workaround
• CapacityScheduler should assign multiple tasks per heartbeat
• User's task logs filling up local disks on the TaskTrackers
• FairScheduler back-port from trunk
9
eBay Inc. confidential
Not in Hadoop 0.22.0 HDFS Part
• Shortcut a local client reads to a Datanodes files directly
– Important HBase optimization
– Porting is in progress
• WebHDFS: accessing HDFS over HTTP
– New experimental feature, back-ported from trunk
• NameNode startup time
– Handling block reports and missed heartbeats from DataNodes
– The rest is forward ported from 1.0
– More startup improvements in 0.22
10
eBay Inc. confidential
Hadoop 0.23 Features
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
• Yarn: Scalability for MapReduce framework
– Separation of JobTracker functions
1. Job scheduling and resource allocation:
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• “Apache Hadoop: The scalability update” USENIX ;login: June, 2011
11
eBay Inc. confidential
Append and HBase
• Append means
– Reopening of existing files for appending new data
– Replica synchronization after failure
– Consistent view of file data during writing by different clients
– hflush, hsync – guarantee data delivered to DNs and persisted on NN
• First implementation of append in 0.19 HADOOP-1700
– 0.20-append branch
• Redesign of append in 0.21 HDFS-265
• HBase needs hflush and hsync only
• Hadoop 1.0 - HBase support via hflush, hsync
• Hadoop 0.22 – fully functional append, including HBase support
12
eBay Inc. confidential
BackupNode
• BackupNode a read-only NameNode
– Contains all file system metadata: files and directories
excluding block locations
– Can perform NameNode operations that don’t modify namespace
• BN maintains up-to-date in-memory image of file system namespace
always synchronized with the NameNode state
– NameNode streams journal to BackupNode
• BackupNode can create a checkpoint without downloading
checkpoint and journal files from active NameNode
• Intended to evolve into hot HA HDFS-2064
13
eBay Inc. confidential
Hadoop at eBay
• 2011 started with 532-node 5 PB cluster running CDH2
• EBay 0.20.203-based build (Wilma)
– Hadoop 0.20.203 – latest stable Apache release
• HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo
– 500+ users; 2000 jobs per day
• Runs on 1000-node cluster
– 24 PB – capacity, 72 GB RAM / node
• Many smaller clusters
• Stabilization of Hadoop platform based on 0.22
14
eBay Inc. confidential
Testing
• One year of testing by different groups in Hadoop ecosystem
• Extensive testing of append by HBase community
• Fully automated build and certification with BigTop
• Hadoop platform team at eBay
– Extensive stabilization effort starting September
– Most bugs found in 0.22 are also in trunk and 0.23
– All new features tested
– Stress testing
– Reliability testing
• Works with: Pig 0.8, Hive 0.7, custom changes
HBase 0.92, Oozie, open sourced
Zookeeper, Cascading no changes needed
15
eBay Inc. confidential
Testing Tools, Examples
• TeraSort, TestDFSIO, DistCp
• GridMix, Rumen – production job traces
• SLive – adjustable mix of HDFS operations, permanent load
• Upgrade / rollback from 0.20.? and 0.20.203 to 0.22
• Oversubscribed cluster running out of memory
• Loosing racks with running jobs and HBase
– Cluster survived consecutive loss of 4 racks, shrinking to single rack
with HBase still alive and MR jobs completing
• Disk-fail-in-place helps identify bad drives during hardware burn-in
16
eBay Inc. confidential
Benchmarking
• TestDFSIO: 10 GB files (same as 100 GB)
• TeraSort: -5% (scheduler to blame)
• YCSB - same
• Internal eBay applications – same or better
• Lots of tuning: Hadoop, Java, OS, HW
– Gradual improvement of results
17
Throughput
MB/sec
Read Write Append
Hadoop-0.22 100 84 83
0.20 breed 96 66 n/a
eBay Inc. confidential
Good to have for 0.22.1
• Restore Security
• Disk Fail in place for MapReduce
• Optimizations
– Multiple tasks per heartbeat for CapacityScheduler
– CapacityScheduler preemption
• MR job and task limits
• Cluster startup time
• Add HA?
• Merge MR-1.0 into Hadoop 0.22?
18
eBay Inc. confidential
Important
• Works but not 0.20
– Good new features
– Reliability is the first concern
– Performance and missing functionality can be reconstructed
• Community release
– Not distributed / advertized by commercial distributors
– Community involvement important
• Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0
It’s the other way around
– Go to Hadoop 0.22 instead
• Forward-going release progress
– Stop porting new features, start releasing them
19
eBay Inc. confidential
Thank you
20
Hadoop 0.22 Contributions Accepted

Mais conteúdo relacionado

Mais procurados

Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 

Mais procurados (20)

Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 

Semelhante a Apache Hadoop 0.22 and Other Versions

Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012Chris Huang
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Semelhante a Apache Hadoop 0.22 and Other Versions (20)

Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

Mais de Konstantin V. Shvachko

HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 

Mais de Konstantin V. Shvachko (6)

HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Apache Hadoop 0.22 and Other Versions

  • 1. Apache Hadoop 0.22 and Other Versions Konstantin V Shvachko Principal Hadoop Architect, eBay IBM Karmasphere Twitter February – March, 2012
  • 2. eBay Inc. confidential Apache Hadoop Ecosystem • Hadoop Core – Common – communication and user facing APIs – HDFS – distributed file system – MapReduce – distributed computation framework • Pig – dataflow language • Hive – data warehouse, SQL • Zookeeper – distributed coordination service • HBase – columnar store • Oozie – complex job workflow • eBay Specific – Cascading – Lzo compression 2
  • 3. eBay Inc. confidential Hadoop Versioning • Straight line from 0.1 to 0.20 • Fanned out starting from 0.20.2 • Multiple distributions in 2010 based on 0.20 – Apache, Y, CDH, FB – More today • Focus on Apache Releases – Release 0.20.2 2010-02-16 – Release 0.21.0 2010-08-13 – Release 0.20.203.0 2011-05-11 Security Stable – Release 0.20.204.0 2011-09-05 Improvements – Release 0.20.205.0 2011-10-17 HBase support • Genealogy of elephants 3
  • 5. eBay Inc. confidential Major Branches • Hadoop 1.0.0 (security branch) 2011-12-27 – Rename of 0.20.205 – Beta • Hadoop 0.22.0 2011-12-10 – Continuation of 0.21.0 – Beta • Hadoop 0.23.0 2011-11-11 – Fedaration – static partitioning of HDFS namespace – Yarn – new implementation of MapReduce – Scalability – Alpha • 2011 – record number of major releases! • No unifying release, containing all the good features 5
  • 6. eBay Inc. confidential Hadoop 0.22 Branch • Branched 2010-11-17 • Released 2011-12-10 • Many events in-between • RM role – started in August 2011 • Stabilization –Hadoop Platform team, eBay –Many contributors from the community 6
  • 7. eBay Inc. confidential Features HDFS - 0.22 • New implementation of file append • HBase support with hflush and hsync • Symbolic links • BackupNode and CheckpointNode • DataNodes tolerate single disk failure. Disk-fail-in-place • File concatenation • SLive test • Sticky bit • Offline Image Viewer 7
  • 8. eBay Inc. confidential Features MapReduce - 0.22 • Hierarchical job queues • Job limits per queue / pool • Dynamically stop / start job queues • Andvances in new MapReduce API – Input/Output formats, ChainMapper / ChainReducer • TaskTracker blacklisting • DistributedCache sharing 8
  • 9. eBay Inc. confidential Features not Supported in Hadoop 0.22.0 Compared to Hadoop 1.0 • Security – LinuxTaskController removed MAPREDUCE-2767 • Optimizations (operability) of the MapReduce framework introduced in the Hadoop 0.20.security line of releases – Limits on per-job JobConf, Counters, StatusReport, Split-Sizes – User / queue limits on tasks / jobs in the CapacityScheduler • Disk-fail-in-place – MapReduce part • JMX-based metrics v2 • Jetty workaround • CapacityScheduler should assign multiple tasks per heartbeat • User's task logs filling up local disks on the TaskTrackers • FairScheduler back-port from trunk 9
  • 10. eBay Inc. confidential Not in Hadoop 0.22.0 HDFS Part • Shortcut a local client reads to a Datanodes files directly – Important HBase optimization – Porting is in progress • WebHDFS: accessing HDFS over HTTP – New experimental feature, back-ported from trunk • NameNode startup time – Handling block reports and missed heartbeats from DataNodes – The rest is forward ported from 1.0 – More startup improvements in 0.22 10
  • 11. eBay Inc. confidential Hadoop 0.23 Features • HDFS Federation – Independent NameNodes sharing a common pool of DataNodes – Cluster is a family of volumes with shared block storage layer – User sees volumes as isolated file systems – ViewFS: the client-side mount table – Federated approach provides a static partitioning of the federated namespace • Yarn: Scalability for MapReduce framework – Separation of JobTracker functions 1. Job scheduling and resource allocation: • Fundamentally centralized 2. Job monitoring and job life-cycle coordination • Delegate coordination of different jobs to other nodes – Dynamic partitioning of cluster resources: no fixed slots • “Apache Hadoop: The scalability update” USENIX ;login: June, 2011 11
  • 12. eBay Inc. confidential Append and HBase • Append means – Reopening of existing files for appending new data – Replica synchronization after failure – Consistent view of file data during writing by different clients – hflush, hsync – guarantee data delivered to DNs and persisted on NN • First implementation of append in 0.19 HADOOP-1700 – 0.20-append branch • Redesign of append in 0.21 HDFS-265 • HBase needs hflush and hsync only • Hadoop 1.0 - HBase support via hflush, hsync • Hadoop 0.22 – fully functional append, including HBase support 12
  • 13. eBay Inc. confidential BackupNode • BackupNode a read-only NameNode – Contains all file system metadata: files and directories excluding block locations – Can perform NameNode operations that don’t modify namespace • BN maintains up-to-date in-memory image of file system namespace always synchronized with the NameNode state – NameNode streams journal to BackupNode • BackupNode can create a checkpoint without downloading checkpoint and journal files from active NameNode • Intended to evolve into hot HA HDFS-2064 13
  • 14. eBay Inc. confidential Hadoop at eBay • 2011 started with 532-node 5 PB cluster running CDH2 • EBay 0.20.203-based build (Wilma) – Hadoop 0.20.203 – latest stable Apache release • HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo – 500+ users; 2000 jobs per day • Runs on 1000-node cluster – 24 PB – capacity, 72 GB RAM / node • Many smaller clusters • Stabilization of Hadoop platform based on 0.22 14
  • 15. eBay Inc. confidential Testing • One year of testing by different groups in Hadoop ecosystem • Extensive testing of append by HBase community • Fully automated build and certification with BigTop • Hadoop platform team at eBay – Extensive stabilization effort starting September – Most bugs found in 0.22 are also in trunk and 0.23 – All new features tested – Stress testing – Reliability testing • Works with: Pig 0.8, Hive 0.7, custom changes HBase 0.92, Oozie, open sourced Zookeeper, Cascading no changes needed 15
  • 16. eBay Inc. confidential Testing Tools, Examples • TeraSort, TestDFSIO, DistCp • GridMix, Rumen – production job traces • SLive – adjustable mix of HDFS operations, permanent load • Upgrade / rollback from 0.20.? and 0.20.203 to 0.22 • Oversubscribed cluster running out of memory • Loosing racks with running jobs and HBase – Cluster survived consecutive loss of 4 racks, shrinking to single rack with HBase still alive and MR jobs completing • Disk-fail-in-place helps identify bad drives during hardware burn-in 16
  • 17. eBay Inc. confidential Benchmarking • TestDFSIO: 10 GB files (same as 100 GB) • TeraSort: -5% (scheduler to blame) • YCSB - same • Internal eBay applications – same or better • Lots of tuning: Hadoop, Java, OS, HW – Gradual improvement of results 17 Throughput MB/sec Read Write Append Hadoop-0.22 100 84 83 0.20 breed 96 66 n/a
  • 18. eBay Inc. confidential Good to have for 0.22.1 • Restore Security • Disk Fail in place for MapReduce • Optimizations – Multiple tasks per heartbeat for CapacityScheduler – CapacityScheduler preemption • MR job and task limits • Cluster startup time • Add HA? • Merge MR-1.0 into Hadoop 0.22? 18
  • 19. eBay Inc. confidential Important • Works but not 0.20 – Good new features – Reliability is the first concern – Performance and missing functionality can be reconstructed • Community release – Not distributed / advertized by commercial distributors – Community involvement important • Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0 It’s the other way around – Go to Hadoop 0.22 instead • Forward-going release progress – Stop porting new features, start releasing them 19
  • 20. eBay Inc. confidential Thank you 20 Hadoop 0.22 Contributions Accepted