SlideShare uma empresa Scribd logo
1 de 14
Sector vs. Hadoop A Brief Comparison Between the Two Systems
Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.

Mais conteúdo relacionado

Mais procurados

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Mais procurados (18)

Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
hadoop
hadoophadoop
hadoop
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
HADOOP
HADOOPHADOOP
HADOOP
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hdfs
HdfsHdfs
Hdfs
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Cppt
CpptCppt
Cppt
 

Semelhante a Sector Vs Hadoop

big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 

Semelhante a Sector Vs Hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hdfs
HdfsHdfs
Hdfs
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
HDFS
HDFSHDFS
HDFS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Sector Vs Hadoop

  • 1. Sector vs. Hadoop A Brief Comparison Between the Two Systems
  • 2. Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
  • 3. Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
  • 4. History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
  • 5. Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
  • 6. Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
  • 7. Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
  • 8. Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
  • 9. Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
  • 10. In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
  • 11. Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
  • 12. Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
  • 13. Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
  • 14. Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.