SlideShare uma empresa Scribd logo
1 de 3
Baixar para ler offline
Pregel: A System for Large-Scale Graph
                 Processing
                                     Paper Review

                                   Maria Stylianou

                                 November 2, 2012


1 Motivation
Nowadays, large-scale graphs, like the Web graph and social networks, are among the
main sources of new computing problems. Processing such graphs efficiently can be a
challenge. MapReduce can be a solution, though very inefficient due to the require-
ment of passing the entire state of the graph from one stage to another. Hence, the
authors propound Pregel, a distributed programming model especially designed to ad-
dress the processing of large-scale graphs, which preserves efficiency, scalability and
fault-tolerance[1].


2 Contributions
So far, there was a gap in the area of frameworks for large-scale graphs processing that
can offer scalability, while being distributed and fault-tolerant. Pregel is exactly designed
with these characteristics. The authors designed Pregel for the Google cluster architec-
ture, in which clusters are interconnected and geographically distributed, and each one
of them containing thousands of commodity machines. Their main contributions in-
clude: 1. Design of a fault-tolerant distributed programming framework for enabling
execution of graph algorithms in parallel over thousands of machines. 2. Provision of
an API with direct message-passing among vertices, combiners for reducing overhead,
aggregators for global communication and monitoring, and lastly topology mutations by
solving conflicting requests.


3 Solution
Pregel operates as a repeated synchronized computation process on vertices. Upon
inserting a graph as an input, the graph is divided into partitions, which include a set
of vertices and their outgoing edges. The vertices are assigned to machines and one of




                                             1
them acts as a master for coordinating the worker machines. The workers then undergo
a series of iterations, called supersteps. In every superstep, all vertices in each worker
execute the same user-defined function which can (a) receive messages sent during the
previous superstep, (b) modify the state of the vertex and its outgoing edges (vertices
and edges are kept on the machines) and (c) send messages to be delivered during the
next superstep. At the end of each superstep a global synchronization point occurs.
Vertices can become inactive and the sequence of iterations terminates when all vertices
are inactive and there are no messages in transit. During computation, the master also
sends ping messages to check for workers failures. The network is used only for sending
messages and therefore it significantly reduces the communication overhead, becoming
more efficient.


4 Strong Points
S1 Fault-tolerance is achieved with the use of checkpoints, in which the state of nodes’
     partitions is saved to a persistent storage. Upon a machine failure during compu-
     tation, the rest of the machines reload their partition state from the most recent
     checkpoint.
S2 Combiners are an optimization for less network traffic and can be manually enabled
    by the user. With this option, messages can be combined and sent in a single
    message, reducing the overhead.
S3 Aggregators are a mechanism for global communication and monitoring. They have
    different uses, like: in statistics, for global coordination or even in more advanced
    implementations. . . .


5 Weak Points
W1 The user has to modify Pregel a lot in order to personalise it to his/her needs.
    More precisely, the user has to code for enabling combiners and for customizing
    aggregators. Additionally, the user is responsible for solving conflicting requests.
    He/She needs to define handlers, which increases the complexity in the system.
W2 No failure detection is mentioned for the master, making it a single point of failure.
W3 The evaluation presented in the paper is very limited with very little explanation.
    There is no clear comparison with other systems. An experimental comparison
    with MapReduce would be an interesting approach. Also, there is no experiment
    evaluating the fault-tolerance of the system. . . .


References
[1] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Cza-
    jkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the



                                            2
2010 ACM SIGMOD International Conference on Management of data, SIGMOD
’10, (New York, NY, USA), pp. 135–146, ACM, 2010.




                                 3

Mais conteúdo relacionado

Mais procurados

MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
1crore projects
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Yahoo Developer Network
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
mathieuraj
 

Mais procurados (20)

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
 
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
IEEE Projects 2015 | Page a partition aware engine for parallel graph computa...
 
Principles of Computing Resources Planning in Cloud-Based Problem Solving Env...
Principles of Computing Resources Planning in Cloud-Based Problem Solving Env...Principles of Computing Resources Planning in Cloud-Based Problem Solving Env...
Principles of Computing Resources Planning in Cloud-Based Problem Solving Env...
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?MapReduce and parallel DBMSs: friends or foes?
MapReduce and parallel DBMSs: friends or foes?
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Page a partition aware engine for parallel graph computation
Page a partition aware engine for parallel graph computationPage a partition aware engine for parallel graph computation
Page a partition aware engine for parallel graph computation
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 

Semelhante a Pregel - Paper Review

IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
Shakas Technologies
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Performance evaluation of larger matrices over cluster of four nodes using mpi
Performance evaluation of larger matrices over cluster of four nodes using mpiPerformance evaluation of larger matrices over cluster of four nodes using mpi
Performance evaluation of larger matrices over cluster of four nodes using mpi
eSAT Journals
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 

Semelhante a Pregel - Paper Review (20)

Facade
FacadeFacade
Facade
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONPAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed Systems
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Performance evaluation of larger matrices over cluster of four nodes using mpi
Performance evaluation of larger matrices over cluster of four nodes using mpiPerformance evaluation of larger matrices over cluster of four nodes using mpi
Performance evaluation of larger matrices over cluster of four nodes using mpi
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...Improving the Performance of Mapping based on Availability- Alert Algorithm U...
Improving the Performance of Mapping based on Availability- Alert Algorithm U...
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Ie3514301434
Ie3514301434Ie3514301434
Ie3514301434
 

Mais de Maria Stylianou

Mais de Maria Stylianou (15)

Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)Scaling Online Social Networks (OSNs)
Scaling Online Social Networks (OSNs)
 
Erlang in 10 minutes
Erlang in 10 minutesErlang in 10 minutes
Erlang in 10 minutes
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee Green Optical Networks with Signal Quality Guarantee
Green Optical Networks with Signal Quality Guarantee
 
Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee Cano projectGreen Optical Networks with Signal Quality Guarantee
Cano projectGreen Optical Networks with Signal Quality Guarantee
 
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
A Survey on Large-Scale Decentralized Storage Systems to be used by Volunteer...
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Intelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesIntelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet Services
 
Instrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel BenchmarkInstrumenting the MG applicaiton of NAS Parallel Benchmark
Instrumenting the MG applicaiton of NAS Parallel Benchmark
 
Low-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic RegistersLow-Latency Multi-Writer Atomic Registers
Low-Latency Multi-Writer Atomic Registers
 
How Companies Learn Your Secrets
How Companies Learn Your SecretsHow Companies Learn Your Secrets
How Companies Learn Your Secrets
 
EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services EEDC - Why use of REST for Web Services
EEDC - Why use of REST for Web Services
 
EEDC - Distributed Systems
EEDC - Distributed SystemsEEDC - Distributed Systems
EEDC - Distributed Systems
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Pregel - Paper Review

  • 1. Pregel: A System for Large-Scale Graph Processing Paper Review Maria Stylianou November 2, 2012 1 Motivation Nowadays, large-scale graphs, like the Web graph and social networks, are among the main sources of new computing problems. Processing such graphs efficiently can be a challenge. MapReduce can be a solution, though very inefficient due to the require- ment of passing the entire state of the graph from one stage to another. Hence, the authors propound Pregel, a distributed programming model especially designed to ad- dress the processing of large-scale graphs, which preserves efficiency, scalability and fault-tolerance[1]. 2 Contributions So far, there was a gap in the area of frameworks for large-scale graphs processing that can offer scalability, while being distributed and fault-tolerant. Pregel is exactly designed with these characteristics. The authors designed Pregel for the Google cluster architec- ture, in which clusters are interconnected and geographically distributed, and each one of them containing thousands of commodity machines. Their main contributions in- clude: 1. Design of a fault-tolerant distributed programming framework for enabling execution of graph algorithms in parallel over thousands of machines. 2. Provision of an API with direct message-passing among vertices, combiners for reducing overhead, aggregators for global communication and monitoring, and lastly topology mutations by solving conflicting requests. 3 Solution Pregel operates as a repeated synchronized computation process on vertices. Upon inserting a graph as an input, the graph is divided into partitions, which include a set of vertices and their outgoing edges. The vertices are assigned to machines and one of 1
  • 2. them acts as a master for coordinating the worker machines. The workers then undergo a series of iterations, called supersteps. In every superstep, all vertices in each worker execute the same user-defined function which can (a) receive messages sent during the previous superstep, (b) modify the state of the vertex and its outgoing edges (vertices and edges are kept on the machines) and (c) send messages to be delivered during the next superstep. At the end of each superstep a global synchronization point occurs. Vertices can become inactive and the sequence of iterations terminates when all vertices are inactive and there are no messages in transit. During computation, the master also sends ping messages to check for workers failures. The network is used only for sending messages and therefore it significantly reduces the communication overhead, becoming more efficient. 4 Strong Points S1 Fault-tolerance is achieved with the use of checkpoints, in which the state of nodes’ partitions is saved to a persistent storage. Upon a machine failure during compu- tation, the rest of the machines reload their partition state from the most recent checkpoint. S2 Combiners are an optimization for less network traffic and can be manually enabled by the user. With this option, messages can be combined and sent in a single message, reducing the overhead. S3 Aggregators are a mechanism for global communication and monitoring. They have different uses, like: in statistics, for global coordination or even in more advanced implementations. . . . 5 Weak Points W1 The user has to modify Pregel a lot in order to personalise it to his/her needs. More precisely, the user has to code for enabling combiners and for customizing aggregators. Additionally, the user is responsible for solving conflicting requests. He/She needs to define handlers, which increases the complexity in the system. W2 No failure detection is mentioned for the master, making it a single point of failure. W3 The evaluation presented in the paper is very limited with very little explanation. There is no clear comparison with other systems. An experimental comparison with MapReduce would be an interesting approach. Also, there is no experiment evaluating the fault-tolerance of the system. . . . References [1] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Cza- jkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2
  • 3. 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, (New York, NY, USA), pp. 135–146, ACM, 2010. 3