SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Processing graph/relational data
             with
         Map-Reduce
             and
   Bulk Synchronous Parallel
              v. 1.1




                          Tomasz Chodakowski,

                          1st Bristol Hadoop Workshop, 08-11-2010
Irregular Algorithms

●   Map-reduce – a simplified model for “embarasingly
    parallel” problems
        –   Easily separable into independent tasks
        –   Captured by static dependence graph

●   Most graph algorithms are irregular, ie.:
        –   Dependencies between tasks arise during
             execution
        –   “don't care non-determinism” - tasks can be
              executed in arbitrary order yet still yield
              correct results.
Irregular Algorithms

●   Often operate on data structures with
    complex topologies:
          –   Graphs, trees, grids, ...
          –   Where “data elements” are connected by
               “relations”


●   Computations on such structures depend
    strongly on relations between data elements
          –   primary source of dependencies between
                tasks

    more in [ADP] “Amorphous Data-parallelism in Irregular Algorithms”
Relational Data

●   Example relations between elements:
        –   social interactions (co-authorship,
              friendship)
        –   web links, document references
        –   linked data or semantic network relations
        –   geo-spatial relations
        –   ...
●   Different from a relational model
        –   in that relations are arbitrary
Graph Algorithms Rough Classification

●   Aggregation, feature extraction
        –   Not leveraging latent relations
●   Network analysis (matrix-based, single relational)
        –   Geodesic (radius, diameter etc.)
        –   Spectral (eigenvector-based, centrality)
●   Algorithmic/node-based algorithms
        –   Recommender systems, belief/label
             propagation
        –   Traversal, path detection, interaction
              networks, etc.
Iterative Vertex-based Graph Algorithms

●   Iteratively:
         –   Compute local function of a vertex that
              depends on the vertex state and local
              graph structure (neighbourhood)
         –   and/or Modify local state
         –   and/or Modify local topology
         –   pass messages to neighbouring nodes

●   -> “vertex-based computation”
             Amorphous Data-Parallelism [ADP] operator formulation:
             “repeated application of neighbourhood operators in a specific order”
Recent applications/developments



●   Google work on graph-based YouTube
    recommendations:
        –   Leveraging latent information
        –   Diffusing interest in sparsely labeled video
             clips
●   User profiling, sentiment analysis
        –   Facebook likes, Hunch, Gravity, MusicMetric
             ...
Single Source Shortest Path
                                                        Time
        P1                 P2                 P1                 P2
         Graph structure                                                     work
         split into two
         partitions (P1, P2)

    0
        1           6                                          This time-space
                            4
                                                               view shows
            1           3                                      workload and
                            2                                  communication
    9                                Turquoise
                2                                              between
                                     rectangles show           partitions
                            5        computational
            1
                                     work load for a
3
                                     partition (work)

        Directed graph
        labelled with
        positive integers
Single Source Shortest Path
        P1                      P2                      P1    P2
                                                                            work
                                                                           comm


    0                     0+6
                          0+6
        1             6         4

            1             3
                0+1
                0+1             2
    9
                2

    0+9
    0+9                         5
            1
3
                                     Signals being
                                     passed along            Thick green lines
Active vertices                      relations are in        show, costly, inter
are in turquoise                     light green             partition
                                                             communications
Single Source Shortest Path
        P1                      P2           P1          P2
                                                                      work
                                                                      comm

                                                                     barrier
    0                     0+6
                          0+6
        1             6         4

            1             3
                0+1
                0+1             2
    9
                2

    0+9
    0+9                         5
            1
3

                                                        Vertical grey line
                                                        is a barrier
                                                        synchronisation to
                                                        avoid race
                                                        conditions
Single Source Shortest Path
         P1                          P2              P1       P2
                                                                          work
                                                                         comm

                                                                         barrier
     0                                                                    work
         1               6       6
                                     4

             1               3
     9                               2
                 1
                     2

             1                       5
9
 3                                                         Work,comm,barrier
                                                           form a BSP superstep

                             Vertices become
                             active upon receiving
                             signal in a previous
                             superstep
Single Source Shortest Path
         P1                         P2                P1   P2
                                                                work
                                                                comm

                                                                barrier
     0                                                          work
         1            6         6
                                    4                           comm
                              1+3
                              1+3
             1            3
     9                              2
              1
                  2
                                        6+2
                                        6+2

          1                         5
9
 3       1+1
         1+1

                               After performing
                               local computation
                               they send signals to
                               their neighbouring
                               vertices
Single Source Shortest Path
         P1                         P2        P1        P2
                                                             work
                                                             comm

                                                             barrier
     0                                                       work
         1            6         6
                                    4                        comm
                              1+3
                              1+3                            barrier
             1            3
     9                              2
              1
                  2
                                        6+2
                                        6+2

          1                         5
9
 3       1+1
         1+1
Single Source Shortest Path
         P1                          P2         P1         P2
                                                                work
                                                                comm

                                                                barrier
     0                                                          work
         1               6       4
                                     4                          comm

                                                                barrier
             1               3
                                                                work
     9                               2
                 1
                     2
                                 8
             1                       5
9
 3
Single Source Shortest Path
         P1                          P2         P1         P2
                                                                work
                                                                comm

                                                                barrier
     0                                                          work
         1               6       4
                                     4                          comm

                                                                barrier
             1               3
                                                                work
     9                               2
                 1                                              comm
                     2
                                         4+2
                                         4+2
                                 8
             1                       5
9
 3
Single Source Shortest Path
         P1                          P2         P1         P2
                                                                work
                                                                comm

                                                                barrier
     0                                                          work
         1               6       4
                                     4                          comm

                                                                barrier
             1               3
                                                                work
     9                               2
                 1                                              comm
                     2
                                         4+2
                                         4+2
                                                                barrier
                                 8
             1                       5
9
 3
Single Source Shortest Path
         P1                          P2         P1         P2
                                                                work
                                                                comm

                                                                barrier
     0                                                          work
         1               6       4
                                     4                          comm

                                                                barrier
             1               3
                                                                work
     9                               2
                 1                                              comm
                     2
                                                                barrier
                                 6
             1                       5
9                                                               work
 3
Single Source Shortest Path
         P1                          P2                P1         P2
                                                                       work
                                                                       comm

                                                                       barrier
     0                                                                 work
         1               6       4
                                     4                                 comm

                                                                       barrier
             1               3
                                                                       work
     9                               2
                 1                                                     comm
                     2
                                                                       barrier
                                 6
             1                       5
9                                                                       work
                                                                       comm
 3                                                                     barrier




                                          Computation ends when
                                          there are no active
                                          vertices left
Bulk Synchronous Parallel
superstep     P1            P2              ...             Pn

   0                                                               w0
         h0
                                                                        l0
   1                               w1
         h1
                                                                        l1
   2                w2
         h2
                                                                        l2
   3
                                   w3
         h3
   ...                                                                  l3
              ...            ...              ...            ...

                          Time to finish work on slowest partition +
 superstep n cost =
                          cost of bulk communication +
  wn + hn + ln            barrier synchronization time
Bulk Synchronous Parallel

●   Advantages
           –   Simple and portable execution model
           –   Clear cost model
           –   No concurrency control, no data races,
                deadlocks, etc.
●   Disadvantages
           –   Coarse grained
                    ●Depends on a large “parallel slack”
           –   Requires well-partitioned problem space for
                efficiency (well balanced partitions)

    more in [BSP] “A bridging model for parallel computation”
Bulk Synchronous Parallel - extensions

●   Combiners
        –   minimizing inter-node communication (h
             factor)
●   Aggregators
        –   Computing global state (ex. map/reduce)


            And other extensions...
public void superStep() {
                                   Sample code
 int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;

 for(DistanceMessage msg: messages()) { // Choose min. proposed distance
 for(DistanceMessage

     minDist = Math.min( minDist, msg.getDistance() );

 }

 if( minDist < this.getCurrentDistance() ) { //If improves the path, store and propagate
 if(

     this.setCurrentDistance(minDist);

     IVertex v = this.getElement();

     for(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) {
     for(IEdge

      IElement recipient = r.getOtherElement(v);

      int rDist = this.getLengthOf(r);

      this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) );

     }}
SSSP - Map-Reduce Naive

●   Idea [DPMR]:
        –   In map phase:
                ●  emit both signals and local vertex
                    structure and state
        –   In reduce phase:
                ●  gather signals and local vertex
                    structure messages
                ● reconstruct vertex structure and state
SSSP - Map-Reduce Naive
def map(Id nId, Node N):        def reduce(Id rId, {m1,m2,..} ):
  //emit state and structure    new M; M.deActivate
emit(nId,                       minDist = MAX_VALUE
N.graphStateAndStruct)
                                for(m in {m1,m2,..})
                                 if(m is Node) M:=m //state
if(N.isActive)
                                 else if(m is Distance) //signals
 for(nbr :N.adjacencyL)
                                  minDist = min( minDist, m )
  //local computation
  dist:= N.currDist+DistToNbr
                                 if(M.currDist > minDist)
  //emit signals
                                  M.currDist:=minDist;
  emit(nbr.id, dist)
                                  M.activate
                                 emit(rId, M)
SSSP - Map Reduce Naive - issues

●   Cost associated with marshaling intermediate
    <k,v> pairs for combiners (which are optional)
        –   -> in-line combiner

●   Need to pass the whole graph state and structure
    around
        –   -> “Shimmy trick” -- pin down the structure

●   Partitions verticies without regard to graph
    topology
        –   -> cluster highly connected components
              together
Inline Combiners

●   In job configure:
        –   Initialize a map<NodeId, Distance>;
●   In job map operation:
        –   Do not emit interm. pairs ( emit(nbr.id, dist) ) ;
        –   Store them in the local map;
        –   Combine values in the same slots.
●   In job close:
        –   Emit a value from each slot in the map to a
             corresponding neighbour
                 ●   emit(nbr.id, map[nbr.id])
“Shimmy trick”

●   Store graph structure in a file system (no shuffle)
●   Inspired by a parallel merge join



                            partition           p1         p1


                                                        p2           p2


                                           p3         p3



     sorted by join key                 sorted and partitioned by join key
“Shimmy trick”

●   Assume:
        –   Graph G representation sorted by node ids;
        –   G partitioned into n parts: G1, G2, .., Gn
        –   Use the same partitioner as in MR
        –   Set number of reducers to n
●   The above gives us:


        –   Reducer Ri, receives the same intermediate
             keys as those in Gi graph partition (in
             sorted order).
“Shimmy trick”
def configure( ):              def reduce(Id rId, {m1,m2,..} ):
  P.openGraphPartition()       repeat:
                                  (id nId, node N) <- P.read()
                                  if (nId != rId): N.deact; emit(nId, N)
                               until: nId == rId
                               minDist = MAX_VALUE
                               for(m in {m1,m2,..}):
def close( ):                     minDist = min( minDist, m )
repeat:                         if(N.currDist > minDist)
 (id nId, node N) <-P.read()     N.currDist:=minDist;
 N.deactivate                    N.activate
 emit(nId, N)                   emit(rId, N)
“Shimmy trick”

●   Improvements:
        –   Files containing graph structure reside on
              dfs
        –   Reducers arbitrarily assigned to cluster
             machines
                ●   -> remote reads.

●   -> change the scheduler to assign key ranges to
    the same machines consistently.
Topology-aware Partitioner

●   Choose a partitioner that:
         –   minimizes inter-block traffic;
         –   maximizes intra-block traffic;
         –   places adjacent nodes in the same block

●   Difficult to achieve particularly with many real world
    datasets:
         –   Power-law distributions
         –   Reported that state of the art partitioners
              (ex. parmetis) fail for such cases (???)
MR Graph Processing Design Pattern

●   [DPMR] reports 60% 70% improvement over naive
    implementation
●   Solution closely resembles the BSP model
BSP (inspired) implementations

●   Google Pregel:
          –   classic BSP, C++, production
●   CMU GraphLab
          –   inspired by BSP, java, multi-core
          –   consistency models, custom schedulers
●   Apache Hama
          –   scientific computation package that runs on top of
                Hadoop, BSP, MS Dryad (?)
●   Signal/Collect (Zurich University)
          –   Scala, not yet distributed
●   ...
Open questions

●   What problems are particularly suitable for MR and
    which ones for BSP – where are the boundaries?
        –   Topology-based centrality algorithms
             (PageRank):
                ●   Algebraic, matrix-based methods vs.
                     vertex-based ones?

●   When considering graph algorithms:
        –   MR user base vs. BSP ergonomy?
        –   Performance overheads?
●   Relaxing the BSP synchronous schedule -->
    “Amorphous data parallelism”
POC, Sample Code

●   Project Masuria (early stages, 2011-02)
         –   http://masuria-project.org/
         –   As much POC of BSP framework as it is
               (distributed) OSGI playground.
●   Sample code:
         –   https://github.com/tch/Cloud9 *
         –   git@git.assembla.com:tch_sandbox.git
         –   RunSSSPNaive.java
         –   RunSSSPShimmy.java *
    * - expect (my) bugs
    Based on Jimmy Lin and Michael Schatz Cloud9 library
References

●   [ADP] “Amorphous Data-parallelism in Irregular Algorithms”, Keshav
    Pingali et al.
●   [BSP] “A bridging model for parallel computation”, Leslie G. Valiant
●   [DPMR] “Design Patterns for Efficient Graph Algorithms in
    MapReduce”, Jimmy Lin and Michael Schatz

Mais conteúdo relacionado

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

  • 1. Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel v. 1.1 Tomasz Chodakowski, 1st Bristol Hadoop Workshop, 08-11-2010
  • 2. Irregular Algorithms ● Map-reduce – a simplified model for “embarasingly parallel” problems – Easily separable into independent tasks – Captured by static dependence graph ● Most graph algorithms are irregular, ie.: – Dependencies between tasks arise during execution – “don't care non-determinism” - tasks can be executed in arbitrary order yet still yield correct results.
  • 3. Irregular Algorithms ● Often operate on data structures with complex topologies: – Graphs, trees, grids, ... – Where “data elements” are connected by “relations” ● Computations on such structures depend strongly on relations between data elements – primary source of dependencies between tasks more in [ADP] “Amorphous Data-parallelism in Irregular Algorithms”
  • 4. Relational Data ● Example relations between elements: – social interactions (co-authorship, friendship) – web links, document references – linked data or semantic network relations – geo-spatial relations – ... ● Different from a relational model – in that relations are arbitrary
  • 5. Graph Algorithms Rough Classification ● Aggregation, feature extraction – Not leveraging latent relations ● Network analysis (matrix-based, single relational) – Geodesic (radius, diameter etc.) – Spectral (eigenvector-based, centrality) ● Algorithmic/node-based algorithms – Recommender systems, belief/label propagation – Traversal, path detection, interaction networks, etc.
  • 6. Iterative Vertex-based Graph Algorithms ● Iteratively: – Compute local function of a vertex that depends on the vertex state and local graph structure (neighbourhood) – and/or Modify local state – and/or Modify local topology – pass messages to neighbouring nodes ● -> “vertex-based computation” Amorphous Data-Parallelism [ADP] operator formulation: “repeated application of neighbourhood operators in a specific order”
  • 7. Recent applications/developments ● Google work on graph-based YouTube recommendations: – Leveraging latent information – Diffusing interest in sparsely labeled video clips ● User profiling, sentiment analysis – Facebook likes, Hunch, Gravity, MusicMetric ...
  • 8. Single Source Shortest Path Time P1 P2 P1 P2 Graph structure work split into two partitions (P1, P2) 0 1 6 This time-space 4 view shows 1 3 workload and 2 communication 9 Turquoise 2 between rectangles show partitions 5 computational 1 work load for a 3 partition (work) Directed graph labelled with positive integers
  • 9. Single Source Shortest Path P1 P2 P1 P2 work comm 0 0+6 0+6 1 6 4 1 3 0+1 0+1 2 9 2 0+9 0+9 5 1 3 Signals being passed along Thick green lines Active vertices relations are in show, costly, inter are in turquoise light green partition communications
  • 10. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 0+6 0+6 1 6 4 1 3 0+1 0+1 2 9 2 0+9 0+9 5 1 3 Vertical grey line is a barrier synchronisation to avoid race conditions
  • 11. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 6 4 1 3 9 2 1 2 1 5 9 3 Work,comm,barrier form a BSP superstep Vertices become active upon receiving signal in a previous superstep
  • 12. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 6 4 comm 1+3 1+3 1 3 9 2 1 2 6+2 6+2 1 5 9 3 1+1 1+1 After performing local computation they send signals to their neighbouring vertices
  • 13. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 6 4 comm 1+3 1+3 barrier 1 3 9 2 1 2 6+2 6+2 1 5 9 3 1+1 1+1
  • 14. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 4 4 comm barrier 1 3 work 9 2 1 2 8 1 5 9 3
  • 15. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 4 4 comm barrier 1 3 work 9 2 1 comm 2 4+2 4+2 8 1 5 9 3
  • 16. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 4 4 comm barrier 1 3 work 9 2 1 comm 2 4+2 4+2 barrier 8 1 5 9 3
  • 17. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 4 4 comm barrier 1 3 work 9 2 1 comm 2 barrier 6 1 5 9 work 3
  • 18. Single Source Shortest Path P1 P2 P1 P2 work comm barrier 0 work 1 6 4 4 comm barrier 1 3 work 9 2 1 comm 2 barrier 6 1 5 9 work comm 3 barrier Computation ends when there are no active vertices left
  • 19. Bulk Synchronous Parallel superstep P1 P2 ... Pn 0 w0 h0 l0 1 w1 h1 l1 2 w2 h2 l2 3 w3 h3 ... l3 ... ... ... ... Time to finish work on slowest partition + superstep n cost = cost of bulk communication + wn + hn + ln barrier synchronization time
  • 20. Bulk Synchronous Parallel ● Advantages – Simple and portable execution model – Clear cost model – No concurrency control, no data races, deadlocks, etc. ● Disadvantages – Coarse grained ●Depends on a large “parallel slack” – Requires well-partitioned problem space for efficiency (well balanced partitions) more in [BSP] “A bridging model for parallel computation”
  • 21. Bulk Synchronous Parallel - extensions ● Combiners – minimizing inter-node communication (h factor) ● Aggregators – Computing global state (ex. map/reduce) And other extensions...
  • 22. public void superStep() { Sample code int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE; for(DistanceMessage msg: messages()) { // Choose min. proposed distance for(DistanceMessage minDist = Math.min( minDist, msg.getDistance() ); } if( minDist < this.getCurrentDistance() ) { //If improves the path, store and propagate if( this.setCurrentDistance(minDist); IVertex v = this.getElement(); for(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) { for(IEdge IElement recipient = r.getOtherElement(v); int rDist = this.getLengthOf(r); this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) ); }}
  • 23. SSSP - Map-Reduce Naive ● Idea [DPMR]: – In map phase: ● emit both signals and local vertex structure and state – In reduce phase: ● gather signals and local vertex structure messages ● reconstruct vertex structure and state
  • 24. SSSP - Map-Reduce Naive def map(Id nId, Node N): def reduce(Id rId, {m1,m2,..} ): //emit state and structure new M; M.deActivate emit(nId, minDist = MAX_VALUE N.graphStateAndStruct) for(m in {m1,m2,..}) if(m is Node) M:=m //state if(N.isActive) else if(m is Distance) //signals for(nbr :N.adjacencyL) minDist = min( minDist, m ) //local computation dist:= N.currDist+DistToNbr if(M.currDist > minDist) //emit signals M.currDist:=minDist; emit(nbr.id, dist) M.activate emit(rId, M)
  • 25. SSSP - Map Reduce Naive - issues ● Cost associated with marshaling intermediate <k,v> pairs for combiners (which are optional) – -> in-line combiner ● Need to pass the whole graph state and structure around – -> “Shimmy trick” -- pin down the structure ● Partitions verticies without regard to graph topology – -> cluster highly connected components together
  • 26. Inline Combiners ● In job configure: – Initialize a map<NodeId, Distance>; ● In job map operation: – Do not emit interm. pairs ( emit(nbr.id, dist) ) ; – Store them in the local map; – Combine values in the same slots. ● In job close: – Emit a value from each slot in the map to a corresponding neighbour ● emit(nbr.id, map[nbr.id])
  • 27. “Shimmy trick” ● Store graph structure in a file system (no shuffle) ● Inspired by a parallel merge join partition p1 p1 p2 p2 p3 p3 sorted by join key sorted and partitioned by join key
  • 28. “Shimmy trick” ● Assume: – Graph G representation sorted by node ids; – G partitioned into n parts: G1, G2, .., Gn – Use the same partitioner as in MR – Set number of reducers to n ● The above gives us: – Reducer Ri, receives the same intermediate keys as those in Gi graph partition (in sorted order).
  • 29. “Shimmy trick” def configure( ): def reduce(Id rId, {m1,m2,..} ): P.openGraphPartition() repeat: (id nId, node N) <- P.read() if (nId != rId): N.deact; emit(nId, N) until: nId == rId minDist = MAX_VALUE for(m in {m1,m2,..}): def close( ): minDist = min( minDist, m ) repeat: if(N.currDist > minDist) (id nId, node N) <-P.read() N.currDist:=minDist; N.deactivate N.activate emit(nId, N) emit(rId, N)
  • 30. “Shimmy trick” ● Improvements: – Files containing graph structure reside on dfs – Reducers arbitrarily assigned to cluster machines ● -> remote reads. ● -> change the scheduler to assign key ranges to the same machines consistently.
  • 31. Topology-aware Partitioner ● Choose a partitioner that: – minimizes inter-block traffic; – maximizes intra-block traffic; – places adjacent nodes in the same block ● Difficult to achieve particularly with many real world datasets: – Power-law distributions – Reported that state of the art partitioners (ex. parmetis) fail for such cases (???)
  • 32. MR Graph Processing Design Pattern ● [DPMR] reports 60% 70% improvement over naive implementation ● Solution closely resembles the BSP model
  • 33. BSP (inspired) implementations ● Google Pregel: – classic BSP, C++, production ● CMU GraphLab – inspired by BSP, java, multi-core – consistency models, custom schedulers ● Apache Hama – scientific computation package that runs on top of Hadoop, BSP, MS Dryad (?) ● Signal/Collect (Zurich University) – Scala, not yet distributed ● ...
  • 34. Open questions ● What problems are particularly suitable for MR and which ones for BSP – where are the boundaries? – Topology-based centrality algorithms (PageRank): ● Algebraic, matrix-based methods vs. vertex-based ones? ● When considering graph algorithms: – MR user base vs. BSP ergonomy? – Performance overheads? ● Relaxing the BSP synchronous schedule --> “Amorphous data parallelism”
  • 35. POC, Sample Code ● Project Masuria (early stages, 2011-02) – http://masuria-project.org/ – As much POC of BSP framework as it is (distributed) OSGI playground. ● Sample code: – https://github.com/tch/Cloud9 * – git@git.assembla.com:tch_sandbox.git – RunSSSPNaive.java – RunSSSPShimmy.java * * - expect (my) bugs Based on Jimmy Lin and Michael Schatz Cloud9 library
  • 36. References ● [ADP] “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al. ● [BSP] “A bridging model for parallel computation”, Leslie G. Valiant ● [DPMR] “Design Patterns for Efficient Graph Algorithms in MapReduce”, Jimmy Lin and Michael Schatz