SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Building Content-Based
Publish/Subscribe Systems
with Distributed Hash Tables



 David Tam, Reza Azimi, Hans-Arno Jacobsen
          University of Toronto, Canada
                 September 8, 2003
Introduction:
Publish/Subscribe Systems

  Push model (a.k.a. event-notification)
      subscribe     publish      match
  Applications:
      stock-market, auction, eBay, news
  2 types
      topic-based: ≈ Usenet newsgroup topics
      content-based: attribute-value pairs
         e.g. (attr1 = value1) ∧ (attr2 = value2) ∧ (attr3 > value3)
The Problem:
Content-Based Publish/Subscribe

  Traditionally centralized
      scalability?
  More recently: distributed
      e.g. SIENA
         small set of brokers
         not P2P
  How about fully distributed?
      exploit P2P
      1000s of brokers
Proposed Solution:
Use Distributed Hash Tables
  DHTs
      hash buckets are mapped to P2P nodes
  Why DHTs?
      scalable, fault-tolerance, load-balancing
  Challenges
      distributed but co-ordinated and light-weight:
         subscribing
         publishing
         matching is difficult
Basic Scheme
   A matching publisher & subscriber must come up with
   the same hash keys based on the content

                   Distributed Hash Table
  buckets




             distributed publish/subscribe system
Basic Scheme
   A matching publisher & subscriber must come up with
   the same hash keys based on the content

                   Distributed Hash Table
  buckets


                   home node




     subscriber

    subscription
Basic Scheme
   A matching publisher & subscriber must come up with
   the same hash keys based on the content

                            Distributed Hash Table
  buckets


                           home node
                  subscription



     subscriber                              publisher

                                             publication
Basic Scheme
   A matching publisher & subscriber must come up with
   the same hash keys based on the content

                            Distributed Hash Table
  buckets


                           home node
                  subscription    publication



     subscriber                                 publisher
Basic Scheme
   A matching publisher & subscriber must come up with
   the same hash keys based on the content

                            Distributed Hash Table
  buckets


                           home node
                  subscription



     subscriber                              publisher
                      publication
Naïve Approach
                                                      Subscription
 Publication
Attr1: value1                                         Attr1: value1
Attr2: value2                                         Attr2:
Attr3: value3                                         Attr3:
                               Key
                      Hash           Key     Hash
Attr4: value4                                         Attr4: value4
                    Function               Function
Attr5: value5                                         Attr5:
Attr6: value6                                         Attr6: value6
Attr7: value7                                         Attr7:


   Publisher must produce keys for all possible attribute
   combinations:
        2N keys for each publication
   Bottleneck at hash bucket node
        subscribing, publishing, matching
Our Approach
  Domain Schema
                  N
     eliminates 2 problem
     similar to RDBMS schema
     set of attribute names
     set of value constraints
     set of indices
        create hash keys for indices only
        choose group of attributes that are common
        but combination of values rare
     well-known
Hash Key Composition
    Indices: {attr1}, {attr1, attr4}, {attr6, attr7}
 Publication                                         Subscription
                        Key1      Key1
                    Hash                     Hash
Attr1: value1                                         Attr1: value1
                  Function                 Function

Attr2: value2                                         Attr2:
Attr3: value3                                         Attr3:
                           Key2     Key2
                    Hash                     Hash
                  Function                 Function
Attr4: value4                                         Attr4: value4
Attr5: value5                                         Attr5:
Attr6: value6                                         Attr6: value6
                           Key3
                    Hash
Attr7: value7                                         Attr7:
                  Function


    Possible false-positives
         because partial matching
         filtered by system
    Possible duplicate notifications
         because multiple subscription keys
Our Approach (cont’d)
  Multicast Trees
      eliminates bottleneck at hash bucket nodes
      distributed subscribing, publishing, matching
                         Home node
                     (hash bucket node)




                                               Existing subscribers




   New subscriber
Our Approach (cont’d)
  Multicast Trees
      eliminates bottleneck at hash bucket nodes
      distributed subscribing, publishing, matching
                           Home node
                       (hash bucket node)




                                               Existing subscribers




         Non-subscribers

   New subscriber
Our Approach (cont’d)
  Multicast Trees
      eliminates bottleneck at hash bucket nodes
      distributed subscribing, publishing, matching
                          Home node
                      (hash bucket node)




                                               Existing subscribers




    New subscribers
Handling Range Queries
  Hash function ruins locality
                    1             2              3         4
  input
           hash( )
  output
               40                           90           120     170

  Divide range of values into intervals
           hash on interval labels
               e.g. RAM attribute
           0            128           256                  512             ∞


    Label:      A             B                      C                 D
               For RAM > 384, submit hash keys:
                  RAM = C
                  RAM = D
           intervals can be sized according to probability distribution
Implementation & Evaluation
  Main Goal: scalability
  Metric: message traffic
  Built using:
      Pastry DHT
      Scribe multicast trees
  Workload Generator: uniformly random distributions
Event Scalability: 1000 nodes




     Need well-designed schema with low false-positives
Node Scalability: 40000 subs, pubs
Range Query Scalability: 1000 nodes




  Multicast tree benefits
       e.g. 1 range vs 0 range, at 40000 subs, pubs
           Expected 2.33 × msgs, but got 1.6
       subscription costs decrease
Range Query Scalability: 40000 subs, pubs
Conclusion
  Method: DHT + domain schema
  Scales to 1000s of nodes
  Multicast trees are important
  Interesting point in design space
      some restrictions on expression of content
          must adhere to domain schema



Future Work
  range query techniques
  examine multicast tree in detail
  locality-sensitive workload distributions
  real-world workloads
  detailed modelling of P2P network
  fault-tolerance

Mais conteúdo relacionado

Semelhante a DHT

Realtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBaseRealtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBase
DataWorks Summit
 

Semelhante a DHT (20)

Ts project Hash based inventory system
Ts project Hash based inventory systemTs project Hash based inventory system
Ts project Hash based inventory system
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Ch_07 (1).pptx
Ch_07 (1).pptxCh_07 (1).pptx
Ch_07 (1).pptx
 
Data Analytics using R.pptx
Data Analytics using R.pptxData Analytics using R.pptx
Data Analytics using R.pptx
 
Realtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBaseRealtime Sentiment Analysis Application Using Hadoop and HBase
Realtime Sentiment Analysis Application Using Hadoop and HBase
 
Hbase.pptx
Hbase.pptxHbase.pptx
Hbase.pptx
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
HBase lon meetup
HBase lon meetupHBase lon meetup
HBase lon meetup
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26Apache Cassandra @Geneva JUG 2013.02.26
Apache Cassandra @Geneva JUG 2013.02.26
 
unit-1-dsa-hashing-2022_compressed-1-converted.pptx
unit-1-dsa-hashing-2022_compressed-1-converted.pptxunit-1-dsa-hashing-2022_compressed-1-converted.pptx
unit-1-dsa-hashing-2022_compressed-1-converted.pptx
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Hashing
HashingHashing
Hashing
 
Hashing
HashingHashing
Hashing
 
Hbase Quick Review Guide for Interviews
Hbase Quick Review Guide for InterviewsHbase Quick Review Guide for Interviews
Hbase Quick Review Guide for Interviews
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

DHT

  • 1. Building Content-Based Publish/Subscribe Systems with Distributed Hash Tables David Tam, Reza Azimi, Hans-Arno Jacobsen University of Toronto, Canada September 8, 2003
  • 2. Introduction: Publish/Subscribe Systems Push model (a.k.a. event-notification) subscribe publish match Applications: stock-market, auction, eBay, news 2 types topic-based: ≈ Usenet newsgroup topics content-based: attribute-value pairs e.g. (attr1 = value1) ∧ (attr2 = value2) ∧ (attr3 > value3)
  • 3. The Problem: Content-Based Publish/Subscribe Traditionally centralized scalability? More recently: distributed e.g. SIENA small set of brokers not P2P How about fully distributed? exploit P2P 1000s of brokers
  • 4. Proposed Solution: Use Distributed Hash Tables DHTs hash buckets are mapped to P2P nodes Why DHTs? scalable, fault-tolerance, load-balancing Challenges distributed but co-ordinated and light-weight: subscribing publishing matching is difficult
  • 5. Basic Scheme A matching publisher & subscriber must come up with the same hash keys based on the content Distributed Hash Table buckets distributed publish/subscribe system
  • 6. Basic Scheme A matching publisher & subscriber must come up with the same hash keys based on the content Distributed Hash Table buckets home node subscriber subscription
  • 7. Basic Scheme A matching publisher & subscriber must come up with the same hash keys based on the content Distributed Hash Table buckets home node subscription subscriber publisher publication
  • 8. Basic Scheme A matching publisher & subscriber must come up with the same hash keys based on the content Distributed Hash Table buckets home node subscription publication subscriber publisher
  • 9. Basic Scheme A matching publisher & subscriber must come up with the same hash keys based on the content Distributed Hash Table buckets home node subscription subscriber publisher publication
  • 10. Naïve Approach Subscription Publication Attr1: value1 Attr1: value1 Attr2: value2 Attr2: Attr3: value3 Attr3: Key Hash Key Hash Attr4: value4 Attr4: value4 Function Function Attr5: value5 Attr5: Attr6: value6 Attr6: value6 Attr7: value7 Attr7: Publisher must produce keys for all possible attribute combinations: 2N keys for each publication Bottleneck at hash bucket node subscribing, publishing, matching
  • 11. Our Approach Domain Schema N eliminates 2 problem similar to RDBMS schema set of attribute names set of value constraints set of indices create hash keys for indices only choose group of attributes that are common but combination of values rare well-known
  • 12. Hash Key Composition Indices: {attr1}, {attr1, attr4}, {attr6, attr7} Publication Subscription Key1 Key1 Hash Hash Attr1: value1 Attr1: value1 Function Function Attr2: value2 Attr2: Attr3: value3 Attr3: Key2 Key2 Hash Hash Function Function Attr4: value4 Attr4: value4 Attr5: value5 Attr5: Attr6: value6 Attr6: value6 Key3 Hash Attr7: value7 Attr7: Function Possible false-positives because partial matching filtered by system Possible duplicate notifications because multiple subscription keys
  • 13. Our Approach (cont’d) Multicast Trees eliminates bottleneck at hash bucket nodes distributed subscribing, publishing, matching Home node (hash bucket node) Existing subscribers New subscriber
  • 14. Our Approach (cont’d) Multicast Trees eliminates bottleneck at hash bucket nodes distributed subscribing, publishing, matching Home node (hash bucket node) Existing subscribers Non-subscribers New subscriber
  • 15. Our Approach (cont’d) Multicast Trees eliminates bottleneck at hash bucket nodes distributed subscribing, publishing, matching Home node (hash bucket node) Existing subscribers New subscribers
  • 16. Handling Range Queries Hash function ruins locality 1 2 3 4 input hash( ) output 40 90 120 170 Divide range of values into intervals hash on interval labels e.g. RAM attribute 0 128 256 512 ∞ Label: A B C D For RAM > 384, submit hash keys: RAM = C RAM = D intervals can be sized according to probability distribution
  • 17. Implementation & Evaluation Main Goal: scalability Metric: message traffic Built using: Pastry DHT Scribe multicast trees Workload Generator: uniformly random distributions
  • 18. Event Scalability: 1000 nodes Need well-designed schema with low false-positives
  • 20. Range Query Scalability: 1000 nodes Multicast tree benefits e.g. 1 range vs 0 range, at 40000 subs, pubs Expected 2.33 × msgs, but got 1.6 subscription costs decrease
  • 21. Range Query Scalability: 40000 subs, pubs
  • 22. Conclusion Method: DHT + domain schema Scales to 1000s of nodes Multicast trees are important Interesting point in design space some restrictions on expression of content must adhere to domain schema Future Work range query techniques examine multicast tree in detail locality-sensitive workload distributions real-world workloads detailed modelling of P2P network fault-tolerance