SlideShare uma empresa Scribd logo
1 de 21
Tamer Elsayed, Jimmy Lin, and Douglas Oard


         Niveda Krishnamoorthy
 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
  • Inverted Index Construction
  • Pairwise document similarity calculation
 Results
 PubMed   – “More like this”
 Similar blog posts
 Google – Similar pages
 Framework   that supports distributed
  computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications
 Consider    two files:

      Hello                Hello
                                      Hello ,2
      World                Hadoop     World ,2
      Bye                  Goodbye     Bye,1
                                     Hadoop ,2
      World                Hadoop    Goodbye ,1
Hello             <Hello,1>

World             <World,1>
          Map 1
Bye               <Bye,1>

World             <World,1>


Hello             <Hello,1>

Hadoop            <Hadoop,1>
          Map 2
Goodbye           <Goodbye,1>

Hadoop            <Hadoop,1>
<Hello,1>
              S   <Hello (1,1)>   Reduce 1    Hello ,2
<World,1>
              H
              U
<Bye,1>           <World(1,1)>    Reduce 2    World ,2
              F
              F
<World,1>
              L    <Bye(1)>       Reduce 3     Bye,1
              E
<Hello,1>         <Hadoop(1,1)>   Reduce 4   Hadoop ,2
              &
<Hadoop,1>
              S   <Goodbye(1)>    Reduce 5   Goodbye ,1
<Goodbye,1>   O
              R
<Hadoop,1>    T
MAPREDUCE ALGORITHM           Scalable
•Inverted Index Computation      and
•Pairwise Similarity          Efficient
Document 1
A                    <A,(d1,2)>
A
B            Map 1   <B,(d1,1)>
C
                     <C,(d1,1)>
Document 2
B                    <B,(d2,1)>
D
D            Map 2
                     <D,(d2,2)>


Document 1           <A,(d3,1)>
A
B                    <B,(d3,2)>
             Map 3
B
E                    <E,(d3,1)>
<A,(d1,2)>
             S     <A,[(d1,2),                   <A,[(d1,2),
<B,(d1,1)>   H      (d3,1)]>        Reduce 1      (d3,1)]>
             U
<C,(d1,1)>   F   <B,[(d1,1), (d2,              <B,[(d1,1), (d2,
             F                      Reduce 2
                 1),(d3,2)]>                   1),(d3,2)]>
             L
<B,(d2,1)>   E     <C,[(d1,1)]>     Reduce 3    <C,[(d1,1)]>

<D,(d2,2)>   &
                   <D,[(d2,2)]>     Reduce 4    <D,[(d2,2)]>
             S
<A,(d3,1)>   O
             R     <E,[(d3,1)]>     Reduce 5    <E,[(d3,1)]>
<B,(d3,2)>   T

<E,(d3,1)>
 Group   by document ID, not pairs




 Golomb’s   compression for postings
 Individual Postings
 List of Postings
<(d1,d3),2>
  <A,[(d1,2),     Map 1
   (d3,1)]>
                          <(d1,d2),1
<B,[(d1,1),
                  Map 2   (d2,d3),2
(d2,1),(d3,2)]>
                          (d1,d3),2>
 <C,[(d1,1)]>


 <D,[(d2,2)]>


 <E,[(d3,1)]>
S
              H
<(d1,d3),2>   U
              F   <(d1,d2)[1]>                <(d1,d2)[1]>
                                   Reduce 1
              F
<(d1,d2),1    L
              E   <(d2,d3)[2]>     Reduce 2   <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
              &
                                   Reduce 3
                  <(d1,d3)[2,2]>              <(d1,d3)[4]>
              S
              O
              R
              T
 Hadoop   0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
  • 2.5 GB
  • 906k documents
 Tokenization
 Stop word removal
 Stemming
 Df-cut
  • Fraction of terms with highest document
   frequency is eliminated – 99% cut (9093)

            Linear space and time complexity

  • 3.7 billion pairs (vs) 81. trillion pairs
 Complexity:      O(n2)



 Df-cut
       of 99 percent eliminates meaning bearing
 terms and some irrelevant terms
  • Cornell, arthritis
  • sleek, frail
 Df-cut   can be relaxed to 99.9 percent
 Exact  algorithms used for inverted index
  construction and pair-wise document
  similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
  the quality of the results significantly?
 The results have not been evaluated.
Pairwise document similarity in large collections with map reduce

Mais conteúdo relacionado

Mais procurados

Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
Databricks
 

Mais procurados (20)

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
ETL Metadata Injection with Pentaho Data Integration
ETL Metadata Injection with Pentaho Data IntegrationETL Metadata Injection with Pentaho Data Integration
ETL Metadata Injection with Pentaho Data Integration
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Sqoop
SqoopSqoop
Sqoop
 
Binary tree
Binary treeBinary tree
Binary tree
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Neo4j in Production: A look at Neo4j in the Real World
Neo4j in Production: A look at Neo4j in the Real WorldNeo4j in Production: A look at Neo4j in the Real World
Neo4j in Production: A look at Neo4j in the Real World
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
大腦的奧秘
大腦的奧秘大腦的奧秘
大腦的奧秘
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Context based text generation using lstm networks
Context based text generation using lstm networksContext based text generation using lstm networks
Context based text generation using lstm networks
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Cybersecurity Automation with OSCAL and Neo4J
Cybersecurity Automation with OSCAL and Neo4JCybersecurity Automation with OSCAL and Neo4J
Cybersecurity Automation with OSCAL and Neo4J
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
6.hive
6.hive6.hive
6.hive
 

Semelhante a Pairwise document similarity in large collections with map reduce

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
Paul Lam
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 

Semelhante a Pairwise document similarity in large collections with map reduce (19)

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
Maths`
Maths`Maths`
Maths`
 
10th Maths model3 question paper
10th Maths model3 question paper10th Maths model3 question paper
10th Maths model3 question paper
 
10th Maths
10th Maths10th Maths
10th Maths
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
End sem solution
End sem solutionEnd sem solution
End sem solution
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Pairwise document similarity in large collections with map reduce

  • 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  • 2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results
  • 3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages
  • 4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications
  • 5.
  • 6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  • 7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>
  • 8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T
  • 9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient
  • 10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>
  • 11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>
  • 12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings
  • 13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), Map 2 (d2,d3),2 (d2,1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  • 14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  • 15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  • 16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  • 17.
  • 18.
  • 19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent
  • 20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.