2. PairwiseSimilarity
MapReduce Framework
Proposed algorithm
• Inverted Index Construction
• Pairwise document similarity calculation
Results
3. PubMed – “More like this”
Similar blog posts
Google – Similar pages
4. Framework that supports distributed
computing on clusters of computers
Introduced by Google in 2004
Map step
Reduce step
Combine step (Optional)
Applications
5.
6. Consider two files:
Hello Hello
Hello ,2
World Hadoop World ,2
Bye Goodbye Bye,1
Hadoop ,2
World Hadoop Goodbye ,1
8. <Hello,1>
S <Hello (1,1)> Reduce 1 Hello ,2
<World,1>
H
U
<Bye,1> <World(1,1)> Reduce 2 World ,2
F
F
<World,1>
L <Bye(1)> Reduce 3 Bye,1
E
<Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2
&
<Hadoop,1>
S <Goodbye(1)> Reduce 5 Goodbye ,1
<Goodbye,1> O
R
<Hadoop,1> T
9. MAPREDUCE ALGORITHM Scalable
•Inverted Index Computation and
•Pairwise Similarity Efficient
10. Document 1
A <A,(d1,2)>
A
B Map 1 <B,(d1,1)>
C
<C,(d1,1)>
Document 2
B <B,(d2,1)>
D
D Map 2
<D,(d2,2)>
Document 1 <A,(d3,1)>
A
B <B,(d3,2)>
Map 3
B
E <E,(d3,1)>
11. <A,(d1,2)>
S <A,[(d1,2), <A,[(d1,2),
<B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]>
U
<C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2,
F Reduce 2
1),(d3,2)]> 1),(d3,2)]>
L
<B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]>
<D,(d2,2)> &
<D,[(d2,2)]> Reduce 4 <D,[(d2,2)]>
S
<A,(d3,1)> O
R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]>
<B,(d3,2)> T
<E,(d3,1)>
12. Group by document ID, not pairs
Golomb’s compression for postings
Individual Postings
List of Postings
14. S
H
<(d1,d3),2> U
F <(d1,d2)[1]> <(d1,d2)[1]>
Reduce 1
F
<(d1,d2),1 L
E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
&
Reduce 3
<(d1,d3)[2,2]> <(d1,d3)[4]>
S
O
R
T
16. Tokenization
Stop word removal
Stemming
Df-cut
• Fraction of terms with highest document
frequency is eliminated – 99% cut (9093)
Linear space and time complexity
• 3.7 billion pairs (vs) 81. trillion pairs
17.
18.
19. Complexity: O(n2)
Df-cut
of 99 percent eliminates meaning bearing
terms and some irrelevant terms
• Cornell, arthritis
• sleek, frail
Df-cut can be relaxed to 99.9 percent
20. Exact algorithms used for inverted index
construction and pair-wise document
similarity are not specified.
Df-cut – Does a df-cut of 99 percent affect
the quality of the results significantly?
The results have not been evaluated.