SlideShare uma empresa Scribd logo
1 de 30
Generating Links by Mining
        Quotations

   OKAN KOLAK AND BILL N. SCHILIT

    PRESENTATION BY DUSTIN SMITH
  THE UNIVERSITY OF TEXAS AT AUSTIN
       SCHOOL OF INFORMATION
Outline
                                2


 Introduction
 Challenges
 Algorithm
   Phase 1: Generating the Shingle Table

   Phase 2: Extracting Shared Sequences

   Phase 3: Sequence Grouping

   Filtering and Ranking

 User Interface
 Evaluation


INF384H                                     10/24/2011
Introduction
                                  3

 What is the goal and why?
   Engaging user interface in Google Books

   Richer hypertext for scanned books

   Achieving these goals at scale for large sets of books
         Via MapReduce




INF384H                                                      10/24/2011
Challenges
                           4

 Mining quality quotation from millions of books in a
  scalable and efficient manner.
 Filtering out misleading quotations and ranking the
  good quotations based on quality.
 Incorporating the proposed link structure online in a
  clear and effective way for users.




INF384H                                          10/24/2011
Algorithm: Phase 1
                                       5

 Generation of shingle tables




                                 Text is parsed,
          Pass text through     normalized, and       Generate a shingle
              shingler        output as a stream of         table
                              overlapping shingles




INF384H                                                                    10/24/2011
Algorithm: Phase 1 (cont)
                                 6

 Each book is passed through the shingler
 A shingle is a stream of text of k length.
 Ex.
   A 2-shingle for the text “a lucky dog” would be “a lucky” and
    “lucky dog”.




INF384H                                                      10/24/2011
Algorithm: Phase 1 (cont)
                                7

 Prior to shingling, the text is parsed and normalized.
 Possible normalizations:
   Lowercasing

   Removing punctuations and accents

   Stemming

   Removing stop-words

   Collapsing numbers to single tokens




INF384H                                           10/24/2011
Algorithm: Phase 1 (cont)
                                      8

 Shingle Tables

          Key              Shingle info   Shingle info
          Shingle key(1)   <B,i>          <B,i>
          Shingle key(2)   <B,i>          <B,i>

 Shingle key: a unique shingle footprint
 B: Book ID where the shingle exists
 i: index of the shingle in its relative B




INF384H                                                  10/24/2011
Algorithm: Phase 1 (cont)
                                  9

 Shingle Tables
   Requires a single linear pass and a very large sorting phase

   They observe that quotes of length <8 are not significant
    quotations and so they set their shingle length to 8 words.




INF384H                                                      10/24/2011
Algorithm: Phase 2
                                  10

 Involves extracting shingles that are shared between
  books
 Books are processed 1 at a time
     Current book = “Source book”
     All other books = “Target books”




INF384H                                          10/24/2011
Algorithm: Phase 2 (cont)
                                  11

 Process for a single book:


                                       Take each shingle
             Generate a list of
                                          and use the
              shingles in the
                                        shingle table to
              order that they
                                         find all other
                  appear
                                          occurrences




INF384H                                                    10/24/2011
Algorithm: Phase 2 (cont)
                             12

 Pseudo-code for Phase 2:




INF384H                                10/24/2011
Algorithm: Phase 2 (cont)
                                 13

 MapReduce adaptation:
  Mapper:
  Start with shingle table as input into the Mapper
  Use the equivalent method for looking up all shingle buckets for
  a given book’s shingles
  Emit (source book ID, relevant shingle bucket)

  Reducer:
  Input (source book ID, list of relevant shingle buckets)
  Use the algorithm from previous slide (Figure 1) with a few
  modifications

INF384H                                                     10/24/2011
Algorithm: Phase 2 (cont)
                                 14

 One notable issue:
   Common shingles that are shared by many books will greatly
    increase overhead.
   These are often insignificant quotes and should be discarded.




INF384H                                                    10/24/2011
Algorithm: Phase 3
                       15

 Sequence Grouping:
 Why?




INF384H                            10/24/2011
Algorithm: Phase 3 (cont)
                       16

 Sequence Grouping:
 How does it work?




INF384H                                10/24/2011
Filtering and Ranking
                                   17

 They identify certain phrases as copyright sentences,
  legal boilerplate, publisher addresses, bibliography
  citations, publisher addresses, titles of other books
  by the author or publisher
     These are not desirable or quality quotations.
     Need to filter these out




INF384H                                                10/24/2011
Filtering and Ranking (cont)
                                 18

 Filtering:
• Quotations on “low content” pages
• Unusual characteristic filtering
  • Too many digits or special characters, repeated tokens, etc.

• Book edition filtering




INF384H                                                      10/24/2011
Filtering and Ranking (cont)
                          19

 Ranking:
Some quotes are more interesting than others, ie:
“The unemployment rate is the percentage of the
labor force that is unemployed” vs. “All human
beings are born free and equal in dignity and
rights…”
• This is difficult to distinguish automatically




INF384H                                         10/24/2011
Filtering and Ranking (cont)
                           20

 Scoring method for ranking
Basically:
Too short and too long receive low scores
Optimal length and is in the middle ground and a
piecewise function is used to represent this scoring.
• What defines “too short ” and “too long” is
  determined by “experimental tuning”
• Same scoring method for frequency



INF384H                                           10/24/2011
User Interface
                                   21

 How to present this concept of general links between
  books?
 “Popular Passages” not “Quotations”
 Display issues:
     Long quotes containing shorter, more familiar quotes
     Quote order variations
Skyline vectors are used to address these issues and
does so effectively.
  •   Basically the “best” quotes are chosen for presentation to the
      user


INF384H                                                       10/24/2011
User Interface (cont)
                                     22

 Navigation within books
   Goals:
       Provide a general feel for the book
       Provide an interface in which the user can quickly navigate to
        important passages within the book




INF384H                                                            10/24/2011
User Interface (cont)
                        23

 Navigation between books




INF384H                              10/24/2011
Evaluation
                            24

 Manual labeling to determine accuracy
 User studied (passive) over a 30 day period
 Analysis of distribution of link types within Google’s
  scanned books.




INF384H                                            10/24/2011
Evaluation (cont)
                           25

 Manual labeling:
• Sampled 120 passages from low scores and 120 from
  high scores (to avoid precision bias).
• Use a Likert scale of 1 to 5 with 1-2 meaning good, 3
  meaning neutral, and 4-5 meaning bad.
• Inter-annotator agreement was 88.5% (± 3.5% to
  account for neutral labels)
• 88% marked good




INF384H                                           10/24/2011
Evaluation (cont)
                                   26

 User study:
• Consisted of monitoring user activity in Google
  Books.
  •   Specifically if they navigated via popular passages
      (Quotations); other book edition links (Editions); to other
      similar books within a cluster (Related); or to books that cite
      the current book (Cited By)

  •   Results 




INF384H                                                         10/24/2011
Evaluation (cont)
                  27




INF384H                       10/24/2011
Evaluation (cont)
                                 28

 Coverage:
   What is the distribution of these link types in scanned books?




INF384H                                                     10/24/2011
Related Work & Future Work
                                29

 Related Work
   Automatic Hypertext

   Plagiarism Detection

 Future Work
   Improved Ranking

   Incremental Processing

   Primary Source Identification

   Attribution




INF384H                                 10/24/2011
Questions + Discussion
                          30

The End.



Questions & discussion.



….Go Rangers!




INF384H                              10/24/2011

Mais conteúdo relacionado

Destaque

27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu
kadektedy
 
32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu
kadektedy
 
Zadanie 4.-Korczak Książki
Zadanie 4.-Korczak KsiążkiZadanie 4.-Korczak Książki
Zadanie 4.-Korczak Książki
Adam Adamskic
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 

Destaque (9)

27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu27 smalb-bahasa-tunarungu
27 smalb-bahasa-tunarungu
 
32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu32 smp-lb-bahasa-tunarungu
32 smp-lb-bahasa-tunarungu
 
Zadanie 4.-Korczak Książki
Zadanie 4.-Korczak KsiążkiZadanie 4.-Korczak Książki
Zadanie 4.-Korczak Książki
 
Touch Screen Technologies
Touch Screen TechnologiesTouch Screen Technologies
Touch Screen Technologies
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Road extraction article
Road extraction articleRoad extraction article
Road extraction article
 
Teletraffic engineering
Teletraffic engineeringTeletraffic engineering
Teletraffic engineering
 
Interference coordination
Interference coordinationInterference coordination
Interference coordination
 
How Touch Screens works
How Touch Screens worksHow Touch Screens works
How Touch Screens works
 

Semelhante a Generating Links by Mining Quotations

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
[Ris cy business]
[Ris cy business][Ris cy business]
[Ris cy business]
Dino, llc
 

Semelhante a Generating Links by Mining Quotations (6)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Tools4BPEL4Chor
Tools4BPEL4ChorTools4BPEL4Chor
Tools4BPEL4Chor
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoder
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable Phenotypes
 
[Ris cy business]
[Ris cy business][Ris cy business]
[Ris cy business]
 

Último

Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
gajnagarg
 
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
amitlee9823
 
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
ougvy
 
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
amitlee9823
 
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get CytotecAbortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
amitlee9823
 
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
drmarathore
 
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
amitlee9823
 
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
ehyxf
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
amitlee9823
 
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
amitlee9823
 
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 

Último (20)

Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
 
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
Critical Commentary Social Work Ethics.pptx
Critical Commentary Social Work Ethics.pptxCritical Commentary Social Work Ethics.pptx
Critical Commentary Social Work Ethics.pptx
 
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
在线制作(ANU毕业证书)澳大利亚国立大学毕业证成绩单原版一比一
 
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
Abortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get CytotecAbortion pills in Jeddah |+966572737505 | Get Cytotec
Abortion pills in Jeddah |+966572737505 | Get Cytotec
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
 
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
 
SM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdfSM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdf
 
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men  🔝Deoghar🔝   Escorts...
➥🔝 7737669865 🔝▻ Deoghar Call-girls in Women Seeking Men 🔝Deoghar🔝 Escorts...
 
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
 
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
 
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls godhra Escorts ☎️9352988975 Two shot with one girl (...
 
VVIP Pune Call Girls Gahunje WhatSapp Number 8005736733 With Elite Staff And ...
VVIP Pune Call Girls Gahunje WhatSapp Number 8005736733 With Elite Staff And ...VVIP Pune Call Girls Gahunje WhatSapp Number 8005736733 With Elite Staff And ...
VVIP Pune Call Girls Gahunje WhatSapp Number 8005736733 With Elite Staff And ...
 
Point of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratoryPoint of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratory
 
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
 

Generating Links by Mining Quotations

  • 1. Generating Links by Mining Quotations OKAN KOLAK AND BILL N. SCHILIT PRESENTATION BY DUSTIN SMITH THE UNIVERSITY OF TEXAS AT AUSTIN SCHOOL OF INFORMATION
  • 2. Outline 2  Introduction  Challenges  Algorithm  Phase 1: Generating the Shingle Table  Phase 2: Extracting Shared Sequences  Phase 3: Sequence Grouping  Filtering and Ranking  User Interface  Evaluation INF384H 10/24/2011
  • 3. Introduction 3  What is the goal and why?  Engaging user interface in Google Books  Richer hypertext for scanned books  Achieving these goals at scale for large sets of books  Via MapReduce INF384H 10/24/2011
  • 4. Challenges 4  Mining quality quotation from millions of books in a scalable and efficient manner.  Filtering out misleading quotations and ranking the good quotations based on quality.  Incorporating the proposed link structure online in a clear and effective way for users. INF384H 10/24/2011
  • 5. Algorithm: Phase 1 5  Generation of shingle tables Text is parsed, Pass text through normalized, and Generate a shingle shingler output as a stream of table overlapping shingles INF384H 10/24/2011
  • 6. Algorithm: Phase 1 (cont) 6  Each book is passed through the shingler  A shingle is a stream of text of k length.  Ex.  A 2-shingle for the text “a lucky dog” would be “a lucky” and “lucky dog”. INF384H 10/24/2011
  • 7. Algorithm: Phase 1 (cont) 7  Prior to shingling, the text is parsed and normalized.  Possible normalizations:  Lowercasing  Removing punctuations and accents  Stemming  Removing stop-words  Collapsing numbers to single tokens INF384H 10/24/2011
  • 8. Algorithm: Phase 1 (cont) 8  Shingle Tables Key Shingle info Shingle info Shingle key(1) <B,i> <B,i> Shingle key(2) <B,i> <B,i>  Shingle key: a unique shingle footprint  B: Book ID where the shingle exists  i: index of the shingle in its relative B INF384H 10/24/2011
  • 9. Algorithm: Phase 1 (cont) 9  Shingle Tables  Requires a single linear pass and a very large sorting phase  They observe that quotes of length <8 are not significant quotations and so they set their shingle length to 8 words. INF384H 10/24/2011
  • 10. Algorithm: Phase 2 10  Involves extracting shingles that are shared between books  Books are processed 1 at a time  Current book = “Source book”  All other books = “Target books” INF384H 10/24/2011
  • 11. Algorithm: Phase 2 (cont) 11  Process for a single book: Take each shingle Generate a list of and use the shingles in the shingle table to order that they find all other appear occurrences INF384H 10/24/2011
  • 12. Algorithm: Phase 2 (cont) 12  Pseudo-code for Phase 2: INF384H 10/24/2011
  • 13. Algorithm: Phase 2 (cont) 13  MapReduce adaptation: Mapper: Start with shingle table as input into the Mapper Use the equivalent method for looking up all shingle buckets for a given book’s shingles Emit (source book ID, relevant shingle bucket) Reducer: Input (source book ID, list of relevant shingle buckets) Use the algorithm from previous slide (Figure 1) with a few modifications INF384H 10/24/2011
  • 14. Algorithm: Phase 2 (cont) 14  One notable issue:  Common shingles that are shared by many books will greatly increase overhead.  These are often insignificant quotes and should be discarded. INF384H 10/24/2011
  • 15. Algorithm: Phase 3 15  Sequence Grouping:  Why? INF384H 10/24/2011
  • 16. Algorithm: Phase 3 (cont) 16  Sequence Grouping:  How does it work? INF384H 10/24/2011
  • 17. Filtering and Ranking 17  They identify certain phrases as copyright sentences, legal boilerplate, publisher addresses, bibliography citations, publisher addresses, titles of other books by the author or publisher  These are not desirable or quality quotations.  Need to filter these out INF384H 10/24/2011
  • 18. Filtering and Ranking (cont) 18  Filtering: • Quotations on “low content” pages • Unusual characteristic filtering • Too many digits or special characters, repeated tokens, etc. • Book edition filtering INF384H 10/24/2011
  • 19. Filtering and Ranking (cont) 19  Ranking: Some quotes are more interesting than others, ie: “The unemployment rate is the percentage of the labor force that is unemployed” vs. “All human beings are born free and equal in dignity and rights…” • This is difficult to distinguish automatically INF384H 10/24/2011
  • 20. Filtering and Ranking (cont) 20  Scoring method for ranking Basically: Too short and too long receive low scores Optimal length and is in the middle ground and a piecewise function is used to represent this scoring. • What defines “too short ” and “too long” is determined by “experimental tuning” • Same scoring method for frequency INF384H 10/24/2011
  • 21. User Interface 21  How to present this concept of general links between books?  “Popular Passages” not “Quotations”  Display issues:  Long quotes containing shorter, more familiar quotes  Quote order variations Skyline vectors are used to address these issues and does so effectively. • Basically the “best” quotes are chosen for presentation to the user INF384H 10/24/2011
  • 22. User Interface (cont) 22  Navigation within books  Goals:  Provide a general feel for the book  Provide an interface in which the user can quickly navigate to important passages within the book INF384H 10/24/2011
  • 23. User Interface (cont) 23  Navigation between books INF384H 10/24/2011
  • 24. Evaluation 24  Manual labeling to determine accuracy  User studied (passive) over a 30 day period  Analysis of distribution of link types within Google’s scanned books. INF384H 10/24/2011
  • 25. Evaluation (cont) 25  Manual labeling: • Sampled 120 passages from low scores and 120 from high scores (to avoid precision bias). • Use a Likert scale of 1 to 5 with 1-2 meaning good, 3 meaning neutral, and 4-5 meaning bad. • Inter-annotator agreement was 88.5% (± 3.5% to account for neutral labels) • 88% marked good INF384H 10/24/2011
  • 26. Evaluation (cont) 26  User study: • Consisted of monitoring user activity in Google Books. • Specifically if they navigated via popular passages (Quotations); other book edition links (Editions); to other similar books within a cluster (Related); or to books that cite the current book (Cited By) • Results  INF384H 10/24/2011
  • 27. Evaluation (cont) 27 INF384H 10/24/2011
  • 28. Evaluation (cont) 28  Coverage:  What is the distribution of these link types in scanned books? INF384H 10/24/2011
  • 29. Related Work & Future Work 29  Related Work  Automatic Hypertext  Plagiarism Detection  Future Work  Improved Ranking  Incremental Processing  Primary Source Identification  Attribution INF384H 10/24/2011
  • 30. Questions + Discussion 30 The End. Questions & discussion. ….Go Rangers! INF384H 10/24/2011