SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Focused Crawling for Vertical Search




Focused Crawling for Vertical Search

                       Marcelo Mendoza


                              11.11.11




                        - JCC 2011 - Curic´, Chile -
                                          o            11.11.11   1 / 40
Focused Crawling for Vertical Search


Overview



1   Vertical Search


2   Crawling


3   State-of-the-art


4   Conclusion




                                         - JCC 2011 - Curic´, Chile -
                                                           o            11.11.11   2 / 40
Focused Crawling for Vertical Search   Vertical Search


Why Web Vertical Search Matters?




   Web size: More than 20 billion pages.
   Millions of users, millions of queries, millions of needs.
   Advantages:
     1   Greater precision due to limited scope
     2   Leverage domain knowledge (ontologies)
   Domains: business, medicine, science, education, ...




                                      - JCC 2011 - Curic´, Chile -
                                                        o              11.11.11   3 / 40
Focused Crawling for Vertical Search   Vertical Search


Science Vertical Search




scienceresearch.com
                                      - JCC 2011 - Curic´, Chile -
                                                        o              11.11.11   4 / 40
Focused Crawling for Vertical Search   Vertical Search


Business Vertical Search




biznar.com
                                     - JCC 2011 - Curic´, Chile -
                                                       o              11.11.11   5 / 40
Focused Crawling for Vertical Search   Vertical Search


Education Vertical Search




contentcompass.cl1
  1
      Fondef D08I1155
                                         - JCC 2011 - Curic´, Chile -
                                                           o              11.11.11   6 / 40
Focused Crawling for Vertical Search   Crawling


Hyperlinks among web pages




                                  - JCC 2011 - Curic´, Chile -
                                                    o            11.11.11   7 / 40
Focused Crawling for Vertical Search   Crawling


The Web as a graph

           web pages

           hyperlinks




                                  - JCC 2011 - Curic´, Chile -
                                                    o            11.11.11   8 / 40
Focused Crawling for Vertical Search   Crawling


The Web: Some facts

   The size of the Web: 11.5 billion of pages (indexable, 2005).
   The deep Web: available by quering databases.
   Static / dynamic pages.
   Graph model: Free-scale network, degree distribution ≈ power law.
   The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).




                                     - JCC 2011 - Curic´, Chile -
                                                       o            11.11.11   9 / 40
Focused Crawling for Vertical Search   Crawling


Crawler architecture




Online resource: C. Castillo, Effective Web Crawling (PhD Thesis) URL

                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   10 / 40
Focused Crawling for Vertical Search   Crawling


Crawling strategies




    Breadth-first crawlers: URL frontier implemented as a FIFO queue.
    Preferential crawlers: URL frontier implemented as a priority queue.
    Priority scores:
      1   Topological properties (e.g. indegree of the target page).
      2   Content properties (e.g. similarity between a query and the source
          page).
      3   Hybrid measures.




                                       - JCC 2011 - Curic´, Chile -
                                                         o            11.11.11   11 / 40
Focused Crawling for Vertical Search   Crawling


Universal / Focused crawling



    Universal crawlers: General purpose.
    Challenges:
      1   Scalability
      2   Coverage / Freshness


    Focused crawlers: We may want to crawl pages in certain topics.
    Challenges:
      1   Coverage / Accuracy




                                       - JCC 2011 - Curic´, Chile -
                                                         o            11.11.11   12 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-first: depth 1



                 Seed
                 Target




                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   13 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-first: depth 2



                 Seed
                 Target




                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   14 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-first: depth 3



                 Seed
                 Target




                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   15 / 40
Focused Crawling for Vertical Search   Crawling


Focused Crawling
Breadth-first: unreacheble pages, excessive computational costs!



                 Seed
                 Target




                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   16 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: Fish search




Bra, P., and Post, R. (1994)
Query (keywords), source page terms, term-based distance, best-first
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   17 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: Shark search




Hersovici et al. (1998)
Query (keywords), anchor text, term-based distance, best-first
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   18 / 40
Focused Crawling for Vertical Search   State-of-the-art


Early algorithms: ARACHNID




Menczer, F. (1997)
Multi-agents, evolutionary inspired: mutation (new seeds), fitness (score
acc.), term-based scores.
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   19 / 40
Focused Crawling for Vertical Search   State-of-the-art


Context: Link Analysis



The Web graph as an information source (beyond the text)

Kleinberg, J. (1998)
HITS: authoritative pages (OUT), hub pages (IN).

Brin, S. & Page, L. (1998)
PageRank: Random walk over the Web graph, stationary probability
vector.




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   20 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms



Cho, J., Garcia-Molina, H., Page L. (1998)
Link-based scores: Backlinks count, PageRank

Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)
Topic distillation: Text-based classifier over web page examples per
category (off-line dataset construction, human labeling, content text
positive and negative examples). On-line phase: Anchor-based score (ML)
+ HITS-based score for distillation.




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   21 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Basic assumptions


                        Seed
                        Target




Davidson, B. (2000)
Topical locality: Locality based on anchor text and links.
                                       - JCC 2011 - Curic´, Chile -
                                                         o               11.11.11   22 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Basic assumptions




Menczer, F. (2004)
Link cluster conjecture: Related pages tend to be linked.
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   23 / 40
Focused Crawling for Vertical Search   State-of-the-art


Link-based algorithms: Backlink graph
Considering how far is the target: Layered backlink graph!




Diligenti et al. (2000)
Using the backlink graph for multiclass learning. Greedy approach.

Babaria et al. (2007)
Using the backling graph for ordinal regression. Greedy approach.
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   24 / 40
Focused Crawling for Vertical Search   State-of-the-art


Off-line learning-based algorithms
Kinds of features
    The content of the web pages which are known to link to the
    candidate URL.
    URL tokens from the candidate URL.




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   25 / 40
Focused Crawling for Vertical Search   State-of-the-art


Off-line learning-based algorithms



Rennie & McCallum (1999)
1st stage (Off-line): Text-based features (anchor + header + title of the
target). 2nd stage (On-line): Candidate URL scoring based on the text
classifier (candidate URL (anchor + URL text)).

Li et al. (2005)
1st stage (Off-line): ID3 learning strategy. Anchor text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classifier
(candidate URL (anchor)).




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   26 / 40
Focused Crawling for Vertical Search   State-of-the-art


Off-line learning-based algorithms


Pant & Srinivasan (2006)
1st stage (Off-line): SVM learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classifier
(candidate URL (surrounding text)).

Feng et al. (2010)
1st stage (Off-line): Term-based weights. Weighted graph construction.
2nd stage (Off-line): PageRank over the weighted graph. 3rd stage
(Off-line): Labeling based on PageRank. Term-based learning. 4th stage
(On-line): Candidate URL scoring based on the text classifier (candidate
URL (anchor)).



                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   27 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms

Learning on-the-fly from the context




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   28 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning on-the-fly from the context
             "Ba
                ch"



    "Bach"
                         candidate URL




Aggarwal et al. (2000)
1st stage (Off-line): Crawling for dataset construction. Human labeling
(positive examples). Bayes learning strategy. Content text-based features.
2nd stage (On-line): Candidate URL scoring based on the text classifier +
feature selection based on interest ratio (candidate URL (anchor)).
                                         - JCC 2011 - Curic´, Chile -
                                                           o               11.11.11   29 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning on-the-fly from the context




Chakrabarti et al. (2002)
1st stage (Off-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (On-line):
Training from positive examples using fetched pages (more sophisticated
features such as DOM tree). 3rd stage (On-line): URL scoring based on
the apprentice learner.
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   30 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip off-topic pages



                Seed
                Target




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   31 / 40
Focused Crawling for Vertical Search              State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip off-topic pages

                                                                                                111
                                                                                                000
                                                                                                111
                                                                                                000
                                                                                                111
                                                                                                000   Dud
                                                                                                111
                                                                                                000

                Seed
                                                         0.8     0.7 0.25 0.1
                Target                                                             0.2

                                                         0.7         0.6
                                                                 111
                                                                 000 0.45
                                        0.8                      111
                                                                 000
                                                                 111
                                                                 000                     0.7
                      0.7            111
                                     000
                                                         0.7               0.7
                              0.5    111
                                     000          111
                                                  000
                                     111
                                     000          111
                                                  000
                                             0.75 000
                                                  111          0.5          0.75
                        0.5                                                                     0.5
                               0.4    0.2         0.15
                                                                                   0.8
                                                                                          0.7
                               0.5




                                            - JCC 2011 - Curic´, Chile -
                                                              o                                             11.11.11   32 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms
Learning to skip off-topic pages: Tunneling!




Bergmark et al. (2002)
1st stage (Off-line): Crawling for dataset construction. Human labeling
(positive examples). Content text-based features. 2nd stage (Off-line):
Tunneling module construction. Cutoff threshold learning based on
nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.
Adaptive cutoff based on paths evaluated by using fetched pages.

                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   33 / 40
Focused Crawling for Vertical Search   State-of-the-art


Machine Learning-based adaptive algorithms

Agents for path detection: Ants




Gasparetti & Micarelli (2004)
Close in aim to ARACHNID (multi agents, multi seeds). Back and forth
trips to relevant resources generates pheromone trails. Shortest paths
attract more ants.

                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   34 / 40
Focused Crawling for Vertical Search            State-of-the-art


Ontology driven crawling strategies


Knowledge representation: Ontologies
      sc      :   SubClassOf
      dom     :   Domain
      range   :   Range                Camp Nou
      i       :   InstanceOf
      eq      :   Equivalent                                                                                i
                                               range                       city                                 Barcelona
      sp      :   SubPropertyOf
                                         i               dom
                                                                                                    sc
                             sports          stadiums
                                                                       country               coastal_city
                                  sp     sp
                      eq                                           range          dom                       i
       football              soccer                plays_in                                                       Spain

                                         sp

                                              national         i
                                               teams                        Barcelona F.C.




                                                   - JCC 2011 - Curic´, Chile -
                                                                     o                                             11.11.11   35 / 40
Focused Crawling for Vertical Search   State-of-the-art


Ontology driven crawling strategies

Ontology-based match expansion




Ehrig & Maedge (2003)
Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2nd
stage: Ontology-based expansion. 3rd stage: Summarization.



                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   36 / 40
Focused Crawling for Vertical Search   State-of-the-art


Ontology driven crawling strategies
Ontology-based learning strategy




Zheng et al. (2008)
Relevance scoring for fetched pages. 1st stage: Concept matching
(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN
training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as
input).
                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   37 / 40
Focused Crawling for Vertical Search   State-of-the-art


More features for unvisited URL scoring




Feng et al. (2010)
On-line PageRank + term scoring (anchor, surrounding)

Patel & Schmidt (2011)
Term scoring based on matching and document structure (structure of the
current page).




                                      - JCC 2011 - Curic´, Chile -
                                                        o               11.11.11   38 / 40
Focused Crawling for Vertical Search   Conclusion


Challenges




   Precision / Recall trade off
   Benchmarking
   Ontology IE for effective crawling
   Unbiased seed identification
   Efficiency issues (scalability,...)




                                     - JCC 2011 - Curic´, Chile -
                                                       o            11.11.11   39 / 40
Focused Crawling for Vertical Search   Conclusion


References




References here




                                      - JCC 2011 - Curic´, Chile -
                                                        o            11.11.11   40 / 40

Mais conteúdo relacionado

Destaque

Tcm Presentation 2011
Tcm Presentation 2011Tcm Presentation 2011
Tcm Presentation 2011erwinr
 
Producto 2 sesion 2
Producto  2 sesion 2Producto  2 sesion 2
Producto 2 sesion 2evita03
 
Industrial revolution
Industrial revolution Industrial revolution
Industrial revolution chrkie
 
Sesion 3 productos 3 y 4
Sesion 3 productos 3 y 4Sesion 3 productos 3 y 4
Sesion 3 productos 3 y 4evita03
 
Maltrato infantil en el caqueta
Maltrato infantil en el caquetaMaltrato infantil en el caqueta
Maltrato infantil en el caquetaangelikita14
 
La direcció de l’empresa
La direcció de l’empresaLa direcció de l’empresa
La direcció de l’empresamaxim2011
 
Kelly McQueen - Visual resume
Kelly McQueen - Visual resumeKelly McQueen - Visual resume
Kelly McQueen - Visual resumeKygal36200
 
Visual Walk - demo slides
Visual Walk - demo slidesVisual Walk - demo slides
Visual Walk - demo slidesFredrik Arvas
 

Destaque (16)

Tcm Presentation 2011
Tcm Presentation 2011Tcm Presentation 2011
Tcm Presentation 2011
 
Programa de Fiestas de Anzo anzofé
Programa de Fiestas de Anzo anzoféPrograma de Fiestas de Anzo anzofé
Programa de Fiestas de Anzo anzofé
 
Presentación1
Presentación1Presentación1
Presentación1
 
Producto 2 sesion 2
Producto  2 sesion 2Producto  2 sesion 2
Producto 2 sesion 2
 
Industrial revolution
Industrial revolution Industrial revolution
Industrial revolution
 
Sesion 3 productos 3 y 4
Sesion 3 productos 3 y 4Sesion 3 productos 3 y 4
Sesion 3 productos 3 y 4
 
97 2003
97 200397 2003
97 2003
 
Maltrato infantil en el caqueta
Maltrato infantil en el caquetaMaltrato infantil en el caqueta
Maltrato infantil en el caqueta
 
97 2003
97 200397 2003
97 2003
 
La direcció de l’empresa
La direcció de l’empresaLa direcció de l’empresa
La direcció de l’empresa
 
Electronica
ElectronicaElectronica
Electronica
 
97 2003
97 200397 2003
97 2003
 
Fabricio moreno-personal experience
Fabricio moreno-personal experienceFabricio moreno-personal experience
Fabricio moreno-personal experience
 
Kelly McQueen - Visual resume
Kelly McQueen - Visual resumeKelly McQueen - Visual resume
Kelly McQueen - Visual resume
 
Visual Walk - demo slides
Visual Walk - demo slidesVisual Walk - demo slides
Visual Walk - demo slides
 
Second life UVT
Second life UVTSecond life UVT
Second life UVT
 

Último

Deira Call Girls # 0522916705 # Call Girls In Deira Dubai || (UAE)
Deira Call Girls # 0522916705 #  Call Girls In Deira Dubai || (UAE)Deira Call Girls # 0522916705 #  Call Girls In Deira Dubai || (UAE)
Deira Call Girls # 0522916705 # Call Girls In Deira Dubai || (UAE)wdefrd
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escortswdefrd
 
Turn Lock Take Key Storyboard Daniel Johnson
Turn Lock Take Key Storyboard Daniel JohnsonTurn Lock Take Key Storyboard Daniel Johnson
Turn Lock Take Key Storyboard Daniel Johnsonthephillipta
 
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...home
 
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Servicesonnydelhi1992
 
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | DelhiFULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | DelhiMalviyaNagarCallGirl
 
Call girls in Kanpur - 9761072362 with room service
Call girls in Kanpur - 9761072362 with room serviceCall girls in Kanpur - 9761072362 with room service
Call girls in Kanpur - 9761072362 with room servicediscovermytutordmt
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noida
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | NoidaFULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noida
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noidasoniya singh
 
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...akbard9823
 
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...akbard9823
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akolasrsj9000
 
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...anilsa9823
 
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Servicesonnydelhi1992
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhisoniya singh
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...gurkirankumar98700
 
The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)thephillipta
 
Alex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson StoryboardAlex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson Storyboardthephillipta
 
FULL ENJOY - 9953040155 Call Girls in Wazirabad | Delhi
FULL ENJOY - 9953040155 Call Girls in Wazirabad | DelhiFULL ENJOY - 9953040155 Call Girls in Wazirabad | Delhi
FULL ENJOY - 9953040155 Call Girls in Wazirabad | DelhiMalviyaNagarCallGirl
 
exhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxexhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxKurikulumPenilaian
 
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...akbard9823
 

Último (20)

Deira Call Girls # 0522916705 # Call Girls In Deira Dubai || (UAE)
Deira Call Girls # 0522916705 #  Call Girls In Deira Dubai || (UAE)Deira Call Girls # 0522916705 #  Call Girls In Deira Dubai || (UAE)
Deira Call Girls # 0522916705 # Call Girls In Deira Dubai || (UAE)
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
 
Turn Lock Take Key Storyboard Daniel Johnson
Turn Lock Take Key Storyboard Daniel JohnsonTurn Lock Take Key Storyboard Daniel Johnson
Turn Lock Take Key Storyboard Daniel Johnson
 
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...
Authentic # 00971556872006 # Hot Call Girls Service in Dubai By International...
 
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Uttam Nagar Delhi >༒9667401043 Escort Service
 
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | DelhiFULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | Delhi
FULL ENJOY - 9953040155 Call Girls in Kotla Mubarakpur | Delhi
 
Call girls in Kanpur - 9761072362 with room service
Call girls in Kanpur - 9761072362 with room serviceCall girls in Kanpur - 9761072362 with room service
Call girls in Kanpur - 9761072362 with room service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noida
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | NoidaFULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noida
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Pari Chowk | Noida
 
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...
Indira Nagar Lucknow #Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payme...
 
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
 
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...
Lucknow 💋 Russian Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Ser...
 
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort ServiceYoung⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
Young⚡Call Girls in Lajpat Nagar Delhi >༒9667401043 Escort Service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Vasant Kunj | Delhi
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
 
The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)
 
Alex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson StoryboardAlex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson Storyboard
 
FULL ENJOY - 9953040155 Call Girls in Wazirabad | Delhi
FULL ENJOY - 9953040155 Call Girls in Wazirabad | DelhiFULL ENJOY - 9953040155 Call Girls in Wazirabad | Delhi
FULL ENJOY - 9953040155 Call Girls in Wazirabad | Delhi
 
exhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxexhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptx
 
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
 

Focused Crawling for Vertical Search

  • 1. Focused Crawling for Vertical Search Focused Crawling for Vertical Search Marcelo Mendoza 11.11.11 - JCC 2011 - Curic´, Chile - o 11.11.11 1 / 40
  • 2. Focused Crawling for Vertical Search Overview 1 Vertical Search 2 Crawling 3 State-of-the-art 4 Conclusion - JCC 2011 - Curic´, Chile - o 11.11.11 2 / 40
  • 3. Focused Crawling for Vertical Search Vertical Search Why Web Vertical Search Matters? Web size: More than 20 billion pages. Millions of users, millions of queries, millions of needs. Advantages: 1 Greater precision due to limited scope 2 Leverage domain knowledge (ontologies) Domains: business, medicine, science, education, ... - JCC 2011 - Curic´, Chile - o 11.11.11 3 / 40
  • 4. Focused Crawling for Vertical Search Vertical Search Science Vertical Search scienceresearch.com - JCC 2011 - Curic´, Chile - o 11.11.11 4 / 40
  • 5. Focused Crawling for Vertical Search Vertical Search Business Vertical Search biznar.com - JCC 2011 - Curic´, Chile - o 11.11.11 5 / 40
  • 6. Focused Crawling for Vertical Search Vertical Search Education Vertical Search contentcompass.cl1 1 Fondef D08I1155 - JCC 2011 - Curic´, Chile - o 11.11.11 6 / 40
  • 7. Focused Crawling for Vertical Search Crawling Hyperlinks among web pages - JCC 2011 - Curic´, Chile - o 11.11.11 7 / 40
  • 8. Focused Crawling for Vertical Search Crawling The Web as a graph web pages hyperlinks - JCC 2011 - Curic´, Chile - o 11.11.11 8 / 40
  • 9. Focused Crawling for Vertical Search Crawling The Web: Some facts The size of the Web: 11.5 billion of pages (indexable, 2005). The deep Web: available by quering databases. Static / dynamic pages. Graph model: Free-scale network, degree distribution ≈ power law. The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS). - JCC 2011 - Curic´, Chile - o 11.11.11 9 / 40
  • 10. Focused Crawling for Vertical Search Crawling Crawler architecture Online resource: C. Castillo, Effective Web Crawling (PhD Thesis) URL - JCC 2011 - Curic´, Chile - o 11.11.11 10 / 40
  • 11. Focused Crawling for Vertical Search Crawling Crawling strategies Breadth-first crawlers: URL frontier implemented as a FIFO queue. Preferential crawlers: URL frontier implemented as a priority queue. Priority scores: 1 Topological properties (e.g. indegree of the target page). 2 Content properties (e.g. similarity between a query and the source page). 3 Hybrid measures. - JCC 2011 - Curic´, Chile - o 11.11.11 11 / 40
  • 12. Focused Crawling for Vertical Search Crawling Universal / Focused crawling Universal crawlers: General purpose. Challenges: 1 Scalability 2 Coverage / Freshness Focused crawlers: We may want to crawl pages in certain topics. Challenges: 1 Coverage / Accuracy - JCC 2011 - Curic´, Chile - o 11.11.11 12 / 40
  • 13. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-first: depth 1 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 13 / 40
  • 14. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-first: depth 2 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 14 / 40
  • 15. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-first: depth 3 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 15 / 40
  • 16. Focused Crawling for Vertical Search Crawling Focused Crawling Breadth-first: unreacheble pages, excessive computational costs! Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 16 / 40
  • 17. Focused Crawling for Vertical Search State-of-the-art Early algorithms: Fish search Bra, P., and Post, R. (1994) Query (keywords), source page terms, term-based distance, best-first - JCC 2011 - Curic´, Chile - o 11.11.11 17 / 40
  • 18. Focused Crawling for Vertical Search State-of-the-art Early algorithms: Shark search Hersovici et al. (1998) Query (keywords), anchor text, term-based distance, best-first - JCC 2011 - Curic´, Chile - o 11.11.11 18 / 40
  • 19. Focused Crawling for Vertical Search State-of-the-art Early algorithms: ARACHNID Menczer, F. (1997) Multi-agents, evolutionary inspired: mutation (new seeds), fitness (score acc.), term-based scores. - JCC 2011 - Curic´, Chile - o 11.11.11 19 / 40
  • 20. Focused Crawling for Vertical Search State-of-the-art Context: Link Analysis The Web graph as an information source (beyond the text) Kleinberg, J. (1998) HITS: authoritative pages (OUT), hub pages (IN). Brin, S. & Page, L. (1998) PageRank: Random walk over the Web graph, stationary probability vector. - JCC 2011 - Curic´, Chile - o 11.11.11 20 / 40
  • 21. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms Cho, J., Garcia-Molina, H., Page L. (1998) Link-based scores: Backlinks count, PageRank Chakrabarti, S., Van den Berg, M., and Dom, B. (1999) Topic distillation: Text-based classifier over web page examples per category (off-line dataset construction, human labeling, content text positive and negative examples). On-line phase: Anchor-based score (ML) + HITS-based score for distillation. - JCC 2011 - Curic´, Chile - o 11.11.11 21 / 40
  • 22. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Basic assumptions Seed Target Davidson, B. (2000) Topical locality: Locality based on anchor text and links. - JCC 2011 - Curic´, Chile - o 11.11.11 22 / 40
  • 23. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Basic assumptions Menczer, F. (2004) Link cluster conjecture: Related pages tend to be linked. - JCC 2011 - Curic´, Chile - o 11.11.11 23 / 40
  • 24. Focused Crawling for Vertical Search State-of-the-art Link-based algorithms: Backlink graph Considering how far is the target: Layered backlink graph! Diligenti et al. (2000) Using the backlink graph for multiclass learning. Greedy approach. Babaria et al. (2007) Using the backling graph for ordinal regression. Greedy approach. - JCC 2011 - Curic´, Chile - o 11.11.11 24 / 40
  • 25. Focused Crawling for Vertical Search State-of-the-art Off-line learning-based algorithms Kinds of features The content of the web pages which are known to link to the candidate URL. URL tokens from the candidate URL. - JCC 2011 - Curic´, Chile - o 11.11.11 25 / 40
  • 26. Focused Crawling for Vertical Search State-of-the-art Off-line learning-based algorithms Rennie & McCallum (1999) 1st stage (Off-line): Text-based features (anchor + header + title of the target). 2nd stage (On-line): Candidate URL scoring based on the text classifier (candidate URL (anchor + URL text)). Li et al. (2005) 1st stage (Off-line): ID3 learning strategy. Anchor text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classifier (candidate URL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 26 / 40
  • 27. Focused Crawling for Vertical Search State-of-the-art Off-line learning-based algorithms Pant & Srinivasan (2006) 1st stage (Off-line): SVM learning strategy. Content text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classifier (candidate URL (surrounding text)). Feng et al. (2010) 1st stage (Off-line): Term-based weights. Weighted graph construction. 2nd stage (Off-line): PageRank over the weighted graph. 3rd stage (Off-line): Labeling based on PageRank. Term-based learning. 4th stage (On-line): Candidate URL scoring based on the text classifier (candidate URL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 27 / 40
  • 28. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-fly from the context - JCC 2011 - Curic´, Chile - o 11.11.11 28 / 40
  • 29. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-fly from the context "Ba ch" "Bach" candidate URL Aggarwal et al. (2000) 1st stage (Off-line): Crawling for dataset construction. Human labeling (positive examples). Bayes learning strategy. Content text-based features. 2nd stage (On-line): Candidate URL scoring based on the text classifier + feature selection based on interest ratio (candidate URL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 29 / 40
  • 30. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning on-the-fly from the context Chakrabarti et al. (2002) 1st stage (Off-line): Crawling for dataset construction. Human labeling (positive examples). Content text-based features. 2nd stage (On-line): Training from positive examples using fetched pages (more sophisticated features such as DOM tree). 3rd stage (On-line): URL scoring based on the apprentice learner. - JCC 2011 - Curic´, Chile - o 11.11.11 30 / 40
  • 31. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip off-topic pages Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 31 / 40
  • 32. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip off-topic pages 111 000 111 000 111 000 Dud 111 000 Seed 0.8 0.7 0.25 0.1 Target 0.2 0.7 0.6 111 000 0.45 0.8 111 000 111 000 0.7 0.7 111 000 0.7 0.7 0.5 111 000 111 000 111 000 111 000 0.75 000 111 0.5 0.75 0.5 0.5 0.4 0.2 0.15 0.8 0.7 0.5 - JCC 2011 - Curic´, Chile - o 11.11.11 32 / 40
  • 33. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Learning to skip off-topic pages: Tunneling! Bergmark et al. (2002) 1st stage (Off-line): Crawling for dataset construction. Human labeling (positive examples). Content text-based features. 2nd stage (Off-line): Tunneling module construction. Cutoff threshold learning based on nugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner. Adaptive cutoff based on paths evaluated by using fetched pages. - JCC 2011 - Curic´, Chile - o 11.11.11 33 / 40
  • 34. Focused Crawling for Vertical Search State-of-the-art Machine Learning-based adaptive algorithms Agents for path detection: Ants Gasparetti & Micarelli (2004) Close in aim to ARACHNID (multi agents, multi seeds). Back and forth trips to relevant resources generates pheromone trails. Shortest paths attract more ants. - JCC 2011 - Curic´, Chile - o 11.11.11 34 / 40
  • 35. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Knowledge representation: Ontologies sc : SubClassOf dom : Domain range : Range Camp Nou i : InstanceOf eq : Equivalent i range city Barcelona sp : SubPropertyOf i dom sc sports stadiums country coastal_city sp sp eq range dom i football soccer plays_in Spain sp national i teams Barcelona F.C. - JCC 2011 - Curic´, Chile - o 11.11.11 35 / 40
  • 36. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Ontology-based match expansion Ehrig & Maedge (2003) Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2nd stage: Ontology-based expansion. 3rd stage: Summarization. - JCC 2011 - Curic´, Chile - o 11.11.11 36 / 40
  • 37. Focused Crawling for Vertical Search State-of-the-art Ontology driven crawling strategies Ontology-based learning strategy Zheng et al. (2008) Relevance scoring for fetched pages. 1st stage: Concept matching (ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANN training. 3rd stage (On-line): term-based URL scoring (ANN, anchor as input). - JCC 2011 - Curic´, Chile - o 11.11.11 37 / 40
  • 38. Focused Crawling for Vertical Search State-of-the-art More features for unvisited URL scoring Feng et al. (2010) On-line PageRank + term scoring (anchor, surrounding) Patel & Schmidt (2011) Term scoring based on matching and document structure (structure of the current page). - JCC 2011 - Curic´, Chile - o 11.11.11 38 / 40
  • 39. Focused Crawling for Vertical Search Conclusion Challenges Precision / Recall trade off Benchmarking Ontology IE for effective crawling Unbiased seed identification Efficiency issues (scalability,...) - JCC 2011 - Curic´, Chile - o 11.11.11 39 / 40
  • 40. Focused Crawling for Vertical Search Conclusion References References here - JCC 2011 - Curic´, Chile - o 11.11.11 40 / 40