SlideShare a Scribd company logo
1 of 52
Investigating the Change of
Web Pages’ Titles Over Time

   Martin Klein and Michael L. Nelson
         Old Dominion University

       {mklein,mln}@cs.odu.edu

                InDP 2009
                Austin, TX
                06/19/2009
The Problem




              2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Environment

                           Web Infrastructure (WI) [McCown07]

            •     Web search engines (Google, Yahoo!, MSN Live) and
                  their caches

            • Research projects (CiteSeer)
            • Web archives (Internet Archive)



[McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007.   3
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                 4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                    4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page

                                                                               •
           ! change number of terms in LS


                                                                                   More sophisticated methods
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                   needed if unsuccessful so far
                                                                                                                        4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !

                                                                               REAL TIME!!!
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                              4
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus



[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate

[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate
                                                             Web pages’ titles
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time

                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
 • Degree of change as Levenshtein score
                                                          6
Dataset

•   6k URLs randomly sampled from DMOZ

•   Parsed the pages and extracted up to three URLs
    referencing to in-domain pages

•   Applied filter for:

    •   Inaccessible pages

    •   Pages not containing any links

    •   Pages not in the .com, .net, .org or .edu domain

    •   Pages without copies in the IA



                                                           7
Dataset

  •   6k URLs randomly sampled from DMOZ

  •   Parsed the pages and extracted up to three URLs
      referencing to in-domain pages

  •   Applied filter for:

      •   Inaccessible pages

      •   Pages not containing any links

      •   Pages not in the .com, .net, .org or .edu domain

      •   Pages without copies in the IA


1090 URLs and more than 100K observations
                                                             7
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations
                                                  10000


1) observations
2) changes
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                 9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000




                                                                                                                  • number of observations
                                                                                                                    does not impact the
                                                  100




                                                                                                                    number of changes
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time

More Related Content

Similar to Investigating the Change of Web Pages’ Titles Over Time (7)

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
A View on eScience
A View on eScienceA View on eScience
A View on eScience
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data Usage
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Boston
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013
 
Chef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationChef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure Automation
 

More from Martin Klein

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Investigating the Change of Web Pages’ Titles Over Time

  • 1. Investigating the Change of Web Pages’ Titles Over Time Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu InDP 2009 Austin, TX 06/19/2009
  • 7. The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, MSN Live) and their caches • Research projects (CiteSeer) • Web archives (Internet Archive) [McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
  • 8. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 9. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 10. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 11. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 12. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 13. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page • ! change number of terms in LS More sophisticated methods ! add/delete term from LS ! advanced search operators (6) present results ! DONE needed if unsuccessful so far 4
  • 14. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 15. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! REAL TIME!!! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 16. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 17. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 18. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 19. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 20. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate Web pages’ titles [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 21. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page 6
  • 22. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] 6
  • 23. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time 6
  • 24. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change 6
  • 25. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change • Degree of change as Levenshtein score 6
  • 26. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 7
  • 27. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 1090 URLs and more than 100K observations 7
  • 28. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 29. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 30. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 31. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 32. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations 10000 1) observations 2) changes Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 33. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 34. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 35. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 • number of observations does not impact the 100 number of changes 10 1 0 200 400 600 800 1000 URLs 9