SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
TrendMachine:
Temporal Resilience of Web Pages
@WaybackMachine
IIPC Web Archiving Conference (WAC), May 03, 2023, Online
Sawood Alam
Mark Graham
Kritika Garg
Michele C. Weigle
Michael L. Nelson
Dietrich Ayala
Internet Archive
Internet Archive
Old Dominion University
Old Dominion University
Old Dominion University
Protocol Labs
@WebSciDL @ProtocolLabs
Supported in part by Protocol Labs and Filecoin Foundation
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2
Research Question
How healthy has a web page been
throughout its lifetime?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3
Temporal and Spatial Landscape of Archival Analysis
Long Duration
Single
Webpage
● TMVis
● Wayback Machine Changes
● TrendMachine
● MementoMap
● CDX Summary
● Archives Unleashed Toolkit
Webpage
Collection
● Memento Damage
● Archival ACID Test
● Reconstructive
● Warrick
● Wayback Machine Downloader
● Video Archiving Insights
Short Duration
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4
Modeling Web Page Health: Linear vs. S-Curve
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5
Sigmoid Function for Web Page Resilience
Spread: How far up or down the value can go from its starting position?
Shift: How soon any significant change in the value can begin?
Slope: How quickly the value reaches close to the maximum change?
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6
TrendMachine: Composite Sigmoid Parameters of Resilience
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7
TrendMachine: Overview
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8
TrendMachine: Temporal Distribution of Archiving Activities
The page is archived
as few as one or zero
times and as many as
tens of thousands of
times in a single day.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9
Specimen Selection Algorithm
PRIORITY = ["2xx", "4xx", "5xx", "3xx"]
FOREACH st OF PRIORITY
IF st IN statuses(day)
specimen = statuses(day).match(st)[0]
BREAK
DAY1 DAY2 DAY3 DAY4
4xx 3xx 5xx 3xx
3xx 3xx 3xx 5xx
2xx 3xx 5xx 3xx
5xx 4xx 5xx
2xx 4xx
A 3xx specimen usually suggests that the URL is
redirecting to somewhere other than a variation of
the same URL.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10
Filling Missing Observations
Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6
Identical 2xx 2xx 2xx 4xx 2xx
Closest 2xx 2xx 2xx 4xx 4xx 2xx
Forward 2xx 2xx 2xx 2xx 4xx 2xx
Backward 2xx 2xx 4xx 4xx 4xx 2xx
ANY 2xx 2xx
Do not fill the gap if the
status codes before and
after are not identical.
Do not fill the gap if it is
larger than a configured
threshold.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11
TrendMachine: TimeMap Status Codes vs. Daily Specimens
Most of the self-redirect 3xx observations
(HTTP/HTTPS or WWW/Apex domain) are
eliminated in daily specimens.
About one third of the days since the first
observation have no captures, of which
some are filled using a filling policy.
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12
TrendMachine: Resilience
● Resilience score is calculated using Sigmoid function on status codes of daily specimens
● Initial value of 0.5 and normalized between 0 and 1
● After the first few observations, Wayback Machine did not archive it for several months in 2002
● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving
● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org”
● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13
TrendMachine: Fixity
● Fixity score (normalized) is calculated using Sigmoid
function on content digests of daily specimens
● Content digest reported in CDX can be sensitive to
Content-Encoding, resulting in false alarms, even
when the underlying content remains unchanged
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14
TrendMachine: Chaos
● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all
status codes of the CDX data in which consecutive duplicates are removed in the numerator
● An alternate sliding-window calculation is performed on the last N observations as the score
becomes insensitive to recent changes on large TimeMaps
● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g.,
adoption of HTTPS and/or consolidation of WWW and Apex domain)
Chaos =
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx |
=
3
= 0.5
| 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15
TrendMachine: Status Code Transitions
● Large numbers along the major diagonal
indicate status code stability for extended
periods of time
● Large numbers in non-diagonal cells suggest
frequent changes in Resilience curve
● Web pages with high Resilience score for
extended periods usually exhibit large numbers
in the top-left cell (2xx -> 2xx)
● A large number in the 3xx -> 3xx cell usually
indicates extended periods of redirection to
other URLs (e.g., URL restructuring, login wall,
domain change, and parked domain)
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16
TrendMachine: Compare First and Last Mementos
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17
TrendMachine: Live Web Page With Headers
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18
Potential Use Cases
● Detect points of interest in a large TimeMap
● Sample captures/mementos from TimeMaps for visual summarization
● Detect archival sinks (like login pages, paywalls, and misconfigured redirects)
● Detect poor-quality pages like Soft-404 and parked domains
● Detect potential link-rot (and fix them when possible, like in a wiki page)
● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage
● Archival quality assurance
● Cluster pages of a large archival collection in different categories
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19
Future Work
● Report heuristics-based archival summary by combining various scores
● Report/embed captures/mementos that can be points of interest
● Calculate Fixity using less-sensitive digests (e.g., SimHash)
● Calculate Chaos after applying convolutions to smooth out alternate changes
● Allow alternate web page health models (not just Sigmoid functions)
● Deploy in production by integrating with Wayback Machine
TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20
Summary
Code: https://github.com/internetarchive/trendmachine
Demo: https://trendmachine.sawood-dev.us.archive.org/
A mathematical model
to quantify temporal
health of a web page
Resilience, Fixity,
Chaos, Distributions,
Transitions, etc. reports
An interactive portal with
configuration options for
experiments
An evolving
open-source codebase
and demo deployment

Mais conteúdo relacionado

Mais procurados

COD2012 九州会場 Active Directory 障害対策
COD2012 九州会場 Active Directory 障害対策COD2012 九州会場 Active Directory 障害対策
COD2012 九州会場 Active Directory 障害対策
wintechq
 
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開 第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
Keisuke Matsuda
 
[AWSマイスターシリーズ]AWS Storage Gateway
[AWSマイスターシリーズ]AWS Storage Gateway[AWSマイスターシリーズ]AWS Storage Gateway
[AWSマイスターシリーズ]AWS Storage Gateway
Amazon Web Services Japan
 

Mais procurados (20)

Responsableを使ったadr実装
Responsableを使ったadr実装Responsableを使ったadr実装
Responsableを使ったadr実装
 
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
 
SQL Server replication overview (JP)
SQL Server replication overview (JP)SQL Server replication overview (JP)
SQL Server replication overview (JP)
 
Sql server 運用 101
Sql server 運用 101Sql server 運用 101
Sql server 運用 101
 
Black Belt Online Seminar AWS Amazon RDS
Black Belt Online Seminar AWS Amazon RDSBlack Belt Online Seminar AWS Amazon RDS
Black Belt Online Seminar AWS Amazon RDS
 
S13 Oracle Database を Microsoft Azure 上で運用する為に~基本事項とベストプラクティス
S13 Oracle Database を Microsoft Azure 上で運用する為に~基本事項とベストプラクティスS13 Oracle Database を Microsoft Azure 上で運用する為に~基本事項とベストプラクティス
S13 Oracle Database を Microsoft Azure 上で運用する為に~基本事項とベストプラクティス
 
COD2012 九州会場 Active Directory 障害対策
COD2012 九州会場 Active Directory 障害対策COD2012 九州会場 Active Directory 障害対策
COD2012 九州会場 Active Directory 障害対策
 
Html 7
Html 7Html 7
Html 7
 
Amazon Aurora - Auroraの止まらない進化とその中身
Amazon Aurora - Auroraの止まらない進化とその中身Amazon Aurora - Auroraの止まらない進化とその中身
Amazon Aurora - Auroraの止まらない進化とその中身
 
React/Redux
React/ReduxReact/Redux
React/Redux
 
Functional tests with TYPO3
Functional tests with TYPO3Functional tests with TYPO3
Functional tests with TYPO3
 
SQL Server運用実践 - 3年間80台の運用経験から20の教訓
SQL Server運用実践 - 3年間80台の運用経験から20の教訓SQL Server運用実践 - 3年間80台の運用経験から20の教訓
SQL Server運用実践 - 3年間80台の運用経験から20の教訓
 
AWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグ
 
Chunked encoding を使った高速化の考察
Chunked encoding を使った高速化の考察Chunked encoding を使った高速化の考察
Chunked encoding を使った高速化の考察
 
Goでwebアプリを開発してみよう
Goでwebアプリを開発してみようGoでwebアプリを開発してみよう
Goでwebアプリを開発してみよう
 
Rate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusRate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX Plus
 
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開 第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
第84回 雲勉【オンライン:初心者向け】ECS入門 _ CloudFront + ELB + ECS FargateでWebサイトを公開
 
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
 
[AWSマイスターシリーズ]AWS Storage Gateway
[AWSマイスターシリーズ]AWS Storage Gateway[AWSマイスターシリーズ]AWS Storage Gateway
[AWSマイスターシリーズ]AWS Storage Gateway
 
第15回JSSUG「Azure SQL Database 超入門」
第15回JSSUG「Azure SQL Database 超入門」第15回JSSUG「Azure SQL Database 超入門」
第15回JSSUG「Azure SQL Database 超入門」
 

Semelhante a TrendMachine: Temporal Resilience of Web Pages

Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performance
kkjjkevin03
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0
Mayank Srivastava
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server
WSO2
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014
Majid Jalili
 

Semelhante a TrendMachine: Temporal Resilience of Web Pages (20)

Big datainmemory pub
Big datainmemory pubBig datainmemory pub
Big datainmemory pub
 
Geographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application PerformanceGeographic Distribution for Global Web Application Performance
Geographic Distribution for Global Web Application Performance
 
Introduction to ASP.NET MVC
Introduction to ASP.NET MVCIntroduction to ASP.NET MVC
Introduction to ASP.NET MVC
 
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
Building Cloud-Native Applications in MiCADO - MiCADO webinar No.2/4 - 09/2019
 
WSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per SecondWSA: Scaling Web Service to Handle Millions of Requests per Second
WSA: Scaling Web Service to Handle Millions of Requests per Second
 
Performance-driven front-end development
Performance-driven front-end developmentPerformance-driven front-end development
Performance-driven front-end development
 
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand ScalingWordPress Cluster for Enterprise High-Availability and On-Demand Scaling
WordPress Cluster for Enterprise High-Availability and On-Demand Scaling
 
IT Resilience Technical
IT Resilience TechnicalIT Resilience Technical
IT Resilience Technical
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
MySQL Schema Design in Practice
MySQL Schema Design in PracticeMySQL Schema Design in Practice
MySQL Schema Design in Practice
 
Monitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windowsMonitoring web application response times^lj a hybrid approach for windows
Monitoring web application response times^lj a hybrid approach for windows
 
Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?Why is this ASP.NET web app running slowly?
Why is this ASP.NET web app running slowly?
 
Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0Targeting Mobile Platform with MVC 4.0
Targeting Mobile Platform with MVC 4.0
 
Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server Introduction to WSO2 Storage Server
Introduction to WSO2 Storage Server
 
Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014Majid_Jalili_SRC_2014
Majid_Jalili_SRC_2014
 
Private cloud with vmware
Private cloud with vmwarePrivate cloud with vmware
Private cloud with vmware
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
 
Docker в автоматизации тестирования
Docker в автоматизации тестированияDocker в автоматизации тестирования
Docker в автоматизации тестирования
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 

Mais de Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

Mais de Sawood Alam (20)

CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 

Último

Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 

Último (20)

Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 

TrendMachine: Temporal Resilience of Web Pages

  • 1. TrendMachine: Temporal Resilience of Web Pages @WaybackMachine IIPC Web Archiving Conference (WAC), May 03, 2023, Online Sawood Alam Mark Graham Kritika Garg Michele C. Weigle Michael L. Nelson Dietrich Ayala Internet Archive Internet Archive Old Dominion University Old Dominion University Old Dominion University Protocol Labs @WebSciDL @ProtocolLabs Supported in part by Protocol Labs and Filecoin Foundation
  • 2. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 2 Research Question How healthy has a web page been throughout its lifetime?
  • 3. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 3 Temporal and Spatial Landscape of Archival Analysis Long Duration Single Webpage ● TMVis ● Wayback Machine Changes ● TrendMachine ● MementoMap ● CDX Summary ● Archives Unleashed Toolkit Webpage Collection ● Memento Damage ● Archival ACID Test ● Reconstructive ● Warrick ● Wayback Machine Downloader ● Video Archiving Insights Short Duration
  • 4. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 4 Modeling Web Page Health: Linear vs. S-Curve
  • 5. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 5 Sigmoid Function for Web Page Resilience Spread: How far up or down the value can go from its starting position? Shift: How soon any significant change in the value can begin? Slope: How quickly the value reaches close to the maximum change?
  • 6. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 6 TrendMachine: Composite Sigmoid Parameters of Resilience
  • 7. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 7 TrendMachine: Overview Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/
  • 8. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 8 TrendMachine: Temporal Distribution of Archiving Activities The page is archived as few as one or zero times and as many as tens of thousands of times in a single day.
  • 9. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 9 Specimen Selection Algorithm PRIORITY = ["2xx", "4xx", "5xx", "3xx"] FOREACH st OF PRIORITY IF st IN statuses(day) specimen = statuses(day).match(st)[0] BREAK DAY1 DAY2 DAY3 DAY4 4xx 3xx 5xx 3xx 3xx 3xx 3xx 5xx 2xx 3xx 5xx 3xx 5xx 4xx 5xx 2xx 4xx A 3xx specimen usually suggests that the URL is redirecting to somewhere other than a variation of the same URL.
  • 10. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 10 Filling Missing Observations Policy DAY1 DAY2 DAY3 DAY4 DAY5 DAY6 Identical 2xx 2xx 2xx 4xx 2xx Closest 2xx 2xx 2xx 4xx 4xx 2xx Forward 2xx 2xx 2xx 2xx 4xx 2xx Backward 2xx 2xx 4xx 4xx 4xx 2xx ANY 2xx 2xx Do not fill the gap if the status codes before and after are not identical. Do not fill the gap if it is larger than a configured threshold.
  • 11. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 11 TrendMachine: TimeMap Status Codes vs. Daily Specimens Most of the self-redirect 3xx observations (HTTP/HTTPS or WWW/Apex domain) are eliminated in daily specimens. About one third of the days since the first observation have no captures, of which some are filled using a filling policy.
  • 12. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 12 TrendMachine: Resilience ● Resilience score is calculated using Sigmoid function on status codes of daily specimens ● Initial value of 0.5 and normalized between 0 and 1 ● After the first few observations, Wayback Machine did not archive it for several months in 2002 ● Towards the end of 2002, Resilience score went up slowly due to infrequent archiving ● In 2003 “wikipedia.org” started to redirect to “en.wikipedia.org” ● After 2005, Resilience of the Wikipedia home page has mostly been stable and high
  • 13. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 13 TrendMachine: Fixity ● Fixity score (normalized) is calculated using Sigmoid function on content digests of daily specimens ● Content digest reported in CDX can be sensitive to Content-Encoding, resulting in false alarms, even when the underlying content remains unchanged
  • 14. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 14 TrendMachine: Chaos ● Chaos score (normalized) is calculated using a Run-Length Encoding inspired technique on all status codes of the CDX data in which consecutive duplicates are removed in the numerator ● An alternate sliding-window calculation is performed on the last N observations as the score becomes insensitive to recent changes on large TimeMaps ● A high Chaos along with a high Resilience is often an indication of canonical redirects (e.g., adoption of HTTPS and/or consolidation of WWW and Apex domain) Chaos = | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | = 3 = 0.5 | 2xx, 2xx, 2xx, 3xx, 3xx, 2xx | 6
  • 15. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 15 TrendMachine: Status Code Transitions ● Large numbers along the major diagonal indicate status code stability for extended periods of time ● Large numbers in non-diagonal cells suggest frequent changes in Resilience curve ● Web pages with high Resilience score for extended periods usually exhibit large numbers in the top-left cell (2xx -> 2xx) ● A large number in the 3xx -> 3xx cell usually indicates extended periods of redirection to other URLs (e.g., URL restructuring, login wall, domain change, and parked domain)
  • 16. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 16 TrendMachine: Compare First and Last Mementos
  • 17. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 17 TrendMachine: Live Web Page With Headers
  • 18. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 18 Potential Use Cases ● Detect points of interest in a large TimeMap ● Sample captures/mementos from TimeMaps for visual summarization ● Detect archival sinks (like login pages, paywalls, and misconfigured redirects) ● Detect poor-quality pages like Soft-404 and parked domains ● Detect potential link-rot (and fix them when possible, like in a wiki page) ● Optimize crawl jobs by minimizing wasteful downloads and maximizing coverage ● Archival quality assurance ● Cluster pages of a large archival collection in different categories
  • 19. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 19 Future Work ● Report heuristics-based archival summary by combining various scores ● Report/embed captures/mementos that can be points of interest ● Calculate Fixity using less-sensitive digests (e.g., SimHash) ● Calculate Chaos after applying convolutions to smooth out alternate changes ● Allow alternate web page health models (not just Sigmoid functions) ● Deploy in production by integrating with Wayback Machine
  • 20. TrendMachine: Temporal Resilience of Web Pages | IIPC WAC 2023 | Sawood Alam <@ibnesayeed> 20 Summary Code: https://github.com/internetarchive/trendmachine Demo: https://trendmachine.sawood-dev.us.archive.org/ A mathematical model to quantify temporal health of a web page Resilience, Fixity, Chaos, Distributions, Transitions, etc. reports An interactive portal with configuration options for experiments An evolving open-source codebase and demo deployment