SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Impact of HTTP Cookie Violations
in Web Archives
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
@ibnesayeed
Cookies Are Why Your Archived Twitter Page Is Not in English
2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
@ibnesayeed
All Your Tweets Are Belong To Kannada
3
9,000+ mementos of @BarackObama
English: 53%
Kannada: 22%
Other 45 languages: 25%
https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
@ibnesayeed
Is JavaScript Causing This?
4
Twitter seems to be rendering translated phrases on the server.
So, JavaScript cannot be responsible.
@ibnesayeed
Is Cache Conflicting at a Shared Proxy?
5
Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
@ibnesayeed
Is On-demand Archiving Bringing User Preferences In?
6
IA replays users’ headers in Save Page Now, but
other archives do not have on-demand archiving.
Archive.is sends custom Accept-Language
header, not the one a user’s browser sends to it.
@ibnesayeed
Is Geo-location Affecting It?
7
Most of the archival crawlers run in the USA or European regions, which does not explain why
Kannada (a regional Indian language) is so popular.
@ibnesayeed
Is Heritrix Sending Wrong Accept-Language Headers?
8
Heritrix generated WARC files do not contain any Accept-Language header.
@ibnesayeed
Language Content Negotiation in Twitter
9
The “?lang=<lang-code>” query parameter has the highest precedence.
Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
@ibnesayeed
Alternate Language Links Pollute Crawler’s Frontier Queue
10
Kannada (kn) being
at the end of the list,
causes its “lang”
cookie stick around
for long, affecting
many subsequent
Twitter URLs.
@ibnesayeed
Experiment With Heritrix On Two Seed URIs
● https://twitter.com/?lang=ar
○ First request has an explicit lang query parameter
○ First response has a “Set-Cookie: lang=ar” header
● https://twitter.com/phonedude_mln/
○ Second request has no lang query parameter, but sends a “Cookie: lang=ar”
○ Second response returns the page in Arabic
11
@ibnesayeed
Replaying Captured WARC With PyWB
12
https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
@ibnesayeed
Cookie Violations Cause Archived Twitter Pages to
Simultaneously Replay in Multiple Languages
13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
@ibnesayeed
Defaced Composite Mementos That Never Existed
on the Live Web
14
Live leakage (Zombies) Temporal Violations
Origin Violations
And now, Cookie Violations!
https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
@ibnesayeed
Anatomy of a Twitter Timeline
15
● Page is loaded with the initial set of tweets
● Navigation bar is in the current language
● Some sidebar blocks are loaded lazily
● New tweets are polled after every 30 seconds
● Global trends are polled after every 5 minutes
@ibnesayeed
Twitter Returns Server-side Rendered Markup
16
Cookies set by of prior responses may impact subsequent XHR responses.
@ibnesayeed
Pages With Explicit lang Parameter Are Consistent
17
?lang=pt
?lang=en
?lang=ur
Mementos with explicit “lang” parameter
are language consistent.
@ibnesayeed
Replicate Heritrix Behavior on the Live Web
18
Load https://twitter.com/
in a browser tab B
Retweet a tweet
in the tab A
Load https://twitter.com/?lang=en
in a browser tab A
Expand notification
in the tab B
Change lang param
in the tab A
@ibnesayeed
What Can We Do About These Cookie Violations?
● Crawling
○ Sandbox short crawl sessions
○ Explicitly enforce short cookie expiration time and garbage collect frequently
○ Identify such sources of cookie violations and filter them off
● Replay
○ Respect content negotiation headers (advertised in “Vary” header)
○ Identify non-advertised cookies that affect the content to incorporate in replay
○ Classify cookies in categories like session, tracking, and configs etc.
19
Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving.
Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
@ibnesayeed
Conclusions
● Cookies Are Why Your Archived Twitter Page Is Not in English
○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in
Multiple Languages
○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
● Identified yet another source of bias in archives (over represented languages)
● Described behavior of cookies in crawling and replay (cookie violations)
● Proposed some potential solutions like keeping cookies short-lived
● Described open problems that need more in-depth research
20

Mais conteúdo relacionado

Mais procurados

URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
butest
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 

Mais procurados (20)

The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
21st Century Archival Appraisal
21st Century Archival Appraisal21st Century Archival Appraisal
21st Century Archival Appraisal
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 

Semelhante a Impact of HTTP Cookie Violations in Web Archives

Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
Himarsha Jayanetti
 
Social Bookmarking Webinar
Social Bookmarking WebinarSocial Bookmarking Webinar
Social Bookmarking Webinar
Karen Brooks
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
webhostingguy
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
webhostingguy
 
Challenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter PagesChallenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter Pages
Kritika Garg
 

Semelhante a Impact of HTTP Cookie Violations in Web Archives (20)

Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
 
Feb 21, 2012 Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
Feb 21, 2012  Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...Feb 21, 2012  Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
Feb 21, 2012 Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
 
Social Bookmarking Webinar
Social Bookmarking WebinarSocial Bookmarking Webinar
Social Bookmarking Webinar
 
Browser Tracking Protections - SuperWeek 2020
Browser Tracking Protections - SuperWeek 2020Browser Tracking Protections - SuperWeek 2020
Browser Tracking Protections - SuperWeek 2020
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Web 2.0 Tools
Web 2.0 ToolsWeb 2.0 Tools
Web 2.0 Tools
 
Web 2.0 for schools
Web 2.0 for schoolsWeb 2.0 for schools
Web 2.0 for schools
 
Web performance optimization for modern web applications
Web performance optimization for modern web applicationsWeb performance optimization for modern web applications
Web performance optimization for modern web applications
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Front End Oprtimization
Front End OprtimizationFront End Oprtimization
Front End Oprtimization
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
 
Web2toolsjan09
Web2toolsjan09Web2toolsjan09
Web2toolsjan09
 
Web 1.0, Web 2.0 and Digital Preservation
Web 1.0, Web 2.0 and Digital PreservationWeb 1.0, Web 2.0 and Digital Preservation
Web 1.0, Web 2.0 and Digital Preservation
 
Web 2.0 PPT
Web 2.0 PPTWeb 2.0 PPT
Web 2.0 PPT
 
The 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themThe 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix them
 
Challenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter PagesChallenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter Pages
 
Blogs and Wikis: Web-based Business Collaboration Tools for the 21st Century
Blogs and Wikis:Web-based Business Collaboration Tools for the 21st CenturyBlogs and Wikis:Web-based Business Collaboration Tools for the 21st Century
Blogs and Wikis: Web-based Business Collaboration Tools for the 21st Century
 
Web 2.0 and other emerging technologies
Web 2.0 and other emerging technologiesWeb 2.0 and other emerging technologies
Web 2.0 and other emerging technologies
 
Web2toolsoctober09
Web2toolsoctober09Web2toolsoctober09
Web2toolsoctober09
 

Mais de Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

Mais de Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Último

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
Asmae Rabhi
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 

Último (20)

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 

Impact of HTTP Cookie Violations in Web Archives

  • 1. Impact of HTTP Cookie Violations in Web Archives Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  • 2. @ibnesayeed Cookies Are Why Your Archived Twitter Page Is Not in English 2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
  • 3. @ibnesayeed All Your Tweets Are Belong To Kannada 3 9,000+ mementos of @BarackObama English: 53% Kannada: 22% Other 45 languages: 25% https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
  • 4. @ibnesayeed Is JavaScript Causing This? 4 Twitter seems to be rendering translated phrases on the server. So, JavaScript cannot be responsible.
  • 5. @ibnesayeed Is Cache Conflicting at a Shared Proxy? 5 Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
  • 6. @ibnesayeed Is On-demand Archiving Bringing User Preferences In? 6 IA replays users’ headers in Save Page Now, but other archives do not have on-demand archiving. Archive.is sends custom Accept-Language header, not the one a user’s browser sends to it.
  • 7. @ibnesayeed Is Geo-location Affecting It? 7 Most of the archival crawlers run in the USA or European regions, which does not explain why Kannada (a regional Indian language) is so popular.
  • 8. @ibnesayeed Is Heritrix Sending Wrong Accept-Language Headers? 8 Heritrix generated WARC files do not contain any Accept-Language header.
  • 9. @ibnesayeed Language Content Negotiation in Twitter 9 The “?lang=<lang-code>” query parameter has the highest precedence. Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
  • 10. @ibnesayeed Alternate Language Links Pollute Crawler’s Frontier Queue 10 Kannada (kn) being at the end of the list, causes its “lang” cookie stick around for long, affecting many subsequent Twitter URLs.
  • 11. @ibnesayeed Experiment With Heritrix On Two Seed URIs ● https://twitter.com/?lang=ar ○ First request has an explicit lang query parameter ○ First response has a “Set-Cookie: lang=ar” header ● https://twitter.com/phonedude_mln/ ○ Second request has no lang query parameter, but sends a “Cookie: lang=ar” ○ Second response returns the page in Arabic 11
  • 12. @ibnesayeed Replaying Captured WARC With PyWB 12 https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
  • 13. @ibnesayeed Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages 13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
  • 14. @ibnesayeed Defaced Composite Mementos That Never Existed on the Live Web 14 Live leakage (Zombies) Temporal Violations Origin Violations And now, Cookie Violations! https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  • 15. @ibnesayeed Anatomy of a Twitter Timeline 15 ● Page is loaded with the initial set of tweets ● Navigation bar is in the current language ● Some sidebar blocks are loaded lazily ● New tweets are polled after every 30 seconds ● Global trends are polled after every 5 minutes
  • 16. @ibnesayeed Twitter Returns Server-side Rendered Markup 16 Cookies set by of prior responses may impact subsequent XHR responses.
  • 17. @ibnesayeed Pages With Explicit lang Parameter Are Consistent 17 ?lang=pt ?lang=en ?lang=ur Mementos with explicit “lang” parameter are language consistent.
  • 18. @ibnesayeed Replicate Heritrix Behavior on the Live Web 18 Load https://twitter.com/ in a browser tab B Retweet a tweet in the tab A Load https://twitter.com/?lang=en in a browser tab A Expand notification in the tab B Change lang param in the tab A
  • 19. @ibnesayeed What Can We Do About These Cookie Violations? ● Crawling ○ Sandbox short crawl sessions ○ Explicitly enforce short cookie expiration time and garbage collect frequently ○ Identify such sources of cookie violations and filter them off ● Replay ○ Respect content negotiation headers (advertised in “Vary” header) ○ Identify non-advertised cookies that affect the content to incorporate in replay ○ Classify cookies in categories like session, tracking, and configs etc. 19 Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving. Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
  • 20. @ibnesayeed Conclusions ● Cookies Are Why Your Archived Twitter Page Is Not in English ○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html ● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages ○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html ● Identified yet another source of bias in archives (over represented languages) ● Described behavior of cookies in crawling and replay (cookie violations) ● Proposed some potential solutions like keeping cookies short-lived ● Described open problems that need more in-depth research 20