O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 19 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Mais recentes (20)

Anúncio

Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

  1. 1. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests 1 Web Science & Digital Libraries Research Group Old Dominion University, Norfolk VA, USA @WebSciDL Kritika Garg1, Himarsha R. Jayanetti1, Sawood Alam2 , Michele C. Weigle1, and Michael L. Nelson1 2 Wayback Machine, Internet Archive San Francisco, California, USA @internetarchive The 24th International Conference on Asia-Pacific Digital Libraries (ICADL 2022) November 30 @kritika_garg @HimarshaJ @ibnesayeed @weiglemc @phonedude_mln
  2. 2. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Several types of web pages make repeated HTTP requests to the server for the latest/live updates 2 Social Media Live News Radio Live sports scores https://www.cbsnews.com/ https://twitter.com/home https://www.iheart.com/ https://www.livesport.com/en/
  3. 3. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Web archive rehosting the captured webpage (memento) 3 https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/ All the embeds and outlinked pages are also served from the web archive. For ex, https://web.archive.org/w eb/20221115072418im_/ https://oduwsdl.github.io/i mg/bg-masthead.jpg https://web.archive.org/web/20221115072418/https://oduwsdl.github.io/ Archive banner providing details of the capture. For ex, this capture is from Nov 15, 2022.
  4. 4. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web page averaging 1098 requests per minute https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/ 4
  5. 5. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln 1098 requests per minute to the server because embedded resources are missing https://arquivo.pt/wayback/20090628044051/http://www.radiocomercial.iol.pt/ http://www.radiocomercial.iol.pt/styles/slideshow/loader-0.png 5 The following types of archived web pages are more likely to cause the recurring requests: 1. Web pages with image carousels, banners, widgets, etc. 1. Web pages that require regular updates and poll the server periodically for the updates. For example, ● sports scores updates, ● stock market updates, ● news updates, ● chat applications, ● social media feed Carousel with missing images
  6. 6. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Linear growth in number of wasteful requests to the server by radiocomercial.iol.pt memento 6 The cumulative number of requests/second by radiocomercial.iol.pt memento. The linear growth after the first 203 requests due to recurring requests (1098 requests/min)
  7. 7. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web pages at esdica.pt with missing banner averaging 400 requests per minute https://arquivo.pt/wayback/20131105211447/http://esdica.pt/ 7
  8. 8. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived web page of livesports.com that polls for regular feeds causes unnecessary recurring requests https://web.archive.org/web/20210901092755/https://www.livesport.com/en/ 8
  9. 9. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Some archives may patch the missing resources by archiving the resource from the live web https://web.archive.org/web/20221122230303/https://edition.cnn.com/ 9 Archiving the resource by requesting the live web Successfully archived
  10. 10. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Patching the archive from the live web creates unnecessary writes & reads 10 https://web.archive.org/web/20100822133654/http://www.radiocomercial.iol.pt/ Missing resource Archiving the missing resource is unsuccessful (Since the resource does not exist in the live web)
  11. 11. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Minimal reproducible example (MRE): Carousel https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel_working.html 11
  12. 12. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln MRE with missing embedded resources averaging 174 requests/min https://kritikagarg.github.io/Unnecessary-Archival-Replay-Requests/MREcarousel.html 12
  13. 13. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Archived carousel example making recurring requests to the server due to missing resources 13 We archived this carousel example locally using pywb.
  14. 14. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Avoid recurring requests using Cache-Control HTTP response header 14 Cache-Control HTTP header field is used to specify directives for caching mechanisms in both requests and responses. public: The response may be cached by any cache, even if the response would normally be non-cacheable. max-age: The cached response remains fresh for N seconds. Web Archive Server response sent from server https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control GET /1.jpg HTTP/1.1 1st request for archived resource HTTP/1.1 404 Not Found Cache-Control: public, max- age=600 response sent from cache recurring request GET /1.jpg HTTP/1.1 600s HTTP/1.1 404 Not Found
  15. 15. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Avoid recurring requests for missing resources in MRE by caching HTTP 404 responses 15 We used Nginx proxy server to set-up the Cache-Control HTTP Response Header to Cache HTTP 404 responses
  16. 16. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Optimizing the replay using Cache-Control HTTP response header 16
  17. 17. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Halt in growth of recurring requests by MRE to server after caching 17 The cumulative number of requests/second by MRE memento before and after caching 404 responses. The linear growth after the first seven requests due to recurring requests (174 requests/min) 0 recurring requests/seconds after caching 404 responses No new requests are sent to the server until the Max-Age value times out
  18. 18. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln No recurring wasteful requests to the server by radiocomercial.iol.pt memento after caching 18 The cumulative number of requests/second by radiocomercial.iol.pt memento before (red line) and anticipated requests/seconds (blue line) after caching 404 responses. The linear growth after the first 203 requests due to recurring requests (1098 requests/min) anticipated rate of recurring requests after caching 404 responses (until the Max-Age value times out) Arquivo.pt has implemented this solution. They have added a Cache-Control HTTP response header to cache HTTP 404 responses.
  19. 19. Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests ○ ICADL 2022 ○ @WebSciDL ○ @kritika_garg, @HimarshaJ, @ibnesayeed, @weiglemc, @phonedude_mln Summary: Use Cache-Control response headers ● Replaying an archived web page with carousels, widgets, etc. should not cause ~1000 requests/min to the web archive! ● Web archives that try to patch 404s from the live web may cause even more unnecessary traffic (reads + writes) to the web archive. ● We demonstrated that these requests can be mitigated by sending the 404 responses with: ○ Cache-Control: public, max-age=600 19

×