O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Web Archives at the Nexus of Good Fakes and Flawed Originals

993 visualizações

Publicada em

Web Archives at the Nexus of Good Fakes and Flawed Originals

Michael L. Nelson

Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln

With:
ODU: Michele C. Weigle, John Berlin, Mohamed Aturban, Justin Whitlock
LANL: Martin Klein, DANS: Herbert Van de Sompel

CNI Spring 2019 Membership Meeting, 2019-04-09,
@phonedude_mln, @WebSciDL

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Web Archives at the Nexus of Good Fakes and Flawed Originals

  1. 1. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web Archives at the Nexus of Good Fakes and Flawed Originals Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln With: ODU: Michele C. Weigle, John Berlin, Mohamed Aturban, Justin Whitlock LANL: Martin Klein, DANS: Herbert Van de Sompel Supported in part by The Andrew Mellon Foundation and the National Science Foundation
  2. 2. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL "You’re in a desert walking along in the sand when all of the sudden you look down, and you see a tortoise..." Supported in part by The Andrew Mellon Foundation and the National Science Foundation Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln With: ODU: Michele C. Weigle, John Berlin, Mohamed Aturban, Justin Whitlock LANL: Martin Klein, DANS: Herbert Van de Sompel
  3. 3. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://en.wikipedia.org/wiki/Blade_Runner National Film Registry Induction, 1993: https://www.loc.gov/loc/lcib/94/9405/film.html http://www.loc.gov/static/programs/national-film-preservation-board/documents/blade_runner.pdf 1982 1968
  4. 4. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://www.youtube.com/watch?v=LwDdP88Dr54
  5. 5. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://www.youtube.com/watch?v=LwDdP88Dr54
  6. 6. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL We’re not going to review RS’s/PKD’s predictions https://www.cnn.com/2018/12/28/movies/blade-runner-predictions-2019-trnd/ https://twentytwowords.com/blade-runner-was-set-in-2019/ https://nwn.blogs.com/nwn/2019/01/blade-runner-los-angeles-2019.html https://www.theregister.co.uk/2019/01/01/blade_runner_today/
  7. 7. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Common themes in the works of Phillip K. Dick • identity • self vs. the other • memory • humanity • authenticity • reality vs. simulacra • unreliable narrator
  8. 8. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Blade Runner in 239 characters
  9. 9. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Voight-Kampff Test: distinguishing authentic (humans) vs. fake (replicants) https://www.youtube.com/watch?v=ic0PuvJbdu0 You’re in a desert walking along in the sand when all of the sudden you look down, and you see a tortoise. You reach down, you flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can’t, not without your help. But you’re not helping. Why is that?
  10. 10. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Robots indistinguishable from humans, off-world slaves, perpetually “dark and stormy” Los Angeles – all good cyberpunk sci-fi tropes – but that’s not our 2019, right?
  11. 11. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL The future is already here — it's just not evenly distributed. -- William Gibson (yes, I’m mixing sci-fi authors) https://twitter.com/badnetworker/status/1093864777179430912 https://geekologie.com/2018/02/boston-dynamics-tests-door-opening-robot.php
  12. 12. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “So when do we get to that part about web archiving?”
  13. 13. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archives are science fiction. Web archives are enabling a reality, as foreseen by PKD and other sci-fi authors, where we can insert bespoke fakes into our collective memory.
  14. 14. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archives are like science fiction because they’re a paradox: We need a significant and continuous technology investment today to be able to say a page “used to look like this.”
  15. 15. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archiving is not file backup. Backup = prevent, detect, repair changes Web archiving = continuous change to better simulate the past Web archiving is a simulacrum of the past.
  16. 16. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL The essence of a web archive is to modify its holdings https://web.archive.org/web/19971211010502/https://www.cni.org/ Rewrite links so they point back in the archive Provide archival metadata banner (what, when, how many) Relatively simple for the Web of 1997. Today, it’s not so easy.
  17. 17. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Some modifications are to make yesterday’s formats safe for / available to today’s browser http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html Cf. https://techcrunch.com/2017/07/25/get-ready-to-say-goodbye-to-flash-in-2020/ http://web.archive.org/web/20100605013233/http://www.youtube.com/watch?v=1aPPSIDr3Mc&feature=player_embedded/
  18. 18. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archive software is continuously evolving, in part to better realize a more authentic version of the past https://github.com/internetarchive/wayback/releases https://github.com/webrecorder/pywb/releases
  19. 19. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL "...the government presented testimony from the office manager of the Internet Archive, who explained how the Archive captures and preserves evidence of the contents of the internet at a given time. The witness also compared the screenshots sought to be admitted with true and accurate copies of the same websites maintained in the Internet Archive, and testified that the screenshots were authentic and accurate copies of the Archive’s records. Based on this testimony, the district court found that the screenshots had been sufficiently authenticated." https://law.justia.com/cases/federal/appellate-courts/ca2/17-2479/17-2479-2018-07-02.html Evidentiary use of “screenshots” of archived pages United States v. Gasperini, No. 17-2479 (2d Cir. 2018)
  20. 20. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Evidentiary use of “screenshots” of archived pages United States v. Gasperini, No. 17-2479 (2d Cir. 2018) "...the government presented testimony from the office manager of the Internet Archive, who explained how the Archive captures and preserves evidence of the contents of the internet at a given time. The witness also compared the screenshots sought to be admitted with true and accurate copies of the same websites maintained in the Internet Archive, and testified that the screenshots were authentic and accurate copies of the Archive’s records. Based on this testimony, the district court found that the screenshots had been sufficiently authenticated." https://law.justia.com/cases/federal/appellate-courts/ca2/17-2479/17-2479-2018-07-02.html Screenshots matching IA’s records are not the same thing as IA’s records matching the past…
  21. 21. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL So why is it so hard to recreate the past? If we just had isolated, static pages (jpegs, pdfs, mp3s, etc.) then there’d be no problem.
  22. 22. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL links Javascript (modifying the page) embedded resources (possibly including other HTML pages via iframes) links links Real HTML pages are complex
  23. 23. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Javascript is why we can’t have nice (archival) things
  24. 24. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Load the archived page, get an eagle https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  25. 25. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Hit “reload”, get a tiger https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  26. 26. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Hit “reload” again, get a mountain https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  27. 27. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “Look on my Javascript, ye Mighty, and despair!”
  28. 28. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Actually, the fws.gov example was super easy; most changes are much harder to trace Mohamed Aturban, unpublished, memento: http://web.archive.org/web/20130724144801/http://www.cnn.com/ Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
  29. 29. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Embedded resources + Javascript = Our simulation of what CNN.com looked like then is flawed. It will never be 2013 again, so in some sense that page is lost.
  30. 30. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Zombies: live web “leaking” into an archived page http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html this page is from 2008 this ad is from 2012 (when this screen shot was taken) As of late 2017, zombies mostly no longer occur https://blog.dshr.org/2017/09/attacking-users-of-wayback-machine.html
  31. 31. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Temporal violations: reconstructing legitimately archived resources into a page that never existed http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html text (2004-12) says rain, image (2005-09) is clear
  32. 32. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Incorrectly replaying the 2004 weather forecast for Varina, Iowa is hardly the stuff of dystopian cyberpunk. There are cases where temporal violations begin to look like tampering…
  33. 33. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Remember the case of Joy Reid’s blog? https://www.odu.edu/news/2018/5/michael_nelson https://twitter.com/DrDanetteAllen/status/990228054952865793
  34. 34. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://twitter.com/phonedude_mln/status/990054945457147904 HTML archived on 2006-01-11 JS archived on 2006-02-07 Reid was a prolific blogger, so a gap of nearly a month is catastrophic for temporal integrity.
  35. 35. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Not always Javascript – cookies causes the web archive to store the Urdu language page at the URL for the English page https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
  36. 36. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html Cookies + Javascript = A combo Urdu / Portuguese / English page that never existed
  37. 37. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archives are unreliable narrators. Unreliable narrators cause us to question everything we’ve been told.
  38. 38. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Let’s prove Lester Holt did not “fudge the tape”! https://twitter.com/AaronBlake/status/1035124642456002565https://twitter.com/realDonaldTrump/status/1035120511259500544 https://news.vice.com/en_us/article/ne5x3d/trump-lester-holt-james-comey-nbc
  39. 39. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL The May, 2017 NBC interview is not archived until August, 2018 (and even then, the video itself is not archived) https://www.nbcnews.com/nightly-news/video/pres-trump-s-extended-exclusive-interview-with-lester-holt-at-the-white-house-941854787582?v=raila https://web.archive.org/web/*/https://www.nbcnews.com/nightly-news/video/pres-trump-s-extended-exclusive-interview-with-lester-holt-at-the-white-house-941854787582?v=raila https://web.archive.org/web/20180825094239/https://www.nbcnews.com/nightly-news/video/pres-trump-s-extended-exclusive-interview-with-lester-holt-at-the-white-house-941854787582?v=raila Clicking through to the video reveals a loop of postal carrier slipping on ice; not the Lester Holt interview.
  40. 40. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Errors in crawling and playback are hard to distinguish from tampering https://twitter.com/katestarbird/status/911257133231910913 https://er.educause.edu/articles/2018/10/managing-the-cultural-record-in-the-information-warfare-era I want to explicitly note here the difference between the act of quietly rewriting the record and enjoying the results of the rewrites that are accepted as truth and that of deliberately destroying the confidence of the public (including the scholarly community) by creating compromise, confusion, and ambiguity to suggest that the record cannot be trusted.
  41. 41. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Disinformation applied to web archives doesn’t necessarily mean you have to insert a specific narrative into the archive. You just need to cast doubt on the archive as our collective memory.
  42. 42. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL We’re unaware of any cases where web archive content has been hacked or faked for any substantive goal. However, web archives are not immune. It’s just the theater of conflict has yet to expand to include web archives.
  43. 43. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Twitter then and now http://inventorspot.com/articles/top_ten_twitterati_tweet_above_rest_31806 https://www.vox.com/policy-and-politics/2017/10/19/16504510/ten-gop-twitter-russia
  44. 44. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Facebook then and now https://twitter.com/Pinboard/status/975013825010458624 https://web.archive.org/web/20090722095954/http://facebook.com/zuck See also: https://www.businessinsider.com/facebook-old-posts-mark-zuckerberg-disappeared-2019-3
  45. 45. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Gmail then and now http://googlepress.blogspot.com/2004/04/google-gets-message-launches-gmail.html https://www.avanan.com/resources/gmail-exploit-allows-dnc-email-attack
  46. 46. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archives then and soon https://web.archive.org/web/20020601134105/http://www.businessweek.com/technology/content/feb2002/tc20020228_1080.htm ?
  47. 47. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Why do we expect things to be different for web archives? Our trust model for web archives is still rooted in the 1980s / early 90s.
  48. 48. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL My chronology with Unix Late 80s: 1 computer, many users Used an X terminal to access Cray, Convex supercomputers 90s: 1 computer, 1 user My Sun IPX workstation was the first www.larc.nasa.gov now: many computers, 1 user I’m not even sure how many computers I have access to
  49. 49. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL From brewster@wais.com Sun Apr 25 00:03:19 1993 Received: from express.larc.nasa.gov by blearg.larc.nasa.gov with SMTP (5.65.2/server2.4) id AA28277; Sun, 25 Apr 93 00:00:26 -0400 Received: from wais.wais.com by express.larc.nasa.gov with SMTP id BA21157 (SMTP/Lite-1.15) for <m.l.nelson@larc.nasa.gov>; Sun, 25 Apr 93 00:00:20 -0400 Received: by wais.wais.com (4.1/SMI- 4.1/Brent-911016) id AA14369; Sat, 24 Apr 93 20:47:54 PDT Date: Sat, 24 Apr 93 20:47:54 PDT Message-Id: <9304250347.AA14369@wais.wais.com> From: Brewster Kahle <brewster@wais.com> To: abc@concert.net To: admin@ds.internic.net To: akers@fiddle.oit.unc.edu To: anders@ifi.uio.no To: anders@munin.ub2.lu.se … To: m.l.nelson@LaRC.NASA.GOV … To: root@ds.internic.net To: root@ncgia.ucsb.edu To: root@fiddle.oit.unc.edu To: root@oac.hsc.uth.tmc.edu To: root@samba.acs.unc.edu To: root@spk41.usace.mil To: root@stone.ucs.indiana.edu To: root@sunsite.unc.edu To: root@uniwa.uwa.oz.au To: root@uva.ci.uv.es To: root@nic.funet.fi … WAIS server maintainers, As you probably know through wais-discussion, we are announcing the commercial WAIS server this thursday. There is a big press event and showcase at the WAIS Inc offices. Thank you, everyone, for making it possible for us to pull off a startup company. We are considering running a special price for a limited time for those that know and understand WAIS already. We would like to discuss this with those that might be interested in it, and would like to help us determine how it should work. Most people will continue to use the freeware, and that is fine, this is for those that might be interested in a commercial version. At this time, we will not be discussing the differences between things or other products. Given that the press has started to call and ask for information before hand (to scoop this story, you know the press...), we have had to keep a very quiet profile. On the other hand, we need the help from all of you. Generally, this is done with a signed non-disclosure basis, but this wont work on the Internet and not in time. What I was thinking was to ask anyone that would like to discuss this, to send an "email non-disclosure" to non-disclosed- waisites-request@wais.com. I wish this weren't so baroque, but you could not believe some of the members of the press I have talked to. If one reporter publishes early, it can spoil things (and get it wrong). (please dont email to me. At this point, my cup floweth over. I will dig out after the showcase!) -brewster TMC->WAIS Inc->AOL->Alexa->IA https://twitter.com/phonedude_mln/status/1105160308866338816
  50. 50. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL From brewster@wais.com Sun Apr 25 00:03:19 1993 Received: from express.larc.nasa.gov by blearg.larc.nasa.gov with SMTP (5.65.2/server2.4) id AA28277; Sun, 25 Apr 93 00:00:26 -0400 Received: from wais.wais.com by express.larc.nasa.gov with SMTP id BA21157 (SMTP/Lite-1.15) for <m.l.nelson@larc.nasa.gov>; Sun, 25 Apr 93 00:00:20 -0400 Received: by wais.wais.com (4.1/SMI- 4.1/Brent-911016) id AA14369; Sat, 24 Apr 93 20:47:54 PDT Date: Sat, 24 Apr 93 20:47:54 PDT Message-Id: <9304250347.AA14369@wais.wais.com> From: Brewster Kahle <brewster@wais.com> To: abc@concert.net To: admin@ds.internic.net To: akers@fiddle.oit.unc.edu To: anders@ifi.uio.no To: anders@munin.ub2.lu.se … To: m.l.nelson@LaRC.NASA.GOV … To: root@ds.internic.net To: root@ncgia.ucsb.edu To: root@fiddle.oit.unc.edu To: root@oac.hsc.uth.tmc.edu To: root@samba.acs.unc.edu To: root@spk41.usace.mil To: root@stone.ucs.indiana.edu To: root@sunsite.unc.edu To: root@uniwa.uwa.oz.au To: root@uva.ci.uv.es To: root@nic.funet.fi … WAIS server maintainers, As you probably know through wais-discussion, we are announcing the commercial WAIS server this thursday. There is a big press event and showcase at the WAIS Inc offices. Thank you, everyone, for making it possible for us to pull off a startup company. We are considering running a special price for a limited time for those that know and understand WAIS already. We would like to discuss this with those that might be interested in it, and would like to help us determine how it should work. Most people will continue to use the freeware, and that is fine, this is for those that might be interested in a commercial version. At this time, we will not be discussing the differences between things or other products. Given that the press has started to call and ask for information before hand (to scoop this story, you know the press...), we have had to keep a very quiet profile. On the other hand, we need the help from all of you. Generally, this is done with a signed non-disclosure basis, but this wont work on the Internet and not in time. What I was thinking was to ask anyone that would like to discuss this, to send an "email non-disclosure" to non-disclosed- waisites-request@wais.com. I wish this weren't so baroque, but you could not believe some of the members of the press I have talked to. If one reporter publishes early, it can spoil things (and get it wrong). (please dont email to me. At this point, my cup floweth over. I will dig out after the showcase!) -brewster When computers were $$$, an email to “root” could be expected to be received by someone entrusted with the necessary $$$ to responsibly administer the machine. IOW, “root” was almost always a white hat. It hasn’t been like that for a long time. Web archives are like the Unix mainframes of today. TMC->WAIS Inc->AOL->Alexa->IA https://twitter.com/phonedude_mln/status/1105160308866338816
  51. 51. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL How well do you know root@archive.org? As in, could you call/email him right now and expect a response? Our entire national digital preservation strategy is predicated on Brewster Kahle “not being evil”™ If he is leading a 25+ year sleeper cell, we’re doomed.
  52. 52. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL How well do you know these roots? Many more: https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
  53. 53. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Up until now, we’ve only looked at failures or edge cases in crawling and replay. What about deliberate fakes?
  54. 54. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Cut-n-paste / mashup “fakes” for humor Victorian Photo Collage https://www.metmuseum.org/exhibitions/listings/2010/victorian-photocollage “The Flying Saucer” (1956) https://en.wikipedia.org/wiki/The_Flying_Saucer_(song) https://www.youtube.com/watch?v=XCrn6QXvHLg Brian Williams Raps ‘Gin & Juice’ https://www.youtube.com/watch?v=XlGLhYFrv6w
  55. 55. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL More convincing fakes require significant skills, knowledge, and access https://en.wikipedia.org/wiki/Piltdown_Man https://en.wikipedia.org/wiki/Shroud_of_Turin https://www.npr.org/templates/story/story.php?storyId=94461486
  56. 56. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “deep learning” + “fake” = deepfakes https://motherboard.vice.com/en_us/article/7x799b/selling-ai-generated-fake-porn-is-probably-a-good-way-to-get-sued https://motherboard.vice.com/en_us/article/ev5eba/ai-fake-porn-of-friends-deepfakes
  57. 57. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Becoming more mainstream: https://twitter.com/MikaelThalen/status/1090349932266094593 https://deepfakesapp.online/ A “safe for work” example: No longer buried in the dark corners of Reddit:
  58. 58. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “Detecting” deepfakes will happen. “Preventing” deepfakes won’t happen; they’re here to stay: Mementos, even of a fake past, are core to the human condition. “Did you get your precious photos?” “Implants. Those aren't your memories, they're somebody else's. They're Tyrell's niece's.” http://deepemotions.free.fr/theme_1.html Real photos, fake memories: replicants attach significant value to photos, even when they know the memories are fake.
  59. 59. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Next Thanksgiving dinner, liven up the discussion with your extended family 1. Extract just 0:23—0:26 of the Obama/Peele video 2. Embed in an HTML page 3. Use Javascript to rewrite the banner and browser URL – Datetime: 2016-11-09 – URL: www.whitehouse.gov/totally NotFake 1. Claim the deep state deleted the page from the live webhttps://www.theverge.com/tldr/2018/4/17/17247334/ai-fake-news-video-barack-obama-jordan-peele-buzzfeed https://www.youtube.com/watch?time_continue=43&v=cQ54GDm1eL0#t=0m23s
  60. 60. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Not just hypothetical.
  61. 61. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Inserting fakes into real archives Here’s an actual page in the IA “proving” Brian Williams released “Gin and Juice” in 1992, a full year before Snoop Dogg. John Berlin, MS Thesis, 2018 https://www.youtube.com/watch?v=k3QTcJZdFfs (actual URI-R & URI-M have also been obscured in the video to hide the technique) The content is clearly fake, but it demonstrates that it’s possible to write Javascript that attacks the archive’s playback capability. It takes an archiving expert to tell the difference.
  62. 62. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL We’ve known about these & other attacks for nearly two years http://labs.rhizome.org/presentations/security.html#/ https://acmccs.github.io/papers/p1741-lernerAT3.pdf https://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html https://ws-dl.blogspot.com/2018/04/2018-05-01-high-fidelity-ms-thesis-to.html
  63. 63. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL There are other ways, presumably still hypothetical, to attack the archives https://twitter.com/internetarchive/status/596768668756774914 https://xkcd.com/538/
  64. 64. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://www.theguardian.com/uk-news/2018/sep/05/planes-trains-and-fake-names-the-trail-left-by-skripal-suspects https://www.cnn.com/2018/10/22/middleeast/saudi-operative-jamal-khashoggi-clothes/index.html “Planes, trains and fake names: the trail left by Skripal suspects” “Surveillance footage shows Saudi 'body double' in Khashoggi's clothes after he was killed, Turkish source says” Before you say “that will never happen!” Reminder: agents, dissidents, journalists have all disappeared; they won’t mind adding a librarian/sysadmin to the list
  65. 65. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL I’ve got good news and bad news: Setting up a web archive is not as difficult nor expensive as it used to be. OpenWayback, WAIL, pywb, et al. + cloud storage = you can have a web archive running for about the same time it took to generate the Steve Buscemi / Jennifer Lawrence deepfake. https://github.com/iipc/openwayback https://github.com/N0taN3rd/wail https://machawk1.github.io/wail/ https://github.com/webrecorder/pywb
  66. 66. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Inserting fakes into fake archives breitbart.com/wayback/*/whitehouse.gov/totallyNotFake infowars.com/web/*/whitehouse.gov/totallyNotFake iluv.aynrand.org/*/whitehouse.gov/totallyNotFake InternetResearchAgency.ru/whitehouse.gov/totallyNotFake How well do you know root at these archives? Are they really four different archives, or one root for all of them? What if 99.9% of the time they faithfully replay pages?
  67. 67. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html What if we start off with > (n/2)+1 archives compromised?
  68. 68. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL What if the archives were targeted to amplify a specific disinformation narrative? And what if the archives had no choice but to cooperate?
  69. 69. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL The University of Farmington is fake DHS strong armed a “.edu” registration, they could do the same to IA & others too https://twitter.com/nwarikoo/status/1090726638034276352 https://web.archive.org/web/20161023170733/https://universityoffarmington.edu/ https://twitter.com/phonedude_mln/status/1092464939040755712 First capture: 2016-10-23
  70. 70. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Blockchain to the rescue!!! <lasers> <sirens> <disco-thumping-soundtrack> nope. https://www.multichain.com/blog/2015/11/avoiding-pointless-blockchain-project/ https://eprint.iacr.org/2017/375.pdf https://blog.dshr.org/search/label/bitcoin
  71. 71. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL There is no shortage of deepfake vs. blockchain stories https://www.wired.com/story/the-blockchain-solution-to-our-deepfake-problems/ https://www.longhash.com/news/the-coming-war-between-deepfakes-and-blockchain
  72. 72. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL A Voight-Kampff Test for deepfakes doesn’t seem that silly now https://twitter.com/TechCrunch/status/1009556795965296642 https://www.technologyreview.com/s/611726/the-defense-department-has-produced-the-first-tools-for-catching-deepfakes/
  73. 73. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Are we prepared for the unintended consequences? “Enforcing digital signatures for all cameras and video devices would offer the same capability in reverse. Suddenly every photograph and video shared online could be traced back to its original owner. Security services in a repressive regime could scour social media for all videos depicting them in a negative light and trace them back to the precise individuals who captured the video, arresting them en masse.” https://www.forbes.com/sites/kalevleetaru/2018/09/09/why-digital-signatures-wont-prevent-deep-fakes-but-will-help-repressive-governments/
  74. 74. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL On the other hand, “blockchaining” our pets is a study in incompatibility, so tracking photos may never happen https://www.aspca.org/about-us/aspca-policy-and-position-statements/microchips https://moviepaws.com/2017/10/22/owls-snakes-and-unicorns-the-animals-of-blade-runner/ In Blade Runner, synthetic pets had serial numbers (real pets are unavailable to all but the richest). “While most of the world has accepted these standards, North America has not. The primary problem is a competitive, technological one involving the compatibility of the microchips and the readers that are used by shelters and veterinary clinics.”
  75. 75. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL As for blockchains and web archives…
  76. 76. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
  77. 77. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps “…right now you can get timestamps for every book, movie, song, computer program, legal document, etc. in the thousands of collections in the archive. In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshots…”
  78. 78. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL That’s never going to happen. (at least not 3rd party through the playback interface)
  79. 79. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Archive URI-Ms ----------------------------- perma-archives.org 182 bibalex.org 199 webarchive.org.uk 349 bac-lac.gc.ca 351 proni.gov.uk 469 digar.ee 488 webharvest.gov 712 internetmemory.org 979 nationalarchives.gov.uk 994 stanford.edu 1222 archive-it.org 1383 archive.is 1396 web.archive.org 1566 arquivo.pt 1569 webcitation.org 1585 vefsafn.is 1589 loc.gov 1594 ----------------------------- Total 16627 Sample 16k+ Mementos from 17 Web Archives
  80. 80. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible).
  81. 81. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible). Partial archive outage because of security / maintenance upgrade
  82. 82. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible). Post-upgrade, replay is variable.
  83. 83. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL More Archived Pages Changed Every Time Than Never Changed (yes, this experiment used “raw” mode) Never changed: 2007 URI-Ms (1 in 8) Always changed: 2773 URI-Ms (1 in 6) Fixity-based approaches, including blockchain, will not work.
  84. 84. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “Hash the screen shot, not the HTML!” That doesn’t work either.
  85. 85. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL 1 WARC file, 2 Wayback Machines, 3 Browsers = 6 different replays http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/ http://web.archive.org/web/20130106140348/http://www.harvard.edu/ see also: https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
  86. 86. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Why not create a LOCKSS for web archives?
  87. 87. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archives are not especially interoperable. There are many issues regarding interoperability, but generational loss is a good demonstration of incompatible assumptions about simulating the past.
  88. 88. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://web.archive.org/web/20180501125952/https:/twitter.com/phonedude_mln/status/990054945457147904
  89. 89. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL http://archive.is/PaKx6
  90. 90. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://perma.cc/3HMS-TB59
  91. 91. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL http://www.webcitation.org/77RhNeyoZ
  92. 92. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://web.archive.org/web/20190407024654/https://perma.cc/3HMS-TB59
  93. 93. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL https://web.archive.org/web/20190407031659/http://www.webcitation.org/77RhNeyoZ
  94. 94. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Web archiving interoperability: a metaphor (non-synthetic pets, possibly microchipped) https://www.youtube.com/watch?v=SQudKvrwDAU
  95. 95. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL To summarize: Existing, trusted archives can be compromised by: 1) crawling malicious pages, or 2) attacking facilities / personnel 3) court orders Lowered resource threshold for archives allows: 1) “long game” archives: faithful now, corrupt later, 2) “sock puppet” archives: surreptitiously cooperating archives The nature of web archives is to change content – current fixity based approaches will not help.
  96. 96. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Looking forward: We need new models for web archiving and verifying authenticity. The Heritrix / Wayback Machine technology stack, while successful, has limited our thinking.
  97. 97. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL “Studies generally suggest that, year after year, less than 60 percent of web traffic is human; … For a period of time in 2013, the Times reported this year, a full half of YouTube traffic was “bots masquerading as people,” a portion so high that employees feared an inflection point after which YouTube’s systems for detecting fraudulent traffic would begin to regard bot traffic as real and human traffic as fake. They called this hypothetical event “the Inversion.”” http://nymag.com/intelligencer/2018/12/how-much-of-the-internet-is-fake.html Robots outnumber humans 10:1 in sessions, 5:4 in HTTP connections in the IA, ca. 2012 http://arxiv.org/abs/1309.4016 https://giphy.com/gifs/harrison-ford-blade-runner-sean-young-yjB2fwqjv5rry/media
  98. 98. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL I suspect the core of the new model will have a lot in common with click farms https://twitter.com/mbrennanchina/status/1072114511212109824
  99. 99. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Record what we saw at crawl time as a baseline, then we need a distance measure for crawl time and replay time http://dx.doi.org/10.5210/fm.v22i112.8097 https://ws-dl.blogspot.com/2013/05/2013-05-25-game-walkthroughs-as.html Documenting instead of archiving… 1)Robotic witnesses 2)New Nielsen families
  100. 100. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL Some of you might be thinking “but I don’t like Blade Runner – what can I take away from this talk?” (my wife refers to the film as “serious white guys talking”) Two methods for passing the Voight- Kampff Test for Blade Runner fandom
  101. 101. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL 1) Is Deckard a replicant? In the book, he’s definitely human. In the seven (!) versions of the movie, it ranges from “ambiguous” to “replicant”. https://moviepaws.com/2017/10/22/owls-snakes-and-unicorns-the-animals-of-blade-runner/ https://en.wikipedia.org/wiki/Themes_in_Blade_Runner https://en.wikipedia.org/wiki/Blade_Runner#Versions
  102. 102. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL 2) “Tears in Rain” – Greatest monologue in sci-fi? Or greatest monologue of all time? I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die. https://www.youtube.com/watch?v=9hDo80ddn4Q https://en.wikipedia.org/wiki/Tears_in_rain_monologue https://www.youtube.com/watch?v=BM54jXndyvQ
  103. 103. CNI Spring 2019 Membership Meeting, 2019-04-09, @phonedude_mln, @WebSciDL 2) “Tears in Rain” – Greatest monologue in sci-fi? Or greatest monologue of all time? I've crawled things you people wouldn't believe. Clickjacking attacks off the x-frame-options: sameorigin. I watched ajax requests redirect at the aggregator TimeGate. All those pages will be lost in time, like tears in rain. Time to lie. https://www.youtube.com/watch?v=9hDo80ddn4Q https://en.wikipedia.org/wiki/Tears_in_rain_monologue https://www.youtube.com/watch?v=BM54jXndyvQ

×