A closing talk I gave at the JISC/DPC 'Missing Links' conference on web archiving in July 2009. The talks were on the DPC site but ironically the link is now broken.
I’m here to talk with you today about some use cases for web archives, and more widely about what people might want to know about the history of the web and what’s on the web. And the reason for thinking about these things is that they can and should influence how we collect stuff, what we collect, but more importantly how we provide access to it. But first, to my title. It is somewhat contrived. And I can reveal that the reason I chose it is simply because it can be shortened, so. This enables me to lay claim to be the first person to mention web 8.0. And that’s pretty pointless, isn’t it. The one thing we can be sure about is that whenever we need to refer to another shift in the usage model of the web similar to that which inspired the name ‘web 2.0’, we won’t be calling it ‘web 3.0’ (although some marketers already are.) We’ll need some other entirely new metaphor to describe an entirely new shift. But if we did stick to using numbers, web 8.0 wouldn’t be with us for another 60 to 90 years. Most of us will be dead. None of us can imagine what it will be like. And some people will be looking back at the web and web 2.0 and marvelling at how primitive, quaint and amusing we all were. And how frustrating - because we can guarantee that we won’t have captured something of importance to the future.
Perhaps we can get some perspective on this by considering our present-day view of a similar technological revolution of the past, the telegraph. This popular history by Tom Standage explicitly tries to draw parallels between the change which the telegraph brought to its day, and that which the internet has brought us now. The comparison is a bit strained, but amongst the points of interest to me are the aspects of the history of the telegraph that make up the book. The contains of messages and their authorship is certainly a part of the history. But just as important are contemporary reactions to it and descriptions of how the technology affected other aspects of life. One image that has stayed with me is of the receptacles which appeared on the back of horse-drawn trams in Paris into which people could drop telegraph messages on pre-paid stationery, from whence they would be delivered to a telegraph office en route for speedy onward delivery. Historians of communication technology are interested in more than simply what was communicated, and by who.
Let’s look at another communication technology, which also allows individuals to send messages to each other. If we have a small number of these messages we might well be interested in analysing their content. If we have a few tens of millions of them, though, other things become interesting, as this visualisation demonstrates. It shows SMS traffic in Amsterdam in the days leading up to and following new year’s eve. There’s a lot of information in this visualisation and a lot more that could be done with the data behind it. Yet it was done without knowing the contents of any of the messages, nor who they were from nor who they were to. We simply know where they were sent and when. When you have enough data of a given type, the useless becomes useful.
So my concern about web archives comes partly from a feeling that some of them are too document-centred, imaging that the only user is one sitting in front of a screen selecting past pages to view one by one. That is one possible use of a web archive, but by no means the only one. We can be interested in content in aggregate; we can be interested in the properties of content, the web of data and as data, and material about the web as well as from it - as Standage’s history makes clear.
So I’m not suggesting thhat no one wants to view single pages. Sometimes we’re interested in their content more than their presentation, as we heard from the archivepress folk today. And sometimes the presentation is key, and sometimes it’s both. We might be interested in timeslices - a bunch of pages from a given point in time - or in investigating how one page or set of pages change over time. Brian Kelly’s presentation of the history of the University of Bath homeppage is a useful illustration, painstakingly constructed by him using the Internet Archive. Couldn’t we make access like this a little easier for people ?
But when we have content in aggregate new possibilities arise. Textual analysis can tell us many things. We can look at the contrasting use of language between different types of sites, we can track the spread of neoligisms, concepts and rumours, or we can do something as simple as constructing word clouds from pages over time. Neil Grindley is currently examining the text of the entire JISC website to look for overuse of particular forms of language, or to constrast mentions of teaching and learning as opposed to research. And none of these involve a human looking at a web page.
But properties of the content can be just as interesting. To take one simple example, we might want to see how the takeup of a particular image format such as PNG took place. Was it geographically uniform, or uniform over types of site? What did it replace - all image formats, or GIF, BMP or JPG preferentially ? And looking back further, whatever happened to XPM, the original icon format used by early graphical web browsers ?
We expect archives to offer some sort of search. But how should it operate ? This was the result of carrying out a search using a well known search engine yesterday for ‘web archive’. What would the results have been if I carried out that search 10 years ago ? Is any archive’s search interface capable of telling me that, or even getting close to it ?
And on a larger scale, the web itself is data, connected data, connected metadata, with connections that shift and break and change meaning over time. There are a host of ways of visualising this, some more suited to individual sites, some to cluster of sites on a topic or over time and some to the entire web, or large parts of it. I want to be able to do these things with archived web content.
And increasingly there is talk of the web of data - the use of the web to link together data and construct something greater than the sum of the parts doing it. This well-known diagram is a visualisation of that state of the web of data a few months ago and it’s since been joined by much more government and research data from a variety of sources. Now one can argue that dealing with this isn’t the job of a web archive - the data is not of the web, but simply on the web, and it’s the job of database archives to preserve it. But some of the data really does make sense only in a web context, and there’s no doubt that the web is being used to do stuff with it that wou;dn’t happen otherwise. I think we should be able to look at the web of data as it was as well as as it is and will be. I confess I don’t know how we might do that, but the possibility should not be ruled out.
APIs are key to allowing much of this to happen. Those which are web-friendly encourage the developmentof innovative ways of accessing content. They allow archives to concentrate on collecting material, protecting it and providing permanent references for their content, whilst permitting a variety of viewing and access methods to emerge. Moreover, those that permit bulk access enable intelligent agents to work for us or to work alongside us in exploring archived content - agents that don’t need to be developed by the archives themselves, but which can be exploited by them. If the archive space fragments, APIs are particlarly important to allow people to spread research over a variety of archives. Let’s look at one reason that should persuade you why APIs are a good thing. Taggalaxy allows the exploration of flickr content in a way that is completely different from flickr, yet depends only on the metadata that flickr exposes via its API. Imagine exploring a web archive this way.
And let’s not forget that there’s more stuff than web pages that might be of future interest. Some of it is already being preserved, such as traditional media about the web - the flurry of consumer magazines that emerged in the 1990s, for instance, although not the premium-rate telephone line that told you the ‘cool site of the day.’ Usage logs, server configs, etc are all part of the history of the web in some form, as is the software that makes it possible. I’m not suggesting that web archivists deal with this stuff - but someone should.
But finally a few more of those content visualisations, all taken from Martin Dodge’s cyber-geography pages - a discipline that’s been going long enough that his pages are no longer maintained and are out of date.