Presentations from Oxford Internet Institute, the Internet Archive, and Hanzo Archives Ltd presenting the results of a JISC-NEH funded transatlantic digitisation project.
1. Slides from Humanities on the Web: Is it working?
Date: Thursday, 19 March 2009, 10-4
Location: Oxford University, Oxford, UK
Webcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275
Slide URL: http://www.slideshare.net/etmeyer/WWWoH
Afternoon Event:
1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in
conjunction with the Internet Archive: The World Wide Web of Humanities
OII: Selecting and analysing the sample WWI and WWII collections
(Christine Madsen & Dr. Eric Meyer)
The Internet Archive: Extracting the data (Molly Bragg)
Hanzo Archives Ltd.: Working with the data (Mark Middleton)
Discussion and questions
Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238
3. Why WWI and WWII?
Many branches of the humanities
History Journalism Art
Art history Advertising Literature
Political Military
Poetry
science history
4. Why WWI and WWII?
Well-rounded set of materials
5. Why WWI and WWII?
• Changes • Differences
over time between
WWI and
Language Doc types WWII
Secondary Top-level
domains domains
7. Building the Collection
Seed
Seeds are: Seed 1
2
the website or Seed
portion of the
website that you
3
plan to include in
your collection
Initial Collection
8. Building the Collection
Expanded
www
Collection
www
A seed is also a web
www
Seed
www www
site from which
3 Seed
www 2
www additional sites can
Seed
be discovered via
www www 5
www
the hyperlinks of the
www
www
site
www www
www Seed
www 6
www
Seed
www 1
www
Seed
www 4
www
www
www
www
11. Building the Collection
‘World War One’ ‘the great war’
‘World War I’
‘Première Guerre
‘First world war’ Mondiale’
‘World War II’ ‘zweiter Weltkrieg’
‘World War Two’
12. Building the Collection
Returning to Record links
‘hub’ sites from first 20
for further pages of
analysis search
[include Following
dead links] links
13. Building the Collection
Expanding scope
http://www.greatwar.co.uk/westfront/Somme/index.htm
http://www.greatwar.co.uk
14. Building the Collection
Expanding scope
memory.loc.gov/ammem/collections/maps/wwii/index.html
www.memory.loc.gov/ammem/collections maps/wwii/
15. Building the Collection
Dealing with illogical or flat directory structures
www.eyewitnesstohistory.com/ <= don’t want whole site
www.eyewitnesstohistory.com/blitzkrieg.htm
www.eyewitnesstohistory.com/dday.html
www.eyewitnesstohistory.com/midway.htm
www.eyewitnesstohistory.com/airbattle.htm
www.eyewitnesstohistory.com/dunkirk.htm
www.eyewitnesstohistory.com/francesurrenders.htm
16. Building the Collection
• Stop when most results are redundant
• Narrow in on more specific topics
Churchill
Hitler
‘zweiter ‘Battle of the
Weltkrieg’ Bulge’
‘Great war’ Guadalcanal
WWI
Allies
WWII
Home front
17. Building the Collection
• Materials in Foreign language
– Focused on German sites
– Consider local conventions, not just translations
WWII
(zweiter Weltkrieg)
the period of National Socialism
(Zeit des Nationalsozialismus)
the period in which the Nazis ruled
(Nazizeit)
18. • Other foreign languages were included, but
not sought after
Belarusian; Catalan/Valencian; Chamorro;
Czech; Danish; German; Dzongkha; English;
Spanish/Castilian; Finnish; French; Hebrew;
Hungarian; Italian; Japanese; Luba-Katanga;
Dutch/Flemish; Polish; Portuguese; Russian;
Slovenian; Turkish; Ukrainian; Chinese
19. Building the Collection
Difficult to find and include:
Museums, libraries, archives
Some improvement through targeted searches
NYPL (2,100 photographs) Harvard Libraries (1,000 WWI Pamphlets)
Directory Structures still limiting
http://pds.lib.harvard.edu/pds/view/7845178
(first page of a multipage object)
20. The World Wide Web of Humanities
“Extracting The Data”
St Anne's College, Oxford
March 19, 2009
Molly Bragg, Partner Specialist
Web Group
The Internet Archive
21. Agenda
Brief Introduction to IA‟s Web Archives
Discipline Specific Data Extraction from
Longitudinal Web Archives: The
WWWoH Case Study
Recommendations for Future Research
and Tools Development Efforts
23. The Internet Archive is…
A digital library of ~4 petabytes of information
Web Pages
Educational Courseware
Films & Videos
Music & Spoken Word
Books & Texts
Software
Images
The Archive’s combined collections receive
over 6 mil downloads a day!
www.archive.org
24. IA Web Archives
1.6+ petabytes of primary data (compressed)
150+ billion URIs, culled from 85+ million
sites, harvested from 1996 to the present
Includes captures from every domain
Encompasses content in over 40 languages
As of 2009, IA will add ½ petabyte to 1 petabyte of
data to these collections each year.
27. WWWoH Case Study
Unique URLs in the collection: 5,362,425
Total number of captures: 23,006,857
Captures span: May, 1996 to Aug, 2008
Total size of compressed data: ~250 GBs
28. The Data Extraction
Process
Oxford Internet Institute selected relevant
sites/URLs
Identified all captures related to the seeds
Identified all files embedded in each capture
(on & off seed domains) for extraction
Attempted to locate additional candidate
seed URLs/domains for inclusion in the
collection using outbound link data
29. The Data Extraction Process
Relevant URLs not identified as seeds
were not extracted.
Automatically harvesting ALL outbound links
can capture relevant non-seed urls however it
can also introduce a large amount of
extraneous content into the collection
Manually curating outbound links excludes
non-relevant content, however it can be an
overwhelming task due to the volume of links
30. WWWoH Case Study: WWI
Number of Seeds: 2263
Unique Hosts: 906
Number of Links: 143+ mil
43. Challenges
Identifying subject matter-specific
resources of interest for an extraction and
then automating those procedures.
Tools are missing from the workflow that
might make the initial scoping of an extraction
easier to define and revise
Available tools for collection building and
access are too technically focused for the
average humanities scholar
45. Implications for Future
Research
Need link and web graphing tools
that use inbound and outbound link
data to identify further resources of
interest
Need to experiment with a more
diverse range of UI navigational
paradigms that address the
dimension of time and curatorial input
48. Opportunities
Extractions make it easier for humanities
scholars to locate and assemble source
materials of interest.
These collections can accelerate and/or
augment discipline specific research efforts
Extractions can encourage distributed
collaboration and cooperation between entities
who might not otherwise be aware of one
another
Aside from being relevant for transatlantic cooperation, because of the involvement of so many countries, the materials available on the World Wars represent a well-rounded set of humanities materials that will allow us to test the tools against a variety of types of documents and resources. World War collections on the web include materials that fall under the topics of history, journalism, art, art history, advertising, literature, poetry, political science, military history and others. <number>
The types of materials that have been digitized also cover a range of challenges that will allow robust testing of our approach, including multiple formats (text, images of documents, photos, audio), multiple languages (English, German, etc.), Many document typesMultiple languagesGet language list from Kris?<number>
All of this started with identifying a set of seed sites. A seed site is a web site from which additional sites can be discovered via the hyperlinks of the site, through in-links to and out-links from the seed site. <number>
Early on in the seed selection process, though, we realized that this selection policy would not result in anything close to the original target of 100-250 million pages, as the first few passes through the collections yielded barely 1 million pages. In the end, our collections are smaller than the total possible limits identified by those responsible for the technological implementation. This was the first lesson for the whole team: even though the data deluge (Hey & Trefethen, 2003) is often identified as a key challenge for researchers across fields, focused collections in the humanities are still relatively unlikely to encompass hundreds of millions of objects. <number>
Early on in the seed selection process, though, we realized that this selection policy would not result in anything close to the original target of 100-250 million pages, as the first few passes through the collections yielded barely 1 million pages. In the end, our collections are smaller than the total possible limits identified by those responsible for the technological implementation. This was the first lesson for the whole team: even though the data deluge (Hey & Trefethen, 2003) is often identified as a key challenge for researchers across fields, focused collections in the humanities are still relatively unlikely to encompass hundreds of millions of objects. <number>
The seeds were identified in a process that began with topic-based web searches. The searches began with the most general topics, ‘World War I,’ ‘World War II’, being sure to include all variations in language, spelling, and phrasing, such as ‘World War One’ and ‘First World War.’ This was followed by searching regional localizations of the phrases and topics, such as ‘the Great War,’ ‘Première Guerre Mondiale,’ and ‘zweiter Weltkrieg.’<number>
For each search, the first twenty pages of the search results were captured by following links from the search results page and copying and pasting the URLs into a spreadsheet. Sites with lists of links to other relevant sites were bookmarked and returned to at a later time for exploration and capture. As the goal was to gather a collection of archived web sites, links to sites that no longer exist were also recorded. These dead links, which appear to be useless on the live web, represent one advantage to this collection method: if the Internet Archive includes archived versions of these pages, they can still be included in the collection. This represents an improvement over the native interface to the Internet Archive’s Wayback Machine, which requires users to type in a URL and then select from various snapshots of those pages collected over time.
The next step was to generalize the URLs in order to maximize the number of pages in the collection. For each URL copied, references to specific pages were removed and the URL truncated to the root site or most logical directory. For example, it was logical to conclude that the entirety of http://www.greatwar.co.uk was on topic, so all references to specific pages, such as http://www.greatwar.co.uk/westfront/Somme/index.htm were removed and replaced with http://www.greatwar.co.uk. (Duplicate sites were removed automatically). Many collections of materials—in particular those from universities, archives, and libraries—were not resident on unique domains. In these cases, the URL could only be truncated as far back as the directory containing the relevant materials. For example: http://memory.loc.gov/ammem/collections/maps/wwii/index.html to http://memory.loc.gov/ammem/collections/maps/wwii/. <number>
The next step was to generalize the URLs in order to maximize the number of pages in the collection. For each URL copied, references to specific pages were removed and the URL truncated to the root site or most logical directory. For example, it was logical to conclude that the entirety of http://www.greatwar.co.uk was on topic, so all references to specific pages, such as http://www.greatwar.co.uk/westfront/Somme/index.htm were removed and replaced with http://www.greatwar.co.uk. (Duplicate sites were removed automatically). Many collections of materials—in particular those from universities, archives, and libraries—were not resident on unique domains. In these cases, the URL could only be truncated as far back as the directory containing the relevant materials. For example: http://memory.loc.gov/ammem/collections/maps/wwii/index.html to http://memory.loc.gov/ammem/collections/maps/wwii/. <number>
Illogical directory structures were often encountered and were a clear barrier to increasing the number of sites collected. EyeWitnesstoHistory.com contains first person accounts of historical events and contains almost fifty pages dedicated to the First and Second World Wars. Each page file sits in the root directory, though, and so needed to be provided individually. The entire site (http://www.eyewitnesstohistory.com/) could not be included because only a fraction of it falls within the scope of the collection, therefore individual pages (http://www.eyewitnesstohistory.com/blitzkrieg.htm, ../dday.html, etc.) had to be recorded. <number>
Although this process may seem to result in an almost infinite number of sites, it became clear that after gathering several hundred seeds, most of the resulting sites identified were redundant. At that point, more precise search terms were selected and the process re-initiated. Narrower topic searches were commonly either biographical (Hitler, Churchill, etc.), event-based (Battle of Midway, Guadalcanal campaign, surrender of Japan), or based around on subjects that while technically broader in scope, are commonly associated with one of the two wars, (holocaust, Allies, home front.) <number>
Because of the time consuming nature of the collection-building process, a decision was made to focus the foreign-language part of the collection on German sites; with the idea that it would be more useful to have one language with a deep collection than many with shallow ones. (Sites identified in other languages were included, but not sought after.) Native German-speakers were consulted and helped design a search strategy to maximize the number of resulting German sites. This strategy took into account local conventions on not speaking only of World War II (zweiter Weltkrieg), for example, but more commonly of the period in which the Nazis ruled (Nazizeit) or Zeit des Nationalsozialismus, the period of National Socialism. This approach illustrated the need for localization, not just translation, when building a collection of sites in other languages. <number>
As the topics for collection development were narrowed, the collection of seed sites continued to grow, but there were several content areas that remained difficult to include. A majority of the material from museums, libraries, and archives was not findable using the subject searches mentioned above. Most of this material was identified using targeted searches of domains likely to contain relevant content. Many of these institutions use local databases to deliver content that are not publicly indexed by common search engines. The New York Public Library has an extensive digital collection of photographs, over 2,100 of which are relevant to one of the world wars. These materials can only by located by first going to NYPL’s site. Similarly, Harvard University has a collection of almost one thousand digitized pamphlets from World War I. They can only be found by searching in the library’s union catalogue. In each of these cases, knowing that the materials exist—or might exist—is a prerequisite for being able to find them. But even when located, materials in databases remained problematic. There is usually no directory structure that can capture a number of items at once, nor are the URLs generated by database searches commonly stable. URLs to the Harvard materials, for example (http://pds.lib.harvard.edu/pds/view/7845178) only provide access to the first page of the multi-page objects. While NYPL does provide stable URLs for the objects in its database, these need to be identified within each bibliographic record in order to be added to the seed list.<number><number>
<number>
<number>
A 501(c)(3) non-profit ; Located in The Presidio, San Francisco, CaliforniaStarted in 1996 to to build an ‘Internet library’ of archived Web pages; Expanded in 1999 to include all media, texts, etc.Focus Harvest, storage, management & access to digital contentContribution and use of open source Web archiving software tools and services.Access to digital assets in the public domainWeb150+Bil objects,~ 1.6 Petabytes of data compressedMoving ImagesPrelinger, public domain filmsStill Images - NASATextsProject Gutenberg, public domain texts, Children’s Digital LibraryAudioLMA, Grateful Dead, public domain audio clips,…Educational CoursewareOther Collections: Software & Television (subsidiary)<number>
100’s of thousands of online journals and blogsMillions of digitized texts100’s of millions of web sites100’s of billions of unique web pages100’s of file/mime typesBut too many files to count…A single snapshot of the visible Web now exceeds a petabyte of data…
<number>
Nuts and bolts of the data extraction process.<number>