9. Archivematica is open source
Accessible Data
Proprietary Software By Karin Apricot via www.flickr.com/people/karenapricot/
By Konrad Summers [CC-BY-SA-2.0
(www.creativecommons.org/licenses/by-sa/2.0)], via Wikimedia
Commons
City Data
By Paul Rudman via
http://www.flickr.com/photos/thecanonrattman/
Open Source Accessible Data
By Paul Rudman via http://www.flickr.com/photos/thecanonrattman/
Digital Preservation
By Trish Steel [CC-BY-SA-2.0] www.creativecommons.org/licenses/bysa/2.0)],viageograph.org.uk
27. URLs for Resources
• Vancouver Open Data
http://data.vancouver.ca/datacatalogue/index.htm
• Europeana Data http://pro.europeana.eu/web/guest/linked-
open-data
• Europeana portal http://www.europeana.eu/portal/
• Map Warperhttp://mapwarper.net/
• Map Warper at NYPL http://maps.nypl.org/warper/
• AkomaNtoso main site http://www.akomantoso.org/
• AkomaNtoso examples http://examples.akomantoso.org/
• Waisdahttp://woordentikkertje.manbijthond.nl/
Notas do Editor
Open data is a good fit for archives. opening records for free public use is the core of what we do, and what we’ve done for decades. Archivists are trained to administer privacy legislation, and we routinely consider privacy concerns when we make information available. In the past, we’ve made analogue records available, so we haven’t been a resource for digital data researchers, but now we have the digital infrastructure and are acquiring the knowledge to be able to offer digital data. At our archives, we are looking at open data in 3 different contexts:
1. We are the official repository and custodians of the older open data that the City has released
2. We have our own metadata that we’d like to release
3. We want to turn some of our analogue archival records into digital data
Let’s look at the City’s Open Data sets first The City of Vancouver maintains a web site offering nearly 140 different data sets, with most sets downloadable in multiple formats. The different sets are updated at different frequencies: daily, monthly, weekly.
Each set has its own descriptive metadata.
Privacy concerns have already been dealt with. The Archives is NOT acquiring every incremental update of every set, just regular snapshots. We are still planning exactly HOW we will do this, but we intend to preserve all these data sets and make them freely downloadable.
We will preserve the data using Archivematica, a preservation system that we played a large role in developing. This is a shot of the first beta release which just came out a few days ago.
How does it embody Best Practices? It’s open source.We cannot preserve anything by putting it into a proprietary black box and hoping for the best: in this system we know every action taken because the code is open and transparent. In other words, Jeff Goldblum is not going to unexpectedly turn into The Fly. He will remain the essence of Jeff Goldblum and we’ll be able to show how that happened.
Archivematica is based on standards. It was designed from the very beginning to conform to this ISO standard OAIS, which is a framework for digital preservation and access. It also incorporates several metadata standards, such as METS and PREMIS.
We need to develop preservation plans for some of the filetypes before we put them into the system, as Archivematica does not have existing plans for everything yet.
We need to take look at the licence under which the City releases its data, and how long the licence will apply to this data. This could become tricky in the future if the City changes the licence -–and it’s likely to evolve. It would be awkward for the end user if the same data sets in different years have different licences.
On to the second context: we have rich metadata about our holding which we’d like to make available for use, and I’m sure most of you have the same: your catalogue metadata. Cultural institutions such as galleries, museums, archives and libraries are making catalogue metadata easier for others to use for analysis, not just for searching. There isn’t a single best practice for sharing these data sets, although most applications use fairly common formats and schema. Best practice is to try to provide what the community can use, and that can be more than one type of access or format. Data can be shared as XML, JSON, or even CSVs; it can be made available for download or via API, or both.
Europeana, Europe’s collaborative program for access to digitized cultural heritage, released linked open metadata about its 20 million digital cultural objects a few days ago. The participating institutions signed a Data Exchange Agreement so they all agree to release the metadata under a Creative Commons Zero licence and to use a standard metadata schema. The public can download the entire data set, or subsets.
Catalogue metadata can be uploaded or harvested by an application such as Viewshare, an open source platform that allows people to look at metadata and digital objects in various views, with no hacking or coding required.
It automatically creates timeline views
pie chart views
and map views.All these views are available with faceting.
More sophisticated hacking has already been done with cultural metadata. Last year, there was a competition in the UK held to use open data including some catalogue data, library user activity data, and even OpenURL router data, which logs patron requests for digital academic papers. The winners included a service that links information about musical composers and one that tells you which English outdoor heritage features are near you so you can visit them.
Finally,Turning Analogue Records into Data We have been digitizing archival records for 15 years, taking analogue records and, until recently, turning them into still or moving images. We want to go further and turn them into data, and there are different approaches we can use depending on the medium.Crowd sourcing will be necessary for some of this, and we’d love it if libraries could encourage people to do some of this work.
This software is called the Map Warper. We are in discussion with the developers, hoping to roll this out next year. An open source application for georectifying images of old maps, the map warper was further developed by the New York Public Library and became easier for public use.We intend that the application will reside on City servers, and we will upload high-resolution scans of our maps to it. Each scan would exist merely as an image until someone wanted to use it. Then they would match known points on the old map with known points on Open Street Map, and, if there were enough control points, they would rectify the old map: that is, the map would know where it belongs geographically. Once the map is rectified, the user can save it to the system in common formats and then others can download the rectified versions.
We’d like to make the City Council Minutes available as structured digital information. Presently, it’s mostly available as handwritten or typed pages, although later years do exist in various digital formats. This is a huge project and it’s just starting. A project to transcribe the handwritten pages has just begun. We’re also planning to scan and OCR the typewritten ones. But, even when we get the OCR and transcription done, it’s still not data – there’s no machine-readable structure, just words.
We’re looking at applying the AkomaNtoso XML schema. Developed for African parliaments, it is becoming widely used for legislative and parliamentary documents BC’s Queen’s Printers have developed a tool for marking up documents in this schema, and legislative XML documents were featured in a Victoria hackathon this year This slide shows a report, viewed in XML and HTML.
This is a project of the Dutch Institute for Sound and Vision. They had hundreds of hours of digitized television broadcast footage that they wanted to have tagged. They created an open source, web-based application to allow crowd-sourced tagging.
To encourage both participation and accuracy, they made it a game – here you can see the top scorers.
To play the game, people watch the footage and they type what they see, and the program associates the tag with a time code in the video. Then the Institute uses software to analyze the tags to fix errors and inconsistencies. For example, they use Freebase to figure out what some of the tags mean. Maybe this stretches the idea of data being structured information, but I think taking a visual medium and turning it into the structure of time code vs. tag counts, and could be very useful to digital humanities researchers. To conclude, I think the cultural sector should be careful to use existing standards, even de facto ones, or make sure they can transform their data and metadata into those standards easily, and also to make the licencing as open as possible, because it’s going to be increasingly important that these data sets be interoperable.