Presentation for the CNI (Coalition for Networked Information) Fall Forum, December 2012. Describes Emory University Library’s first-hand experience in interlinking Civil War-related materials and other online resources by leveraging open linked data principles. The library has been actively evaluating linked data’s potential to replace current library processes and services (bibliographic services, finding aids, cataloging, and metadata work) as a more efficient and sustainable means, and one that could bring greater benefit to end users for research and learning. The Library’s initial focus was on workforce education and hands-on learning through real-time experiments: the Connections project was begun to prepare staff to work with linked data, a process that has culminated in a 3-month hands-on pilot to build and convert some data. The pilot introduced the concept to a wide range of staff, including subject liaisons, archivists, metadata librarians, and programmers. Emory’s “silos” of data were interlinked with other open data sources as a way to enhance user discovery and use of library materials on a very limited scale.
4.16.24 21st Century Movements for Black Lives.pptx
Piloting Linked Data to Connect Library and Archive Resources to the New World of Data, and Staff to New Skills
1. Connections:
Piloting linked data to connect library and
archive resources to the new world of data, and
staff to new skills
CNI Fall Meeting, December 11, 2012
Laura Akerman Zheng (John) Wang
Metadata Librarian AUL, Digital Access, Resources,
and IT
Robert W. Woodruff Library
Hesburgh Library
Emory University
Notre Dame University
17. Ingredients
• Leader/teacher/evangelist
• Learning group – open to all
o
2 "classes" a month, 5 months.
• Pilot: 3 months
o
Brainstorming a pilot project
o
Start small
o
Team: programmer, subject liaison, metadata
specialists, archivist, digital curator, fellow.
o
1-3 hrs/week for all but leader
o
A sandbox running Linux
18.
19. Integrate linked SPARQL
data into discovery
layer (catalog)?
Our Own Triplestore
User interface
RDF from EAD Civil Navigation
War
id.loc.gov Timelines
Maps
RDF from TEI
Crowdsou
DBPedia rcing
Rosters (and MARC)
RDF from MARCXML Faculty
project
Other
50
Data from other 1
CW data
archives Redesign metadata
National Park
creation as RDF
Service Data
21. Sampling little bites of the
meal:
EAD (starting from
ArchiveHub
stylesheet
id.loc.gov URIs for LC
subjects and names
(scripted) MARCXML
(starting from LC
DC stylesheet)
Make some
DBPedia/subjects
RDF
(by hand)
metadata
Sesame
Visualization – triplestore
Simile Welkin
22. A few of the connections...
HasSubject
"Mobley, Thomas"
HTTP:OurResourceURL
27. We learned:
Selecting material that will “link up”
without SPARQL, is too hard!
Even when items are in a unified “discovery
layer”, the types of search are limited.
Get it into triples, then find out!
28. We learned:
•(No one model to follow has emerged. We
have to think about this ourselves.)
There are many ways of modeling data
30. LC's MARCXML to RDF/Dublin
Core:
dc:subject " Geary, John White,
1819-1873."
31. Simile MARC to MODS to RDF:
<modsrdf:subject rdf:resource=
"http://simile.mit.edu/2006/01/Entity#Geary_John_White_18191873
"/>
<rdf:Description rdf:about=
"http://simile.mit.edu/2006/01/Entity#Geary_John_White_18191873
">
<rdf:type rdf:resource=
"http://simile.mit.edu/2006/01/ontologies/mods3#Person"/>
<modsrdf:fullName>Geary, John White </modsrdf:fullName>
<modsrdf:dates>1819-1873</modsrdf:dates
</rdf:Description>
32. We learned:
Linked data is HUGE
It’s coming at us FAST
It’s not “cooked” yet
33. More learnings
• We learned more by doing than by "class".
• Making DBPedia mappings or links by hand is
very time consuming! We need better tools.
• We need to spend a lot more time learning
about OWL, and linked data modeling.
34. Challenges
• Easily available tools are not ideal!
• Skills we needed more of: HTML5, CSS,
Javascript
• Time!
• Visualization/killer app not there yet.
• Can't do things without the data! No timeline
if no dates!
35. What we got out of it
Test triplestore for training and more
development
Better ideas on what to pilot next
Convinced some doubters
"Gut knowledge“ about triples, SPARQL, scale
Beginning to realize how this can be so much
more than a better way to provide "search"
36. Outside our reach for now
Transform ILS system to use triple store instead of
MARC
Create hub of all data our researchers might want
Make a bank of shared transformations for EAD,
MARC, etc.
Shared vocabulary mappings
Social/networking aspect (e.g. Vivo, OpenSocial...)
- need a culture shift?
37. Next? Maybe...
Build user navigation?
More Civil War triples including other local
institutions’ stuff?
Publishing plan?
Integrate ILS with DBPedia links?
Suite of “portal tools” for scholars?
Use linked data for crowdsourcing metadata?
More classes?
Connect with others at Emory around linked data
38. Recommendation:
Individual Institutions
• Focus on unique digital content
• Publish unique triples
• Reuse existing linked data
40. Recommendation:
Librarians’ Role?
• Interdisciplinary linking?
• Metadata librarians - Linking association and
normalization
41. Acknowledgements
Connections group sponsors: Lars Meyer, John
Ellinger
Connections Pilot team: Laura Akerman (leader), Tim
Bryson, Kim Durante, Kyle Fenton, Bernardo Gomez,
Elizabeth Roke, John Wang
Fellows who joined us: Jong Hwan Lee, Bethany Nash
Our website:
https://scholarblogs.emory.edu/connections/
Laura Akerman, liblna@emory.edu
John Wang, Zheng.Wang.257@nd.edu
"Connections: Piloting linked data to connect library and archive resources to the new world of data, and staff to new skills". Emory University Libraries (EUL) will share their first-hand experience in interlinking Civil War related materials and other online resources, leveraging Open Linked Data principles. As library linked data are emerging, EUL planned to evaluate linked data's potential to replace current library processes and services (bibliographic services, finding aids , cataloging, and metadata work) as a more efficient and sustainable means, and understand its payoff for end user research and learning. We initally focused on workforce education and hand-on learning through real-time experiments. Over the last year, the Connections project has begun to educate staff and prepare them to work with linked data, culminating in a 3 month hands-on pilot to build and convert some data. The Pilot introduced the concept to a wide range of staff, including subject liaisons, archivists, and metadata librarians, and programmers. We interlinked our "silos" of data with other open data sources as a road to enhancing user discovery and use of our material in a very limited scale. From our experience (insights, as well as some limitations and stumbling blocks) the group developed some better informed recommendations for more education and staff involvement, as well as potential incorporation of this technology into library services and workflows. Our experience will be particularly helpful for institutions that are awake to linked data's transformative potential and are making plans. We will share our assessment on the readiness of the entire linked open data ecosystem for libraries to cross-link disciplines and the possible roles of libraries in a linked world. Beyond that, we plan to suggest some possible routes to help peers to involve their staff with this new paradigm of information curation and dissemination. " • Can linked data replace library bibliographic services? • How to start such an initiative • Readiness • Recommendations – Education – Integration – Potential areas of work/roles of librarians and staff
Reduce Duplicative Work (Downloading, editing, creating holding records) Shorten Process Time (Knowledge Linking) Enhance Authority Control (This John Wang is from Emory) Give Library a Universal Attention (Web Scale) Help Libraries Achieve Missions
Than current biblio service tool sets Not treat as an additional thing Connect to the Larger ecosystem Convert/make some RDF. Show value.
Finished ILS Migration Setup Cloud Services setup Biblio Vendor Services How to get two division directors sponsored the initiative
Now I'm going to talk a little about our experience, and some of the discoveries we made. What we've produced so far isn't that significant, compared to what some other institutions have done... but maybe this is one way to think about involving staff in learning and preparing for using linked data in libraries.
We started having classes toward the end of last year. As John explained, our library had a lot on its plate, and the people whom we wanted to involve are very busy. So we usually had brown bag lunches, every other week. This was a high-level overview... sort of “ABC’s” This is a triple, this is how SPARQL queries work, here’s what OWL can be used for, here are some things about publishing linked data that we need to think about. I worked with a graduate fellow to team-teach these. We had a core group that was asked to attend every time, but a lot of folks from across the library attended. By the time we got to brainstorming about a “pilot project” for the summer, the group had a lot of ideas...
And, we decided to try as many as we could with our pilot.
We chose to center our pilot around a small topic, The American Civil War, because we had some interesting resources and it’s the 150 anniversary... We wanted to show how linked data could link up our metadata “silos”, enhance our unique content, integrate it with data from other places (manually and automatically). Of course, that would include DBPedia but also other archives, other sources of data specific to our theme, and perhaps even data we converted from other formats. We planned to have cool visualizations like maps and timelines. Maybe our data could contribute to a faculty project on the Battle of Atlanta. Oh, and also, we were going to build an interface to create metadata as “native RDF”. We would choose which data models we wanted to use. And, we would investigate which free or open source tools were most useful for doing this work. All, in 3 months... using enthusiastic but busy people who were not the “a team” (our actual developers). Oh, but we were only going to work on a small sample of our metadata – chosen carefully ahead of time, to “connect up well”...
So, this was very ambitious, but linked data sounded so simple! We were only able to accomplish a fraction of all we’d planned in the 3 months… Investigated Virtuoso and Sesame, and also Callimachus, a new “beta” web framework; we decided to go with Sesame, it had a web client people could use to do SPARQL queries and load things. By the way, our programmer Bernardo Gomez converted a copy of our ILS database to RDF (mostly Dublin Core) – this wasn’t part of the pilot per se, but interesting exploration. Transformed a small number finding aids using ArchivesHub stylesheet as starting point (lots of modifications still needed!) Transformed subset of MARCxml for some digitized books, using LC stylesheet as starting point (experimenting a little bit with RDA vocabularies but not getting too far into it). Made some N3 triples by hand, in Notepad, to describe images with no metadata. Included id.loc.gov links and DBikipedia links. Retrieved id.loc.gov name/subject URLs via script. Our programmer looked at scripting some links to DBPedia based on our names and subjects but found this too involved to attempt for this pilot. Building a navigation interface, turned out to be a bit too much to accomplish in 3 months, but we had some adventures along the way. At the end, a power outage corrupted our Sesame/OWLIM triplestore... so no live demo today but we can rebuild it. Investigated lots of softwares (mostly free and open source) for display, navigation and publishing of linked data. Some of us were interested in using Drupal 7’s linked data capabilities to create a user interface, but we’re not sure this was the applicationwe wanted. We had not planned to publish our data for this pilot! But, came to realize, if we wanted web-based tools such as LinkSailor to navigate our data, we’d have to publish it. We had fun with a few simple visualization tools that we could plug some of the data into directly.
This sequence illustrates the kind of connections we want be able to make. I’m using sort of generic terms for the predicates, rather than any particular vocabulary, for simplicity. We go from a name/text string as a subject (in this case a person)
To a URI identifier which we came up with. With the ArchivesHub model we were able to see lots of “coined” URIs for entities such as names associated with an archive. We began to see some wisdom in having our own URIs which we can make assertions about (like, our URL identifies the same person as one from id.loc.gov) without having to assert things about other people’s data... But in this case, we don’t think anyone else has made an identifier for Mr. Mobley so we would have to. The number of URIs we would need to mint was kind of astonishing for us, one of the things we learned is, we need a strategy for this.
We can then assert that he’s a member of a particular Civil War regiment. We have “NACO Authority” strings...
From here we could link to a regimental history in our collection. And, if our URI for the regiment was linked to a DBPedia entity, we could link to whatever information Wikipedia has on it and navigate to other regiments and much more. And who knows what other data might link to the DBPedia entity?
Or, a user could explore other material in that MARBL manuscript collection, or in any other collection that had material on that regiment, or the Civil War...
So, what we learned on our summer non-vacation.... We spent too much time trying to select specific records to convert for our pilot. In the end, we loaded all our regimental histories, and a subset of our finding aids, and SPARQL query told us which ones had common subject headings. SPARQL is a skill that I think many librarians could start learning, by the way. There are plenty of SPARQL endpoints...
When we contrast ArchivesHub's "associatedWith" construction to express concepts - in this case, a person - with archives that have material about them,
With the very simjple mapping to Dublin Core,
To Simile's MARC to MODS to RDF approach... And I haven't had time to play with the conversion the BIBFRAME project has come up with, yet, or you'd get a slide of that! You can see that there are a lot of choices. What we wondered was, were there enough similarities in the relationships that we could find some common models and vocabularies across our data? That would make querying easier...
From my own perspective, it can be a bit overwhelming to follow linked data just now... so much to learn, so much happening. But I think you just have to dive in. Since this has been a major focus for me this year, and I've followed so many email lists and tried to keep up with projects here and in Europe, and am starting to feel a sense of urgency about this - we need to be on board. At the same time, there are aspects of the "how" that are still unclear and difficult - provenance, tools, vocabulary mappings are just a few.
So we learned a lot more in the pilot than in class. People were more engaged because they were doing an assignment they came up with. Those of us that worked with DBPedia could see real possibilities, both as a means to link to Wikipedia content and as a vocabulary in itself. String matching LC subjects didn't work very well - this needs to be a larger project - maybe it's already happening? using algorithms but also, we think, some human review. By larger, I mean, community effort. Who's doing it?
-- We had challenges finding, and using tools (especially the non-programmers); how do we find what’s new, what’s good, what do we need to build ourselves? I had been warned that there weren’t really good tools for non-programmers out there... but it was interesting, how many new tools appeared in beta just in the 3 months we were doing the pilot. We just ran out of time to try them all. The pilot made some of us painfully aware that our web skills were out of date. Most of the group expressed regret that they didn't have more time to get involved in this project, but felt they got a lot out of it anyway. In our discussions we recognized a conflict between the desire to create more of our metadata as data, to provide more hooks, and the reality that we have limited staff working at capacity... we talked about crowdsourcing. We also need to explore how this would change the tools we are using to create metadata. Is it possible to make it easy to make more links?
One of our members said at one point, "this is really like a relational database, just not with tables" and from his perspective there's some truth to that, but we are starting to see that we can do way more with this than replicate what we're doing with relational databases, MARC, and XML. Linked data is not just all about “search”. We can make discoveries about our collections as a whole, but we can also link our content to the "things" it relates to and really weave them into the research environment for scholars, such as the articles appearing in UniProt and other scientific databases... As we look back, although we don't have a killer app yet, we've gotten a lot out of the last 3 months. We have our test triplestore and can begin to expand it bit by bit towards realizing some of our grand schemes... but we also have other ideas.
We also get a sense of our limitations. Some developments really call for big communities, maybe global effort. Who is going to host banks of shared transformations and vocabulary mappings? Some of us are interested in the social tools but our library isn't ready for that right now, however we can begin to feel out our faculty and students about their interest.
Our sponsors haven’t made decisions on where we go next. I’m pretty sure we’re not in a position yet to invest more staff time, but: I think many of our original ambitions for the Civil War pilot could be achieved if we can continue at a slow pace, one step at a time. There’s also some interest in at least demoing, interlinking our Primo discovery layer with DBPedia. We want to continue learning and broaden the participation of staff at the library, coordinate with more people in our Systems division - We know of at least one faculty digital scholarship project that our programmers are involved in that uses linked data, and we'd like to open our information sharing group to others at the University.
Management (Learning and experience) Technical and learning aspect: Different publishing methods Technology readiness Ecosystem readiness Users perspective on what they get from library and how they might use the data Who should learn, in the conversation
(enable linking and creation of linked data)
Given the “infrastructure” of global LD isn’t “mature” yet, why not wait for big players to sort it out? What can we do now? (our project was an attempt to answer) Our library is “pinched” for staff time – what can we do? Who in your organization do you get involved in learning/transition, and when? (our project started from systems and “tech services” but public services folks came in and we discovered we need them! Everybody!) Is LD only “big data” – or is small data a part? How can we get data (metadata) in RDF when we don’t have it? Standardization? Who decides? Tools for everyone! Who will build? Where is the community? We need X..... Big jobs – e.g. linking LCSH to Wikipedia Share info on tools (dlf Zot group – no traction – what would work?