This document discusses building an archival identity management network to transform archival practice and historical research. It extracted name data from 150,000 finding aids and other sources to create a prototype system with historical information, social networks, and links to archival resources. Challenges included merging name records from different sources and dealing with variations in names and forms of names. The project aims to address problems of proliferation and inconsistencies in name data to improve access to archival collections.
2. Building an Archival Identity
Management Network: Transforming
Archival Practice and Historical
Research
Daniel Pitti* and Brian Tingle**
* Institute for Advance Technology in the Humanities
** California Digital Library
Thanks to Ray R. Larson of the University of California, Berkeley, School of Information
for many of the slides here
12/11/12
2012-11-04 - SLIDE
3. Funding and People
• Funding and Timeline
– National Endowment for the Humanities
– May 2010-April 2012
– Andrew W. Mellon Foundation
– May 2012-April 2014
• People
– Daniel Pitti (PI) and Worthy Martin (Institute for Advanced
Technology in the Humanities, University of Virginia)
– Adrian Turner and Brian Tingle (California Digital Library,
University of California)
– Ray Larson (School of Information, University of California,
Berkeley)
12/11/12
2012-11-04 - SLIDE
4. The Source Data
• EAD-encoded finding aids (guides to archival
records)
– 150K
– Primarily from U.S. sources, but also U.K. and
France
• Archival authority records (360K)
– National Archives and Records Administration
– State Archive of New York
– Smithsonian Institution
– British Library
– National Archives (France) & BnF
• WorldCat Archival Descriptions: 2M
12/11/12
2012-11-04 - SLIDE
5. Library and Museum Authority Records
• Getty Vocabulary Program: Union List of
Artist Names (293K personal and corporate
names)
• Virtual International Authority File (16M+
cluster records)
– Contributed from around the world by national
libraries and others
12/11/12
2012-11-04 - SLIDE
7. Methods and Processing
• Extract EAC-CPF records from existing EAD-
encoded archival descriptions
– Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
against existing authority records (ULAN, VIAF,
LCNAF)
– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data
(ULAN)
• Create a prototype historical resource and access
system
– Historical data and social-professional networks
– Links to archive, library, and museum resources (by
and about)
12/11/12
2012-11-04 - SLIDE
8. Example EAD Record (Hub)
<ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English">
<EAD>
<DID>
<EADHEADER LANGENCODING = "ISO 639">
<REPOSITORY>
<EADID>
University of Manchester, John Rylands University Library of Manchester
GB 0133 TAB
</REPOSITORY>
</EADID>
<UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB"
<FILEDESC>
REPOSITORYCODE = "0133">
<TITLESTMT>
GB 0133 TAB
<TITLEPROPER>
</UNITID>
Tabley Muniments
<UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2.">
</TITLEPROPER>
Tabley Muniments
</TITLESTMT>
</UNITTITLE>
<PUBLICATIONSTMT>
<UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3.">
<PUBLISHER>
19th century
John Rylands University Library of
</UNITDATE>
Manchester
<PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5.">
</PUBLISHER>
<EXTENT>
<ADDRESS>
1.24 cu.m
<ADDRESSLINE>
</EXTENT>
150 Deansgate
</PHYSDESC>
</ADDRESSLINE>
<ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1.">
<ADDRESSLINE>
<FAMNAME SOURCE = "NCARULES">
Manchester
Warren, family, of Tabley, Cheshire
</ADDRESSLINE>
</FAMNAME>
<ADDRESSLINE>
<PERSNAME SOURCE = "NCARULES">
... (Parts removed )…
Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet
</FRONTMATTER>
</PERSNAME>
</ORIGINATION>
</DID>
12/11/12
2012-11-04 - SLIDE
9. Example EAD Record (Hub)
<BIOGHIST ENCODINGANALOG = "ISADG3.2.2.">
<HEAD>
Administrative/Biographical History
</HEAD>
<P>
The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire,
was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian,
the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He
was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he
published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly
under his own name.
</P>
<P>
His early verse included
<TITLE>
Praeterita
</TITLE>
(1863),
<TITLE>
Eclogues and Monodramas
</TITLE>
(1864),
<TITLE>
Studies in Verse
</TITLE>
(1865),
<TITLE>
Philocletes
</TITLE>
(1866), and
<TITLE>
Orestes
</TITLE>
(1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and
Swinburne. In 1873 he produced …. (some data removed)…
12/11/12
2012-11-04 - SLIDE
10. Example EAD Record (Hub)
<SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1.">
<HEAD>
Scope and Content
</HEAD>
<P>
The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in
literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian
figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates.
</P>
<P>
Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert
Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and
loose manuscripts of verse and other writings. There are various bundles and boxes relating to
"Coins", "Botany", "Poetry", "Literary", "Financial"
and bookplates.
</P>
</SCOPECONTENT>
<ADD>
<OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6.">
<P>
Preliminary survey list.
</P>
</OTHERFINDAID>
<RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3.">
<P>
There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM.
The Library also has custody of the important Tabley Book Collection.
</P>
</RELATEDMATERIAL>
<SEPARATEDMATERIAL>
<P>
The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record
Office. Some of these papers were originally in the custody of the John Rylands University Library
of Manchester.
</P>
</SEPARATEDMATERIAL>
</ADD>
12/11/12
2012-11-04 - SLIDE
12. 2010-2012 Extraction Results
• Source data: 30,000 finding aids
• EAC-CPF records extracted
– LoC: 43,702 from 1,159 finding aids
– OAC: 91,811 from ~15,400
– NWDA: 22,609 from 5,160
– VH: 15,175 from 8,390
– Total 173,297
12/11/12
2012-11-04 - SLIDE
13. Phase II preliminary results
• unmerged SIA Henry Correspondence
• 32,988 Names
• unmerged WorldCat MARC
• 4,548,270 Names
12/11/12
2012-11-04 - SLIDE
14. Methods and Processing
• Extract EAC-CPF records from existing EAD-
encoded archival descriptions
– Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
against existing authority records (ULAN, VIAF,
LCNAF)
– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data
(ULAN)
• Create a prototype historical resource and access
system
– Historical data and social-professional networks
– Links to archive, library, and museum resources (by
and about)
12/11/12
2012-11-04 - SLIDE
15. The Problem
• Proliferation of the forms of names
– Different names for the same person
– Different people with the same names
• Examples
– from Books in Print (semi-controlled but not
consistent)
– ERIC author index (not controlled)
12/11/12
2012-11-04 - SLIDE
18. Library and Archive Authority Control
• Library (or bibliographic) authority control is almost
exclusively about the control of names
• Archival identity control involves biographical-
historical description of the CPF entity
– Descriptions based on controlled vocabularies, for
example, occupations, place of birth and death
– But also biographical-historical description
• Prose
• Chronological list
• Archival authority control provides context for
understanding records, the context of their creation,
the provenance
12/11/12
2012-11-04 - SLIDE
19. Merging EAC-CPF Records
LCNAF Repository VIAF Repository ULAN Repository
Cheshire
Search
Connect Connect
exactly records using Merge
matching name authority
records information
Repository of Repository of
EAC Repository merged EAC
connected EAC
Records Records
(MongoDB)
12/11/12
2012-11-04 - SLIDE
20. Merging EAC-CPF Records
VIAF Repository
Cheshire
Search
Connect Connect
exactly records using Merge
matching name authority
records information
Repository of Repository of
EAC Repository merged EAC
connected EAC
Records Records
(MongoDB)
12/11/12
2012-11-04 - SLIDE
21. Connect Exact Matches
• The EAC-CPF records provide the names
without having to parse texts, etc.
• Allows us to use some simple methods like
exact matching
– Assume identical name entries means the
same person/corporate body/family
– Enter the full names and record IDs into a
database and flag IDs with same names for
merging
12/11/12
2012-11-04 - SLIDE
22. But…
• Exact merging assumes that archives are
following LC cataloging practice in their
EAD records
– There are some problems with this assumption
12/11/12
2012-11-04 - SLIDE
23. Some failures for merging…
• Different abbreviations:
– A. & G. Carisch & C.
– A. & G. Carisch & Co.
• And spacing issues:
– A. C. Peters & Bro.
– A. C. Peters & Brother.
– A. C. Peters. (??)
– A. C.Peters & Bro.
• Completeness and alternate rules
– Tabb, John B. (John Banister), 1845-1909.
– Tabb, John Banister, 1845-1909.
• Also differing transliterations for non-Latin scripts
12/11/12
2012-11-04 - SLIDE
24. More…
• Variant romanizations (and spacing):
– M. P. Belaieff.
– M. P. Belaïeff.
– M. P. Bieliaev.
– M.P. Belaïeff.
– M.P.Belaïeff.
• Initials vs. names:
– Zabolotskii, N.A.
– Zabolotskii, Nikolai Alekseevich, 1903-1958.
– Zabolotskii.
12/11/12
2012-11-04 - SLIDE
25. More…
• Inverted order vs. uninverted
– Taylor, Zachary, 1784-1850.
– Zachary Taylor.
• Various combinations:
– Tchaikovsky, Peter I.
– Tchaikovsky, Pëtr Il.
– Tchaikovsky, Piotr Ilyich.
– Tchaikovsky, Pyotr Il.
– Tchaikovsky, Pyotr Ilyich.
12/11/12
2012-11-04 - SLIDE
26. Merging EAC-CPF Records
VIAF Repository
Cheshire
Search
Connect Connect
exactly records using Merge
matching name authority
records information
Repository of Repository of
EAC Repository merged EAC
connected EAC
Records Records
(MongoDB)
12/11/12
2012-11-04 - SLIDE
27. Search Authority Files
• For each name, formulate a search of the
VIAF database using the Cheshire system
(SGML/XML retrieval system with
probabilistic and Boolean matching)
– Search both the “authoritative” and “non-
authoritative” forms
– Consider any name matching a non-
authoritative form to be a candidate match for
the authoritative form
– Flag EAC records that match the same
authority record as potential matches
12/11/12
2012-11-04 - SLIDE
28. NGRAM or Shingle Matching
Name: Einstein Albert
Shingle sequence: ein, ins, nst, ste, tei, ein … , ert
Probability that the sequence (ins, nst, ste) follows ein is very high for the name
einstein
Shingle Language Model for names
Krishna Janakiraman and Sean Marimpietri - Biograph
12/11/12
2012-11-04 - SLIDE
29. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein
ein In
hta tai na
ein In ain ste
na sht
ste al
al nst
al nsh
nst
alb
alb ins
alb ins
ins
lbe ein lbe
Ain
ein lbe
ert ert
ein
ert ein
ein rte tei rte
tei
tei rte
Shingle Language Model for names
Krishna Janakiraman and Sean Marimpietri - Biograph
12/11/12
2012-11-04 - SLIDE
30. Merging EAC-CPF Records
VIAF Repository
Cheshire
Search
Connect Connect
exactly records using Merge
matching name authority
records information
Repository of Repository of
EAC Repository merged EAC
connected EAC
Records Records
(MongoDB)
12/11/12
2012-11-04 - SLIDE
31. Merge Flagged Records
• For all of the exact matches and authority
matches
– Use the Authoritative form of the name
– Combine data from each match into a single
EAC-CPF record
– Retain all source record IDs and information
• Finally, output the merged EAC-CPF
records
12/11/12
2012-11-04 - SLIDE
32. Inputs to SNAC merging
• LoC: 43,702 EAC-CPF records derived from 1159
finding aids
• OAC: 91,814 EAC-CPF records derived from
~15,400 finding aids
• NWDA: 24952 EAC-CPF records derived from
5,568 finding aids
• VH: 15,175 EAC-CPF records
• Total: 175,688 Input EAC records for merging
• Result: 128,781 “unique” names
12/11/12
2012-11-04 - SLIDE
33. Another view of the numbers…
• 95624 Person names merged from
125555 Person records
• 31287 Institutions merged from 47189
Institution records
• 1980 Families merged from 2899 Family
records
12/11/12
2012-11-04 - SLIDE
34. Merging Conclusions
• There will not be a single merging method,
but a staged set of approaches that will
allow us to go from the simplest exact
matches, to (we hope) reliably identifying
various variant forms of a name, etc. when
corroborated by contextual (date, etc.)
information
12/11/12
2012-11-04 - SLIDE
35. Next
• Developing an updateable database of
merged EAC data (dumping Mongo for
PostgreSQL)
– Will permit incremental addition of new data
and support editing and “forced” merges
• Process the 2M WorldCat archival
descriptions
• Process the 150,000 finding aids
• Convert several hundred thousand archival
authority records into EAC-CPF and
match/merge process
12/11/12
2012-11-04 - SLIDE
36. Methods and Processing
• Extract EAC-CPF records from existing EAD-
encoded archival descriptions
– Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
against existing authority records (ULAN, VIAF,
LCNAF)
– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data
(ULAN)
• Create a prototype historical resource and access
system
– Historical data and social-professional networks
– Links to archive, library, and museum resources (by
and about)
12/11/12
2012-11-04 - SLIDE
37. Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
38. Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or
product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families
and networks. Sometimes he comes to the site looking for information on specific people; other times he is
looking for information on a specific subject or event. He also TAs an undergraduate history class and
sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking
themselves how this site would be useful to their users. Wants to understand how their records were used
and what the added value is.
• Quincy: Library School Student working to QA record matching.
• Adele: Person doing authority work during collection processing.
• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established
programatically.
12/11/12
39. Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
68. Outline
• User Persona
• Search and Display
• Network graph visualization
• Context widget (needs new name)
• Linked Data / RDF
• Future Plans
12/11/12
69. Tinkerpop graph database stack
• Simple "property graph" model
• "JDBC for graph databases" [SNAC is using Neo4J
for the graphDB]
• XPath like "gremlin" for graph query
• REST interfaces with "Rexster"
• For me, this was 10 to 100 times easier than using
RDF
12/11/12
70.
71.
72.
73.
74.
75.
76.
77.
78.
79. Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
80. What is Linked Open Data?
• w3c Semantic Web Technology Stack
• Web of atomized Data, not a web of documents
• RDF; OWL ontologies; SPARQL queries; triple/quad/quint
stores
• httpRange14; content negotiation; CURIE
• No restrictions on data use; free and easy license
• Lenny wants it, but does Randy?
12/11/12
81. What is Linked Open Data?
• Getting to the good stuff
• Blue underlined text
• Pulling in data from multiple sources, in an
intelligent way, into a "document"
• Understand and discover relationships
• Open access for research, education, private study
and other fair use
12/11/12
91. My opinion on the use cases for w3c RDF
tech
• Good for publishing data
• Good for controlled vocabularies
• Data models?
• Most people with open source RDF-store type
systems do the real stuff with solr
• Consider a graph database
12/11/12
92.
93. Outline
• User Persona
• Search and Display
• Linked Data / RDF
• Network graph visualization
• Future Plans
12/11/12
94. Future Plans
• Conduct assessment activities involving members of
target audiences to establish mental model of users for
design work
• Scale interface to millions of names
• Visualizations useful and integrated (network and
geospatial)
• Stable URLs between batches for linked data
• Social and personalization features (gateway to
crowdsourcing)
12/11/12
• Integration with local systems (such as with the context
In the order of importance Lenny the link head is last
So, this is what happens when you let the programmer design the user interface In phase two, Rachel Hu, CDL's user experience designer in our in house assessment group will be helping
Hopefully this is where the user will focus
AZ browse
Featured items on home page (rather than 0-9) Note the tabs to limit by record type
Also note the subject and occupation facets
Person
Advanced search hides, allows On other browsers, hierarchy represented graphically
Advanced search help
Autocomplete
Search results for Oppenheimer
View EAD Report data issue link has been added back Will come back to the radial graph demo
Sometimes the related resources will come from the EAD, but most of these are from VIAF This whole section is hard to use when there are lots of related items
This was the first iteration of the graph visualization