SlideShare a Scribd company logo
1 of 95
Opening Slide
Building an Archival Identity
Management Network: Transforming
  Archival Practice and Historical
             Research

                     Daniel Pitti* and Brian Tingle**
    * Institute for Advance Technology in the Humanities
                        ** California Digital Library

     Thanks to Ray R. Larson of the University of California, Berkeley, School of Information
                                 for many of the slides here




                                                                                                      12/11/12

                                                                                        2012-11-04 - SLIDE
Funding and People
• Funding and Timeline
   –   National Endowment for the Humanities
   –   May 2010-April 2012
   –   Andrew W. Mellon Foundation
   –   May 2012-April 2014
• People
   – Daniel Pitti (PI) and Worthy Martin (Institute for Advanced
     Technology in the Humanities, University of Virginia)
   – Adrian Turner and Brian Tingle (California Digital Library,
     University of California)
   – Ray Larson (School of Information, University of California,
     Berkeley)



                                                                      12/11/12

                                                        2012-11-04 - SLIDE
The Source Data
• EAD-encoded finding aids (guides to archival
  records)
  – 150K
  – Primarily from U.S. sources, but also U.K. and
    France
• Archival authority records (360K)
  –   National Archives and Records Administration
  –   State Archive of New York
  –   Smithsonian Institution
  –   British Library
  –   National Archives (France) & BnF
• WorldCat Archival Descriptions: 2M
                                                          12/11/12

                                            2012-11-04 - SLIDE
Library and Museum Authority Records

• Getty Vocabulary Program: Union List of
  Artist Names (293K personal and corporate
  names)
• Virtual International Authority File (16M+
  cluster records)
  – Contributed from around the world by national
    libraries and others




                                                     12/11/12

                                       2012-11-04 - SLIDE
12/11/12

2012-11-04 - SLIDE
Methods and Processing
• Extract EAC-CPF records from existing EAD-
  encoded archival descriptions
  – Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
  against existing authority records (ULAN, VIAF,
  LCNAF)
  – Enhance EAC-CPF by normalizing entries, adding
    alternative entries, titles (VIAF), and historical data
    (ULAN)
• Create a prototype historical resource and access
  system
  – Historical data and social-professional networks
  – Links to archive, library, and museum resources (by
    and about)


                                                               12/11/12

                                                 2012-11-04 - SLIDE
Example EAD Record (Hub)
                                        <ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English">
<EAD>
                                         <DID>
 <EADHEADER LANGENCODING = "ISO 639">
                                          <REPOSITORY>
  <EADID>
                                        University of Manchester, John Rylands University Library of Manchester
GB 0133 TAB
                                          </REPOSITORY>
  </EADID>
                                          <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB"
  <FILEDESC>
                                        REPOSITORYCODE = "0133">
   <TITLESTMT>
                                        GB 0133 TAB
    <TITLEPROPER>
                                          </UNITID>
Tabley Muniments
                                          <UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2.">
    </TITLEPROPER>
                                        Tabley Muniments
   </TITLESTMT>
                                          </UNITTITLE>
   <PUBLICATIONSTMT>
                                          <UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3.">
    <PUBLISHER>
                                        19th century
John Rylands University Library of
                                          </UNITDATE>
Manchester
                                          <PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5.">
    </PUBLISHER>
                                           <EXTENT>
    <ADDRESS>
                                        1.24 cu.m
     <ADDRESSLINE>
                                           </EXTENT>
150 Deansgate
                                          </PHYSDESC>
     </ADDRESSLINE>
                                          <ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1.">
     <ADDRESSLINE>
                                           <FAMNAME SOURCE = "NCARULES">
Manchester
                                        Warren, family, of Tabley, Cheshire
     </ADDRESSLINE>
                                           </FAMNAME>
     <ADDRESSLINE>
                                           <PERSNAME SOURCE = "NCARULES">
... (Parts removed )…
                                        Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet
 </FRONTMATTER>
                                           </PERSNAME>
                                          </ORIGINATION>
                                         </DID>

                                                                                                 12/11/12

                                                                                   2012-11-04 - SLIDE
Example EAD Record (Hub)
  <BIOGHIST ENCODINGANALOG = "ISADG3.2.2.">
    <HEAD>
  Administrative/Biographical History
    </HEAD>
    <P>
  The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire,
  was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian,
  the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He
  was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he
  published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly
   under his own name.
    </P>
    <P>
  His early verse included
     <TITLE>
  Praeterita
     </TITLE>
   (1863),
     <TITLE>
  Eclogues and Monodramas
     </TITLE>
   (1864),
     <TITLE>
  Studies in Verse
     </TITLE>
   (1865),
     <TITLE>
  Philocletes
     </TITLE>
   (1866), and
     <TITLE>
  Orestes
     </TITLE>
   (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and
  Swinburne. In 1873 he produced …. (some data removed)…
                                                                                                                               12/11/12

                                                                                                                 2012-11-04 - SLIDE
Example EAD Record (Hub)
     <SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1.">
        <HEAD>
     Scope and Content
        </HEAD>
        <P>
     The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in
     literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian
     figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates.
        </P>
        <P>
     Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert
     Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and
     loose manuscripts of verse and other writings. There are various bundles and boxes relating to
     &quot;Coins&quot;, &quot;Botany&quot;, &quot;Poetry&quot;, &quot;Literary&quot;, &quot;Financial&quot;
     and bookplates.
        </P>
       </SCOPECONTENT>
       <ADD>
        <OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6.">
        <P>
     Preliminary survey list.
        </P>
        </OTHERFINDAID>
        <RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3.">
        <P>
     There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM.
     The Library also has custody of the important Tabley Book Collection.
        </P>
        </RELATEDMATERIAL>
        <SEPARATEDMATERIAL>
        <P>
     The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record
     Office. Some of these papers were originally in the custody of the John Rylands University Library
     of Manchester.
        </P>
        </SEPARATEDMATERIAL>
       </ADD>


                                                                                                                         12/11/12

                                                                                                    2012-11-04 - SLIDE
Example EAD Record (Hub)
<CONTROLACCESS>
                                                                    <PERSNAME SOURCE = "NCARULES">
  <HEAD>
                                                                    <EMPH ALTRENDER = "surname">Milnes</EMPH>
Index terms
                                                                    <EMPH ALTRENDER = "forename">Richard Monckton</EMPH>
  </HEAD>
                                                                    <EMPH ALTRENDER = "dates">1809-1885</EMPH>
  <GEOGNAME SOURCE = "NCARULES">
                                                                    <EMPH ALTRENDER = "epithet">1st Baron Houghton</EMPH>
<EMPH ALTRENDER = "a">Tabley Inferior</EMPH>
                                                                      </PERSNAME>
<EMPH ALTRENDER = "a-">Cheshire SJ7378</EMPH>
                                                                      <SUBJECT SOURCE = "LCSH">
  </GEOGNAME>
                                                                    <EMPH ALTRENDER = "a">Bookplates</EMPH>
  <PERSNAME SOURCE = "NCARULES">
                                                                      </SUBJECT>
<EMPH ALTRENDER = "surname">Benson</EMPH>
                                                                      <SUBJECT SOURCE = "LCSH">
<EMPH ALTRENDER = "forename">Arthur Christopher</EMPH>
                                                                    <EMPH ALTRENDER = "a">Botany</EMPH>
<EMPH ALTRENDER = "dates">1862-1923</EMPH>
                                                                      </SUBJECT>
  </PERSNAME>
                                                                      <SUBJECT SOURCE = "LCSH">
  <PERSNAME SOURCE = "NCARULES">
                                                                    <EMPH ALTRENDER = "a">Numismatics</EMPH>
<EMPH ALTRENDER = "surname">Bridges</EMPH>
                                                                      </SUBJECT>
<EMPH ALTRENDER = "forename">Robert Seymour</EMPH>
                                                                      <SUBJECT SOURCE = "LCSH">
<EMPH ALTRENDER = "dates">1844-1930</EMPH>
                                                                    <EMPH ALTRENDER = "a-">Poetry</EMPH>
  </PERSNAME>
                                                                    <EMPH ALTRENDER = "a">Modern</EMPH>
  <PERSNAME SOURCE = "NCARULES">
                                                                    <EMPH ALTRENDER = "y">19th century</EMPH>
<EMPH ALTRENDER = "surname">Duff</EMPH>
                                                                      </SUBJECT>
<EMPH ALTRENDER = "title">Sir</EMPH>
                                                                     </CONTROLACCESS>
<EMPH ALTRENDER = "forename">Mountstuart Elphinstone Grant</EMPH>
                                                                    </ARCHDESC>
<EMPH ALTRENDER = "dates">1829-1906</EMPH>
                                                                    </EAD>
<EMPH ALTRENDER = "epithet">Knight</EMPH>
  </PERSNAME>
  <PERSNAME SOURCE = "NCARULES">
<EMPH ALTRENDER = "surname">Gosse</EMPH>
<EMPH ALTRENDER = "title">Sir</EMPH>
<EMPH ALTRENDER = "forename">Edmund William</EMPH>
<EMPH ALTRENDER = "dates">1849-1928</EMPH>
<EMPH ALTRENDER = "epithet">Knight</EMPH>
  </PERSNAME>




                                                                                                             12/11/12

                                                                                             2012-11-04 - SLIDE
2010-2012 Extraction Results
• Source data: 30,000 finding aids
• EAC-CPF records extracted
  – LoC: 43,702 from 1,159 finding aids
  – OAC: 91,811 from ~15,400
  – NWDA: 22,609 from 5,160
  – VH: 15,175 from 8,390
  – Total 173,297



                                                        12/11/12

                                          2012-11-04 - SLIDE
Phase II preliminary results
• unmerged SIA Henry Correspondence
• 32,988 Names

• unmerged WorldCat MARC
• 4,548,270 Names




                                             12/11/12

                               2012-11-04 - SLIDE
Methods and Processing
• Extract EAC-CPF records from existing EAD-
  encoded archival descriptions
  – Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
  against existing authority records (ULAN, VIAF,
  LCNAF)
  – Enhance EAC-CPF by normalizing entries, adding
    alternative entries, titles (VIAF), and historical data
    (ULAN)
• Create a prototype historical resource and access
  system
  – Historical data and social-professional networks
  – Links to archive, library, and museum resources (by
    and about)
                                                               12/11/12

                                                 2012-11-04 - SLIDE
The Problem
• Proliferation of the forms of names
  – Different names for the same person
  – Different people with the same names


• Examples
  – from Books in Print (semi-controlled but not
    consistent)
  – ERIC author index (not controlled)


                                                      12/11/12

                                        2012-11-04 - SLIDE
Goethe




         …etc…


                       12/11/12

         2012-11-04 - SLIDE
John Muir




                          12/11/12

            2012-11-04 - SLIDE
Library and Archive Authority Control
• Library (or bibliographic) authority control is almost
  exclusively about the control of names
• Archival identity control involves biographical-
  historical description of the CPF entity
  – Descriptions based on controlled vocabularies, for
    example, occupations, place of birth and death
  – But also biographical-historical description
      • Prose
      • Chronological list
• Archival authority control provides context for
  understanding records, the context of their creation,
  the provenance


                                                            12/11/12

                                              2012-11-04 - SLIDE
Merging EAC-CPF Records
             LCNAF Repository   VIAF Repository      ULAN Repository




                                  Cheshire
                                   Search



                  Connect           Connect
                   exactly       records using                       Merge
                  matching      name authority
                  records         information


                                                   Repository of                Repository of
EAC Repository                                                                  merged EAC
                                                  connected EAC
                                                     Records                      Records
                                                    (MongoDB)

                                                                                 12/11/12

                                                                   2012-11-04 - SLIDE
Merging EAC-CPF Records
                            VIAF Repository




                              Cheshire
                               Search



                 Connect        Connect
                  exactly    records using                       Merge
                 matching   name authority
                 records      information


                                               Repository of                Repository of
EAC Repository                                                              merged EAC
                                              connected EAC
                                                 Records                      Records
                                                (MongoDB)

                                                                             12/11/12

                                                               2012-11-04 - SLIDE
Connect Exact Matches
• The EAC-CPF records provide the names
  without having to parse texts, etc.
• Allows us to use some simple methods like
  exact matching
  – Assume identical name entries means the
    same person/corporate body/family
  – Enter the full names and record IDs into a
    database and flag IDs with same names for
    merging


                                                    12/11/12

                                      2012-11-04 - SLIDE
But…
• Exact merging assumes that archives are
  following LC cataloging practice in their
  EAD records
  – There are some problems with this assumption




                                                   12/11/12

                                     2012-11-04 - SLIDE
Some failures for merging…
• Different abbreviations:
   – A. & G. Carisch & C.
   – A. & G. Carisch & Co.
• And spacing issues:
   –   A. C. Peters & Bro.
   –   A. C. Peters & Brother.
   –   A. C. Peters. (??)
   –   A. C.Peters & Bro.
• Completeness and alternate rules
   – Tabb, John B. (John Banister), 1845-1909.
   – Tabb, John Banister, 1845-1909.
• Also differing transliterations for non-Latin scripts
                                                           12/11/12

                                             2012-11-04 - SLIDE
More…
• Variant romanizations (and spacing):
  – M. P. Belaieff.
  – M. P. Belaïeff.
  – M. P. Bieliaev.
  – M.P. Belaïeff.
  – M.P.Belaïeff.
• Initials vs. names:
  – Zabolotskii, N.A.
  – Zabolotskii, Nikolai Alekseevich, 1903-1958.
  – Zabolotskii.
                                                     12/11/12

                                       2012-11-04 - SLIDE
More…
• Inverted order vs. uninverted
  – Taylor, Zachary, 1784-1850.
  – Zachary Taylor.
• Various combinations:
  – Tchaikovsky, Peter I.
  – Tchaikovsky, Pëtr Il.
  – Tchaikovsky, Piotr Ilyich.
  – Tchaikovsky, Pyotr Il.
  – Tchaikovsky, Pyotr Ilyich.
                                                12/11/12

                                  2012-11-04 - SLIDE
Merging EAC-CPF Records
                            VIAF Repository




                              Cheshire
                               Search



                 Connect        Connect
                  exactly    records using                       Merge
                 matching   name authority
                 records      information


                                               Repository of                Repository of
EAC Repository                                                              merged EAC
                                              connected EAC
                                                 Records                      Records
                                                (MongoDB)

                                                                             12/11/12

                                                               2012-11-04 - SLIDE
Search Authority Files
• For each name, formulate a search of the
  VIAF database using the Cheshire system
  (SGML/XML retrieval system with
  probabilistic and Boolean matching)
  – Search both the “authoritative” and “non-
    authoritative” forms
  – Consider any name matching a non-
    authoritative form to be a candidate match for
    the authoritative form
  – Flag EAC records that match the same
    authority record as potential matches
                                                      12/11/12

                                        2012-11-04 - SLIDE
NGRAM or Shingle Matching

Name: Einstein Albert


  Shingle sequence: ein, ins, nst, ste, tei, ein … , ert



Probability that the sequence (ins, nst, ste) follows ein is very high for the name
einstein




                       Shingle Language Model for names

                                                    Krishna Janakiraman and Sean Marimpietri - Biograph
                                                                                     12/11/12

                                                                     2012-11-04 - SLIDE
Name 1 : Einstein Albert                         Name 2 : Ainshtain Albert                          Name 3 : Albert Einstein



                                                                                                                    ein         In
                                                                 hta   tai                                                            na
             ein         In                                                  ain                              ste
                               na                         sht
       ste                                                                                                                                  al
                                                                                   al                   nst
                                    al               nsh
 nst
                                                                                                                                                 alb
                                                                                     alb              ins
                                     alb            ins
ins
                                                                                        lbe            ein                                       lbe
                                                     Ain
 ein                                     lbe
                                                                                   ert                                                     ert
                                                                                                              ein
                                    ert                    ein
       ein                                                                   rte                                          tei        rte
                                                                       tei
                   tei        rte




                                               Shingle Language Model for names



                                                                                              Krishna Janakiraman and Sean Marimpietri - Biograph
                                                                                                                                       12/11/12

                                                                                                                    2012-11-04 - SLIDE
Merging EAC-CPF Records
                            VIAF Repository




                              Cheshire
                               Search



                 Connect        Connect
                  exactly    records using                       Merge
                 matching   name authority
                 records      information


                                               Repository of                Repository of
EAC Repository                                                              merged EAC
                                              connected EAC
                                                 Records                      Records
                                                (MongoDB)

                                                                             12/11/12

                                                               2012-11-04 - SLIDE
Merge Flagged Records
• For all of the exact matches and authority
  matches
  – Use the Authoritative form of the name
  – Combine data from each match into a single
    EAC-CPF record
  – Retain all source record IDs and information


• Finally, output the merged EAC-CPF
  records

                                                     12/11/12

                                       2012-11-04 - SLIDE
Inputs to SNAC merging
• LoC: 43,702 EAC-CPF records derived from 1159
  finding aids
• OAC: 91,814 EAC-CPF records derived from
  ~15,400 finding aids
• NWDA: 24952 EAC-CPF records derived from
  5,568 finding aids
• VH: 15,175 EAC-CPF records
• Total: 175,688 Input EAC records for merging
• Result: 128,781 “unique” names
                                                     12/11/12

                                       2012-11-04 - SLIDE
Another view of the numbers…
• 95624 Person names merged from
  125555 Person records
• 31287 Institutions merged from 47189
  Institution records
• 1980 Families merged from 2899 Family
  records




                                              12/11/12

                                2012-11-04 - SLIDE
Merging Conclusions
• There will not be a single merging method,
  but a staged set of approaches that will
  allow us to go from the simplest exact
  matches, to (we hope) reliably identifying
  various variant forms of a name, etc. when
  corroborated by contextual (date, etc.)
  information



                                                 12/11/12

                                   2012-11-04 - SLIDE
Next
• Developing an updateable database of
  merged EAC data (dumping Mongo for
  PostgreSQL)
  – Will permit incremental addition of new data
    and support editing and “forced” merges
• Process the 2M WorldCat archival
  descriptions
• Process the 150,000 finding aids
• Convert several hundred thousand archival
  authority records into EAC-CPF and
  match/merge process
                                                      12/11/12

                                        2012-11-04 - SLIDE
Methods and Processing
• Extract EAC-CPF records from existing EAD-
  encoded archival descriptions
  – Extracting both creators and referenced CPF names
• Match EAC-CPF records against one another and
  against existing authority records (ULAN, VIAF,
  LCNAF)
  – Enhance EAC-CPF by normalizing entries, adding
    alternative entries, titles (VIAF), and historical data
    (ULAN)
• Create a prototype historical resource and access
  system
  – Historical data and social-professional networks
  – Links to archive, library, and museum resources (by
    and about)
                                                               12/11/12

                                                 2012-11-04 - SLIDE
Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans



                                12/11/12
Meet the target users
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or

product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)




•          Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families
           and networks.  Sometimes he comes to the site looking for information on specific people; other times he is
           looking for information on a specific subject or event.  He also TAs an undergraduate history class and
           sometimes has to help students find topics for papers. 


•          Connie: Works at an institution that contributed records to the project.  Is going to be asking
           themselves how this site would be useful to their users.  Wants to understand how their records were used
           and what the added value is.


•          Quincy: Library School Student working to QA record matching.

•          Adele: Person doing authority work during collection processing.

•          Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established
           programatically.


                                                                                                                                                      12/11/12
Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans



                                12/11/12
12/11/12
12/11/12
12/11/12
Advanced limits match EAC sections
Outline

• User Persona

• Search and Display

• Network graph visualization

    • Context widget (needs new name)


• Linked Data / RDF

• Future Plans

                                        12/11/12
Tinkerpop graph database stack

• Simple "property graph" model

• "JDBC for graph databases" [SNAC is using Neo4J
  for the graphDB]

• XPath like "gremlin" for graph query

• REST interfaces with "Rexster"

• For me, this was 10 to 100 times easier than using
  RDF
                                                 12/11/12
Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans



                                12/11/12
What is Linked Open Data?

• w3c Semantic Web Technology Stack

• Web of atomized Data, not a web of documents

• RDF; OWL ontologies; SPARQL queries; triple/quad/quint
  stores

• httpRange14; content negotiation; CURIE

• No restrictions on data use; free and easy license

• Lenny wants it, but does Randy?
                                                       12/11/12
What is Linked Open Data?

• Getting to the good stuff

    • Blue underlined text

    • Pulling in data from multiple sources, in an
      intelligent way, into a "document"

• Understand and discover relationships

• Open access for research, education, private study
  and other fair use
                                                     12/11/12
RDFa owl:sameAs
HTML 5 microdata in chron list
RDF of the social graph




                          Thanks Ed Summers!
Silvia Mazzini
                                     regesta.exe srl

http://templates.xdams.net/IBC/ontology/eac-cpf.rdf
&mode=xml2owl [experimental]




                               12/11/12
My opinion on the use cases for w3c RDF
tech

• Good for publishing data

• Good for controlled vocabularies

• Data models?

• Most people with open source RDF-store type
  systems do the real stuff with solr

• Consider a graph database


                                                12/11/12
Outline

• User Persona

• Search and Display

• Linked Data / RDF

• Network graph visualization

• Future Plans



                                12/11/12
Future Plans

• Conduct assessment activities involving members of
  target audiences to establish mental model of users for
  design work

• Scale interface to millions of names

• Visualizations useful and integrated (network and
  geospatial)

• Stable URLs between batches for linked data

• Social and personalization features (gateway to
  crowdsourcing)
                                                      12/11/12

• Integration with local systems (such as with the context
• Photo attribution http://www.flickr.com/photos/dsevilla

• http://xtf.cdlib.org/

• http://code.google.com/p/eac-graph-load/source/bro

• http://tinkerpop.com/

• http://thejit.org/

• https://github.com/tingletech/snac-related-
  widget


                                                12/11/12

More Related Content

Similar to Building an Archival Identity Management Network

Robertclass200801
Robertclass200801Robertclass200801
Robertclass200801SCPilsk
 
Robertclass2008 04 15
Robertclass2008 04 15Robertclass2008 04 15
Robertclass2008 04 15SCPilsk
 
Metadata costs per unit of effort (cpue)
Metadata  costs per unit of effort (cpue)Metadata  costs per unit of effort (cpue)
Metadata costs per unit of effort (cpue)Tom Moritz
 
Lita national forum 2012
Lita national forum 2012Lita national forum 2012
Lita national forum 2012Joel Richard
 
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKS
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKSA BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKS
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKSKelly Lipiec
 
Utilizing record connections in the Field Book Project records between collec...
Utilizing record connections in the Field Book Project records between collec...Utilizing record connections in the Field Book Project records between collec...
Utilizing record connections in the Field Book Project records between collec...Lesley Parilla
 
Texas sla presentation finding sci tech grey literature information
Texas sla presentation  finding sci tech grey literature informationTexas sla presentation  finding sci tech grey literature information
Texas sla presentation finding sci tech grey literature informationMatthew Von Hendy
 
LIS 653, Session 4-A: Bibliographic Formats and MARC
LIS 653, Session 4-A: Bibliographic Formats and MARC LIS 653, Session 4-A: Bibliographic Formats and MARC
LIS 653, Session 4-A: Bibliographic Formats and MARC Dr. Starr Hoffman
 
Digitization in Support of Services @ Smithsonian Libraries (March)
Digitization in Support of Services @ Smithsonian Libraries (March)Digitization in Support of Services @ Smithsonian Libraries (March)
Digitization in Support of Services @ Smithsonian Libraries (March)Martin Kalfatovic
 
Suzanne Pilsk Presentation to SIL Board 2012
Suzanne Pilsk Presentation to SIL Board 2012Suzanne Pilsk Presentation to SIL Board 2012
Suzanne Pilsk Presentation to SIL Board 2012Smithsonian Libraries
 
Cua lsc 603_2011
Cua lsc 603_2011Cua lsc 603_2011
Cua lsc 603_2011SCPilsk
 
Smithsonian Libraries Partnering in Research
Smithsonian Libraries Partnering in ResearchSmithsonian Libraries Partnering in Research
Smithsonian Libraries Partnering in ResearchSCPilsk
 
Archives And Art
Archives And ArtArchives And Art
Archives And ArtTim Johnson
 
Tanya Szrajber, The British Museum Collection Database
Tanya Szrajber, The British Museum Collection DatabaseTanya Szrajber, The British Museum Collection Database
Tanya Szrajber, The British Museum Collection DatabaseAndrew Prescott
 

Similar to Building an Archival Identity Management Network (17)

Robertclass200801
Robertclass200801Robertclass200801
Robertclass200801
 
Robertclass2008 04 15
Robertclass2008 04 15Robertclass2008 04 15
Robertclass2008 04 15
 
Metadata costs per unit of effort (cpue)
Metadata  costs per unit of effort (cpue)Metadata  costs per unit of effort (cpue)
Metadata costs per unit of effort (cpue)
 
Lita national forum 2012
Lita national forum 2012Lita national forum 2012
Lita national forum 2012
 
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKS
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKSA BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKS
A BIBLIOGRAPHY OF IDAHO FRESHWATER AND TERRESTRIAL MOLLUSKS
 
EBHL 2008
EBHL 2008EBHL 2008
EBHL 2008
 
Utilizing record connections in the Field Book Project records between collec...
Utilizing record connections in the Field Book Project records between collec...Utilizing record connections in the Field Book Project records between collec...
Utilizing record connections in the Field Book Project records between collec...
 
Texas sla presentation finding sci tech grey literature information
Texas sla presentation  finding sci tech grey literature informationTexas sla presentation  finding sci tech grey literature information
Texas sla presentation finding sci tech grey literature information
 
LIS 653, Session 4-A: Bibliographic Formats and MARC
LIS 653, Session 4-A: Bibliographic Formats and MARC LIS 653, Session 4-A: Bibliographic Formats and MARC
LIS 653, Session 4-A: Bibliographic Formats and MARC
 
Digitization in Support of Services @ Smithsonian Libraries (March)
Digitization in Support of Services @ Smithsonian Libraries (March)Digitization in Support of Services @ Smithsonian Libraries (March)
Digitization in Support of Services @ Smithsonian Libraries (March)
 
Sharing "Hidden Collections" Beyond Our Libraries' Walls / Elaine Franco
Sharing "Hidden Collections" Beyond Our Libraries' Walls / Elaine FrancoSharing "Hidden Collections" Beyond Our Libraries' Walls / Elaine Franco
Sharing "Hidden Collections" Beyond Our Libraries' Walls / Elaine Franco
 
Suzanne Pilsk Presentation to SIL Board 2012
Suzanne Pilsk Presentation to SIL Board 2012Suzanne Pilsk Presentation to SIL Board 2012
Suzanne Pilsk Presentation to SIL Board 2012
 
Cua lsc 603_2011
Cua lsc 603_2011Cua lsc 603_2011
Cua lsc 603_2011
 
Smithsonian Libraries Partnering in Research
Smithsonian Libraries Partnering in ResearchSmithsonian Libraries Partnering in Research
Smithsonian Libraries Partnering in Research
 
Security & Preservation of Rare Materials: Planning an Environmental Building...
Security & Preservation of Rare Materials: Planning an Environmental Building...Security & Preservation of Rare Materials: Planning an Environmental Building...
Security & Preservation of Rare Materials: Planning an Environmental Building...
 
Archives And Art
Archives And ArtArchives And Art
Archives And Art
 
Tanya Szrajber, The British Museum Collection Database
Tanya Szrajber, The British Museum Collection DatabaseTanya Szrajber, The British Museum Collection Database
Tanya Szrajber, The British Museum Collection Database
 

More from Brian Tingle

Mets2011 dlf lightning ppt
Mets2011 dlf lightning pptMets2011 dlf lightning ppt
Mets2011 dlf lightning pptBrian Tingle
 
Snac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynoteSnac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynoteBrian Tingle
 
Snac saa-aug-2011.ppt
Snac saa-aug-2011.pptSnac saa-aug-2011.ppt
Snac saa-aug-2011.pptBrian Tingle
 
Saa 2011-snac anila
Saa 2011-snac anilaSaa 2011-snac anila
Saa 2011-snac anilaBrian Tingle
 
Snac dh2011-june-2011
Snac dh2011-june-2011Snac dh2011-june-2011
Snac dh2011-june-2011Brian Tingle
 
Snac oclc-may-2011
Snac oclc-may-2011Snac oclc-may-2011
Snac oclc-may-2011Brian Tingle
 

More from Brian Tingle (9)

Dlf 2012
Dlf 2012Dlf 2012
Dlf 2012
 
Mets2011 dlf lightning ppt
Mets2011 dlf lightning pptMets2011 dlf lightning ppt
Mets2011 dlf lightning ppt
 
saa-2011-snac
saa-2011-snacsaa-2011-snac
saa-2011-snac
 
Snac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynoteSnac saa-aug-2011-try 3 keynote
Snac saa-aug-2011-try 3 keynote
 
Snac saa-aug-2011.ppt
Snac saa-aug-2011.pptSnac saa-aug-2011.ppt
Snac saa-aug-2011.ppt
 
Saa 2011-snac anila
Saa 2011-snac anilaSaa 2011-snac anila
Saa 2011-snac anila
 
Snac dh2011-june-2011
Snac dh2011-june-2011Snac dh2011-june-2011
Snac dh2011-june-2011
 
Snac oclc-may-2011
Snac oclc-may-2011Snac oclc-may-2011
Snac oclc-may-2011
 
Snac webinar v3
Snac webinar v3Snac webinar v3
Snac webinar v3
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Building an Archival Identity Management Network

  • 2. Building an Archival Identity Management Network: Transforming Archival Practice and Historical Research Daniel Pitti* and Brian Tingle** * Institute for Advance Technology in the Humanities ** California Digital Library Thanks to Ray R. Larson of the University of California, Berkeley, School of Information for many of the slides here 12/11/12 2012-11-04 - SLIDE
  • 3. Funding and People • Funding and Timeline – National Endowment for the Humanities – May 2010-April 2012 – Andrew W. Mellon Foundation – May 2012-April 2014 • People – Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia) – Adrian Turner and Brian Tingle (California Digital Library, University of California) – Ray Larson (School of Information, University of California, Berkeley) 12/11/12 2012-11-04 - SLIDE
  • 4. The Source Data • EAD-encoded finding aids (guides to archival records) – 150K – Primarily from U.S. sources, but also U.K. and France • Archival authority records (360K) – National Archives and Records Administration – State Archive of New York – Smithsonian Institution – British Library – National Archives (France) & BnF • WorldCat Archival Descriptions: 2M 12/11/12 2012-11-04 - SLIDE
  • 5. Library and Museum Authority Records • Getty Vocabulary Program: Union List of Artist Names (293K personal and corporate names) • Virtual International Authority File (16M+ cluster records) – Contributed from around the world by national libraries and others 12/11/12 2012-11-04 - SLIDE
  • 7. Methods and Processing • Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) • Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
  • 8. Example EAD Record (Hub) <ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English"> <EAD> <DID> <EADHEADER LANGENCODING = "ISO 639"> <REPOSITORY> <EADID> University of Manchester, John Rylands University Library of Manchester GB 0133 TAB </REPOSITORY> </EADID> <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" <FILEDESC> REPOSITORYCODE = "0133"> <TITLESTMT> GB 0133 TAB <TITLEPROPER> </UNITID> Tabley Muniments <UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2."> </TITLEPROPER> Tabley Muniments </TITLESTMT> </UNITTITLE> <PUBLICATIONSTMT> <UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3."> <PUBLISHER> 19th century John Rylands University Library of </UNITDATE> Manchester <PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5."> </PUBLISHER> <EXTENT> <ADDRESS> 1.24 cu.m <ADDRESSLINE> </EXTENT> 150 Deansgate </PHYSDESC> </ADDRESSLINE> <ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1."> <ADDRESSLINE> <FAMNAME SOURCE = "NCARULES"> Manchester Warren, family, of Tabley, Cheshire </ADDRESSLINE> </FAMNAME> <ADDRESSLINE> <PERSNAME SOURCE = "NCARULES"> ... (Parts removed )… Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet </FRONTMATTER> </PERSNAME> </ORIGINATION> </DID> 12/11/12 2012-11-04 - SLIDE
  • 9. Example EAD Record (Hub) <BIOGHIST ENCODINGANALOG = "ISADG3.2.2."> <HEAD> Administrative/Biographical History </HEAD> <P> The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly under his own name. </P> <P> His early verse included <TITLE> Praeterita </TITLE> (1863), <TITLE> Eclogues and Monodramas </TITLE> (1864), <TITLE> Studies in Verse </TITLE> (1865), <TITLE> Philocletes </TITLE> (1866), and <TITLE> Orestes </TITLE> (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)… 12/11/12 2012-11-04 - SLIDE
  • 10. Example EAD Record (Hub) <SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1."> <HEAD> Scope and Content </HEAD> <P> The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. </P> <P> Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to &quot;Coins&quot;, &quot;Botany&quot;, &quot;Poetry&quot;, &quot;Literary&quot;, &quot;Financial&quot; and bookplates. </P> </SCOPECONTENT> <ADD> <OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6."> <P> Preliminary survey list. </P> </OTHERFINDAID> <RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3."> <P> There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. </P> </RELATEDMATERIAL> <SEPARATEDMATERIAL> <P> The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester. </P> </SEPARATEDMATERIAL> </ADD> 12/11/12 2012-11-04 - SLIDE
  • 11. Example EAD Record (Hub) <CONTROLACCESS> <PERSNAME SOURCE = "NCARULES"> <HEAD> <EMPH ALTRENDER = "surname">Milnes</EMPH> Index terms <EMPH ALTRENDER = "forename">Richard Monckton</EMPH> </HEAD> <EMPH ALTRENDER = "dates">1809-1885</EMPH> <GEOGNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "epithet">1st Baron Houghton</EMPH> <EMPH ALTRENDER = "a">Tabley Inferior</EMPH> </PERSNAME> <EMPH ALTRENDER = "a-">Cheshire SJ7378</EMPH> <SUBJECT SOURCE = "LCSH"> </GEOGNAME> <EMPH ALTRENDER = "a">Bookplates</EMPH> <PERSNAME SOURCE = "NCARULES"> </SUBJECT> <EMPH ALTRENDER = "surname">Benson</EMPH> <SUBJECT SOURCE = "LCSH"> <EMPH ALTRENDER = "forename">Arthur Christopher</EMPH> <EMPH ALTRENDER = "a">Botany</EMPH> <EMPH ALTRENDER = "dates">1862-1923</EMPH> </SUBJECT> </PERSNAME> <SUBJECT SOURCE = "LCSH"> <PERSNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "a">Numismatics</EMPH> <EMPH ALTRENDER = "surname">Bridges</EMPH> </SUBJECT> <EMPH ALTRENDER = "forename">Robert Seymour</EMPH> <SUBJECT SOURCE = "LCSH"> <EMPH ALTRENDER = "dates">1844-1930</EMPH> <EMPH ALTRENDER = "a-">Poetry</EMPH> </PERSNAME> <EMPH ALTRENDER = "a">Modern</EMPH> <PERSNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "y">19th century</EMPH> <EMPH ALTRENDER = "surname">Duff</EMPH> </SUBJECT> <EMPH ALTRENDER = "title">Sir</EMPH> </CONTROLACCESS> <EMPH ALTRENDER = "forename">Mountstuart Elphinstone Grant</EMPH> </ARCHDESC> <EMPH ALTRENDER = "dates">1829-1906</EMPH> </EAD> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"> <EMPH ALTRENDER = "surname">Gosse</EMPH> <EMPH ALTRENDER = "title">Sir</EMPH> <EMPH ALTRENDER = "forename">Edmund William</EMPH> <EMPH ALTRENDER = "dates">1849-1928</EMPH> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> 12/11/12 2012-11-04 - SLIDE
  • 12. 2010-2012 Extraction Results • Source data: 30,000 finding aids • EAC-CPF records extracted – LoC: 43,702 from 1,159 finding aids – OAC: 91,811 from ~15,400 – NWDA: 22,609 from 5,160 – VH: 15,175 from 8,390 – Total 173,297 12/11/12 2012-11-04 - SLIDE
  • 13. Phase II preliminary results • unmerged SIA Henry Correspondence • 32,988 Names • unmerged WorldCat MARC • 4,548,270 Names 12/11/12 2012-11-04 - SLIDE
  • 14. Methods and Processing • Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) • Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
  • 15. The Problem • Proliferation of the forms of names – Different names for the same person – Different people with the same names • Examples – from Books in Print (semi-controlled but not consistent) – ERIC author index (not controlled) 12/11/12 2012-11-04 - SLIDE
  • 16. Goethe …etc… 12/11/12 2012-11-04 - SLIDE
  • 17. John Muir 12/11/12 2012-11-04 - SLIDE
  • 18. Library and Archive Authority Control • Library (or bibliographic) authority control is almost exclusively about the control of names • Archival identity control involves biographical- historical description of the CPF entity – Descriptions based on controlled vocabularies, for example, occupations, place of birth and death – But also biographical-historical description • Prose • Chronological list • Archival authority control provides context for understanding records, the context of their creation, the provenance 12/11/12 2012-11-04 - SLIDE
  • 19. Merging EAC-CPF Records LCNAF Repository VIAF Repository ULAN Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository of EAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
  • 20. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository of EAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
  • 21. Connect Exact Matches • The EAC-CPF records provide the names without having to parse texts, etc. • Allows us to use some simple methods like exact matching – Assume identical name entries means the same person/corporate body/family – Enter the full names and record IDs into a database and flag IDs with same names for merging 12/11/12 2012-11-04 - SLIDE
  • 22. But… • Exact merging assumes that archives are following LC cataloging practice in their EAD records – There are some problems with this assumption 12/11/12 2012-11-04 - SLIDE
  • 23. Some failures for merging… • Different abbreviations: – A. & G. Carisch & C. – A. & G. Carisch & Co. • And spacing issues: – A. C. Peters & Bro. – A. C. Peters & Brother. – A. C. Peters. (??) – A. C.Peters & Bro. • Completeness and alternate rules – Tabb, John B. (John Banister), 1845-1909. – Tabb, John Banister, 1845-1909. • Also differing transliterations for non-Latin scripts 12/11/12 2012-11-04 - SLIDE
  • 24. More… • Variant romanizations (and spacing): – M. P. Belaieff. – M. P. Belaïeff. – M. P. Bieliaev. – M.P. Belaïeff. – M.P.Belaïeff. • Initials vs. names: – Zabolotskii, N.A. – Zabolotskii, Nikolai Alekseevich, 1903-1958. – Zabolotskii. 12/11/12 2012-11-04 - SLIDE
  • 25. More… • Inverted order vs. uninverted – Taylor, Zachary, 1784-1850. – Zachary Taylor. • Various combinations: – Tchaikovsky, Peter I. – Tchaikovsky, Pëtr Il. – Tchaikovsky, Piotr Ilyich. – Tchaikovsky, Pyotr Il. – Tchaikovsky, Pyotr Ilyich. 12/11/12 2012-11-04 - SLIDE
  • 26. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository of EAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
  • 27. Search Authority Files • For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) – Search both the “authoritative” and “non- authoritative” forms – Consider any name matching a non- authoritative form to be a candidate match for the authoritative form – Flag EAC records that match the same authority record as potential matches 12/11/12 2012-11-04 - SLIDE
  • 28. NGRAM or Shingle Matching Name: Einstein Albert Shingle sequence: ein, ins, nst, ste, tei, ein … , ert Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph 12/11/12 2012-11-04 - SLIDE
  • 29. Name 1 : Einstein Albert Name 2 : Ainshtain Albert Name 3 : Albert Einstein ein In hta tai na ein In ain ste na sht ste al al nst al nsh nst alb alb ins alb ins ins lbe ein lbe Ain ein lbe ert ert ein ert ein ein rte tei rte tei tei rte Shingle Language Model for names Krishna Janakiraman and Sean Marimpietri - Biograph 12/11/12 2012-11-04 - SLIDE
  • 30. Merging EAC-CPF Records VIAF Repository Cheshire Search Connect Connect exactly records using Merge matching name authority records information Repository of Repository of EAC Repository merged EAC connected EAC Records Records (MongoDB) 12/11/12 2012-11-04 - SLIDE
  • 31. Merge Flagged Records • For all of the exact matches and authority matches – Use the Authoritative form of the name – Combine data from each match into a single EAC-CPF record – Retain all source record IDs and information • Finally, output the merged EAC-CPF records 12/11/12 2012-11-04 - SLIDE
  • 32. Inputs to SNAC merging • LoC: 43,702 EAC-CPF records derived from 1159 finding aids • OAC: 91,814 EAC-CPF records derived from ~15,400 finding aids • NWDA: 24952 EAC-CPF records derived from 5,568 finding aids • VH: 15,175 EAC-CPF records • Total: 175,688 Input EAC records for merging • Result: 128,781 “unique” names 12/11/12 2012-11-04 - SLIDE
  • 33. Another view of the numbers… • 95624 Person names merged from 125555 Person records • 31287 Institutions merged from 47189 Institution records • 1980 Families merged from 2899 Family records 12/11/12 2012-11-04 - SLIDE
  • 34. Merging Conclusions • There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information 12/11/12 2012-11-04 - SLIDE
  • 35. Next • Developing an updateable database of merged EAC data (dumping Mongo for PostgreSQL) – Will permit incremental addition of new data and support editing and “forced” merges • Process the 2M WorldCat archival descriptions • Process the 150,000 finding aids • Convert several hundred thousand archival authority records into EAC-CPF and match/merge process 12/11/12 2012-11-04 - SLIDE
  • 36. Methods and Processing • Extract EAC-CPF records from existing EAD- encoded archival descriptions – Extracting both creators and referenced CPF names • Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF) – Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) • Create a prototype historical resource and access system – Historical data and social-professional networks – Links to archive, library, and museum resources (by and about) 12/11/12 2012-11-04 - SLIDE
  • 37. Outline • User Persona • Search and Display • Network graph visualization • Linked Data / RDF • Future Plans 12/11/12
  • 38. Meet the target users Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing) • Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers.  • Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is. • Quincy: Library School Student working to QA record matching. • Adele: Person doing authority work during collection processing. • Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically. 12/11/12
  • 39. Outline • User Persona • Search and Display • Network graph visualization • Linked Data / RDF • Future Plans 12/11/12
  • 40.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50. Advanced limits match EAC sections
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68. Outline • User Persona • Search and Display • Network graph visualization • Context widget (needs new name) • Linked Data / RDF • Future Plans 12/11/12
  • 69. Tinkerpop graph database stack • Simple "property graph" model • "JDBC for graph databases" [SNAC is using Neo4J for the graphDB] • XPath like "gremlin" for graph query • REST interfaces with "Rexster" • For me, this was 10 to 100 times easier than using RDF 12/11/12
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79. Outline • User Persona • Search and Display • Network graph visualization • Linked Data / RDF • Future Plans 12/11/12
  • 80. What is Linked Open Data? • w3c Semantic Web Technology Stack • Web of atomized Data, not a web of documents • RDF; OWL ontologies; SPARQL queries; triple/quad/quint stores • httpRange14; content negotiation; CURIE • No restrictions on data use; free and easy license • Lenny wants it, but does Randy? 12/11/12
  • 81. What is Linked Open Data? • Getting to the good stuff • Blue underlined text • Pulling in data from multiple sources, in an intelligent way, into a "document" • Understand and discover relationships • Open access for research, education, private study and other fair use 12/11/12
  • 83. HTML 5 microdata in chron list
  • 84. RDF of the social graph Thanks Ed Summers!
  • 85.
  • 86.
  • 87.
  • 88. Silvia Mazzini regesta.exe srl http://templates.xdams.net/IBC/ontology/eac-cpf.rdf
  • 89.
  • 91. My opinion on the use cases for w3c RDF tech • Good for publishing data • Good for controlled vocabularies • Data models? • Most people with open source RDF-store type systems do the real stuff with solr • Consider a graph database 12/11/12
  • 92.
  • 93. Outline • User Persona • Search and Display • Linked Data / RDF • Network graph visualization • Future Plans 12/11/12
  • 94. Future Plans • Conduct assessment activities involving members of target audiences to establish mental model of users for design work • Scale interface to millions of names • Visualizations useful and integrated (network and geospatial) • Stable URLs between batches for linked data • Social and personalization features (gateway to crowdsourcing) 12/11/12 • Integration with local systems (such as with the context
  • 95. • Photo attribution http://www.flickr.com/photos/dsevilla • http://xtf.cdlib.org/ • http://code.google.com/p/eac-graph-load/source/bro • http://tinkerpop.com/ • http://thejit.org/ • https://github.com/tingletech/snac-related- widget 12/11/12

Editor's Notes

  1. In the order of importance Lenny the link head is last
  2. So, this is what happens when you let the programmer design the user interface In phase two, Rachel Hu, CDL&apos;s user experience designer in our in house assessment group will be helping
  3. Hopefully this is where the user will focus
  4. AZ browse
  5. Featured items on home page (rather than 0-9) Note the tabs to limit by record type
  6. Also note the subject and occupation facets
  7. Person
  8. Advanced search hides, allows On other browsers, hierarchy represented graphically
  9. Advanced search help
  10. Autocomplete
  11. Search results for Oppenheimer
  12. View EAD Report data issue link has been added back Will come back to the radial graph demo
  13. Sometimes the related resources will come from the EAD, but most of these are from VIAF This whole section is hard to use when there are lots of related items
  14. This was the first iteration of the graph visualization