O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Validating 126 million MARC records (DATeCH 2019)

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Session7 02.peter kiraly
Session7 02.peter kiraly
Carregando em…3
×

Confira estes a seguir

1 de 26 Anúncio

Mais Conteúdo rRelacionado

Semelhante a Validating 126 million MARC records (DATeCH 2019) (20)

Mais de Péter Király (20)

Anúncio

Mais recentes (20)

Validating 126 million MARC records (DATeCH 2019)

  1. 1. Validating 126 million MARC records DATeCH 2019, Brussels, 2019-05-10. Péter Király Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG http://bit.ly/qa-datech2019
  2. 2. part I. short introduction to MARC ❏ MAchine Readable Cataloging ❏ format and semantic specification ❏ comes from the age of punch cards – information compression required ❏ invented in early 60’s ❏ love to hate criticise it: “MARC must die”*, “Stockholm syndrome of MARC”** ❏ “There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” * Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ ** Niklas Lindström at ELAG 2019 https://twitter.com/cm_harlow/status/1126068414928293888 2 http://bit.ly/qa-datech2019
  3. 3. a (pretty printed) example LDR 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 3 http://bit.ly/qa-datech2019
  4. 4. looks like rocket science... Apollo 11 (moon landing) source code https://github.com/chrislgarry/Apollo-11 4 http://bit.ly/qa-datech2019
  5. 5. positional fields - 008 ‘801003s1958 ja 000 0 jpn ‘ 0 1 2 3 0123456789012345678901234567890123456789 aaaaaabccccddddeeefffgh All materials IIIIjkLLLLmnopqr Books ijklmnOOOpqrs Continuing Resources iijklmNNNNNNOOp Music IIIIjjklmnOO Maps Iiijklmn Visual Materials ijkl Computer Files i Mixed Materials lower case = distinct units upper case = repeatable units = undefined position depends on record type (calculated from Leader values) 5 http://bit.ly/qa-datech2019
  6. 6. datafields repeatable/non-repeatable Indicator1 Indicator2 Subfield1, ... , Subfieldn always 1 char long dictionary term ❏ code ❏ value ❏ free text ❏ dictionary term ❏ fixed format (e.g. yymmdd) ❏ fixed format + dictionary terms (d7i2) ❏ fixed positions + dictionary terms ❏ repeatable/non-repeatable 6 http://bit.ly/qa-datech2019
  7. 7. versions ❏ changes of the standard ❏ no versioning ❏ new, deleted and changed elements every year ❏ localized versions ❏ introducing new fields ❏ overwriting existing fields ❏ mixing localized versions ❏ no notion about the localization ❏ 50+ localizations (international, national, consortial) 7 http://bit.ly/qa-datech2019
  8. 8. size – number of data elements implemented 8 MARC 21 versions total control fields 7 7 control subfields 211 211 data fields 215 68 283 indicators 175 8 183 subfields 2259 344 2603 3287 http://bit.ly/qa-datech2019 Java classes qa-metadata-marc.jar Avram JSON data model export machine readable standard
  9. 9. Remember heroines! 9 http://bit.ly/qa-datech2019 Margaret Hamilton https://qz.com/726338/ Henriette D. Avram smithsonianmag.com
  10. 10. Part II. record validation and quality assessment Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg 10 http://bit.ly/qa-datech2019
  11. 11. quality assessment workflow 1. ingest 2. measure records 3. aggregate 4. report 5. evaluate with experts (feedback loop) 11 http://bit.ly/qa-datech2019 Improve records
  12. 12. 1. ingest data Bavarian union catalogue (bay) – 27.3 million records; Baden-Würtemberg union catalogue (bzb) – 23.1 m; Columbia (col) – 6.0 m; Heritage of the Printed Book Database, CERL (cer) – 6.7 m; German National Bibliography (dnb) – 16.7 m; Gent (gen) – 1.8 m; Harvard (har) – 13.7 m; Library of Congress (loc) – 10.1 m; Michigan (mic) – 1.3 m; Finnish National Bibliography (nfi) – 1.0 m; Repertoire International des Sources Musicales (ris) – 1.3 m; San Francisco Public Library (sfp) – 0.9 m; Stanford (sta) – 9.4 m; Szeged (szt) – 1.2 m; TIB Hannover (tib) – 3.5 m; Toronto Public Library (tor) – 2.5 m union catalogues – national libraries – university libraries – public libraries 12 http://bit.ly/qa-datech2019
  13. 13. 2. measure records $ ./validator [options] [file] 001999999 852 undefined subfield L https://www.loc.gov/... 002000005 035 undefined subfield 9 https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000008 035 undefined subfield 9 https://www.loc.gov/… 13 http://bit.ly/qa-datech2019
  14. 14. 3. aggregating results – records with issues 14 all filtered bay 100.0 18.8 bzb 100.0 76.1 cer 2.8 2.8 col 90.4 66.0 dnb 13.9 0.2 gen 40.8 27.3 har 100.0 97.3 loc 30.5 29.3 all filtered mic 80.8 67.5 nfi 62.1 58.1 ris 99.7 57.1 sfp 82.7 60.4 sta 92.7 92.5 szt 30.8 30.6 tib 100.0 100.0 tor 100.0 74.2 Filtered = issues excluding the undocumented tags and subfields http://bit.ly/qa-datech2019
  15. 15. issue types issues on record level ❏ R1 ambiguous linkage ❏ R2 invalid linkage ❏ R3 type error control field issues ❏ C1 invalid code ❏ C2 invalid value 15 field issues ❏ F1 missing reference subfield (880$6) ❏ F2 non-repeatable field ❏ F3 undefined field indicator issues ❏ I1 invalid value ❏ I2 non-empty value ❏ I3 obsolete value subfield issues ❏ S1 classification ❏ S2 invalid ISBN ❏ S3 invalid ISSN ❏ S4 invalid length ❏ S5 invalid value ❏ S6 repetition ❏ S7 undefined subfield ❏ S8 non well-formatted value http://bit.ly/qa-datech2019
  16. 16. number of subfields in catalogues total 1% 10% bay 854 144 51 bzb 522 144 65 crl 169 65 39 col 1862 196 59 dnb 575 186 97 gnt 955 122 47 har 2024 154 49 loc 1156 128 40 16 total 1% 10% mic 1233 138 37 nfi 811 145 54 ris 138 88 52 sfp 1046 125 37 sta 2997 225 64 szt 1210 74 42 tib 46 41 35 tor 1733 163 46 The tool has 2600+ subfield definitions total: total number of fields, 1% fields availabe in at least 1% of the records, 10%: fields available in at least 10% of the records. Top fields (not in the table) – 50%: 13-25 fields, 80%: 4-18 fields, 90%: 0-16 fields http://bit.ly/qa-datech2019
  17. 17. completeness by field groups 17
  18. 18. summary of errors 18
  19. 19. K-means clustering Spark (Scala) increasing number of clusters decreasing the distance from the centroids after a point this gain is not so big (“elbow effect”) -- in theory Big number or low quality records small clusters with ‘in between’ quality records the acceptable average clusters with good quality records 19 http://bit.ly/qa-datech2019 Thompson and Traill (2017) http://journal.code4lib.org/articles/12828
  20. 20. 4. report (web UI) 20 http://bit.ly/qa-datech2019
  21. 21. 21 http://bit.ly/qa-datech2019
  22. 22. 22 http://bit.ly/qa-datech2019
  23. 23. 23 http://bit.ly/qa-datech2019
  24. 24. Finding problems with facets Vandenhoeck und Ruprecht Vandenhoeck & Ruprecht Vandenhoeck u. Ruprecht Vandenhoeck Vandenhoek & Ruprecht Vandenhoek und Ruprecht Bandenhoed und Ruprecht Vandenhoeck et Ruprecht Vandenhoeck & Reprecht Vandenhoed und Ruprecht V&R unipress V&R Unipress V & R Unipress V & R unipress 24 http://bit.ly/qa-datech2019 est. 1735
  25. 25. cataloging frontline intensive backward cataloging - maybe importing? backward cataloging is still intensive, the tendency continues peak is > 13K 2000-07-10, the “golden day”: 95K new records forward cataloging 25 http://bit.ly/qa-datech2019
  26. 26. everything else … at least regarding to this project code & docs: https://github.com/pkiraly/metadata-qa-marc Web UI source code: https://github.com/pkiraly/metadata-qa-marc-web Avram Specification (Jakob Voß): http://format.gbv.de/schema/avram/specification https://twitter.com/kiru peter.kiraly@gwdg.de 26 http://bit.ly/qa-datech2019

×