SlideShare uma empresa Scribd logo
1 de 33
Measuring library catalogs
ADOCHS meeting
Royal Library, Brussels, 2017-11-21.
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary
last month, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
2
an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
3
Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
4
Record type
Type of record Bibliographic level type
a a or c or d or m Books
a b or i or s Continuing Resources
t Books
c or d or i or j Music
e or f Maps
g or k or o or r Visual Materials
m Computer Files
p Mixed Materials
5
Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9
‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘
common for all types
part I
type specific part
common for all types
part II
6
Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
0123456789012345678901234567890123456789
aaaaaabccccddddeeefffgh All materials
IIIIjkLLLLmnopqr Books
ijklmnOOOpqrs Continuing Resources
iijklmNNNNNNOOp Music
IIIIjjklmnOO Maps
Iiijklmn Visual Materials
ijkl Computer Files
i Mixed Materials
lower case = distinct units
upper case = repeatable units
 = undefined position
depends on record
type (calculated from
Leader values)
7
Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
8
Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
9
Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
10
Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
11
Part II.
record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
12
validating individual records
./validator [file]
001999999 852 undefined subfield L
https://www.loc.gov/...
002000005 035 undefined subfield 9
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000005 852 undefined subfield L
https://www.loc.gov/...
002000008 035 undefined subfield 9
https://www.loc.gov/… 13
summary of errors
./validator --summary [file]
006/01-02 (tag006music01): invalid value ' ' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times)
006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times)
006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2 times)
006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2 times)
14
other options
./validator --marcVersion “GENT” [file]
./validator --format “tsv” [file]
./validator --defaultRecordType “BOOKS” [file]
SEVERE: Error with record '002066968'. Leader/06
(typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm'
./validator --fileName “my-report” [file]
./validator ... [file] | catmandu … | RScript … | python … | grep ...
15
viewing/filtering/selecting records
Displaying record with given ID
./formatter --id “002032820” [file]
Displaying records matching a query
./formatter --search ‘245$c=Shakespeare’ [file]
Retrieve given elements
./formatter --selector ‘245$c’ [file]
16
calculating Thompson-Traill completeness
Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management (Code4Lib
Journal 38, http://journal.code4lib.org/articles/12828) 17
calculating Thompson-Traill completeness
./tt-completeness [options] [file]
output:
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date
26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of
Publication,noLanguageOrEnglish,RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
"010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7
18
K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not so
big (“elbow effect”) -- in theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
19
Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
20
How
to
name
the
fields?
Facetted search interface
21
accessing every record element
22
Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
23
http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html
Usage in DH
Benjamin Smith (2017) A brief
visual history of MARC cataloging
at the Library of Congress.
1. extract fields from MARC
2. data cleaning
3. visualize with R
24
./formatter --selector "260c;008~0-5" [file] > dates.tsv
or put into a cleaning pileline
./formatter --selector "260c;008~0-5" [file] 
| sed ... | grep ... | awk ... 
> dates.tsv
Extract data
260c 008~0-
5
1977. 780804
1977. 781121
[1973]. 740215
publication record
1977 1978-08-04
1977 1978-11-21
1973 1974-02-15
25
Filtering out extreme values
data %>%
filter(publication > 2018) %>%
arrange(desc(publication))
publication record
<int> <int>
1 5732 1990
2 4185 2013
3 2201 2012
4 2030 2015
5 2022 2016
6 2020 2011
7 2019 2015
26
cataloging
frontline
intensive backward
cataloging -
maybe importing?
backward
cataloging is still
intensive, the
tendency continues
peak is > 13K
2000-07-10, the “golden day”:
95K new records
forward cataloging
27
28
reproducibility of science
❏ accessing users (first one: Gent)
❏ making easy of usage (downloadable binaries, helper scripts, documentation)
❏ distribution via Maven Central
❏ continuous integration (Travis CI)
❏ code coverage report
❏ list of freely reusable library catalogs
❏ licencing (GPL-3.0)
29
available catalogs to measure
30
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources
Future work
❏ implementing more validation rules
❏ visual dashboard
❏ communication with catalogers
❏ writing articles/dissertation
31
Authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en
Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
32
everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
https://twitter.com/kiru
peter.kiraly@gwdg.de
33

Mais conteúdo relacionado

Mais de Péter Király

Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Péter Király
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Péter Király
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)Péter Király
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Péter Király
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Péter Király
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Péter Király
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Péter Király
 
Stiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataStiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataPéter Király
 
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...Péter Király
 

Mais de Péter Király (20)

Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 
Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)Measuring MARC (ELAG 2018)
Measuring MARC (ELAG 2018)
 
SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)SHACL shortly (ELAG 2018)
SHACL shortly (ELAG 2018)
 
Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)Measuring Metadata Quality (ELAG, 2018)
Measuring Metadata Quality (ELAG, 2018)
 
Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)Measuring completeness as metadata quality metric in Europeana (DH 2017)
Measuring completeness as metadata quality metric in Europeana (DH 2017)
 
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)Nothing is created, nothing is lost, everything changes (ELAG, 2017)
Nothing is created, nothing is lost, everything changes (ELAG, 2017)
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
Stiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of MetadataStiller & Király, Multilinguality of Metadata
Stiller & Király, Multilinguality of Metadata
 
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
 

Último

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Último (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Measuring library catalogs (ADOCHS 2017)

  • 1. Measuring library catalogs ADOCHS meeting Royal Library, Brussels, 2017-11-21. Péter Király Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
  • 2. Part I. Introduction to MARC ❏ MAchine Readable Catalog ❏ format and semantic specification ❏ comes from the age of punchcards - information compression ❏ invented in early 60’s ❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary last month, but MARC is still living ❏ „There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” * by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ 2
  • 3. an example LEADER 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 3
  • 4. Positional fields - Leader 00928nam a2200265 c 4500 0 1 2 01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3 00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0 ❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999) ❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new” ❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material” ❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item” ❏ ... 4
  • 5. Record type Type of record Bibliographic level type a a or c or d or m Books a b or i or s Continuing Resources t Books c or d or i or j Music e or f Maps g or k or o or r Visual Materials m Computer Files p Mixed Materials 5
  • 6. Positional fields - 008 ‘801003s1958 ja 000 0 jpn ‘ 0 1 2 3 012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9 ‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘ common for all types part I type specific part common for all types part II 6
  • 7. Positional fields - 008 ‘801003s1958 ja 000 0 jpn ‘ 0 1 2 3 0123456789012345678901234567890123456789 aaaaaabccccddddeeefffgh All materials IIIIjkLLLLmnopqr Books ijklmnOOOpqrs Continuing Resources iijklmNNNNNNOOp Music IIIIjjklmnOO Maps Iiijklmn Visual Materials ijkl Computer Files i Mixed Materials lower case = distinct units upper case = repeatable units = undefined position depends on record type (calculated from Leader values) 7
  • 8. Datafields repeatable/non-repeatable Indicator1 Indicator2 Subfield1, ... , Subfieldn always 1 char long dictionary term ❏ code ❏ value ❏ free text ❏ dictionary term ❏ fixed format (e.g. yymmdd) ❏ fixed format + dictionary terms (d7i2) ❏ fixed positions + dictionary terms ❏ repeatable/non-repeatable 8
  • 9. Versions ❏ Changes of the standard ❏ No versioning ❏ New, deleted and changed elements every year ❏ Localized versions ❏ Introducing new fields ❏ Overwriting existing fields ❏ Mixing localized versions ❏ No notion about the localization ❏ 50+ localizations (international, national, consortial) 9
  • 10. Handling versions (020, ISBN) setSubfieldsWithCardinality( "a", "International Standard Book Number", "NR", "c", "Terms of availability", "NR", "q", "Qualifying information", "R", ... ); setHistoricalSubfields( "b", "Binding information (BK, MP, MU) [OBSOLETE]" ); putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList( new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R") )); 10
  • 11. Addressing elements - MARCspec XML: XPath﹣W3C standard JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/) MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin) ❏ 260﹣field ❏ 245^2﹣the second indicator of a field ❏ 700[0]﹣the first instance of a field ❏ 245$c﹣a subfield ❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with position ‘0’ of field 007 equals ‘a’ OR ‘t’. ❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’. http://marcspec.github.io/MARCspec/marc-spec.html 11
  • 12. Part II. record validation and quality assurance Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg 12
  • 13. validating individual records ./validator [file] 001999999 852 undefined subfield L https://www.loc.gov/... 002000005 035 undefined subfield 9 https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000005 852 undefined subfield L https://www.loc.gov/... 002000008 035 undefined subfield 9 https://www.loc.gov/… 13
  • 14. summary of errors ./validator --summary [file] 006/01-02 (tag006music01): invalid value ' ' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''0' in '060 '' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''6' in '060 '' (https...) (1 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'nnn '' (https...) (3 times) 006/01-04 (tag006book01): contains invalid code ''n' in 'uunn'' (https...) (2 times) 006/01-04 (tag006book01): contains invalid code ''u' in 'uunn'' (https...) (2 times) 14
  • 15. other options ./validator --marcVersion “GENT” [file] ./validator --format “tsv” [file] ./validator --defaultRecordType “BOOKS” [file] SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm' ./validator --fileName “my-report” [file] ./validator ... [file] | catmandu … | RScript … | python … | grep ... 15
  • 16. viewing/filtering/selecting records Displaying record with given ID ./formatter --id “002032820” [file] Displaying records matching a query ./formatter --search ‘245$c=Shakespeare’ [file] Retrieve given elements ./formatter --selector ‘245$c’ [file] 16
  • 17. calculating Thompson-Traill completeness Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management (Code4Lib Journal 38, http://journal.code4lib.org/articles/12828) 17
  • 18. calculating Thompson-Traill completeness ./tt-completeness [options] [file] output: id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish,RDA,total "010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4 "01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5 "010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5 "010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6 "010023623",0,0,3,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,7 18
  • 19. K-means clustering Spark (Scala) increasing number of clusters decreasing the distance from the centroids after a point this gain is not so big (“elbow effect”) -- in theory Big number or low quality records small clusters with ‘in between’ quality records the acceptable average clusters with good quality records 19
  • 20. Indexing with Solr "marc-tags" format "100a_ss": "Jung-Baek, Myong Ja", "100ind1_ss": "Surname", "245c_ss": "Vorgelegt von Myong Ja Jung-Baek." "human-readable" format "MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "MainPersonalName_type_ss": "Surname", "Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." "mixed" format "100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "100ind1_MainPersonalName_type_ss": "Surname", "245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." 20 How to name the fields?
  • 23. Finding problems with facets Vandenhoeck und Ruprecht Vandenhoeck & Ruprecht Vandenhoeck u. Ruprecht Vandenhoeck Vandenhoek & Ruprecht Vandenhoek und Ruprecht Bandenhoed und Ruprecht Vandenhoeck et Ruprecht Vandenhoeck & Reprecht Vandenhoed und Ruprecht V&R unipress V&R Unipress V & R Unipress V & R unipress 23
  • 24. http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html Usage in DH Benjamin Smith (2017) A brief visual history of MARC cataloging at the Library of Congress. 1. extract fields from MARC 2. data cleaning 3. visualize with R 24
  • 25. ./formatter --selector "260c;008~0-5" [file] > dates.tsv or put into a cleaning pileline ./formatter --selector "260c;008~0-5" [file] | sed ... | grep ... | awk ... > dates.tsv Extract data 260c 008~0- 5 1977. 780804 1977. 781121 [1973]. 740215 publication record 1977 1978-08-04 1977 1978-11-21 1973 1974-02-15 25
  • 26. Filtering out extreme values data %>% filter(publication > 2018) %>% arrange(desc(publication)) publication record <int> <int> 1 5732 1990 2 4185 2013 3 2201 2012 4 2030 2015 5 2022 2016 6 2020 2011 7 2019 2015 26
  • 27. cataloging frontline intensive backward cataloging - maybe importing? backward cataloging is still intensive, the tendency continues peak is > 13K 2000-07-10, the “golden day”: 95K new records forward cataloging 27
  • 28. 28
  • 29. reproducibility of science ❏ accessing users (first one: Gent) ❏ making easy of usage (downloadable binaries, helper scripts, documentation) ❏ distribution via Maven Central ❏ continuous integration (Travis CI) ❏ code coverage report ❏ list of freely reusable library catalogs ❏ licencing (GPL-3.0) 29
  • 30. available catalogs to measure 30 ❏ Library of Congress ❏ Harvard University Library ❏ Columbia University Library ❏ Deutsche Nationalbibliothek ❏ Universiteitsbibliotheek Gent ❏ Bibliotheksservice-Zentrum Baden Würtemberg ❏ Bibliotheksverbundes Bayern ❏ University of Michigan Library ❏ Toronto Public Library ❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB) ❏ Répertoire International des Sources Musicales ❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich) ❏ British library ❏ Talis https://github.com/pkiraly/metadata-qa-marc#datasources
  • 31. Future work ❏ implementing more validation rules ❏ visual dashboard ❏ communication with catalogers ❏ writing articles/dissertation 31
  • 32. Authority entries Responsibility statement: Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent (vormgeving). Authority entries: ❏ Herr Seele ❏ Coussement, Toon ❏ Claes, Peter ❏ Van Sande, Hera 32
  • 33. everything else … at least regarding to this project https://github.com/pkiraly/metadata-qa-marc https://twitter.com/kiru peter.kiraly@gwdg.de 33