SlideShare uma empresa Scribd logo
1 de 41
Measuring library catalogs
ELAG 2018
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary
last year, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
2
an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
3
Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
4
Record type
Type of record Bibliographic level type
a a or c or d or m Books
a b or i or s Continuing Resources
t Books
c or d or i or j Music
e or f Maps
g or k or o or r Visual Materials
m Computer Files
p Mixed Materials
5
Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9
‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘
common for all types
part I
type specific part
common for all types
part II
6
Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
0123456789012345678901234567890123456789
aaaaaabccccddddeeefffgh All materials
IIIIjkLLLLmnopqr Books
ijklmnOOOpqrs Continuing Resources
iijklmNNNNNNOOp Music
IIIIjjklmnOO Maps
Iiijklmn Visual Materials
ijkl Computer Files
i Mixed Materials
lower case = distinct units
upper case = repeatable units
 = undefined position
depends on record
type (calculated
from Leader values)
7
Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
8
Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
9
Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
10
Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
11
Part II.
record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
12
preparation
cd ~/Prog
mkdir marc
cd marc
wget https://www.loc.gov/cds/downloads/MDSConnect/BooksAll.2014.part01.utf8.gz
gunzip BooksAll.2014.part01.utf8.gz
mv BooksAll.2014.part01.utf8 loc01.mrc
cd ../metadata-qa-marc/target
# optional:
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2/metadata-qa-marc-0.2-
SNAPSHOT-jar-with-dependencies.jar
13
validating individual records
> ./validator ../marc/loc01.mrc
> less validation-report.txt
Error in ' 00000057 ':
082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html)
Error in ' 00000119 ':
700$ind1: invalid value '2' (https://www.loc.gov/marc/bibliographic/bd700.html)
Error in ' 00000234 ':
082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html)
Errors in ' 00000294 ':
050$ind2: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd050.html)
260$ind1: obsolete value '0' (https://www.loc.gov/marc/bibliographic/bd260.html)
710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html)
710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html)
14
summary of errors
> ./validator --summary ../marc/loc01.mrc
> less validation-report.txt
006/01-04 (tag006book01): contains invalid code 'n' in ' n '
(https://www.loc.gov/marc/bibliographic/bd006.html) (1 times)
...
020$a: invalid ISBN '052179296X' is not a valid ISBN 10 value'
(https://en.wikipedia.org/wiki/International_Standard_Book_Number) (1 times)
...
045$a: invalid lengh 'v v': length is not 4 char' (https://www.loc.gov/marc/bibliographic/bd045.html)
(1 times)
…
740$ind2: obsolete value '1' (https://www.loc.gov/marc/bibliographic/bd740.html) (22 times)
15
Specifying the MARC version
> ./validator --marcVersion “GENT” [file]
Currently supported versions:
★ MARC21 -- Library of Congress MARC21
★ DNB -- Deuthche Nationalbibliothek's MARC version
★ OCLC -- OCLCMARC
★ GENT -- Gent University (Belgium)
★ SZTE -- Szegedi Tudományegyetem (Hungary)
★ FENNICA -- the Fennica catalog of Finnish National Library
16
output format
> ./validator --format “tab-separated” ../marc/loc01.mrc
Options:
★ text
★ tab-separated
★ comma-separated
★ json*
* in progress
17
default record type
SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n',
Leader/07 (bibliographicLevel): 'm'
> ./validator --defaultRecordType “BOOKS” ../marc/loc01.mrc
★ BOOKS
★ CONTINUING_RESOURCES
★ MUSIC
★ MAPS
★ VISUAL_MATERIALS
★ COMPUTER_FILES
★ MIXED_MATERIALS
18
suppress log messages
> ./validator --nolog ../marc/loc01.mrc
19
output file name
./validator --fileName “my-report” ../marc/loc01.mrc
Special filename: “stdout”
./validator --fileName “stdout” --nolog [file] | catmandu … | RScript … |
python … | grep ...
20
processing MARC XML file
> ./validator --marcxml [example.marcxml]
21
processing a subset of records
process 1-1000th records
> ./validator --limit 1000 ../marc/loc01.mrc
process 5001-6000th records
> ./validator --offset 5000 --limit 1000 ../marc/loc01.mrc
22
Fix ALEPHSEQ placeholder '^'
ALEPH export contains '^' characters instead spaces in control fields (006, 007,
008). This flag replace them to spaces before the validation
./validator --fixAlephseq [file]
23
viewing/selecting records
Displaying record with given ID
> ./formatter --id “002032820” ../marc/loc01.mrc
Displaying Nth record
> ./formatter --countNr 1 ../marc/loc01.mrc
Displaying records matching a query
> ./formatter --search ‘260$c=1899.’ ../marc/loc01.mrc
Retrieve given elements
> ./formatter --selector ‘245$c’ ../marc/loc01.mrc
24
> ./formatter --selector ‘245$c’ ../marc/loc01.mrc
By S. H. Aurand.
by Charles E. Chadman.
> ./formatter --selector ‘245$c’ --withId ../marc/loc01.mrc
00000002 By S. H. Aurand.
00000004 by Charles E. Chadman.
> ./formatter --selector ‘245$a;245$c’ --withId ../marc/loc01.mrc
00000002 1899. By S. H. Aurand.
00000004 1899. by Charles E. Chadman.
extract given elements
25
calculating Thompson-Traill completeness
Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management
(Code4Lib Journal 38, http://journal.code4lib.org/articles/12828) 26
calculating Thompson-Traill completeness
./tt-completeness ../marc/loc01.mrc
options: limit, offset, fileName, nolog
> less tt-completeness.csv
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date
008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of
Resource,Country of Publication,noLanguageOrEnglish,RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
27
K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
28
indexing with Solr
java -cp $JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file]
Options:
★ solrUrl
★ doCommit
★ solrFieldType
○ marc-tags
○ human-readable
○ mixed
★ nolog
29
Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
30
How
to
name
the
fields?
Facetted search interface
31
https://github.com/pkiraly/metadata-qa-marc-web
accessing every record element
32
Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
33
http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html
Usage in DH
Benjamin Smith (2017) A brief
visual history of MARC cataloging
at the Library of Congress.
1. extract fields from MARC
2. data cleaning
3. visualize with R
34
./formatter --selector "260c;008~0-5" [file] > dates.tsv
or put into a cleaning pileline
./formatter --selector "260c;008~0-5" [file] 
| sed ... | grep ... | awk ... 
> dates.tsv
Extract data
260c 008~0-
5
1977. 780804
1977. 781121
[1973]. 740215
publication record
1977 1978-08-04
1977 1978-11-21
1973 1974-02-15
35
Filtering out extreme values
data %>%
filter(publication > 2018) %>%
arrange(desc(publication))
publication record
<int> <int>
1 5732 1990
2 4185 2013
3 2201 2012
4 2030 2015
5 2022 2016
6 2020 2011
7 2019 2015
36
cataloging
frontline
intensive backward
cataloging -
maybe importing?
backward
cataloging is still
intensive, the
tendency continues
peak is > 13K
2000-07-10, the “golden day”:
95K new records
forward cataloging
37
38
available catalogs to measure
39
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources
Authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans
en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
40
everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
https://twitter.com/kiru
peter.kiraly@gwdg.de
41

Mais conteúdo relacionado

Semelhante a Measuring MARC (ELAG 2018)

ASIC_Design.pdf
ASIC_Design.pdfASIC_Design.pdf
ASIC_Design.pdf
Ahmed Abdelazeem
 
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map ReduceQuick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
ohkura
 
NG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
NG3S903 - Electronic Systems Engineering - Fault Modelling TechniquesNG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
NG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
Chris Francis
 

Semelhante a Measuring MARC (ELAG 2018) (9)

UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
UKSG Conference 2015 - E-resources: ezPAARSE helps you discover who is readin...
 
ASIC_Design.pdf
ASIC_Design.pdfASIC_Design.pdf
ASIC_Design.pdf
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Facsimiles printing: innovative dissemination of cultural heritage
Facsimiles printing: innovative dissemination of cultural heritageFacsimiles printing: innovative dissemination of cultural heritage
Facsimiles printing: innovative dissemination of cultural heritage
 
Crosstalk
CrosstalkCrosstalk
Crosstalk
 
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map ReduceQuick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
 
NG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
NG3S903 - Electronic Systems Engineering - Fault Modelling TechniquesNG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
NG3S903 - Electronic Systems Engineering - Fault Modelling Techniques
 
Compromise Indicator Magic
Compromise Indicator MagicCompromise Indicator Magic
Compromise Indicator Magic
 
Indicators of Compromise Magic: Living with compromise
Indicators of Compromise Magic: Living with compromiseIndicators of Compromise Magic: Living with compromise
Indicators of Compromise Magic: Living with compromise
 

Mais de Péter Király

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 

Mais de Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
Measuring Completeness as Metadata Quality Metric in Europeana (CAS 2018)
 

Último

➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 

Measuring MARC (ELAG 2018)

  • 1. Measuring library catalogs ELAG 2018 Péter Király Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0 https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
  • 2. Part I. Introduction to MARC ❏ MAchine Readable Catalog ❏ format and semantic specification ❏ comes from the age of punchcards - information compression ❏ invented in early 60’s ❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary last year, but MARC is still living ❏ „There are only two kinds of people who believe themselves able to read a MARC record without referring to a stack of manuals: a handful of our top catalogers and those on serious drugs.” * by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/ 2
  • 3. an example LEADER 01136cnm a2200253ui 4500 001 002032820 005 20150224114135.0 008 031117s2003 gw 000 0 ger d 020 $a3805909810 100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766 245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger. 250 $aNeubearb. 2003$bvon Jörn Eckert 260 $aBerlin :$bSellier-de Gruyter,$c2003. 300 $a534 p. ;. 500 $aCiteertitel: BGB. 500 $aBandtitel: Staudinger BGB. 700 1 $aEckert, Jörn 852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147 3
  • 4. Positional fields - Leader 00928nam a2200265 c 4500 0 1 2 01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3 00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0 ❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999) ❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new” ❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material” ❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item” ❏ ... 4
  • 5. Record type Type of record Bibliographic level type a a or c or d or m Books a b or i or s Continuing Resources t Books c or d or i or j Music e or f Maps g or k or o or r Visual Materials m Computer Files p Mixed Materials 5
  • 6. Positional fields - 008 ‘801003s1958 ja 000 0 jpn ‘ 0 1 2 3 012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9 ‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘ common for all types part I type specific part common for all types part II 6
  • 7. Positional fields - 008 ‘801003s1958 ja 000 0 jpn ‘ 0 1 2 3 0123456789012345678901234567890123456789 aaaaaabccccddddeeefffgh All materials IIIIjkLLLLmnopqr Books ijklmnOOOpqrs Continuing Resources iijklmNNNNNNOOp Music IIIIjjklmnOO Maps Iiijklmn Visual Materials ijkl Computer Files i Mixed Materials lower case = distinct units upper case = repeatable units = undefined position depends on record type (calculated from Leader values) 7
  • 8. Datafields repeatable/non-repeatable Indicator1 Indicator2 Subfield1, ... , Subfieldn always 1 char long dictionary term ❏ code ❏ value ❏ free text ❏ dictionary term ❏ fixed format (e.g. yymmdd) ❏ fixed format + dictionary terms (d7i2) ❏ fixed positions + dictionary terms ❏ repeatable/non-repeatable 8
  • 9. Versions ❏ Changes of the standard ❏ No versioning ❏ New, deleted and changed elements every year ❏ Localized versions ❏ Introducing new fields ❏ Overwriting existing fields ❏ Mixing localized versions ❏ No notion about the localization ❏ 50+ localizations (international, national, consortial) 9
  • 10. Handling versions (020, ISBN) setSubfieldsWithCardinality( "a", "International Standard Book Number", "NR", "c", "Terms of availability", "NR", "q", "Qualifying information", "R", ... ); setHistoricalSubfields( "b", "Binding information (BK, MP, MU) [OBSOLETE]" ); putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList( new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R") )); 10
  • 11. Addressing elements - MARCspec XML: XPath﹣W3C standard JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/) MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin) ❏ 260﹣field ❏ 245^2﹣the second indicator of a field ❏ 700[0]﹣the first instance of a field ❏ 245$c﹣a subfield ❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with position ‘0’ of field 007 equals ‘a’ OR ‘t’. ❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’. http://marcspec.github.io/MARCspec/marc-spec.html 11
  • 12. Part II. record validation and quality assurance Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg 12
  • 13. preparation cd ~/Prog mkdir marc cd marc wget https://www.loc.gov/cds/downloads/MDSConnect/BooksAll.2014.part01.utf8.gz gunzip BooksAll.2014.part01.utf8.gz mv BooksAll.2014.part01.utf8 loc01.mrc cd ../metadata-qa-marc/target # optional: wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2/metadata-qa-marc-0.2- SNAPSHOT-jar-with-dependencies.jar 13
  • 14. validating individual records > ./validator ../marc/loc01.mrc > less validation-report.txt Error in ' 00000057 ': 082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html) Error in ' 00000119 ': 700$ind1: invalid value '2' (https://www.loc.gov/marc/bibliographic/bd700.html) Error in ' 00000234 ': 082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html) Errors in ' 00000294 ': 050$ind2: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd050.html) 260$ind1: obsolete value '0' (https://www.loc.gov/marc/bibliographic/bd260.html) 710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html) 710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html) 14
  • 15. summary of errors > ./validator --summary ../marc/loc01.mrc > less validation-report.txt 006/01-04 (tag006book01): contains invalid code 'n' in ' n ' (https://www.loc.gov/marc/bibliographic/bd006.html) (1 times) ... 020$a: invalid ISBN '052179296X' is not a valid ISBN 10 value' (https://en.wikipedia.org/wiki/International_Standard_Book_Number) (1 times) ... 045$a: invalid lengh 'v v': length is not 4 char' (https://www.loc.gov/marc/bibliographic/bd045.html) (1 times) … 740$ind2: obsolete value '1' (https://www.loc.gov/marc/bibliographic/bd740.html) (22 times) 15
  • 16. Specifying the MARC version > ./validator --marcVersion “GENT” [file] Currently supported versions: ★ MARC21 -- Library of Congress MARC21 ★ DNB -- Deuthche Nationalbibliothek's MARC version ★ OCLC -- OCLCMARC ★ GENT -- Gent University (Belgium) ★ SZTE -- Szegedi Tudományegyetem (Hungary) ★ FENNICA -- the Fennica catalog of Finnish National Library 16
  • 17. output format > ./validator --format “tab-separated” ../marc/loc01.mrc Options: ★ text ★ tab-separated ★ comma-separated ★ json* * in progress 17
  • 18. default record type SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n', Leader/07 (bibliographicLevel): 'm' > ./validator --defaultRecordType “BOOKS” ../marc/loc01.mrc ★ BOOKS ★ CONTINUING_RESOURCES ★ MUSIC ★ MAPS ★ VISUAL_MATERIALS ★ COMPUTER_FILES ★ MIXED_MATERIALS 18
  • 19. suppress log messages > ./validator --nolog ../marc/loc01.mrc 19
  • 20. output file name ./validator --fileName “my-report” ../marc/loc01.mrc Special filename: “stdout” ./validator --fileName “stdout” --nolog [file] | catmandu … | RScript … | python … | grep ... 20
  • 21. processing MARC XML file > ./validator --marcxml [example.marcxml] 21
  • 22. processing a subset of records process 1-1000th records > ./validator --limit 1000 ../marc/loc01.mrc process 5001-6000th records > ./validator --offset 5000 --limit 1000 ../marc/loc01.mrc 22
  • 23. Fix ALEPHSEQ placeholder '^' ALEPH export contains '^' characters instead spaces in control fields (006, 007, 008). This flag replace them to spaces before the validation ./validator --fixAlephseq [file] 23
  • 24. viewing/selecting records Displaying record with given ID > ./formatter --id “002032820” ../marc/loc01.mrc Displaying Nth record > ./formatter --countNr 1 ../marc/loc01.mrc Displaying records matching a query > ./formatter --search ‘260$c=1899.’ ../marc/loc01.mrc Retrieve given elements > ./formatter --selector ‘245$c’ ../marc/loc01.mrc 24
  • 25. > ./formatter --selector ‘245$c’ ../marc/loc01.mrc By S. H. Aurand. by Charles E. Chadman. > ./formatter --selector ‘245$c’ --withId ../marc/loc01.mrc 00000002 By S. H. Aurand. 00000004 by Charles E. Chadman. > ./formatter --selector ‘245$a;245$c’ --withId ../marc/loc01.mrc 00000002 1899. By S. H. Aurand. 00000004 1899. by Charles E. Chadman. extract given elements 25
  • 26. calculating Thompson-Traill completeness Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management (Code4Lib Journal 38, http://journal.code4lib.org/articles/12828) 26
  • 27. calculating Thompson-Traill completeness ./tt-completeness ../marc/loc01.mrc options: limit, offset, fileName, nolog > less tt-completeness.csv id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date 008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of Resource,Country of Publication,noLanguageOrEnglish,RDA,total "010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4 "01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5 "010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5 "010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6 27
  • 28. K-means clustering Spark (Scala) increasing number of clusters decreasing the distance from the centroids after a point this gain is not so big (“elbow effect”) -- in theory Big number or low quality records small clusters with ‘in between’ quality records the acceptable average clusters with good quality records 28
  • 29. indexing with Solr java -cp $JAR de.gwdg.metadataqa.marc.cli.MarcToSolr [options] [file] Options: ★ solrUrl ★ doCommit ★ solrFieldType ○ marc-tags ○ human-readable ○ mixed ★ nolog 29
  • 30. Indexing with Solr "marc-tags" format "100a_ss": "Jung-Baek, Myong Ja", "100ind1_ss": "Surname", "245c_ss": "Vorgelegt von Myong Ja Jung-Baek." "human-readable" format "MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "MainPersonalName_type_ss": "Surname", "Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." "mixed" format "100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja", "100ind1_MainPersonalName_type_ss": "Surname", "245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek." 30 How to name the fields?
  • 33. Finding problems with facets Vandenhoeck und Ruprecht Vandenhoeck & Ruprecht Vandenhoeck u. Ruprecht Vandenhoeck Vandenhoek & Ruprecht Vandenhoek und Ruprecht Bandenhoed und Ruprecht Vandenhoeck et Ruprecht Vandenhoeck & Reprecht Vandenhoed und Ruprecht V&R unipress V&R Unipress V & R Unipress V & R unipress 33
  • 34. http://sappingattention.blogspot.de/2017/05/a-brief-visual-history-of-marc.html Usage in DH Benjamin Smith (2017) A brief visual history of MARC cataloging at the Library of Congress. 1. extract fields from MARC 2. data cleaning 3. visualize with R 34
  • 35. ./formatter --selector "260c;008~0-5" [file] > dates.tsv or put into a cleaning pileline ./formatter --selector "260c;008~0-5" [file] | sed ... | grep ... | awk ... > dates.tsv Extract data 260c 008~0- 5 1977. 780804 1977. 781121 [1973]. 740215 publication record 1977 1978-08-04 1977 1978-11-21 1973 1974-02-15 35
  • 36. Filtering out extreme values data %>% filter(publication > 2018) %>% arrange(desc(publication)) publication record <int> <int> 1 5732 1990 2 4185 2013 3 2201 2012 4 2030 2015 5 2022 2016 6 2020 2011 7 2019 2015 36
  • 37. cataloging frontline intensive backward cataloging - maybe importing? backward cataloging is still intensive, the tendency continues peak is > 13K 2000-07-10, the “golden day”: 95K new records forward cataloging 37
  • 38. 38
  • 39. available catalogs to measure 39 ❏ Library of Congress ❏ Harvard University Library ❏ Columbia University Library ❏ Deutsche Nationalbibliothek ❏ Universiteitsbibliotheek Gent ❏ Bibliotheksservice-Zentrum Baden Würtemberg ❏ Bibliotheksverbundes Bayern ❏ University of Michigan Library ❏ Toronto Public Library ❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB) ❏ Répertoire International des Sources Musicales ❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich) ❏ British library ❏ Talis https://github.com/pkiraly/metadata-qa-marc#datasources
  • 40. Authority entries Responsibility statement: Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent (vormgeving). Authority entries: ❏ Herr Seele ❏ Coussement, Toon ❏ Claes, Peter ❏ Van Sande, Hera 40
  • 41. everything else … at least regarding to this project https://github.com/pkiraly/metadata-qa-marc https://twitter.com/kiru peter.kiraly@gwdg.de 41