Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

Word Occurrence Based Extraction
of Work Contributors from
Statements of Responsibility

Nuno Freire
The European Library

TPDL-2013
Valletta, September 2013

Overview
Statements of responsibility from library bibliographic data:
“French Canadian freely arranged by Katherine K. Davis”.
“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by
Coop Himmelblau.”
“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”
“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de
Luis”

Extracting work contributors for use in a rights
infrastructure: ARROW
http://arrow-net.eu

Outline
 The context
• The ARROW rights infrastructure
• The use of national bibliographies in ARROW






The problem
The approach
Evaluation
Conclusion and future work

The ARROW rights
infrastructure
 ARROW aims to support mass digitisation projects
with automated ways to clear the rights of the books to
be digitised.
 To identify and clear the rights associated with a book
a complex process needs to be undertaken:
•
•
•
•
•

Determine the work(s) contained within the book
Identify all the other expressions of the same work(s)
Identify the publisher(s) and contributor(s) involved
Determine the dates of publication at work level
Determine whether that work(s), and not the book itself, is
still in commerce
• If necessary, obtain any licenses from the rights holders or
collective rights organizations
4

What is ARROW
 A rights infrastructure and system for the
identification of:
• Rights status
• In or out of copyright
• In or out of print / commercialised or not
• Rights
• Which rights are involved
• Right holders
• Authors
• Publishers
• How and where to clear the rights
• Orphan Works and their registration
5

Sources of Information in ARROW
 ARROW makes information available
from several sources:
• The European Library:
• National bibliographies - to identify the book and to
cluster it with all other books containing the same
intellectual work
• Virtual International Authority File - to better identify
the authors and support the identification of in copyright
works
• Books in Print database - to know if any of the books
concerned are actively commercialised by any publisher
• Reproduction Rights Organisation – to see if they know
or can trace the rightholders
6

The Role of Libraries
The Role of Libraries
••NationalLibraries as Metadata Providers
National Libraries as Metadata Providers
••
Provide the National Bibliographies to The
Provide the National Bibliographies to The
European Library
European Library

The Role of The European Library (TEL)
The Role of The European Library (TEL)
•To match library requests with national bibliographies
•To match library requests with national bibliographies
••Identifyall other manifestations that potentially share
Identify all other manifestations that potentially share
intellectual work with a manifestation
intellectual work with a manifestation
••Tocreate a Work record: work metadata, manifestations,
To create a Work record: work metadata, manifestations,
contributors, etc.
contributors, etc.

The Role of Books-in-Print (BIP)
The Role of Books-in-Print (BIP)
••Toprovide data about in print/out of print status
To provide data about in print/out of print status
••Toprovide data about publishers
To provide data about publishers
•To add new manifestation records of the work
•To add new manifestation records of the work

The Role of Reproduction Rights Organisation (RRO)
The Role of Reproduction Rights Organisation (RRO)
•RROs as Metadata Provider
•RROs as Metadata Provider
••
To provide data about authors and publishers
To provide data about authors and publishers
••
To provide data about available licenses
To provide data about available licenses
…
…

Statements of responsibility

 These statements usually contain information
about authorship, editors, photographers,
translators, and others involved in creating the
work
 In printed books, the statement of responsibility
is typically present on the title page
• The statement of responsibility is transcribed by the cataloguer
exactly as it appears in the book
(according to Anglo-American Cataloguing Rules)

Examples of statements of
responsibility
“French Canadian freely arranged by Katherine K. Davis”.
“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by
Coop Himmelblau.”
“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”
“by Pamela and Neal Priestland”
“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de
Luis”

The problem
 National bibliographies are reliable on
representing in structured form the first author of
a work
 But secondary contributors are often not
represented in structured form
 Secondary contributors may reside only within
the statements of responsibility

The approach
 To approach the problem as a Named Entity Recognition
task in text that may not be grammatically correct, thus
lacking lexical evidence
 Some requirements from the ARROW context
• Easily applicable to several languages
• The outcomes of the recognition task must be explainable

 Design decisions
• Exploring the structured data within national bibliographies
• By analysis of the frequency of word occurrences in names of
persons, and in other textual data
• Using word occurrence frequency allows to
• bypass the need for building training sets
• be able to provide simpler explanations of the name recognition
results

The process – pre-processing

 A pre-processing of each national
bibliography is performed:
• Word frequency is calculated
• The frequency values are normalized, for
independence on the size of the national bibliography
• The pre-processing results in four dictionaries:
•
•
•
•

Words in titles
Words in person’s surnames
Words in other parts of person’s names, than the surname
Words that appear in lowercase in person names
(such as “von” in German names, or “de” in Portuguese
names)

• The dictionaries contain the normalized frequency
associated the words

The process – bibliographic record
processing
 The named entity recognition is performed for a
record as follows:
• Statement of responsibility is tokenized
• The person names are recognized by comparing the
tokens with the dictionaries
• The recognized names are compared against the
names of the contributors present in the structured
fields of the record.
• If no similar name exists in the record, the contributor
is added to the record in a structured data field

The process – named entity
recognition
Possible token sequences used to locate person names:
(in Augmented Backus–Naur Form)

non-ambiguous-surname
/
(
initial /
non-ambiguous-first-name /
non-ambiguous-surname /
non-ambiguous-non-capitalized-name
)
*(initial / first-name / surname / non-capitalized-name)
surname

(more details on the definition of these tokens are included in the paper)

Evaluation data set

(size of bibliographies and evaluation samples)
National Bibliography
British Library
German National
Library
National Library of the
Netherlands
National Library of
Greece
Central Institute for the
Union Catalogue of
Italian Libraries
Royal Library of
Belgium

Total
records

Main
language

Evaluation sample
Statements of
responsibility

Referred
Persons

13.4 million

English

205

328

9.4 million

German

200

378

3.2 million

Dutch

200

335

0.4 million

Greek

297

379

12.4 million

Italian

224

297

203

387

1329

2104

1 million

French and
Dutch
Total:

Evaluation results
Exact match
metric

Dataset

Partial match
metric

Precision
British Library
German National Library
National Library of the
Netherlands
National Library of
Greece
Central Institute for the
Union Catalogue of
Italian Libraries
Royal Library of Belgium
Overall:

Recall

Precision

Recall

0.981
0.975

0.979
0.934

0.991
0.992

0.991
0.992

0.973

0.875

0.977

0.979

0.656

0.414

0.758

0.868

0.97

0.896

0.971

0.973

0.981
0.948

0.959
0.837

0.981
0.958

0.982
0.963

Evaluation results analysis

 The main causes of recognition errors:
• Foreign person names negatively affected recall
• Names of persons used in names of
organizations negatively affected precision
• Two persons with same surname mentioned
together negatively affected recall. As for
example:
• “hrsg. von Volker und Michael Kriegeskorte”
• “by Pamela and Neal Priestland”

Conclusions
 The approach performed reliably in most
languages and bibliographic datasets
• Datasets of at least one million records
• Precision and recall above 0.97 on all but one dataset

 The results obtained on the Greek national
bibliography were not satisfactory
• This dataset has distinct characteristics from the
others:
• smaller size,
• a different alphabet
• different language
• Further investigation of the Greek national
bibliography is necessary

Future work
 Evaluation of the impact of this solution on the
final results of the rights clearance process of
ARROW
 Building the dictionaries from comprehensive
source of names of persons
• Virtual International Authority File (VIAF)
• International Standard Name Identifier (ISNI)

 Further functionality:
• recognition of organization names
• recognition of the role of the recognized contributors
(illustrator, editor, etc.)

 Other application scenarios
• Functional Requirements for Bibliographic Records
• Resource Description and Access

Acknowledgments
 The European Library
• Marcela Strelcova, Chiara Latronico and
Eva Kralt-Yap

 Associazione Italiana Editori
 University of Innsbruck
 This work was partially supported by the
ARROWplus project, with co-funding by the
European Commission programme
eContentplus
Co-funded by the
Community
programme
eContentplus

T hank you
Questions or comments?
Contact:
Nuno Freire – nuno.freire@kb.nl

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

Semelhante a Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility (20)

Mais de The European Library

Mais de The European Library (20)

Último

Último (20)

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility