Managing Your Metadata Quality 2010 CrossRef Workshops

Patricia Feeney
Metadata Quality Coordinator
Managing your
metadata quality

Agenda
I. Metadata quality audit
II. DOI registration
III. Conflicts overhaul (discussion)
IV. Metadata Quality tools

Best query ever -> bad metadata = match
Mediocre query -> bad metadata = match
Horrible query -> bad metadata = match
Best query ever -> good metadata = match ✓+
Mediocre query -> good metadata = match
(probably) ✓
Horrible query -> good metadata = match (maybe)
✓-
Metadata Quality Audit: Overview
Accurate and complete metadata is vital to querying and citation
linking.
If the metadata for a DOI is incorrect, incomplete, or messy, a
match can't be made, regardless of the quality of a query.

Current efforts include:
 Reports
Resolution report (emailed monthly)
depositor report (on website)
crawler (on website)
field report (on website)
conflict report (on website, emailed
monthly)
schematron reports (emailed weekly)
failed query report (on website)
DOI error reports (emailed daily)
 Contact members individually (as
issues arise)
 Documentation and communication

Metadata Quality Audit
A Metadata Quality Audit will:
 provide publishers with detailed feedback on the
quality of their metadata by identifying problem areas
 identify members who need attention
 provide motivation and support to members with
metadata issues
The intent of the audit is to provide information, but there
may be consequences for extreme abusers.

Audit Scope
I. DOI resolution
II. Conflicts
III. Overall metadata
quality
IV.Metadata
maintenance Hello, I’d
like to audit
you
Great, lets
get started! Hooray!

Level I: DOIs that have been distributed but not deposited
and resolve to the Handle error page. *
Level II: DOIs resolving to an error page *
Level III: DOIs with response page blocked by access
control
Level IV: DOIs that resolve to an inadequate response
page.
I. DOI Resolution
* actionable transgressions

II. Conflicts
Conflicts occur when two (or more) DOIs are
deposited with identical metadata.
Level I: conflicts created between members *
Level II: conflicts within a publisher prefix(es) *
Level III: conflicts created due to insufficient metadata
+
Level IV: conflicts created due to item/content type +
* actionable transgressions
+ this may change, more later

Quality of deposited metadata
I. Missing metadata: is all available metadata
deposited?
II. Accuracy: is metadata correct?
III. Unusual metadata: does metadata fit into the
correct content type?
IV. Overall quality: is metadata messy?

Maintenance
I. Gaps in coverage - this usually indicates
undeposited DOIs (very very bad)
II. Currency of deposits - are deposits made ahead
of DOIs being distributed?
III. Title maintenance - less of a problem with recent
title restrictions, but we still have problems, title
abbreviations
IV. Reference linking compliance

Actionable Areas
DOI Resolution:
Level I (Undeposited DOIs)
Level II (DOIs resolving to error page)
 If action is not taken within a reasonable time period (TBD), DOIs
will be registered on behalf of the member (eventually for a fee)
 Continual distribution of unregistered DOIs may affect membership
Conflicts:
Level I conflict created between members
Level II conflicts within a publisher prefix
 A $2 per DOI conflict penalty fee may be imposed for conflicts of this
type if they are not resolved within a reasonable time period (TBD).
Metadata Maintenance:
Outbound linking compliance
 members found to not be linking during the audit will be subject to
non-linking penalties

Audit Process
1. Notification:
Auditees will be informed
of pending audit;
data collection begins (1-2
weeks for most members)
2. Data delivery
Audit document will be
emailed to member for review
2 weeks prior to audit (longer
if necessary); audit scheduled
3. Audit
phone conference, follow-up
scheduled (if necessary)
4. Response
member/CR reconvene to
discuss progress on audit
findings
5. Follow-up
(if necessary)

II. DOI Registration Pilot
DOIs should without exception be registered before
they are released to the public.
Most DOIs resolve, but the ones that don’t are a big
problem.
Solution: we’re going to register them*
*(ideal solution: publisher registers them)

DOI selection: At the moment, we will register DOIs
reported by end users, using the DOI error report as
a source.

DOI error report:
Implemented mid-2008
~4,000 DOI errors reported
monthly
> 1,400 fixed monthly through
publisher deposits
Some of the unfixed DOIs are
not ‘real’ DOIs, but many are.

We will register DOIs that meet the following criteria:
 Have been distributed publicly by the
publisher/prefix owner
 Have an identifiable response page
 Have been reported to the publisher’s technical
and business contacts

DOI Registration Process
1. DOI reported: a user reports an unresolving DOI
using the DOI error form
2. Technical contact notified (DOI error report email)
3. CrossRef review: CR staff reviews reported DOIs and
expires DOIs that do not meet our registration criteria
4. Business contact notified: 2 weeks from the initial
report, business contact is notified of remaining valid
unregistered DOIs.
5. CR deposit: after 2 weeks have passed from business
contact notification, CrossRef will register any
undeposited DOIs.

Conflicts overhaul
Conflicts occur when two (or more) DOIs share
the same metadata, suggesting two DOIs are
assigned to a single item.

Why are conflicts bad?
 Only one DOI should be assigned per item
 Queries will return multiple DOIs, causing
confusion
 Some queries (OpenURL) may not return a
DOI if multiple results are present
 Conflicts between two DOIs often result in one
of the DOIs being neglected***

We currently have ~200,000+ conflicts in our
system. Not all of them are a problem:
 For some items, our schema only allows
minimal metadata
 Some content types require matching
metadata (standards and book chapters with
minimal metadata (dictionaries) for example)

Legitimate conflicts
Conflict between 2 prefixes:
http://dx.doi.org/10.1639/0044-7447(2001)030[0037:IOPOFU]2.0.CO;2
http://dx.doi.org/10.1579/0044-7447-30.1.37
Sample query
Conflict within 1 prefix:
http://dx.doi.org/10.3724/SP.J.1006.2008.00070
http://dx.doi.org/10.3724/SP.J.1006.2008.00770
Journal Title Year Vol Issue Page Author Article Title
AMBIO 2001 30 1 37 Köhlin Impact of Plantations on Forest Use a...
Journal Title Year Vol Iss Page Author Article Title
ACTA AGRONOMICA
SINICA
2008 34 5 770 Zhang Differential Gene Expression in
Upper…

‘Bad’ conflicts
Conflicts with minimal metadata:
10.1002/ijc.11095
10.1002/ijc.11093
Conflict due to content type:
10.1520/C0506-10 10.1520/C0506-10A
10.1520/C0506-10B
Journal Title Year Vol Issue Page Author Article Title
International Journal of Cancer 2003 104 6 798 Errata
Book Title Year Editi
on
Page Author Title
Specification for
Reinforced Concrete...
2010 2010 C13
Committee

Elements considered during
conflict generation:
 Content type
 Journal, book and/or series
title
 Article title /content_item title
(book chapters)
 Publication year
 Volume
 Issue
 First page
 Author
 Edition
If there is a match between all
deposited elements, a conflict is
generated.
2 Items with matching journal
title, volume, issue, and article
title will cause a conflict.

Ideas?
What should our minimum set of metadata
be?
How should conflicts be
monitored/reported?

Managing your
metadata quality

Sample #1: incorrect metadata
Q: My link resolver is retrieving the wrong metadata for DOI
10.1002/rra.1288, causing our links to break - here is my
query*:
http://www.crossref.org/openurl?pid=pfeeney@crossref.org&aulast=Null&
title=River Research and
Applications&volume=26&issue=6&page=663&year=2010
*query metadata matches the response page metadata
A: Two problems with deposited metadata (DOI query):
#1 <year media_type="print">2009</year>
#2 <pages>
<first_page>n/a</first_page>
<last_page>n/a</last_page>
</pages>

Sample #2: messy metadata
Q: I know DOI 10.1068/p6742 exists, why doesn’t my query
work?
A: Let’s check the guest query form
Metadata for article:
Newport R, Preston C, 2010, "Pulling the finger off disrupts agency, embodiment and
peripersonal space" Perception 39(9) 1296 – 1298
Problem is: author surname is deposited as:
<person_name sequence="first" contributor_role="author">
<given_name>Roger</given_name></given_name>
<surname><surname>Newport</surname></surname>
</person_name>

Sample #3: duplicate authors
Q: Why does DOI 10.2307/1382491 have multiple versions of
the same author?
A: attempt to improve query matching
<contributors>
<person_name sequence="first" contributor_role="author">
<given_name>Erling Johan</given_name>
<surname>Solberg</surname>
</person_name>
<person_name sequence="additional" contributor_role="author">
<given_name>Bernt-Erik</given_name>
<surname>Sæther</surname>
</person_name>
<person_name sequence="additional" contributor_role="author">
<given_name>Bernt-Erik</given_name>
<surname>Saether</surname>
</person_name>
</contributors>

New(ish) tools for managing
metadata and deposit problems
Schema documentation:
http://www.crossref.org/schema/documentation/ or linked
from help doc
Reporting problems / asking for help:
 Help documentation (http://www.crossref.org/help/)
 Support portal and forums (http://support.crossref.org)
 Contact support@crossref.org

Schematron update
Schematron reports notify depositors of non-fatal
deposit issues
 35-40 emails sent out weekly
 Alerts are generated for < 1% of deposits
 Tend to identify ‘messy’ deposits
 Rules updated periodically

Schematron Warnings
page number
contains
underscore
2%
first page
contains dash
4%
last page
contains dash
7%
Jr.' in surname
61%
punctuation in
surname
26%
Jr. in surname:
Araújo Jr
Prata Jr.
Szezech Jr.
Punctuation in surname:
(Earven) Tribble
Frederick (Frikkie) J.
Arch Marin march@ub.edu
Plauchu********
Other rules:
 ‘ed’ ‘iss’ ‘vol’ in edition,
issue, volume elements
 Publication year exceeds
current year by >2
 Surname / title all upper
case

Questions?
support@crossref.org
pfeeney@crossref.org

Managing Your Metadata Quality 2010 CrossRef Workshops

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Managing Your Metadata Quality 2010 CrossRef Workshops

Semelhante a Managing Your Metadata Quality 2010 CrossRef Workshops (20)

Mais de Crossref

Mais de Crossref (20)

Último

Último (20)

Managing Your Metadata Quality 2010 CrossRef Workshops

Notas do Editor