This document discusses annotating research datasets. It defines annotation as adding notes or explanations for clarification. Genome annotation attaches biological information to sequences. Research data annotation makes opaque data visible, sensible and valuable. It notes that many researchers have limited funding for data services and are not taught proper data management, so their datasets are difficult to find and curate. The document proposes that research data is well-suited for crowd-sourced annotation to help address these issues.
1. Annota&ng
Research
Datasets
1 1
A p r i l
2 0 1 3
U n i v e r s i t y
o f
C a l i f o r n i a
C u r a & o n
C e n t e r
C a l i f o r n i a
D i g i t a l
L i b r a r y
2. Term
skew
Annota&on:
The
act
of
adding
a
note
by
way
of
comment
or
explana&on.
Genome
annota&on:
The
process
of
aFaching
biological
informa&on
to
sequences.
E.g.,
• Protein
Data
Bank
annota&on
manual:
247
pgs
Research
data
annota&on:
(?!)
Adding
to
opaque
data
to
make
it
visible,
sensible,
and
valuable.
4. The
Long
Tail
Size
of
dataset
#
datasets
#
researchers
5. The
Long
Tail
Size
of
dataset
#
datasets
#
researchers
#
grants
6. The
Long
Tail
Size
of
dataset
grant
($)
#
datasets
#
researchers
#
grants
7. The
Long
Tail
With
data
managers
and
fancy
tools
Size
of
dataset
grant
($)
#
datasets
#
researchers
Do-‐it-‐yourself
tools
#
grants
8. UGLY
TRUTH
Many
researchers…
have
limited
funding
for
data
services
are
not
taught
data
management
don’t
know
what
metadata
or
data
centers
are
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
From
Flickr
By
puck90
9. The research data problem
• Journal article • Research data
– Uniquely and persistently – Nope
identified
– Concept of “publish” – Not really
– Multiple copies – Typically one
– Easily findable – Difficult
– Impact metrics, etc. – Nope
– Curation funding – Barely
Research data is ripe for crowd-sourced annotation
Notas do Editor
10 minutes, Day 2, 9am April 11Abstract: A huge amount of incredibly diverse research data remains beyond the reach of internet search engines, peer review processes, and systematic cataloging. The ability by consumers to annotate data is an important mitigation, harnessing "the crowd" to make it easier for everyone to discover and re-use data.
One way of looking at Big Data is this graph showing dataset size on the vertical axis against numbers of datasets on the horizontal axis.While there are some very large, celebrated datasets produced by satellites, ocean sensors, etc., there’s a very long tail off to the right of smaller, more obscure datasets that cumulatively account for a large portion of Big Data.
There are many more researchers out in the field collecting heterogeneous data, such as species counts obtained by visual sightings.
And there are many more grants supporting this kind of research...
And those grants are usually much smaller in terms of dollar amounts.
As a result, the large, celebrated datasets tend to come with staff positions for data management, as well as well-supported, standardized software tools supporting rich description and discovery, and enforcing certain curation standards.So for a huge number of grants and datasets, especially in Earth, environmental, and ecological sciences, ...
... there’s an ugly truth. Many of these researchers,This amounts to a whole lot of inertia that keeps a large part of the scientific record invisible, at-risk, and unavailable for re-use.