1. Scien&fic
Data
Management
A
tutorial
at
ICADL
2011
October
24,
2011
Jian
Qin
School
of
Informa&on
Studies
Syracuse
University
hGp://eslib.ischool.syr.edu/
2. The
morning
ahead
An
environmental
scan
• E-‐Science,
cyberinfrastructure,
and
data
• What
do
all
these
have
to
do
with
me?
Case
study:
The
gravita&onal
wave
research
data
management
Group
work:
Role
play
in
developing
data
management
ini&a&ves
12/18/11
15:51
Overview
of
E-‐Science
2
3. An
environmental
scan
• E-‐Science,
cyberinfrastructure,
and
data
• What
do
all
these
have
to
do
with
me?
Overview
of
E-‐Science
Characteris&cs
of
e-‐science
Data
sets,
data
collec&ons,
and
data
repositories
Why
does
it
maGer
to
libraries?
4. E-‐Science
“In
the
future,
e-‐Science
will
refer
to
the
large
scale
science
that
will
increasingly
be
carried
out
through
distributed
global
collabora&ons
enabled
by
the
Internet.
”
Na&onal
e-‐Science
Center.
(2008).
Defining
e-‐Science.
hGp://www.nesc.ac.uk/nesc/define.html
12/18/11
15:51
Overview
of
E-‐Science
4
5. E-‐Infrastructure
for
the
research
lifecycle
hGp://epubs.cclrc.ac.uk/bitstream/
3857/
science_lifecycle_STFC_poster1.PD
F
12/18/11
15:51
Overview
of
E-‐Science
5
6. Shib
in
Science
Paradigms
Thousand
years
A
few
hundred
A
few
decades
Today
ago
years
ago
ago
Data
explora7on
(eScience)
unify
theory,
experiment,
and
simula&on
A
computa7onal
-‐-‐
Data
captured
by
instruments
approach
or
generated
by
simulator
simula&ng
complex
-‐-‐
Processed
by
sobware
Theore7cal
branch
phenomena
-‐-‐
Informa&on/Knowledge
using
models,
stored
in
computer
generaliza&ons
-‐-‐
Scien&st
analyzes
database/
files
using
data
management
Science
was
and
sta&s&cs
empirical
describing
natural
phenomena
Gray,
J.
&
Szalay,
A.
(2007).
eScience
–
A
transformed
scien&fic
method.
hGp://research.microsob.com/en-‐us/um/
people/gray/talks/NRC-‐CSTB_eScience.ppt
8. Gray,
J.
&
Szalay,
A.
(2007).
eScience
–
A
transformed
X-‐Info
scien&fic
method.
hGp://research.microsob.com/en-‐us/um/
people/gray/talks/NRC-‐CSTB_eScience.ppt
• The
evolu&on
of
X-‐Info
and
Comp-‐X
for
each
discipline
X
• How
to
codify
and
represent
our
knowledge
Experiments
&
Instruments
Other
Archives
facts
ques&ons
Literature
facts
?
answers
Simula&ons
The
Generic
Problems
• Data
ingest
• Managing
a
petabyte
• Query
and
Vis
tools
• Building
and
execu&ng
models
• Common
schema
• How
to
organize
it
• Integra&ng
data
and
Literature
• Documen&ng
experiments
• How
to
reorganize
it
• How
to
share
with
others
• Cura&on
and
long-‐term
preserva&on
9. Useful
resources
• What
is
eScience?
• eScience
Ini7a7ves
• Science
Research
and
Data
• Science
Data
Management
• Literature
Reviews
• Data
Policy
Issues
• eScience
Research
Centers
• hGp://eslib.ischool.syr.edu/index.php?
op&on=com_content&view=sec&on&id
hGp://research.microsob.com/en-‐ =9&Itemid=83
us/collabora&on/fourthparadigm/
12/18/11
15:51
Overview
of
E-‐Science
9
10. A
FEW
IMPORTANT
CONCEPTS
12/18/11
15:51
Overview
of
E-‐Science
10
11. Data
Any
and
all
complex
data
en&&es
from
observa&ons,
experiments,
simula&ons,
models,
and
higher
order
assemblies,
along
with
the
associated
documenta&on
needed
to
describe
and
An
ar&st’s
concep&on
(above)
depicts
fundamental
NEON
observatory
interpret
the
data. instrumenta&on
and
systems
as
well
as
poten&al
spa&al
organiza&on
of
the
environmental
measurements
made
by
these
instruments
and
systems.
hGp://www.nsf.gov/pubs/2007/nsf0728/
nsf0728_4.pdf
12/18/11
15:51
Overview
of
E-‐Science
11
12. Scien&fic
data
formats
Common
data
format
Image
formats
Matrix
formats
Microarray
file
formats
Communica&on
protocols
12/18/11
15:51
Overview
of
E-‐Science
12
13. Scien&fic
datasets
• The
scien&fic
data
set,
or
SDS,
is
a
group
of
data
structures
used
to
store
and
describe
mul&dimensional
arrays
of
scien&fic
data.
• The
boundaries
of
datasets
vary
from
discipline
to
discipline
NCSA
HDF
Development
Group.
(1998).
HDF
4.1r2
User's
Guide.
hGp://www.hdfgroup.org/training/HDFtraining/UsersGuide/
SDS_SD.fm1.html#48894
12/18/11
15:51
Overview
of
E-‐Science
13
14. Scien&fic
workflows
• Steps
in
data
collec&on
and
analysis
process
• Different
types
of
scien&fic
workflows:
– Data-‐intensive
– Compute-‐intensive
– Analysis-‐intensive
– Visualiza&on-‐intensive
Ludäscher,
B.,
Al&ntas,
I.,
Berkley,
C.,
Higgins,
D.,
Jaeger,
E.,
Jones,
E.,
Lee,
E.A.,
Tao,
J.,
&
Zhao,
Y.
(2006).
Scien&fic
workflow
management
and
the
Kepler
system.
Currency
and
Computa>on:
Prac>ce
and
Experience,
18(10):
1039-‐1065.
12/18/11
15:51
Overview
of
E-‐Science
14
15. Example:
Ecological
dataset
• Floris&c
diversity
data
– Related
links
– Data
aGributes
– Download
link
12/18/11
15:51
Overview
of
E-‐Science
15
16. Example:
Biodiversity
dataset
• Ac7ons
for
Porcupine
Marine
Natural
History
Society
-‐
Marine
flora
and
fauna
records
from
the
North-‐east
Atlan7c
– Metadata
record
output
in
different
standard
formats
– URL
for
dataset
download
12/18/11
15:51
Overview
of
E-‐Science
16
17. Example:
The
Significant
Earthquake
Database
• The
Significant
Earthquake
Database
– A
database
containing
data
about
significant
earthquake
events
and
the
damages
caused
– An
interface
for
extrac&ng
a
subset
of
data
– A
link
to
download
the
whole
dataset
– Documenta&on
12/18/11
15:51
Overview
of
E-‐Science
17
19. Research
data
collec&ons
Data
output
Size
Metadata
Management
Standards
Larger,
Mul&ple,
Organized
discipline-‐ comprehensive
Ins&tu&onalized,
based
Heroic
Smaller,
individual
team-‐based
None
or
inside
the
random
team
12/18/11
15:51
Overview
of
E-‐Science
19
20. Research
collec&ons
• Limited
processing
or
long-‐term
management
• Not
conformed
to
any
data
standards
• Varying
sizes
and
formats
of
data
files
• Low
level
of
processing,
lack
of
plan
for
data
products
• Low
awareness
of
metadata
standards
and
data
management
issues
12/18/11
15:51
Overview
of
E-‐Science
20
21. Resource
collec&ons
• Authored
by
a
community
of
inves&gators,
within
a
domain
or
science
or
engineering
• Developed
with
community
level
standards
• Life
&me
is
between
mid-‐
and
long-‐term
• Example:
Hubbard
Brook
Ecosystem
Study
(
hGp://www.hubbardbrook.org
)
– One
of
the
regional
sites
in
the
Long
term
Ecological
Research
Network
(LTER)
– Community
of
the
ecological
domain
– Community
of
inves&gators
from
around
the
country
on
ecosystem
study
– Ecological
Metadata
Language
(EML),
a
community-‐level
standard
– Cataloged,
searchable
dataset
collec&ons
12/18/11
15:51
Overview
of
E-‐Science
21
22. Reference
collec&on
• Example:
Global
Biodiversity
Informa&on
Facility
– Created
by
large
segments
of
science
community
– Conform
to
robust,
well-‐established
and
comprehensive
standards,
e.g.
• ABCD
(Access
to
Biological
Collec&on
Data)
• Darwin
Core
• DiGIR
(Distributed
Generic
Informa&on
Retrieval)
• Dublin
Core
Metadata
standard
• GGF
(Global
Grid
Forum)
• Invasive
Alien
Species
Profile
• LSID
(Life
Sciences
Iden&fier)
• OGC
(Open
Geospa&al
Consor&um)
12/18/11
15:51
Overview
of
E-‐Science
22
23. hGp://www.tdwg.org/
Global
Biodiversity
standards/
Informa7on
Facility
hGp://www.gbif.org/informa&cs/discoverymetadata/a-‐metadata-‐infrastructure/
12/18/11
15:51
Overview
of
E-‐Science
23
24. Datasets,
data
collec&ons,
and
data
repositories
System
for
storing,
managing,
preserving,
and
providing
access
to
• Data
collec&ons
are
built
for
datasets
larger
segments
of
science
and
engineering
Data
• Datasets
repository
– typically
centered
around
an
A
repository
may
event
or
a
study
contain
one
or
more
– contain
a
single
file
or
mul&ple
data
collec&ons
files
in
various
formats
A
data
collec&on
may
– coupled
with
documenta&on
contain
one
or
more
about
the
background
of
data
datasets
collec&on
and
processing
A
dataset
may
contain
one
or
more
data
files
12/18/11
15:51
Overview
of
E-‐Science
24
25. An
emerging
trend
in
academic
libraries
12/18/11
15:51
Overview
of
E-‐Science
25
26. Ini&a&ves
in
research
libraries
Data
support
and
Libraries
involved
in
services
in
suppor&ng
eScience:
ins&tu&ons:
73%
45%
• Pressure
points:
– Lack
of
resources
– Difficulty
acquiring
the
appropriate
staff
and
exper&se
to
provide
eScience
and
data
management
or
cura&on
services
– Lack
of
a
unifying
direc&on
on
campus
Source:
Soehner,
C.,
Steeves,
C.
&
Ward,
J.
(2010).
E-‐Science
and
data
support
services:
A
study
of
ARL
member
ins&tu&on.
hGp://www.arl.org/bm~doc/escience_report2010.pdf
12/18/11
15:51
Overview
of
E-‐Science
26
27. Data
management
challenges
• No
one-‐size-‐fits-‐all
solu&on
• Requires
an
in-‐depth
understanding
of
scien&fic
workflows
and
research
lifecycle
• Involves
not
only
technical
design
and
planning
but
also
organiza&onal
collabora&on
and
ins&tu&onaliza&on
of
data
policy
12/18/11
15:51
Overview
of
E-‐Science
27
28. Data
preserva&on
challenges
• Data
formats
– Vary
in
data
types,
e.g.
vector
and
raster
data
types
– Format
conversions,
e.g.
from
an
old
version
to
a
newer
one
• Data
rela&ons
– e.g.
there
are
data
models,
annota&ons,
classifica&on
schemes,
and
symboliza&on
files
for
a
digital
map
• Seman&c
issues
– Naming
datasets
and
aGributes
12/18/11
15:51
Overview
of
E-‐Science
28
29. Data
access
challenges
• Reliability
• Authen&city
• Leverage
technology
to
make
data
access
easier
and
more
effec&ve
– Cross-‐database
search
– Integra&on
applica&ons
12/18/11
15:51
Overview
of
E-‐Science
29
30. Suppor&ng
digital
research
data
• Lifecycle
of
research
data
– Create:
data
crea&on/capture/gathering
from
laboratory
experiments,
field
work,
surveys,
devices,
media,
simula&on
output…
– Edit:
organize,
annotate,
clean,
filter…
– Use/reuse:
analyze,
mine,
model,
derive
addi&onal
data,
visualize,
input
to
instruments
/computers
– Publish:
disseminate
data
via
portals
and
associate
datasets
with
research
publica&ons
– Preserve/destroy:
store
/
preserve,
store
/replicate
/
preserve,
store
/
ignore,
destroy…
12/18/11
15:51
Overview
of
E-‐Science
30
31. Suppor&ng
data
management
The
data
deluge
Researchers
need:
Numerical,
image,
video
Specialized
search
engines
to
discover
the
Models,
simula&ons,
bit
data
they
need
streams
Powerful
data
mining
XML,
CVS,
DB,
HTML
tools
to
use
and
analyze
the
data
12/18/11
15:51
Overview
of
E-‐Science
31
32. Research
data
management
Community
Ins&tu&on
eScience
librarian
Financial
and
policy
support
Science
Data
content
User
domain
idiosyncrasies
requirements
Evolving
and
interconnec&ng
–
Ins&tu&onal
Community
Na&onal
Interna&onal
repository
repository
repository
repository
12/18/11
15:51
Overview
of
E-‐Science
32
33. Implica&ons
to
scholarly
communica&on
process
Publishing
Cura&on
Archiving
Data
publishing;
Maintaining,
preserving
The
long-‐term
storage,
New
scholarly
publishing
and
adding
value
to
digital
retrieval,
and
use
of
models—open
access,
research
data
throughout
scien&fic
data
and
ins&tu&onal
and
its
lifecycle.
methods.
community
repositories,
self-‐publishing,
library
publishing,
....
12/18/11
15:51
Overview
of
E-‐Science
33
45. Summary
• Managing
research
data
is
mo&vated
by:
– Government
funding
agency’s
policy
– Needs
for
data
sharing,
cross
valida&on
of
data
and
research,
credit,
and
large-‐scale
interdisciplinary
discovery
• Organiza&onal
changes:
– New
organiza&onal
units
within
the
university
library
or
at
the
university
level
– Virtual
group
– Collabora&on
among
key
units:
Libraries,
IT
services,
research
administra&on
office
46. Summary
• Types
of
services
– Training
faculty
and
students
for
data
literacy
– Data
cura&on
services
(data
repositories,
digital
libraries,
archiving
data)
– Consul&ng
services
– Data
management
plan
– Developing
data
policies