Big data and open access: a collision course for science

Keynote
talk
at
2nd
Int’l
LSDMA
Symposium
–
The
Challenge
of
Big
Data
in
Science,
Karlsruhe,

Germany,
Sept
2013

Big data and open access: on
track for collision of cosmic
proportions?
Beth Plale, PhD, MBA
Director, Data To Insight Center
School of Informatics and Computing
Indiana University

Open
access,

open
cleaning,

open
data

yields
greatest
degree
of
science

advancement
on
grand
societal

ques�ons
we
face

Open Access

“Data
is
the
New
Gold”

Title
of
Opening
Remarks,
Neelie

Kroes,
VP
of
EU
Commission
responsible
for
Digital
Agenda,

Press
Conference
on
Open
Data
Strategy,
Dec
2011

Applied Forces
Open
access

ini�a�ves
by
federal

governments

Big
Data

Applied Force Distorts Object
Enables
societal

grand
challenges

addressed
in:

à 
Climate
change

à 
Food
security

à
New
economies

Open
access

ini�a�ves
by
federal

governments

à Grows
concerns

about
privacy
of

personal
data

Big
Data

Negative form of tension (tension I)

Chilling
eﬀect

on
data

sharing
where

social

phenomena

involved

Social
pressure

to
privacy

overwhelm
and

spill
over
to

non-‐personal

data

Exponential Growth in Data Production

Similar growth in societal expectations
that large societal problems will be
solved by more data

Tension II: Rapid growth in data and
expectations yields impossible-toreach success

Technical barriers to easing tensions but first …

DRIVING APPLICATIONS:
LIBRARY TEXTS; URBAN
SCIENCE; WIND AND WATER

Hathi Trust Research Center
Text mining at scale

#HTRC
#HathiTrust

#HTRC
#HathiTrust

à  HathiTrust is large corpus
providing opportunity for new
forms of computation
investigation.
à  The bigger the data, the less
able we are to move it to a
researcher’s desktop machine
à  Future research on large
collections will require
computation moves to the data,
not vice versa

HTRC Partners
 
 
 
 
 
 

Indiana University School of Informatics and Computing
Indiana Universities Libraries
University of Illinois Graduate School of Library and
Information Science
University of Illinios Libraries
Brandies University Library
University of Michigan

http://www.hathitrust.org/htrc

#HTRC
#HathiTrust

HTRC Non-Consumptive Research
Paradigm
No action or set of actions on part of users, either
acting alone or in cooperation with other users
over duration of one or multiple sessions can result
in sufficient information gathered from collection of
copyrighted works to reassemble pages from
collection.
Definition disallows collusion between users, or
accumulation of material over time. Differentiates
human researcher from proxy which is not a user.
Users are human beings.

#HTRC
#HathiTrust

Topic modeling on author
Two topics with identical centralities but separate themes

Yearly values of a ratio between two wordlists in
three different genres. 4,275 volumes. 1700-1899.

Underwood et al. Research

Computation moves to data
  REST based Web services architecture and
protocols
  Registry of services and algorithms
  Solr full text index
  noSQL store as volume store
  openID authentication
  Portal front-end, programmatic access
  SEASR text mining algos
2/4/14

17

Portal
Blacklight
SEASR
analy�cs

service

Agent

framework

Agent

instance

Agent

instance

WSO2
registry

services,
collec�ons,
data

capsule
images

HTRC
Data
API
v0.1

WS02

Iden�ty

Server

Agent

instance

Agent

instance

Solr

index

Task

deployment

Meandre

Orchestra�on

Non-consumptive
Data capsules

NCSA
local
resources

Volume
store

Volume
store

(Cassandra)

Volume
store

(Cassandra)

(Cassandra)

rsync

NSF
XSEDE

Big
Red
II/IU
Quarry

Programma�c

access

e.g.,

HathiTrust

corpus

Page/volume

tree
(ﬁle
system)

18

University of Michigan

HTRC: Open Data, Open Access, Open Cleaning?

  HathiTrust collection (69%) is not open
data
  Constrained by authors who hold
copyright to the books
  Computational analysis is by all accounts
“fair use” under US copyright

HTRC: Open Data, Open Access, Open Cleaning?

  “Open cleaning” – enhancing OCR and
MARC metadata
  HTRC is opening data and “cleaning” as
fully as we can to make the collection
useful to scholarly and scientific
investigation

Wind and Water: the hydrologist’s
(atmospheric) observational data
dilemma
Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University

*
Credit/blame
for
�tle
goes
to
Beth
Plale

Atmospheric Observing Systems
Recent addition of plethora of new observing systems to
national US atmosphere observing infrastructure
 

Improves ability to analyze current state of atmosphere, thus
allowing new applications in hydrology and biology

Challenges in:
 
 
 

Data access; unique sensing requirements
Data quality, calibrations, and errors
Complex and non-uniform metadata

Use Case
Use observational data from 3 different radars: FAA TDWR,
WSR-88D, and local X-band (CASA)
Feed data through OU-custom QA/calibration workflow.
Feed into Vflow hydrological model. Note that Vflow is able
to operate on (ingest) the “raw” reflectivity data directly.
That is, it does not require the data to be turned into
gridded precipitation data. Vflow is unique among
hydrology models because of this ability.
Done in real time, that is, continuously ingesting data over
fixed interval.

List of Issues for Flood Forecasting using Radar data
Problem

Cause

Poten�al
Solu�on

Hail
contamina�on

Assumes
high
rainfall
rate

Use
of
dual-‐pol,
QC

Bright
band

Ice
at
mid-‐levels
biases
dBZ

Real-‐�me
QC,
2
radar
beams

Ground
clu�er

Wind
farms,
blockage

Use
of
Neural
Net,
velocity

Radar
a�enua�on

High-‐frequency
radars

Real-‐�me
QC
model,
ﬁx

Anomalous
propaga�on
High
stable
environment

Use
of
Level
1,
velocity

Velocity
de-‐aliasing

High
velocity
returns

Real-‐�me
QC

Radar
calibra�on

Poor
maintenance

Post
QC

Over/under
es�ma�on

below
beam

Radar
too
far
from
area
of

interest;
undersampled

Improved
radar
sampling;

addi�onal
sfc
input

Poor
�me
sampling

Radar
5-‐min
volume
sampling

Improved
temporal
sampling

ET
under
beam

Lack
of
surface
informa�on

Addi�onal
surface
data

Spa�al
interpola�on

Polar
to
Cartesian
coordinates
Interpola�on
algorithm

Use
of
Reﬂec�vity

Does
not
measure
rain
directly
Calibra�on
against
sfc
data

Example
Workﬂow

Quality
Control

Other
radar

systems

(TDWR,
CASA)

Clear-‐air

echoes

removed

Anomalous

propaga�on

(AP)

removed

WSR-‐88D
data

Clu�er

removal

Interpola�on

from
polar
to

a
common

Cartesian
grid

Hail

contamina�on

removal

Velocity
de-‐
aliasing

Radar

calibra�on

Mel�ng
layer

contamina�on

removal

Undersampling

Representa�ve
ness

Convert
radar

reﬂec�vity

dBZ
to
rainfall

rate

Radar
merger

(across
same

network
and

mul�ple

networks)

Integrate

radar
data

with
satellite,

surface

observa�ons

on
grid

Examine hail contamination in more detail
  Level II radar data that is widely available (through LDM
tool of UCAR in US) has not been “cleaned” of effects of
clean air echoes, hail, undersampling, and melting layer
contamination
  Hail has effect of high reflectivity readings and these
high readings can be misinterpreted as high rainfall
  Meteorologists can detect hail easily by eyeballing a
visual plot of reflectivity intensities so can go back to
Level II data and process by removing hail contamination
  Meteorologists solve problem through trained eye, and
good in-house scripts. What does poor hydrologist do?

Meterology/Hydrology: Open Data, Open Access,
Open Cleaning?
Data is open, but how to handle cleaning?
A: force all level II data through workflow. Hydrologist uses
only processed data (i.e., gridded precipitation data).
  Advantage: hides details from hydrologist
  Disadvantage: black box approach reduces trust
A: Make “raw” level II data and Q&A workflow tasks
available to hydrologist.
  Advantage: hydrologist can develop high level of
trust in data
  Disadvantage: current metadata not sufficiently
described to capture the kinds of Q&A that have
been applied

Urban Science

Tag
cloud
of
related
tweet
topics

#smartcityjam
thanks
to
Jennifer
Belissent,

PhD

*
Credit/blame
for
�tle
goes
to
Beth
Plale

Urban Science
  Harness data from disparate sources with goal of
improving city life.
  Fuses physical, biological, and informational sensing of
the city
 
 
 
 

in-situ sensors for environment: light, temperature, pollution
Video: pedestrian and vehicular traffic
Personal sensors: Fitbit and Up wristbands
Internet sources: Twitter feeds, blogs, news articles, crowdsourced sensing

  Two examples in US

  Center of Urban Science and Progress, New York University
  Urban Center for Computation and Data, University of
Chicago

Urban Science

Thanks to Physics Today, Sept 2013

Graphic
courtesy
NYU
Center
for
Urban
Science
and
Progress

*
Credit/blame
for
�tle
goes
to
Beth
Plale

Urban science: open data, open access, open
cleaning?
CUSP is cleaning its own data for integration. Is this being
done in way that Chicago can use? Likely not.
Temporal streams are relatively simple to understand with
even bad metadata. They are observational-physical and
observational-social data sources so come with relatively
known trust and attribution.
What happens when CUSP wants to integrate predictive
weather forecasting model results? Weak metadata and
attribution can significantly compromise accuracy of results.

Data Provenance

Work of Data To Insight Center at IU, its
affiliated faculty and students

Provenance
for situational
analysis of
agent based
model used in
social
ecological
systems
research
Village labor
sharing for
agriculture
production in
Africa

Provenance capture AMSR-E
data processing pipeline

Advanced
Microwave
Scanning

Radiometer
(AMSR-‐E)
:
sensor

aboard
Aqua
satellite;
passive

microwave
radiometer.

Observes
precipita�on,
sea

surface
temperatures,
ice

concentra�ons,
snow
water

equivalent,
surface
wetness,

wind
speed,
atmospheric
cloud

water,
and
water
vapor.

36

Aug
2013

NASA

AMSR-‐E

imagery

ingest

processing

pipeline:

provenanc
e
capture

for

anomaly

detec�on

Dataset: D2I-AMSR-E-Provenance Dataset
Owner and Creator: Data to Insight Center
Size: 15MB
The University of Alabama in Huntsville processes data from the
NASA AMSR-E instrument. The Karma project at Indiana
University instrumented the ingest processing system and
captured provenance for 3,890 runs for the period of September
2 - October 4 2011. The details of the runs are in Figure III-16
below; the largest provenance graph is the monthly rain graph
that, when represented as a XML is approximately 13MB.
Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei,
Conover, Helen. 2012. Provenance of AMSR-E Data from the
National Snow and Ice Data Center (NSIDC). OPM XML Ver.
1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight
Center. http://dx.doi.org/10.5967/M0F47M2D

Provenance History Layout Algorithm
Provenance of 1 month
processing of NASA satellite
ingest processing pipeline.
Can help tracing error back to
its cause.
Shows relationship between
daily products (each clover
flower in clover leaf chain) and
final monthly products at leftend.
Provenance
of

a
seaIce
daily

workﬂow

39

Aug
2013

Provenance graph
compare: failed
runs

Le�:
complete
provenance
of
successful
execu�on.
Right:
failed
run,

because
ﬁnal
data
product
(green
on
le�)
cannot
be
matched.

40

Graph compare: dropped provenance

Le�:
successful
execu�on.
Right:
although
successful

execu�on,
shows
dropped
no�ﬁca�ons
in
provenance

capture,
because
all
nodes
except
some
edges
in
le�

graph
cannot
be
matched.

41

Role of provenance in Open Data, Open Access,
Open Cleaning
Key contribution of provenance is to data quality.
We posit that quality of data provenance has 3 dimensions:
  Correctness
  Completeness
  Relevancy
Assumption: provenance collection process is automated
Assessment is focused on correctness and completeness of
captured provenance
Steps:
1)  Detect ambiguities and conflicts in real and synthetic
provenance traces
2)  Complete portions of missing provenance traces
3)  Validate provenance traces when possible
4)  Score the quality of provenance traces

42

Provenance Quality Analysis Overview

G : Graph level
M-G : Multi-Graph (Multiple graphs) Level
N / E : Node/Edge Level

43

Wrapping Up: Open Data, Open Cleaning,
Open Access
S�mula�ng
new
business

opportunity
on
stable

interfaces
to
open
data

Open
interfaces

Open
cleaning

Open
data

Who’s
working
on:

Research
Data
Alliance

How?
e.g,
Crea�ve

Commons
license

Personal
privacy

respected

Applied Forces Come Together to Distort
Object into New Space
Open
access

ini�a�ves

Fundamental

advances
in

à Climate
change,

à 
Food
security

à
à
New

economies

Big

Data

Personal
data

privacy,
social

isues
of
sharing

Research

Data

Alliance

Maturity
in

provenance

and

metadata

plale@indiana.edu

Our
hosts
RDA
Plenary
1
Chalmers

Univ,
Gothenburg,
Sweden

Photo
courtesy
Leif
Laaksonen

Big data and open access: a collision course for science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Big data and open access: a collision course for science

Semelhante a Big data and open access: a collision course for science (20)

Mais de Beth Plale

Mais de Beth Plale (11)

Último

Último (20)

Big data and open access: a collision course for science