This document summarizes key points from a keynote talk on big data and open access given at the 2nd International LSDMA Symposium. It discusses how open access to data, along with open cleaning and analysis of data, can help address major societal challenges related to climate change, food security, and new economies. However, rapid growth in data and expectations is creating tensions between enabling new discoveries and addressing privacy concerns. Technical solutions exist but cultural and policy changes may also be needed to fully realize the benefits of open data while protecting privacy. The document provides several examples of open data initiatives related to text analysis, urban science, and atmospheric observations.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Big data and open access: a collision course for science
1. Keynote
talk
at
2nd
Int’l
LSDMA
Symposium
–
The
Challenge
of
Big
Data
in
Science,
Karlsruhe,
Germany,
Sept
2013
Big data and open access: on
track for collision of cosmic
proportions?
Beth Plale, PhD, MBA
Director, Data To Insight Center
School of Informatics and Computing
Indiana University
2. Open
access,
open
cleaning,
open
data
yields
greatest
degree
of
science
advancement
on
grand
societal
ques�ons
we
face
3. Open Access
“Data
is
the
New
Gold”
Title
of
Opening
Remarks,
Neelie
Kroes,
VP
of
EU
Commission
responsible
for
Digital
Agenda,
Press
Conference
on
Open
Data
Strategy,
Dec
2011
5. Applied Force Distorts Object
Enables
societal
grand
challenges
addressed
in:
à
Climate
change
à
Food
security
à
New
economies
Open
access
ini�a�ves
by
federal
governments
à Grows
concerns
about
privacy
of
personal
data
Big
Data
6. Negative form of tension (tension I)
Chilling
effect
on
data
sharing
where
social
phenomena
involved
Social
pressure
to
privacy
overwhelm
and
spill
over
to
non-‐personal
data
8. Similar growth in societal expectations
that large societal problems will be
solved by more data
9. Tension II: Rapid growth in data and
expectations yields impossible-toreach success
10. Technical barriers to easing tensions but first …
DRIVING APPLICATIONS:
LIBRARY TEXTS; URBAN
SCIENCE; WIND AND WATER
11. Hathi Trust Research Center
Text mining at scale
#HTRC
#HathiTrust
#HTRC
#HathiTrust
12. à HathiTrust is large corpus
providing opportunity for new
forms of computation
investigation.
à The bigger the data, the less
able we are to move it to a
researcher’s desktop machine
à Future research on large
collections will require
computation moves to the data,
not vice versa
13. HTRC Partners
Indiana University School of Informatics and Computing
Indiana Universities Libraries
University of Illinois Graduate School of Library and
Information Science
University of Illinios Libraries
Brandies University Library
University of Michigan
http://www.hathitrust.org/htrc
#HTRC
#HathiTrust
14. HTRC Non-Consumptive Research
Paradigm
No action or set of actions on part of users, either
acting alone or in cooperation with other users
over duration of one or multiple sessions can result
in sufficient information gathered from collection of
copyrighted works to reassemble pages from
collection.
Definition disallows collusion between users, or
accumulation of material over time. Differentiates
human researcher from proxy which is not a user.
Users are human beings.
#HTRC
#HathiTrust
15. Topic modeling on author
Two topics with identical centralities but separate themes
16. Yearly values of a ratio between two wordlists in
three different genres. 4,275 volumes. 1700-1899.
Underwood et al. Research
17. Computation moves to data
REST based Web services architecture and
protocols
Registry of services and algorithms
Solr full text index
noSQL store as volume store
openID authentication
Portal front-end, programmatic access
SEASR text mining algos
2/4/14
17
18. Portal
Blacklight
SEASR
analy�cs
service
Agent
framework
Agent
instance
Agent
instance
WSO2
registry
services,
collec�ons,
data
capsule
images
HTRC
Data
API
v0.1
WS02
Iden�ty
Server
Agent
instance
Agent
instance
Solr
index
Task
deployment
Meandre
Orchestra�on
Non-consumptive
Data capsules
NCSA
local
resources
Volume
store
Volume
store
(Cassandra)
Volume
store
(Cassandra)
(Cassandra)
rsync
NSF
XSEDE
Big
Red
II/IU
Quarry
Programma�c
access
e.g.,
HathiTrust
corpus
Page/volume
tree
(file
system)
18
University of Michigan
19. HTRC: Open Data, Open Access, Open Cleaning?
HathiTrust collection (69%) is not open
data
Constrained by authors who hold
copyright to the books
Computational analysis is by all accounts
“fair use” under US copyright
20. HTRC: Open Data, Open Access, Open Cleaning?
“Open cleaning” – enhancing OCR and
MARC metadata
HTRC is opening data and “cleaning” as
fully as we can to make the collection
useful to scholarly and scientific
investigation
21. Wind and Water: the hydrologist’s
(atmospheric) observational data
dilemma
Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University
*
Credit/blame
for
�tle
goes
to
Beth
Plale
22. Atmospheric Observing Systems
Recent addition of plethora of new observing systems to
national US atmosphere observing infrastructure
Improves ability to analyze current state of atmosphere, thus
allowing new applications in hydrology and biology
Challenges in:
Data access; unique sensing requirements
Data quality, calibrations, and errors
Complex and non-uniform metadata
23. Use Case
Use observational data from 3 different radars: FAA TDWR,
WSR-88D, and local X-band (CASA)
Feed data through OU-custom QA/calibration workflow.
Feed into Vflow hydrological model. Note that Vflow is able
to operate on (ingest) the “raw” reflectivity data directly.
That is, it does not require the data to be turned into
gridded precipitation data. Vflow is unique among
hydrology models because of this ability.
Done in real time, that is, continuously ingesting data over
fixed interval.
24. List of Issues for Flood Forecasting using Radar data
Problem
Cause
Poten�al
Solu�on
Hail
contamina�on
Assumes
high
rainfall
rate
Use
of
dual-‐pol,
QC
Bright
band
Ice
at
mid-‐levels
biases
dBZ
Real-‐�me
QC,
2
radar
beams
Ground
clu�er
Wind
farms,
blockage
Use
of
Neural
Net,
velocity
Radar
a�enua�on
High-‐frequency
radars
Real-‐�me
QC
model,
fix
Anomalous
propaga�on
High
stable
environment
Use
of
Level
1,
velocity
Velocity
de-‐aliasing
High
velocity
returns
Real-‐�me
QC
Radar
calibra�on
Poor
maintenance
Post
QC
Over/under
es�ma�on
below
beam
Radar
too
far
from
area
of
interest;
undersampled
Improved
radar
sampling;
addi�onal
sfc
input
Poor
�me
sampling
Radar
5-‐min
volume
sampling
Improved
temporal
sampling
ET
under
beam
Lack
of
surface
informa�on
Addi�onal
surface
data
Spa�al
interpola�on
Polar
to
Cartesian
coordinates
Interpola�on
algorithm
Use
of
Reflec�vity
Does
not
measure
rain
directly
Calibra�on
against
sfc
data
25. Example
Workflow
Quality
Control
Other
radar
systems
(TDWR,
CASA)
Clear-‐air
echoes
removed
Anomalous
propaga�on
(AP)
removed
WSR-‐88D
data
Clu�er
removal
Interpola�on
from
polar
to
a
common
Cartesian
grid
Hail
contamina�on
removal
Velocity
de-‐
aliasing
Radar
calibra�on
Mel�ng
layer
contamina�on
removal
Undersampling
Representa�ve
ness
Convert
radar
reflec�vity
dBZ
to
rainfall
rate
Radar
merger
(across
same
network
and
mul�ple
networks)
Integrate
radar
data
with
satellite,
surface
observa�ons
on
grid
26. Examine hail contamination in more detail
Level II radar data that is widely available (through LDM
tool of UCAR in US) has not been “cleaned” of effects of
clean air echoes, hail, undersampling, and melting layer
contamination
Hail has effect of high reflectivity readings and these
high readings can be misinterpreted as high rainfall
Meteorologists can detect hail easily by eyeballing a
visual plot of reflectivity intensities so can go back to
Level II data and process by removing hail contamination
Meteorologists solve problem through trained eye, and
good in-house scripts. What does poor hydrologist do?
27. Meterology/Hydrology: Open Data, Open Access,
Open Cleaning?
Data is open, but how to handle cleaning?
A: force all level II data through workflow. Hydrologist uses
only processed data (i.e., gridded precipitation data).
Advantage: hides details from hydrologist
Disadvantage: black box approach reduces trust
A: Make “raw” level II data and Q&A workflow tasks
available to hydrologist.
Advantage: hydrologist can develop high level of
trust in data
Disadvantage: current metadata not sufficiently
described to capture the kinds of Q&A that have
been applied
28. Urban Science
Tag
cloud
of
related
tweet
topics
#smartcityjam
thanks
to
Jennifer
Belissent,
PhD
*
Credit/blame
for
�tle
goes
to
Beth
Plale
29. Urban Science
Harness data from disparate sources with goal of
improving city life.
Fuses physical, biological, and informational sensing of
the city
in-situ sensors for environment: light, temperature, pollution
Video: pedestrian and vehicular traffic
Personal sensors: Fitbit and Up wristbands
Internet sources: Twitter feeds, blogs, news articles, crowdsourced sensing
Two examples in US
Center of Urban Science and Progress, New York University
Urban Center for Computation and Data, University of
Chicago
30. Urban Science
Thanks to Physics Today, Sept 2013
Graphic
courtesy
NYU
Center
for
Urban
Science
and
Progress
*
Credit/blame
for
�tle
goes
to
Beth
Plale
31. Urban science: open data, open access, open
cleaning?
CUSP is cleaning its own data for integration. Is this being
done in way that Chicago can use? Likely not.
Temporal streams are relatively simple to understand with
even bad metadata. They are observational-physical and
observational-social data sources so come with relatively
known trust and attribution.
What happens when CUSP wants to integrate predictive
weather forecasting model results? Weak metadata and
attribution can significantly compromise accuracy of results.
36. Provenance capture AMSR-E
data processing pipeline
Advanced
Microwave
Scanning
Radiometer
(AMSR-‐E)
:
sensor
aboard
Aqua
satellite;
passive
microwave
radiometer.
Observes
precipita�on,
sea
surface
temperatures,
ice
concentra�ons,
snow
water
equivalent,
surface
wetness,
wind
speed,
atmospheric
cloud
water,
and
water
vapor.
36
Aug
2013
37. NASA
AMSR-‐E
imagery
ingest
processing
pipeline:
provenanc
e
capture
for
anomaly
detec�on
38. Dataset: D2I-AMSR-E-Provenance Dataset
Owner and Creator: Data to Insight Center
Size: 15MB
The University of Alabama in Huntsville processes data from the
NASA AMSR-E instrument. The Karma project at Indiana
University instrumented the ingest processing system and
captured provenance for 3,890 runs for the period of September
2 - October 4 2011. The details of the runs are in Figure III-16
below; the largest provenance graph is the monthly rain graph
that, when represented as a XML is approximately 13MB.
Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei,
Conover, Helen. 2012. Provenance of AMSR-E Data from the
National Snow and Ice Data Center (NSIDC). OPM XML Ver.
1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight
Center. http://dx.doi.org/10.5967/M0F47M2D
39. Provenance History Layout Algorithm
Provenance of 1 month
processing of NASA satellite
ingest processing pipeline.
Can help tracing error back to
its cause.
Shows relationship between
daily products (each clover
flower in clover leaf chain) and
final monthly products at leftend.
Provenance
of
a
seaIce
daily
workflow
39
Aug
2013
40. Provenance graph
compare: failed
runs
Le�:
complete
provenance
of
successful
execu�on.
Right:
failed
run,
because
final
data
product
(green
on
le�)
cannot
be
matched.
40
41. Graph compare: dropped provenance
Le�:
successful
execu�on.
Right:
although
successful
execu�on,
shows
dropped
no�fica�ons
in
provenance
capture,
because
all
nodes
except
some
edges
in
le�
graph
cannot
be
matched.
41
42. Role of provenance in Open Data, Open Access,
Open Cleaning
Key contribution of provenance is to data quality.
We posit that quality of data provenance has 3 dimensions:
Correctness
Completeness
Relevancy
Assumption: provenance collection process is automated
Assessment is focused on correctness and completeness of
captured provenance
Steps:
1) Detect ambiguities and conflicts in real and synthetic
provenance traces
2) Complete portions of missing provenance traces
3) Validate provenance traces when possible
4) Score the quality of provenance traces
42
43. Provenance Quality Analysis Overview
G : Graph level
M-G : Multi-Graph (Multiple graphs) Level
N / E : Node/Edge Level
43
44. Wrapping Up: Open Data, Open Cleaning,
Open Access
S�mula�ng
new
business
opportunity
on
stable
interfaces
to
open
data
Open
interfaces
Open
cleaning
Open
data
Who’s
working
on:
Research
Data
Alliance
How?
e.g,
Crea�ve
Commons
license
Personal
privacy
respected
45. Applied Forces Come Together to Distort
Object into New Space
Open
access
ini�a�ves
Fundamental
advances
in
à Climate
change,
à
Food
security
à
à
New
economies
Big
Data
Personal
data
privacy,
social
isues
of
sharing
Research
Data
Alliance
Maturity
in
provenance
and
metadata