SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
Keynote	
  talk	
  at	
  2nd	
  Int’l	
  LSDMA	
  Symposium	
  –	
  The	
  Challenge	
  of	
  Big	
  Data	
  in	
  Science,	
  Karlsruhe,	
  
Germany,	
  Sept	
  2013	
  

Big data and open access: on
track for collision of cosmic
proportions?
Beth Plale, PhD, MBA
Director, Data To Insight Center
School of Informatics and Computing
Indiana University
Open	
  access,	
  	
  
open	
  cleaning,	
  	
  
open	
  data	
  
yields	
  greatest	
  degree	
  of	
  science	
  
advancement	
  on	
  grand	
  societal	
  
ques�ons	
  we	
  face	
  
Open Access

“Data	
  is	
  the	
  New	
  Gold”	
  	
  Title	
  of	
  Opening	
  Remarks,	
  Neelie	
  
Kroes,	
  VP	
  of	
  EU	
  Commission	
  responsible	
  for	
  Digital	
  Agenda,	
  
Press	
  Conference	
  on	
  Open	
  Data	
  Strategy,	
  Dec	
  2011	
  
Applied Forces
Open	
  access	
  
ini�a�ves	
  by	
  federal	
  
governments	
  

Big	
  Data	
  
Applied Force Distorts Object
Enables	
  societal	
  
grand	
  challenges	
  
addressed	
  in:	
  	
  
	
  
	
  
	
  
	
  
à 	
  Climate	
  change	
  
à 	
  Food	
  security	
  
à	
  New	
  economies	
  

Open	
  access	
  
ini�a�ves	
  by	
  federal	
  
governments	
  

à Grows	
  concerns	
  
about	
  privacy	
  of	
  
personal	
  data	
  
Big	
  Data	
  
Negative form of tension (tension I)

Chilling	
  effect	
  
on	
  data	
  
sharing	
  where	
  
social	
  
phenomena	
  
involved	
  

Social	
  pressure	
  
to	
  privacy	
  
overwhelm	
  and	
  
spill	
  over	
  to	
  
non-­‐personal	
  
data	
  	
  
Exponential Growth in Data Production
Similar growth in societal expectations
that large societal problems will be
solved by more data
Tension II: Rapid growth in data and
expectations yields impossible-toreach success
Technical barriers to easing tensions but first …

DRIVING APPLICATIONS:
LIBRARY TEXTS; URBAN
SCIENCE; WIND AND WATER
Hathi Trust Research Center
Text mining at scale

	
  #HTRC	
  #HathiTrust	
  
	
  #HTRC	
  #HathiTrust	
  
à  HathiTrust is large corpus
providing opportunity for new
forms of computation
investigation.
à  The bigger the data, the less
able we are to move it to a
researcher’s desktop machine
à  Future research on large
collections will require
computation moves to the data,
not vice versa
HTRC Partners
 
 
 
 
 
 

Indiana University School of Informatics and Computing
Indiana Universities Libraries
University of Illinois Graduate School of Library and
Information Science
University of Illinios Libraries
Brandies University Library
University of Michigan

http://www.hathitrust.org/htrc

	
  #HTRC	
  #HathiTrust	
  
HTRC Non-Consumptive Research
Paradigm
No action or set of actions on part of users, either
acting alone or in cooperation with other users
over duration of one or multiple sessions can result
in sufficient information gathered from collection of
copyrighted works to reassemble pages from
collection.
Definition disallows collusion between users, or
accumulation of material over time. Differentiates
human researcher from proxy which is not a user.
Users are human beings.
	
  #HTRC	
  #HathiTrust	
  
Topic modeling on author
Two topics with identical centralities but separate themes
Yearly values of a ratio between two wordlists in
three different genres. 4,275 volumes. 1700-1899.

Underwood et al. Research
  Computation moves to data
  REST based Web services architecture and
protocols
  Registry of services and algorithms
  Solr full text index
  noSQL store as volume store
  openID authentication
  Portal front-end, programmatic access
  SEASR text mining algos
2/4/14	
  
17	
  
Portal
Blacklight
SEASR	
  analy�cs	
  
service	
  

Agent	
  
framework	
  
Agent	
  
instance	
  
Agent	
  
instance	
  

WSO2	
  registry	
  

services,	
  collec�ons,	
  data	
  
capsule	
  images	
  

HTRC	
  Data	
  API	
  v0.1	
  

WS02	
  
Iden�ty	
  
Server	
  
	
  
	
  

Agent	
  
instance	
  
Agent	
  
instance	
  

Solr	
  	
  index	
  
Task	
  	
  
deployment	
  

Meandre	
  
Orchestra�on	
  

Non-consumptive
Data capsules

NCSA	
  local	
  resources	
  

Volume	
  store	
  	
  
Volume	
  store	
  
(Cassandra)	
   	
  
Volume	
  store	
  
(Cassandra)	
   	
  
(Cassandra)	
  

rsync

NSF	
  XSEDE	
  

Big	
  Red	
  II/IU	
  Quarry	
  

Programma�c	
  
access	
  	
  e.g.,	
  

HathiTrust	
  
corpus	
  

Page/volume	
  
tree	
  (file	
  system)	
  

18	
  

University of Michigan
HTRC: Open Data, Open Access, Open Cleaning?

  HathiTrust collection (69%) is not open
data
  Constrained by authors who hold
copyright to the books
  Computational analysis is by all accounts
“fair use” under US copyright
HTRC: Open Data, Open Access, Open Cleaning?

  “Open cleaning” – enhancing OCR and
MARC metadata
  HTRC is opening data and “cleaning” as
fully as we can to make the collection
useful to scholarly and scientific
investigation
Wind and Water: the hydrologist’s
(atmospheric) observational data
dilemma
Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University

*	
  Credit/blame	
  for	
  �tle	
  goes	
  to	
  Beth	
  Plale	
  
Atmospheric Observing Systems
Recent addition of plethora of new observing systems to
national US atmosphere observing infrastructure
 

Improves ability to analyze current state of atmosphere, thus
allowing new applications in hydrology and biology

Challenges in:
 
 
 

Data access; unique sensing requirements
Data quality, calibrations, and errors
Complex and non-uniform metadata
Use Case
Use observational data from 3 different radars: FAA TDWR,
WSR-88D, and local X-band (CASA)
Feed data through OU-custom QA/calibration workflow.
Feed into Vflow hydrological model. Note that Vflow is able
to operate on (ingest) the “raw” reflectivity data directly.
That is, it does not require the data to be turned into
gridded precipitation data. Vflow is unique among
hydrology models because of this ability.
Done in real time, that is, continuously ingesting data over
fixed interval.
List of Issues for Flood Forecasting using Radar data
Problem	
  

Cause	
  

Poten�al	
  Solu�on	
  

Hail	
  contamina�on	
  

Assumes	
  high	
  rainfall	
  rate	
  	
  

Use	
  of	
  dual-­‐pol,	
  QC	
  

Bright	
  band	
  

Ice	
  at	
  mid-­‐levels	
  biases	
  dBZ	
  

Real-­‐�me	
  QC,	
  2	
  radar	
  beams	
  

Ground	
  clu�er	
  

Wind	
  farms,	
  blockage	
  

Use	
  of	
  Neural	
  Net,	
  velocity	
  	
  

Radar	
  a�enua�on	
  

High-­‐frequency	
  radars	
  

Real-­‐�me	
  QC	
  model,	
  fix	
  

Anomalous	
  propaga�on	
   High	
  stable	
  environment	
  

Use	
  of	
  Level	
  1,	
  velocity	
  

Velocity	
  de-­‐aliasing	
  

High	
  velocity	
  returns	
  

Real-­‐�me	
  QC	
  

Radar	
  calibra�on	
  

Poor	
  maintenance	
  

Post	
  QC	
  

Over/under	
  es�ma�on	
  
below	
  beam	
  

Radar	
  too	
  far	
  from	
  area	
  of	
  
interest;	
  undersampled	
  	
  

Improved	
  radar	
  sampling;	
  
addi�onal	
  sfc	
  input	
  

Poor	
  �me	
  sampling	
  

Radar	
  5-­‐min	
  volume	
  sampling	
  

Improved	
  temporal	
  sampling	
  

ET	
  under	
  beam	
  

Lack	
  of	
  surface	
  informa�on	
  

Addi�onal	
  surface	
  data	
  

Spa�al	
  interpola�on	
  

Polar	
  to	
  Cartesian	
  coordinates	
   Interpola�on	
  algorithm	
  

Use	
  of	
  Reflec�vity	
  

Does	
  not	
  measure	
  rain	
  directly	
   Calibra�on	
  against	
  sfc	
  data	
  
Example	
  Workflow	
  
Quality	
  Control	
  

Other	
  radar	
  
systems	
  
(TDWR,	
  CASA)	
  

Clear-­‐air	
  
echoes	
  
removed	
  

Anomalous	
  
propaga�on	
  
(AP)	
  
removed	
  

WSR-­‐88D	
  data	
  

Clu�er	
  
removal	
  

Interpola�on	
  
from	
  polar	
  to	
  
a	
  common	
  
Cartesian	
  grid	
  

Hail	
  
contamina�on	
  
removal	
  

Velocity	
  de-­‐
aliasing	
  

Radar	
  
calibra�on	
  

Mel�ng	
  layer	
  
contamina�on	
  
removal	
  

Undersampling	
  
Representa�ve
ness	
  

Convert	
  radar	
  
reflec�vity	
  
dBZ	
  to	
  rainfall	
  
rate	
  

Radar	
  merger	
  
(across	
  same	
  
network	
  and	
  
mul�ple	
  
networks)	
  

Integrate	
  
radar	
  data	
  
with	
  satellite,	
  
surface	
  
observa�ons	
  
on	
  grid	
  
Examine hail contamination in more detail
  Level II radar data that is widely available (through LDM
tool of UCAR in US) has not been “cleaned” of effects of
clean air echoes, hail, undersampling, and melting layer
contamination
  Hail has effect of high reflectivity readings and these
high readings can be misinterpreted as high rainfall
  Meteorologists can detect hail easily by eyeballing a
visual plot of reflectivity intensities so can go back to
Level II data and process by removing hail contamination
  Meteorologists solve problem through trained eye, and
good in-house scripts. What does poor hydrologist do?
Meterology/Hydrology: Open Data, Open Access,
Open Cleaning?
Data is open, but how to handle cleaning?
A: force all level II data through workflow. Hydrologist uses
only processed data (i.e., gridded precipitation data).
  Advantage: hides details from hydrologist
  Disadvantage: black box approach reduces trust
A: Make “raw” level II data and Q&A workflow tasks
available to hydrologist.
  Advantage: hydrologist can develop high level of
trust in data
  Disadvantage: current metadata not sufficiently
described to capture the kinds of Q&A that have
been applied
Urban Science

Tag	
  cloud	
  of	
  related	
  tweet	
  topics	
  
#smartcityjam	
  thanks	
  to	
  Jennifer	
  Belissent,	
  
PhD	
  
*	
  Credit/blame	
  for	
  �tle	
  goes	
  to	
  Beth	
  Plale	
  
Urban Science
  Harness data from disparate sources with goal of
improving city life.
  Fuses physical, biological, and informational sensing of
the city
 
 
 
 

in-situ sensors for environment: light, temperature, pollution
Video: pedestrian and vehicular traffic
Personal sensors: Fitbit and Up wristbands
Internet sources: Twitter feeds, blogs, news articles, crowdsourced sensing

  Two examples in US

  Center of Urban Science and Progress, New York University
  Urban Center for Computation and Data, University of
Chicago
Urban Science

Thanks to Physics Today, Sept 2013

Graphic	
  courtesy	
  NYU	
  Center	
  for	
  Urban	
  Science	
  and	
  Progress	
  	
  
*	
  Credit/blame	
  for	
  �tle	
  goes	
  to	
  Beth	
  Plale	
  
Urban science: open data, open access, open
cleaning?
CUSP is cleaning its own data for integration. Is this being
done in way that Chicago can use? Likely not.
Temporal streams are relatively simple to understand with
even bad metadata. They are observational-physical and
observational-social data sources so come with relatively
known trust and attribution.
What happens when CUSP wants to integrate predictive
weather forecasting model results? Weak metadata and
attribution can significantly compromise accuracy of results.
Data Provenance

Work of Data To Insight Center at IU, its
affiliated faculty and students
Provenance Core (W3C PROV)
Provenance
for situational
analysis of
agent based
model used in
social
ecological
systems
research
Village labor
sharing for
agriculture
production in
Africa
Provenance capture AMSR-E
data processing pipeline

Advanced	
  Microwave	
  Scanning	
  
Radiometer	
  (AMSR-­‐E)	
  :	
  sensor	
  
aboard	
  Aqua	
  satellite;	
  passive	
  
microwave	
  radiometer.	
  
	
  Observes	
  precipita�on,	
  sea	
  
surface	
  temperatures,	
  ice	
  
concentra�ons,	
  snow	
  water	
  
equivalent,	
  surface	
  wetness,	
  
wind	
  speed,	
  atmospheric	
  cloud	
  
water,	
  and	
  water	
  vapor.	
  

36	
  

Aug	
  2013	
  
NASA	
  
AMSR-­‐E	
  
imagery	
  
ingest	
  
processing	
  
pipeline:	
  
provenanc
e	
  capture	
  
for	
  
anomaly	
  
detec�on	
  
Dataset: D2I-AMSR-E-Provenance Dataset
Owner and Creator: Data to Insight Center
Size: 15MB
The University of Alabama in Huntsville processes data from the
NASA AMSR-E instrument. The Karma project at Indiana
University instrumented the ingest processing system and
captured provenance for 3,890 runs for the period of September
2 - October 4 2011. The details of the runs are in Figure III-16
below; the largest provenance graph is the monthly rain graph
that, when represented as a XML is approximately 13MB.
Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei,
Conover, Helen. 2012. Provenance of AMSR-E Data from the
National Snow and Ice Data Center (NSIDC). OPM XML Ver.
1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight
Center. http://dx.doi.org/10.5967/M0F47M2D
Provenance History Layout Algorithm
Provenance of 1 month
processing of NASA satellite
ingest processing pipeline.
Can help tracing error back to
its cause.
Shows relationship between
daily products (each clover
flower in clover leaf chain) and
final monthly products at leftend.
Provenance	
  of	
  
a	
  seaIce	
  daily	
  
workflow	
  
39	
  

Aug	
  2013	
  
Provenance graph
compare: failed
runs

Le�:	
  complete	
  provenance	
  of	
  successful	
  execu�on.	
  Right:	
  failed	
  run,	
  
because	
  final	
  data	
  product	
  (green	
  on	
  le�)	
  cannot	
  be	
  matched.	
  

40	
  
Graph compare: dropped provenance

Le�:	
  successful	
  execu�on.	
  Right:	
  although	
  successful	
  
execu�on,	
  shows	
  dropped	
  no�fica�ons	
  in	
  provenance	
  
capture,	
  because	
  all	
  nodes	
  except	
  some	
  edges	
  in	
  le�	
  
graph	
  cannot	
  be	
  matched.	
  
41	
  
Role of provenance in Open Data, Open Access,
Open Cleaning
Key contribution of provenance is to data quality.
We posit that quality of data provenance has 3 dimensions:
  Correctness
  Completeness
  Relevancy
Assumption: provenance collection process is automated
Assessment is focused on correctness and completeness of
captured provenance
Steps:
1)  Detect ambiguities and conflicts in real and synthetic
provenance traces
2)  Complete portions of missing provenance traces
3)  Validate provenance traces when possible
4)  Score the quality of provenance traces

42	
  
Provenance Quality Analysis Overview

G : Graph level
M-G : Multi-Graph (Multiple graphs) Level
N / E : Node/Edge Level

43	
  
Wrapping Up: Open Data, Open Cleaning,
Open Access
S�mula�ng	
  new	
  business	
  
opportunity	
  on	
  stable	
  
interfaces	
  to	
  open	
  data	
  
Open	
  interfaces	
  
Open	
  cleaning	
  
Open	
  data	
  

Who’s	
  working	
  on:	
  
Research	
  Data	
  Alliance	
  
How?	
  e.g,	
  Crea�ve	
  
Commons	
  license	
  

Personal	
  privacy	
  
respected	
  
Applied Forces Come Together to Distort
Object into New Space
Open	
  access	
  
ini�a�ves	
  

Fundamental	
  
advances	
  in	
  
à Climate	
  change,	
  
à 	
  Food	
  security	
  
à
à	
  New	
  
economies	
  
Big	
  
Data	
  

Personal	
  data	
  
privacy,	
  social	
  
isues	
  of	
  sharing	
  

Research	
  
Data	
  
Alliance	
  

Maturity	
  in	
  
provenance	
  
and	
  
metadata	
  
plale@indiana.edu

Our	
  hosts	
  RDA	
  Plenary	
  1	
  Chalmers	
  
Univ,	
  Gothenburg,	
  Sweden	
  

Photo	
  courtesy	
  Leif	
  Laaksonen	
  

Mais conteúdo relacionado

Mais procurados

cloudComputing_ProjectJunior
cloudComputing_ProjectJuniorcloudComputing_ProjectJunior
cloudComputing_ProjectJunior
Dominic Searson
 
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
IJwest
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
David LeBauer
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 

Mais procurados (20)

cloudComputing_ProjectJunior
cloudComputing_ProjectJuniorcloudComputing_ProjectJunior
cloudComputing_ProjectJunior
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
User Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and CloudsUser Inspired Management of Scientific Jobs in Grids and Clouds
User Inspired Management of Scientific Jobs in Grids and Clouds
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Maki...
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
PEARC17: Data Access for LIGO on the OSG
PEARC17: Data Access for LIGO on the OSGPEARC17: Data Access for LIGO on the OSG
PEARC17: Data Access for LIGO on the OSG
 
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
Estimating Fire Weather Indices Via Semantic Reasoning Over Wireless Sensor N...
 
A time efficient approach for detecting errors in big sensor data on cloud
A time efficient approach for detecting errors in big sensor data on cloudA time efficient approach for detecting errors in big sensor data on cloud
A time efficient approach for detecting errors in big sensor data on cloud
 
Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Data repository for sensor network a data mining approach
Data repository for sensor network  a data mining approachData repository for sensor network  a data mining approach
Data repository for sensor network a data mining approach
 
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Physics Research in an Era of Global Cyberinfrastructure
Physics Research in an Era of Global CyberinfrastructurePhysics Research in an Era of Global Cyberinfrastructure
Physics Research in an Era of Global Cyberinfrastructure
 

Semelhante a Big data and open access: a collision course for science

Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Society
osimod
 
2008-02-11: EPA DataFed Presentation
2008-02-11: EPA DataFed Presentation2008-02-11: EPA DataFed Presentation
2008-02-11: EPA DataFed Presentation
Rudolf Husar
 

Semelhante a Big data and open access: a collision course for science (20)

Tim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published recordTim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published record
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Cyberistructure
CyberistructureCyberistructure
Cyberistructure
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Challenges and outlook with Big Data
Challenges and outlook with Big Data Challenges and outlook with Big Data
Challenges and outlook with Big Data
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
 
Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Society
 
Why data science matters and what we can do with it
Why data science matters and what we can do with itWhy data science matters and what we can do with it
Why data science matters and what we can do with it
 
The Importance of Large-Scale Computer Science Research Efforts
The Importance of Large-Scale Computer Science Research EffortsThe Importance of Large-Scale Computer Science Research Efforts
The Importance of Large-Scale Computer Science Research Efforts
 
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
 
"Some Reflections on Data in the Public Sector" : Communia: The European Them...
"Some Reflections on Data in the Public Sector" : Communia: The European Them..."Some Reflections on Data in the Public Sector" : Communia: The European Them...
"Some Reflections on Data in the Public Sector" : Communia: The European Them...
 
Driving Applications on the UCSD Big Data Freeway System
Driving Applications on the UCSD Big Data Freeway SystemDriving Applications on the UCSD Big Data Freeway System
Driving Applications on the UCSD Big Data Freeway System
 
Lambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computingLambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computing
 
Research on Blue Waters
Research on Blue WatersResearch on Blue Waters
Research on Blue Waters
 
10probs.ppt
10probs.ppt10probs.ppt
10probs.ppt
 
Ci days notre_dame_april2010
Ci days notre_dame_april2010Ci days notre_dame_april2010
Ci days notre_dame_april2010
 
2008-02-11: EPA DataFed Presentation
2008-02-11: EPA DataFed Presentation2008-02-11: EPA DataFed Presentation
2008-02-11: EPA DataFed Presentation
 

Mais de Beth Plale

Mais de Beth Plale (11)

Trustworthy AI and Open Science
Trustworthy AI and Open ScienceTrustworthy AI and Open Science
Trustworthy AI and Open Science
 
Open science as roadmap to better data science research
Open science as roadmap to better data science researchOpen science as roadmap to better data science research
Open science as roadmap to better data science research
 
Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science Capsule Computing: Safe Open Science
Capsule Computing: Safe Open Science
 
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID TestbedTowards FAIR Open Science with PID Kernel Information: RPID Testbed
Towards FAIR Open Science with PID Kernel Information: RPID Testbed
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Trust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail ScienceTrust threads: Provenance for Data Reuse in Long Tail Science
Trust threads: Provenance for Data Reuse in Long Tail Science
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Big data and open access: a collision course for science

  • 1. Keynote  talk  at  2nd  Int’l  LSDMA  Symposium  –  The  Challenge  of  Big  Data  in  Science,  Karlsruhe,   Germany,  Sept  2013   Big data and open access: on track for collision of cosmic proportions? Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University
  • 2. Open  access,     open  cleaning,     open  data   yields  greatest  degree  of  science   advancement  on  grand  societal   ques�ons  we  face  
  • 3. Open Access “Data  is  the  New  Gold”    Title  of  Opening  Remarks,  Neelie   Kroes,  VP  of  EU  Commission  responsible  for  Digital  Agenda,   Press  Conference  on  Open  Data  Strategy,  Dec  2011  
  • 4. Applied Forces Open  access   ini�a�ves  by  federal   governments   Big  Data  
  • 5. Applied Force Distorts Object Enables  societal   grand  challenges   addressed  in:             à   Climate  change   à   Food  security   à  New  economies   Open  access   ini�a�ves  by  federal   governments   à Grows  concerns   about  privacy  of   personal  data   Big  Data  
  • 6. Negative form of tension (tension I) Chilling  effect   on  data   sharing  where   social   phenomena   involved   Social  pressure   to  privacy   overwhelm  and   spill  over  to   non-­‐personal   data    
  • 7. Exponential Growth in Data Production
  • 8. Similar growth in societal expectations that large societal problems will be solved by more data
  • 9. Tension II: Rapid growth in data and expectations yields impossible-toreach success
  • 10. Technical barriers to easing tensions but first … DRIVING APPLICATIONS: LIBRARY TEXTS; URBAN SCIENCE; WIND AND WATER
  • 11. Hathi Trust Research Center Text mining at scale  #HTRC  #HathiTrust    #HTRC  #HathiTrust  
  • 12. à  HathiTrust is large corpus providing opportunity for new forms of computation investigation. à  The bigger the data, the less able we are to move it to a researcher’s desktop machine à  Future research on large collections will require computation moves to the data, not vice versa
  • 13. HTRC Partners             Indiana University School of Informatics and Computing Indiana Universities Libraries University of Illinois Graduate School of Library and Information Science University of Illinios Libraries Brandies University Library University of Michigan http://www.hathitrust.org/htrc  #HTRC  #HathiTrust  
  • 14. HTRC Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.  #HTRC  #HathiTrust  
  • 15. Topic modeling on author Two topics with identical centralities but separate themes
  • 16. Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899. Underwood et al. Research
  • 17.   Computation moves to data   REST based Web services architecture and protocols   Registry of services and algorithms   Solr full text index   noSQL store as volume store   openID authentication   Portal front-end, programmatic access   SEASR text mining algos 2/4/14   17  
  • 18. Portal Blacklight SEASR  analy�cs   service   Agent   framework   Agent   instance   Agent   instance   WSO2  registry   services,  collec�ons,  data   capsule  images   HTRC  Data  API  v0.1   WS02   Iden�ty   Server       Agent   instance   Agent   instance   Solr    index   Task     deployment   Meandre   Orchestra�on   Non-consumptive Data capsules NCSA  local  resources   Volume  store     Volume  store   (Cassandra)     Volume  store   (Cassandra)     (Cassandra)   rsync NSF  XSEDE   Big  Red  II/IU  Quarry   Programma�c   access    e.g.,   HathiTrust   corpus   Page/volume   tree  (file  system)   18   University of Michigan
  • 19. HTRC: Open Data, Open Access, Open Cleaning?   HathiTrust collection (69%) is not open data   Constrained by authors who hold copyright to the books   Computational analysis is by all accounts “fair use” under US copyright
  • 20. HTRC: Open Data, Open Access, Open Cleaning?   “Open cleaning” – enhancing OCR and MARC metadata   HTRC is opening data and “cleaning” as fully as we can to make the collection useful to scholarly and scientific investigation
  • 21. Wind and Water: the hydrologist’s (atmospheric) observational data dilemma Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  • 22. Atmospheric Observing Systems Recent addition of plethora of new observing systems to national US atmosphere observing infrastructure   Improves ability to analyze current state of atmosphere, thus allowing new applications in hydrology and biology Challenges in:       Data access; unique sensing requirements Data quality, calibrations, and errors Complex and non-uniform metadata
  • 23. Use Case Use observational data from 3 different radars: FAA TDWR, WSR-88D, and local X-band (CASA) Feed data through OU-custom QA/calibration workflow. Feed into Vflow hydrological model. Note that Vflow is able to operate on (ingest) the “raw” reflectivity data directly. That is, it does not require the data to be turned into gridded precipitation data. Vflow is unique among hydrology models because of this ability. Done in real time, that is, continuously ingesting data over fixed interval.
  • 24. List of Issues for Flood Forecasting using Radar data Problem   Cause   Poten�al  Solu�on   Hail  contamina�on   Assumes  high  rainfall  rate     Use  of  dual-­‐pol,  QC   Bright  band   Ice  at  mid-­‐levels  biases  dBZ   Real-­‐�me  QC,  2  radar  beams   Ground  clu�er   Wind  farms,  blockage   Use  of  Neural  Net,  velocity     Radar  a�enua�on   High-­‐frequency  radars   Real-­‐�me  QC  model,  fix   Anomalous  propaga�on   High  stable  environment   Use  of  Level  1,  velocity   Velocity  de-­‐aliasing   High  velocity  returns   Real-­‐�me  QC   Radar  calibra�on   Poor  maintenance   Post  QC   Over/under  es�ma�on   below  beam   Radar  too  far  from  area  of   interest;  undersampled     Improved  radar  sampling;   addi�onal  sfc  input   Poor  �me  sampling   Radar  5-­‐min  volume  sampling   Improved  temporal  sampling   ET  under  beam   Lack  of  surface  informa�on   Addi�onal  surface  data   Spa�al  interpola�on   Polar  to  Cartesian  coordinates   Interpola�on  algorithm   Use  of  Reflec�vity   Does  not  measure  rain  directly   Calibra�on  against  sfc  data  
  • 25. Example  Workflow   Quality  Control   Other  radar   systems   (TDWR,  CASA)   Clear-­‐air   echoes   removed   Anomalous   propaga�on   (AP)   removed   WSR-­‐88D  data   Clu�er   removal   Interpola�on   from  polar  to   a  common   Cartesian  grid   Hail   contamina�on   removal   Velocity  de-­‐ aliasing   Radar   calibra�on   Mel�ng  layer   contamina�on   removal   Undersampling   Representa�ve ness   Convert  radar   reflec�vity   dBZ  to  rainfall   rate   Radar  merger   (across  same   network  and   mul�ple   networks)   Integrate   radar  data   with  satellite,   surface   observa�ons   on  grid  
  • 26. Examine hail contamination in more detail   Level II radar data that is widely available (through LDM tool of UCAR in US) has not been “cleaned” of effects of clean air echoes, hail, undersampling, and melting layer contamination   Hail has effect of high reflectivity readings and these high readings can be misinterpreted as high rainfall   Meteorologists can detect hail easily by eyeballing a visual plot of reflectivity intensities so can go back to Level II data and process by removing hail contamination   Meteorologists solve problem through trained eye, and good in-house scripts. What does poor hydrologist do?
  • 27. Meterology/Hydrology: Open Data, Open Access, Open Cleaning? Data is open, but how to handle cleaning? A: force all level II data through workflow. Hydrologist uses only processed data (i.e., gridded precipitation data).   Advantage: hides details from hydrologist   Disadvantage: black box approach reduces trust A: Make “raw” level II data and Q&A workflow tasks available to hydrologist.   Advantage: hydrologist can develop high level of trust in data   Disadvantage: current metadata not sufficiently described to capture the kinds of Q&A that have been applied
  • 28. Urban Science Tag  cloud  of  related  tweet  topics   #smartcityjam  thanks  to  Jennifer  Belissent,   PhD   *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  • 29. Urban Science   Harness data from disparate sources with goal of improving city life.   Fuses physical, biological, and informational sensing of the city         in-situ sensors for environment: light, temperature, pollution Video: pedestrian and vehicular traffic Personal sensors: Fitbit and Up wristbands Internet sources: Twitter feeds, blogs, news articles, crowdsourced sensing   Two examples in US   Center of Urban Science and Progress, New York University   Urban Center for Computation and Data, University of Chicago
  • 30. Urban Science Thanks to Physics Today, Sept 2013 Graphic  courtesy  NYU  Center  for  Urban  Science  and  Progress     *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  • 31. Urban science: open data, open access, open cleaning? CUSP is cleaning its own data for integration. Is this being done in way that Chicago can use? Likely not. Temporal streams are relatively simple to understand with even bad metadata. They are observational-physical and observational-social data sources so come with relatively known trust and attribution. What happens when CUSP wants to integrate predictive weather forecasting model results? Weak metadata and attribution can significantly compromise accuracy of results.
  • 32. Data Provenance Work of Data To Insight Center at IU, its affiliated faculty and students
  • 34.
  • 35. Provenance for situational analysis of agent based model used in social ecological systems research Village labor sharing for agriculture production in Africa
  • 36. Provenance capture AMSR-E data processing pipeline Advanced  Microwave  Scanning   Radiometer  (AMSR-­‐E)  :  sensor   aboard  Aqua  satellite;  passive   microwave  radiometer.    Observes  precipita�on,  sea   surface  temperatures,  ice   concentra�ons,  snow  water   equivalent,  surface  wetness,   wind  speed,  atmospheric  cloud   water,  and  water  vapor.   36   Aug  2013  
  • 37. NASA   AMSR-­‐E   imagery   ingest   processing   pipeline:   provenanc e  capture   for   anomaly   detec�on  
  • 38. Dataset: D2I-AMSR-E-Provenance Dataset Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 - October 4 2011. The details of the runs are in Figure III-16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei, Conover, Helen. 2012. Provenance of AMSR-E Data from the National Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. http://dx.doi.org/10.5967/M0F47M2D
  • 39. Provenance History Layout Algorithm Provenance of 1 month processing of NASA satellite ingest processing pipeline. Can help tracing error back to its cause. Shows relationship between daily products (each clover flower in clover leaf chain) and final monthly products at leftend. Provenance  of   a  seaIce  daily   workflow   39   Aug  2013  
  • 40. Provenance graph compare: failed runs Le�:  complete  provenance  of  successful  execu�on.  Right:  failed  run,   because  final  data  product  (green  on  le�)  cannot  be  matched.   40  
  • 41. Graph compare: dropped provenance Le�:  successful  execu�on.  Right:  although  successful   execu�on,  shows  dropped  no�fica�ons  in  provenance   capture,  because  all  nodes  except  some  edges  in  le�   graph  cannot  be  matched.   41  
  • 42. Role of provenance in Open Data, Open Access, Open Cleaning Key contribution of provenance is to data quality. We posit that quality of data provenance has 3 dimensions:   Correctness   Completeness   Relevancy Assumption: provenance collection process is automated Assessment is focused on correctness and completeness of captured provenance Steps: 1)  Detect ambiguities and conflicts in real and synthetic provenance traces 2)  Complete portions of missing provenance traces 3)  Validate provenance traces when possible 4)  Score the quality of provenance traces 42  
  • 43. Provenance Quality Analysis Overview G : Graph level M-G : Multi-Graph (Multiple graphs) Level N / E : Node/Edge Level 43  
  • 44. Wrapping Up: Open Data, Open Cleaning, Open Access S�mula�ng  new  business   opportunity  on  stable   interfaces  to  open  data   Open  interfaces   Open  cleaning   Open  data   Who’s  working  on:   Research  Data  Alliance   How?  e.g,  Crea�ve   Commons  license   Personal  privacy   respected  
  • 45. Applied Forces Come Together to Distort Object into New Space Open  access   ini�a�ves   Fundamental   advances  in   à Climate  change,   à   Food  security   à à  New   economies   Big   Data   Personal  data   privacy,  social   isues  of  sharing   Research   Data   Alliance   Maturity  in   provenance   and   metadata  
  • 46. plale@indiana.edu Our  hosts  RDA  Plenary  1  Chalmers   Univ,  Gothenburg,  Sweden   Photo  courtesy  Leif  Laaksonen