Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

Gulf of Mexico Hydrocarbon
Database: Integrating
Heterogeneous Data for
Improved Model Development
Anne E. Thessen, Sean
McGinnis, Elizabeth North, and Ian
Mitchell
http://www.slideshare.net/athessen

Thank You to Data Providers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

NOAA/NOS Office of Response and
Restoration
Commonwealth Scientific and Industrial
Research Organization
Environmental Protection Commission of
Hillsborough County
National Estuarine Research Reserves
Sarah Allan
Kim Anderson
Jamie Pierson
Nan Walker
Ed Overton
Richard Aronson
Ryan Moody
Charlotte Brunner
William Patterson
Kyeong Park
Kendra Daly
Liz Kujawinski
Jana Goldman
Jay Lunden
Samuel Georgian
Leslie Wade
British Petroleum

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Joe Montoya
Terry Hazen
Mandy Joye
Richard Camilli
Chris Reddy
John Kessler
David Valentine
Tom Soniat
Matt Tarr
Tom Bianchi
Tom Miller
Elise Gornish
Terry Wade
Steven Lohrenz
Dick Snyder
Paul Montagna
Patrick Bieber
Wei Wu
Mitchell Roffer
Dongjoo Joung
Mark Williams
Don Blake
Jordan Pino

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

John Valentine
Jeffrey Baguely
Gary Ervin
Erik Cordes
Michaeol Perdue
Bill Stickle
Andrew Zimmerman
Andrew Whitehead
Alice Ortmann
Alan Shiller
Laodong Guo
A. Ravishankara
Ken Aikin
Tom Ryerson
Prabhakar Clement
Christine Ennis
Eric Williams
Ed Sherwood
Julie Bosch
Wade Jeffrey
Chet Pilley
Just Cebrian
Ambrose Bordelon

LTRANS
• Lagrangian Transport
Model
• Open Source
• http://northweb.hpl.umc
es.edu/LTRANS.htm
• Used to predict transport
of particles, subsurface
hydrocarbons, and
surface oil slicks (in
development)

GISR Deepwater Horizon Database

Number of
Data Points

• Over 8 million georeferenced data points
• Over 13 GB
• Over 2000 analytes and parameters

Database Contents
• Oceanographic Data
–
–
–
–

Salinity
Temperature
Oxygen
More

•
•
•
•

Air
Water
Tissue
Sediment/Soil

• Chemistry Data
–
–
–
–

Hydrocarbons
Heavy metals
Nutrients
More
n > 10,000

Challenges
•
•
•
•

Obtaining the data
Heterogeneity
Metadata
Comparison

The Great Data Hunt
• Discovery
– Project directory
– Funding agency records
– Literature
– Internet search

Relevant

Total Data Sets
Discovered
n = 146

The Great Data Hunt
• Access
– Online
– Ask directly
– Literature

data and
response
no data and
response
no data no
response
data no
response

We received responses to 58% of our inquires and
obtained 40% of the identified data sets

Heterogeneity
• Heterogeneity
– Terms
– Units
– Format
– Structure
– Quality Codes

Benzoic Acid

Carboxybenzene

E210

Benzoic Acid

Dracylic Acid

C7H6O2

2,212

1,367

Heterogeneity
• Heterogeneity

n-Decane

– Terms
– Units
– Format
– Structure
– Quality Codes

122

parts per trillion
ppbv

37

μg/g

ng/g ppt mg/kg μg/kg

ppb

Metadata
• Metadata
– Missing
– Not computable

Name
Unit

Location

Data
Point
Attribution

Time

Metadata
• Metadata
– Missing
– Not computable

Name
Unit
Method

Location

Data
Point
Attribution

Uncertainty

Time

Comparing to Model Output
Model Output in netCDF format
Parameter

Depth

Latitude

Longitude

TimeStamp

Nearest Neighbor
Algorithm

Database in SQL
Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp

Comparing to Model Output
• Set limits on what is considered nearestneighbor
• Not all data points have to be matched
• Data points can have many neighbors
• Matching is done before query

Attribution and Citation
• Literature citation
• Repository identifier
• Generate new

Future Work
•
•
•
•
•
•

More data
User feedback
Web Access
Users’ Guide
Manuscripts
Improved query

The Great Data Hunt

– Online
– Ask directly
– Literature
We received responses
to 58% of our inquires
and obtained 40% of
the identified data sets

25

20

Number of Responses

• Discovery
• Access

40% of those responses were received
within 24 hours and 27% were received
within the first week

15

10

5

0
First Day

2 to 7

8 to 30

31 to 60

61 to 90

91 to 120

Time to First Response (Days)

121 to 150 151 to 180

The Great Data Hunt

– Online
– Ask directly
– Literature

0-24 email exchanges per data set

We received responses
to 58% of our inquires
and obtained 40% of
the identified data sets

7
6

Number of Data Sets

• Discovery
• Access

40% of those responses were received
within 24 hours and 27% were received
within the first week

5
4
3
2
1
0
0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Number of Emails

Why didn’t people share?
•
•
•
•
•

Paper not published yet – 30%
Passed the buck – 17%
Too busy – 9%
Medical problems – 9%
Poor quality – 9%

Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

Semelhante a Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development (20)

Mais de Anne Thessen

Mais de Anne Thessen (11)

Último

Último (20)

Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

Notas do Editor