SlideShare uma empresa Scribd logo
1 de 128
Data management principles
Contents
• Introduction
• Formats: from text to relational structures
• Global scientific metadata systems
• Data availability and access
• Principles of data policies
• Value of data
Introduction
• Basic principles of why data-mangement (see
last week)
– Selfish reasons
– Altruistic reasons
– Moral obligation  costs of generation of data
• Nice ideas, nice examples, ...
– Work behind is sometimes less nice
– For datamanagement there some
rules/techniques/principles
1. Data formats
• Large heterogeneity in data formats
• Data format = the physical or electronic shape
in which data is stored
• Piece of paper with hand written text = data
format
• However focuss here:
– Electronic data formats
– Commonly used data formats
1. Data formats
• Why use which format?
– Historical reasons:
• Old data mostly in text based list formats
• Software and technology is accompagning certain
formats
• Example: xml is only being used after its invention
– Other reasons:
• Depending on data generator:
– Machine generated data (mostly ascii format)
• Worldwide agreed formats for certain types of data
– Facilitate exchange of data packages
1. Data formats
• Exchange of data formats
– Most formats are exchangeable into eachother
– Mostly top down:
• Relational structure  spreadsheet  txt-based
Data formats: different classifications
• Physical types:
– ASCII
– BINARY
• Format types : 15 often used data types
Dataformat – ascii format (1)
• Ascii: American Standard Code for Information
Interchange
• ASCII data are encoded so that the human reader can
see and understand the values, because they are
displayed as normal integers and real numbers. This
means that the actual digital file contains print and
display information for the human-readable
characters, not the actual values of the data. The
benefit of using ASCII data is that the user can
see, understand and edit the file contents directly; the
downside of using ASCII is that the data files are much
larger.
Dataformat – ascii format (2)
• Combination of letters
and numbers
• Readable by any
computer
• No complex software
required
Dataformat – Binary data
• Binary data are numeric data whose values
are expressed in bits and bytes, instead of the
human-readable ascii code.
• Number values can be stored in much smaller
files:  be read more rapidly (by machines)
• the method for large datafiles, especially
gridded data.
• To use binary data: not so easy  interpreting
steps are required
Dataformat – Binary data
• Contents and structure
of binary files may vary:
– Type of data stored:
• Bit (0-1) – 1 bit
• Byte (0-255) – 8 bits
• Short integer (-32,768
32,767) – 16 bits
• Interpreter – translator
is required
Data formats – 15 common used types
• Text – files  Ascii/Binary
• Spreadsheets
• Relational structures
• Others
– Images
– Maps
1 & 2 : Auxiliary Formats
• Auxiliary Formats - Information about data
files; these are not really "data" files, but are
included here for completeness
– 1 Header Formats - Information about the
format, location or geo-referencing; usually very
short
– 2 Metadata Formats - see also metadata
3. Document
• Digital data in proprietary formats (or
sometimes just simple ASCII) designed for
visual inspection, but not for data processing
• ASCII ,MS Word DOC , WordPerfect , HTML ,
PDF - Adobe Acrobat , PS/EPS -
PostScript/Encapsulated PS , Desktop
publisher programs - all proprietary ...
3. Document
• Advantages: Very polished appearance;
powerful editors available; compatibility with
other major document editing software.
• Disadvantages: (hard to use in data mining)
– ASCII text must be extracted for the sections of
interest.
– Embedded images must be converted to more
easily used GIF, JPG or BMP formats. PDF and
PS/EPS very tricky to convert to other formats.
4. Gridded data
• File formats:
– ASCII : example - SURFER (*.GRD) - with "DSAA"
header lines
– Binary : Plain binary grids: byte, short integer, long
integer, single-precision or double-precision; with
or without ASCII Header Files (see earlier)
4. Gridded data
• Creation of the Grid:
– The gridded data file is created from scattered
data points in the real world, by a process called
"gridding."
– mathematical methods to create the grid
– algorithms are available to examine data points
4. Gridded data
• Gridded data files commonly contain more than a single grid
– Data mostly avaiable for different parameters
– Using sequences of XYZ dimensions and parameter dimensions
– There is no "correct" way to construct files of multiple data grids
• It is extremely important to document the sequence in which the dimensions
(XYZ location, time, parameters) are "read."
• Vector Grids: To represent vectors (literally arrows showing the
direction of flow) in ocean and meteorological datasets two
methods have been devised: provide the U and V components of
the vector, or provide the direction and magnitude of the
arrow. Both of these methods have been adapted to grids, for
vector results from gridded models for instance. The grids can be
contained in separate files, or sequentially listed in the same file.
4. Gridded data
• Advantages:
– Saves storage space
– XYZ storage which requires 3 data per gridpoint.
– Binary takes much less space than ASCII.
– Reading the data is usually a very straightforward
creation of a
• DO LOOP routine (or nest of routines) that follows the order
in which the data were stored
• Disadvantages:
– Binary data are not liked by those who want "to see"
their data at all times.
5. Hard copy
• Older, hard copy datasets
• necessary evil
– (pre-60s) ocean data has never been digitized
• These datasets range from technical reports to hand-
written log sheets and lab sheets.
– Reports usually contain enough information to be
successfully digitized
– Manuscript holdings often require tedious collation and
cross-referencing in order to assemble all the needed
parts.
– Datasets with missing critical parts (e.g. station data) exist,
as well as analysis and synthesis reports containing
statistics, graphs and tables, but no data.
5. Hard copy
• Examples:
– Lab sheets
– Journal articles
– Technical Reports
– 80-character punch cards - Included here because
many locations lack the facilities to read them
– Hand-annotated charts/graphs
– Specimen identification cards
– Diaries
– Ship logs
5. Hard copy
• Risk of data loss:
– Rule in many data centres: No paper data should be mailed or shipped unless photocopied.
– All ORIGINAL paper data should be gathered by the data manager immediately after the
relevant cruise and grouped into named folios whose contents are indexed.
• All paper data should be submitted to supervised digitization as soon as possible.
– Example: heritage library
• Metadata of hard copy data: should fully describe the folios
– numbers of pages
– Color of frontpage
– Other identifying characteristics
• Advantages: They still exist.
• Disadvantages:
– Cannot be used in modern digital analysis.
– Digital capture is very labor intensive.
– Access is a tricky political issue in some institutions.
• Compatibilities: Published papers in good condition can be scanned and converted
to ASCII text with many commercial packages. (OCR techniques)
– Controll afterwards ….
5. Hard copy
• From hard copy to digital copy ...
– Technique used depends on aim and type of data
– Often just transformed in ‘document’ format
– If to other formats – often man-driven
• In many cases going back to hard copy only
way to work (due to lack of metadata, file
versions, ...)
6. Simple Images
– Graphics file without earth mapping information
– Interpretation is purely man-based
– Very variable
– Many file formats:
• TIFF, GIF, JPG, BMP …
• RAW versus compressed
– RAW: all image information is stored without compression
– Compressed: JPG/GIF information is compressed by
extrapolation, reducing colors  smaller files but loss of
information
6. Simple images
• Some images have added artistic borders -
– outside the geographic grid: that obscure the pixel-to-
coordinates relationship
• Advantages
– Quick visualization of data that may have originally
been extremely complex. Subjective analyses that do
not require positional accuracy.
– Disadvantages Quantification difficult; synthesis
nearly impossible unless with pictures derived in
exactly the same fashion Compatibilities Nearly all
graphic picture formats are interchangeable with
editor programs.
7. Geo-referenced images
• Graphics file, with ancillary mapping
information, showing 1 or more parameters of
the earth's system in a rectilinear grid, usually
derived by processing and decimation of very
high-density information from aerial or space
sensors.
– Coordinates of pixel correspond to XY geo-
coordinate.
– Color of pixel represents a parameter
7. Geo-referenced images
• TIF files can be made into Geo-Referenced Image files by the addition of
internal geographic tags, which require exact knowledge of the image
dimensions and its proper location on the earth's surface.
• JPG, TIF and BMP can be made into Geo-Referenced Image formats by the
addition of header "world files," which require exact knowledge of the
image dimensions and its proper location on the earth's surface. A world
file is a simple ASCII file with the following contents:
– X-pixel size (delta X)
– Rotation term for row (normally zero)
– Rotation term for column (normally zero)
– Y-pixel size (delta Y)
– X-coordinate of center of upper left pixel
– Y-coordinate of center of upper left pixel
• World files for TIF have the extension TFW;
• world files for JPG have the extension JPW;
• world files for BMP have the extension BPW.
7. Geo-referenced images
8-9-10. Mapping data
• Mapping - Mapping data consisting of digital
representations of individual objects (points, lines,
polygons, etc.)
– 8 XY- Mapping line objects, in X (usually longitude) and Y
(usually latitude) coordinates only
– 9 List- Mapping objects (points, lines, symbols, text, etc.)
without topology or descriptive attributes
– 10 Geographic Information System (GIS) - Mapping
objects (points, lines, polygons, etc.) on the earth
incorporated into robust data assemblages that contain
additional detailed information about the properties and
topologies of the objects. [NOTE: Most GIS systems can
also accommodate gridded, geo-referenced image,
relational and spreadsheet formats.]
8. XY data
• Description:
– simplest kind of geographic information:
• lines specified by their ordered X and Y coordinates.
• country boundaries: separated by several different markers
• ASCII Export Format from GEBCO Database/Software
(actually YX in column order)
• Advantages: Simple to write, easy to read (when
ASCII).
• Disadvantages: Contain no topological relationships
between objects, or attributes of the objects.
• Text is rendered as drawing instructions, and cannot be
retrieved as recognizable data.
9. Mapping data - List
• ordered list of "map primitives" to be drawn:
– such as points, lines, circles, labels, etc.
• These formats are extremely specific to certain software.
• They could almost be called "plotter formats" because they do little
more than draw pictures of geographically referenced information.
• Small amounts of data can be included, however, coded into the
appearance of such primitives as the circle (variable diameters), the
vector arrow (variable lengths), and contour lines (colors).
• Advantages; Usually easy to read/write.
• Disadvantages exists in many variant subtypes; MS Word and
WordPerfect differ markedly in the versions they accept.
10. Geographic Information System
(GIS)
• Charting and mapping: tools for natural resource management.
• Digital methods are becoming much more common in ocean data
analysis.
• Geographic Information System (GIS) data formats contain
complex, multi-theme collections of spatial information that can
be used to create maps and charts, and to perform analyses.
• The data formats that can support these systems are not just
sufficient to draw maps, but also contain necessary ancillary data
about the features included (in space and time).
• NOTE: GIS files can be vector-type or raster-type, and many GIS
software systems can handle both. Conversion utilities exist that
can convert these files in either direction, although the raster-to-
vector conversion often requires intensive quality control by skilled
operators.
10. Geographic Information System
(GIS)
• Software:
– Esri/Mapinfo/Surfer/...
• Recently: also many online gis-tools
– OBIS
– Open Gis standards : Open Geospatial Consortium
• an international industry consortium of 334 companies,
government agencies and universities participating in a
consensus process to develop publicly available
geoprocessing specifications.
• Open Geospatial Consortium (OGC) protocols include Web
Map Service (WMS) and Web Feature Service (WFS).
10. Geographic Information System
(GIS)
• Formats Within This Group ESRI Shapefiles (SHP) , VPF
• Advantages:
– Rapid creation of new maps and charts using the same databases.
– No laborious hand-drawing methods.
– Synthesis of different kinds of information, on an as-needed basis,
from a common pool of datasets.
– Instant changes in projection, scale, coverage area, etc.
• Disadvantages:
– GIS formats tend to be very complex, and populating them with the
actual data of interest is laborious.
• Compatibilities Most of the major software systems now recognize
each other's formats.
– Most have ASCII export routines for simple versions of the internal
datafiles (e.g. DXF).
11. Message data
• Ocean and meteorological data compressed into official (usually
WMO-sanctioned) formats for transmission over approved
international channels, especially the WMO's Global
Telecommunications System (GTS). These highly compacted
formats usually require unpacking programs before they can be
used for analysis purposes. [The Self-Describing Formats BUFR and
GRIB are also often used for data and analysis messages within the
GTS.]
• Formats : DBCP-x, AAXX, BBXX, EEAA, EEBB, EECC, EEDD , IIAA, IIBB,
IICC, IIDD , JJXX, JJYY, PPAA, PPBB, PPCC, PPDD , QQAA, QQBB,
QQCC, QQDD , TTAA, TTBB, TTCC, TTDD , UUAA, UUBB, UUCC,
UUDD , VVAA, VVCC , YYXX , ZZYY
• As an example, the JJYY format encodes real-time
bathythermograph data; it replaces an older format, JJXX, used until
1995.
11. Message data
• Advantages :
– Cheap and quick to send over often crowded
circuits; widely accepted among non-technical
marine community.
– when of poor quality, they create a "placeholder"
for the higher quality data which should follow
• Disadvantages
– Only very coarse resolution and/or low precision is
possible due to the message format limitations.
11. Message data
This element defines an observation report on temperature, salinity and
currents at one particular location on the ocean surface, or in subsurface
layers.
12. Relational database
• A suite of spreadsheet-like tables with explicit links
between them in special linkage arrangements (usually
contained in additional tables).
• This collection of linked tables, known as a Relational
Database (RD), divides up very large initial tables into
much smaller tables and eliminates much duplication
of information that would otherwise be required.
• Relational Databases require the use of special
software (in which they are created, manipulated, and
analyzed) called Relational Database Management
Systems (RDMS).
• Formats: MS Access, Oracle, Sybase, dBase, SQL Server
12. Relational database
• Advantages:
– Enormously flexible systems, capable of most typical statistical
and graphical analyses of data.
– Some have immediate Web compatibility for publishing
databases directly on the Internet; ability to exchange data (via
I/O operations or direct linking).
• Disadvantages
– Ocean data are seldom published in commercial RDMS
formats, due to the machine- and software-specific
requirements they would carry with them.
– Users cannot immediately "look at" their data, although this
only requires simple queries that can written in minutes.
• More about these formats later
13. Spreadsheets
• Spreadsheet formats are simply row-and-column data tables.
• Easily be imported into several proprietary spreadsheet software
programs and many public domain programs.
• Each row is called a "record."
• The separate "fields" may be labeled by a single "label row" at the
beginning of the spreadsheet
• Formats: EXCEL , WK*
• Advantages
– Extremely easy to create, read, quality-control and manipulate in
commercial spreadsheet programs. Each record (data line) is unique
and complete.
• Disadvantages
– Can be quite large, compared to binary files of the same data.
14. Self describing data formats
• Data files that contain information about their own contents and structure.
• Collections of other format types :
– Together with metadata about the main data components.
• The rules and syntax :
– provided by (international) oversight groups
• Examples:
– HDF - widely used for satellite data archives
– NetCDF - widely used for gridded data and satellite data
– BUFR - meteorological format for observations
– GRIB - meteorological format for gridded data
• Advantages:
– Metadata and data are "married" within a single structure
– Software programs can find and browse desired data by working with the data files
themselves rather than external indexes.
– Wide use has given rise to a long list of community software and "read" libraries.
• Disadvantages:
– There is steep learning curve for all these formats, due to their complexity and
comprehensiveness.
15. Stratified data formats
• A very common method to reduce the large size of Spreadsheet format data is to
take the slowly changing fields, which take up a lot of room in each record and to
place them in a totally separate "Cruise/Station" record that precedes all "Data"
records to which it refers.
• Naturally, this new type of record will have a different format from the other
records.
• This process can be taken further, so that "Cruise" records, "Station" records, and
"Data" records all have different formats.
– significance in the order of the records: because each "Data" record takes its full meaning
from the closest preceding "Cruise" and "Station" records.
• ICES Standard Profile
• Advantages:
– Smaller in size than spreadsheet.
• Disadvantages :
– Tricky to write software, due to multiple line formats.
– Usually the lines are formatted, so it is difficult for the human eye to read the data values.
– Use with spreadsheet software is very limited (editing, block sorting/cutting/pasting) due to
the different line formats.
– Import to relational databases with "off the shelf" routines is impossible.
15. Stratified data formats
Cruiseid A B C
stationid x y Z W
Sampleid l p K
Sampleid2 l2 p2 K2
Sampleid3 l3 p2 K3
Stationid x2 y2 Z2 W2
... ... ... ... ...
16. Extra - XML
• Currently widely used
• Data exchange format
• Extensible Markup Language (XML)
16. Extra - XML
• Text based – small file size
• Ascii format
• Similar to stratified  hierarchy
• Formats defined by international organisations (see
also stratified)
• Metadata can be embeded in data
• Data exchange format – through internet
• Both for data delivery & data request
• Used in GIS in recent versions of software
• Web technology (e.g. Newsitems, search engines, ...)
Extra: Relational databases
Introduction
Introduction
• Most common used data format next to
spreadsheets.
• Spreadsheets relatively easily
• Research projects mostly claim data to be
stored in relational database.
• Understanding a relational structure opens
the access to many data
Relational databases - Data mining
• Exploration of data
• Prerequisite: data should be available in a
minable format - database
• Database = electronic document storing data
– Non-relational: 1 bulk system with non-related
items (eg. Msexcel files, text-documents, non-
related-tables)
– Relational: all items (tables) are linked to each
other (see further)
Relational databases
Why using a database
• Relational database:
– All your data is stored in 1 file
• Easy to retrieve data
• Easy to backup
– Data and metadata stored together
• Data ...
• Metadata: data about the data (documentation)
– Many data-files contain undocumented values:
– Species A has an abundance of 17 ( meaning of value 17?)
Relational databases
Why using a database
• All data in a good relational designed database
is only stored once:
– Example: species list  typing errors
• Nudora thorakista
• Nudora thorrakista
• Nudora thorakhista
• Nudora thorakisa
– 1 species  species richness calculation: 4
– Solution: 1 table with each species 1 record and
use it as a reference
Why using a database
• Data is much more rigid ...
– More difficult to make errors
– E.g. Sorting in excell
Relational databases
Principle - Exercise
• A practical example to understand ...
– Make a list of 15 people you know
– Make a list of all genders
– Make a list of characters and indicate for each
character whether nice or not
– Make a list of countries
• Start coupling all your lists
• You made a relational database
Relational database - biology
Species
person
Places
Sample
Country
Density
Equipment
Species
person
Places
Sample
Country
Density
Equipment
Which person was present on samplings in sweden?
Species
person
Places
Sample
Country
Density
Equipment
Which species sampled with a core occur in densities higher than 40
Variable
Var_value
Taxonomy
Photo
Literature
...
...
...
...
Relational databases
Principles
• Think before you start ...
– Structure of a database is the key to a good
dataset
– Structure has to translate the whole concept
• One look at the structure (relational scheme) should
explain the database
Relational databases - components
• Tables
– Basic structures containing the data
– Structure of table important
– ID
• Relations
– Definition of how different tables are connected
and form a sense-full unit
• Queries
– Extractions of data from database
Table designs ...
• A table consists of a series of Columns ...
• Each record as such:
– Different fields
– Design of table must be done
before data is entered
– Each field: name, data type
– Each field can also by formatted  layout
Record
ColumnField
Table designs ...
• Field types:
– Numeric – integer/double
– Text
– Date/Time
– Memo
– Autonumber  ID
– Yes/No
Excercise on field types:
• 12
• 15 jan 1988
• hallo
• 12,456
• 12:56
• Azdazdazd azdda zda azdd dad zd dadazdzd
azdazddazdd azdazd azdazd dzdzdzzd ada zzd
azdaz dda azd da az d z azdzadazd a zd a azd
azd z dd da a z a z zd d ddaa zd
• 09:89
Special field in a table: key
• A key = a unique identifier for a record
– Example: pasport number:
• Number in a database which is unique and relates to all data
about you
– Each record in a table gets also a key
– This key is used to link tables to each other
– Example:
• Nudora sp1 – id: 123776
• Nudora sp2 – id: 34688
– Advantage: species name changes: linked taxa remain
linked
Linking tables through id’s
• Storing numbers is most effecient way to store
data:
• Nudora sp1 is found in the north sea with a
density of 32
• Species 123776 is found in station 2 (North
sea) with a density of 32
• Record in table density becomes:
123776 | 2 | 32
Setting up relations between tables
• Relations: links between tables
• Connecting tables through certain fields in a
rigid way to each other
• Advantage: database becomes a strong unity
• Types of relations:
– 1 to many
– Many to many ( = 2 times 1 to many)
Examples of relations
• Table places: field country (numeric)
• Table countries – list of countries,
each country has unique id
• Relation is made between:
– Field country in places
– Field id in country
• One to many relation: 1 record in table
country linked to multiple records in places
• No deleting of countries possible
Places
Country
Examples of relations
• Many to many
• Id of sample
• Id of species
• Table density: unique combination of sample,
species ...
Species
Sample
Density
Queries
• All data in database:
– Next step: get it out again
– Selections on 1 table: by using filters
– Selections on multiple tables: using queries
– Queries can be saved and reused
– Queries can be the basis for new queries
Sorting on tables
• Sorting
Filtering on tables
Making a simple selection Query
• Create ... Query in design view
• Switching between views:
Making a simple selection Query
• Select the tables and/or queries needed
Making a simple selection Query
• Select the fields needed for output/selection/sorting
Making a simple selection Query
• Select the fields needed for output/selection/sorting
Making a simple selection Query
• Select the fields needed for output/selection/sorting
Making a simple selection Query
• Select the fields needed for output/selection/sorting
Making a simple selection Query
• Set the criteria
Making a simple selection Query
• Select the values to out put and add sorting
options
Output the results
• Go to datasheet view
Making a simple selection Query
• Special options ...
Exporting data
• From msaccess it is possible to export to
different formats!
• Tables, queries, ...
• Exports can be used to do further data mining:
– Through MSExcell  making graphs
– To do statistical analysis
Exporting data
Step by step demonstration
• Open a database
• Different items in database
• Open tables, sorting, filtering
• Table design
• Relationships
• Queries
Query operators
= equals
> Larger than
< Smaller than
>= larger than or equals
Between ... And ...
Is null
Like ...
Not like ...
Query operators
Query operators
and both true
or at least 1 true
< Smaller than
>= larger than or equals
Between ... And ...
Is null
Like ...
Not like ... >"q*" and <"u*" VOORNAAM René, Robbie, Stefan, Stijn, Tim, Tristam
="r*" or "s*" VOORNAAM Robbie, Stefan, Stijn
Intermezzo ... Design a dataset
• Research project:
– You work with 3 persons on it
– You will sample 4 times on 3 locations
– You will measure 5 environmental characteristics
– You will identify all species
– You will count them
– Extra: you will measure each specimen
– Task: design on paper how your dataset will look
like
GLOBAL Scientific Data and
Metadata systems
Global Change Master Directory
• NASA's Global Change Master Directory (GCMD):
– is a comprehensive directory of descriptions of data sets of relevance
to global change research.
– includes descriptions of data sets covering :
• climate change, agriculture, the atmosphere, biosphere, hydrosphere &
oceans, geology, geography, and human dimensions of global change.
– freely searchable
– Only metadata records:
• nature of the data (e.g., parameters measured, geographic location, time
range)
• where stored.
• Adding data description simple:
– A web-based registration form
– free of charge
Metadata standards (1)
• Used to avoid the arbitrary use of properties when describing a dataset
• A document that presents a set of statements:
– rules of usage for metadata elements = metadata specification = metadata
standard.
• Some examples of metadata specifications:
– The common metadata standards for describing geospatial datasets are ISO
19115, DIF and FGDC.
– Common Communications Format: Developed by UNESCO and others as "a
common bibliographic exchange format that would be useful both to libraries
and other information services." - Used in UNESCO's library software
– CDI: Common Data Index - Used to describe oceanographic cruises. The
hard-copy forms were formerly known as ROSCOPS
• Global catalog online at the ICES site
– DIF: Directory Interchange Format: Format used by the Global Change Master
Directory and MEDI Used to describe earth science datasets
Metadata standards (1)
• Dublin Core ISO 15836: An element set for describing a wide range of networked
resources, focusing on bibliographic needs.
– also been used for other metadata documentation purposes. Also known as NISO Standard
Z39.85
• FGDC: Content Standard for Digital Geospatial Metadata (CSDGM) from the [US]
Federal Geospatial Data Committee:
– Used to describe geospatial data The FGDC metadata standard is lengthy (>200 fields) and
compliance with the standard has proved to be difficult.
• FGDC/BDP: FGDC with Biological Data Profile Extension:
– A standard agreed upon at the International Meeting of Cataloguing Experts held in
Copenhagen in 1969; it provides a standard order and content for the description of
monographic material and facilitates the international exchange of bibliographic information
by standardizing the elements to be used in the bibliographic description, assigning an order
to these elements in the entry,
– specifying a system of symbols to be used in punctuating these elements`
• ISO 19115 ISO 19115:2003 Geographic Information
– contains almost 300 elements. However only a small number of these form part of the core
metadata and only a few of those comprising the core metadata are mandatory.
– ISO 19115 allows the creations of extensions and profiles. A profile is a formalised extension
requiring registration of the profile.
Metadata standards (1)
• MARC 21 Machine Readable Code: Most
widely used format for bibliographic records
– Several variants exist, e.g. US MARC
• RDF: Proposed by W3C for cataloguing web
resources Uses a complex syntax that
incorporates the topology of the resource
objects (i.e. captures relationships)
Metadata standards (1)
• W3C:
– World wide Web consortium
– develops interoperable technologies
(specifications, guidelines, software, and tools)
• System used:
– Propose a specification
– Period of evaluation
– After common agreement
– Setting of standard
– Examples: XML/SVG/HTML/PNG
ROSCOP
• ROSCOP (Report of Observations/Samples
collected by Oceanographic Programmes)
• Conceived by IOC in the late 1960s
• low level inventory: for tracking oceanographic
data collected on Research Vessels
• revised in 1990: re-named as CSR (Cruise
Summary Report)
• Disciplines included:
– physical, chemical, and biological oceanography,
fisheries, marine contamination/pollution, and marine
meteorology.
FGDC
• Federal Geographic Data Committee Content Standard for
Digital Geospatial Metadata (FGDC)
• Metadata Profile for Shoreline Data, FGDC-STD-001.2-2001
• The Federal Geographic Data Committee:
– coordinates development of the National Spatial Data
Infrastructure (NSDI).
– The NSDI encompasses policies, standards, and procedures for
organizations to cooperatively produce and share geographic
data
• promotes the coordinated development, use, sharing, and
dissemination of geospatial data on a national basis.  US
• This nationwide data publishing effort is known as the
National Spatial Data Infrastructure (NSDI).
Marine Metadata
• A key part of any marine dataset: is the accompanying metadata.
• Metadata describe :
– content, quality, condition and other characteristics of a dataset.
– mechanism to describe data in a consistent form
• Some formats:
– CSR/ROSCOP . The CSR System (also known as ROSCOP forms) is used to support a global
inventory of data collected at sea and to provide ready access to scientists, program managers
and data managers to timely information on data collected.
– DIF/MEDI. The DIF format has been developed by NASA's Global Change Master Directory
(GCMD), is used by the Marine Environmental Data Inventory (MEDI) metadata catalogue. This
format has also been adopted by a number of international programs, including UNEP.
• Structured descriptions of marine datasets are the oldest environmental
metadata, going back to the early 1960's with the creation of the "Report of
Observations/Samples Collected from Oceanographic Platform" (ROSCOP) paper
forms. ROSCOPs have been replaced by CSRs (see below), while a number of other
major systems have been created to describe larger (or more complex) data
collections. Three major systems of great importance to contemporary
oceanography are described here:
• Marine Environmental Data Inventory
• MEDI is a directory system:
– marine related datasets and data inventories
– IOC’s International Oceanographic Data and Information Exchange (IODE) system.
– MEDI contains metadata or "data about data".
• The aim of MEDI is to address the questions:
– What data do we have?
– When and where was it collected?
– Who holds that data?
• MEDI is a reference point for locating marine and coastal datasets
• MEDI is not limited to governmental datasets or restricted to freely distributed
data.
• Structure of MEDI:
– based on the Global Change Master Directory (GCMD) developed by NASA
– both systems use the DIF metadata format.
• The MEDI authoring tool: encourage data collectors and scientists to produce
metadata descriptions for their datasets.
Data catalogues
• The ability to discover and access oceanographic data resources for use in
visualisation, planning, and decision support is an important requirement
to support research and planning.
• Data catalog or gateway provides search and access to referenced data
• Populated with metadata which describe the attributes and contents of
datasets, databases, images, maps, documents and other catalogs and
collections of resources that are available both on-line and off-line.
• Types:
– Met-Ocean Data Catalogs - Catalogs (indexes) of data and products from all
Biological, Chemical, Geological and Physical fields indicated in Sciences of
Oceanography. The term "Met-Ocean" is used to emphasize the integration
of marine meteorology and all aspects of climate into this category.
– Remote Sensing Data Catalogs - Catalogs of data and products from remote
sensing (usually satellites).
– Ancillary and Applied Data Catalogs - Catalogs of data and products from all
other fields useful to oceanography.
Practical Example of metadata
datasets
• MARBEF
– Marine biodiversity and ecosystem functioning
network of excellence
– European network of institutes
– One of main aims during projects:
• Inventorize all data generated through project
• Inventorize also older data
– Marbef website: access tot datasets metadata
Practical Example of metadata
datasets
• What data is available from Mediterranean
• To what taxon level is data available
• What kind of environmental variables have
been measured
• Who is the dataset responsable
• Experimental data?
• What datasets are about molluscs
• Geographic ranges of a dataset
Data policies - history
General background
• Key principle:
– Open access to data – as much as possible
– Science – versus commercial
– What is the economic value of data
• Important in decisions about free availability of data
• Fish stock/catch data
• Sand extraction data
• Oil/Gaz exploitation data
– Lot of data is hidden: extracted by private
companies
General background
• Ways to calculate economic value of certain data
– Biodiversity data
• Question:
– Global distribution data of a shrimp
• What price to pay for such data
– Global distribution of whale species
• What price to pay?
– Global distribution of Gaz hydrate areas
• What price to pay?
General background
• Data originator – data owner ... ???
– Master thesis at Ghent University:
• Who is owner of the data?
• Can you clame intellectual property?
• When is intellectual property made?
– Counting data of species
– Weighing species
– Identyfying species
– Describing species
– Genetic codes of species
General background
• Data ownership depends a lot on paying
instance for projects!
– Example of belgium
– Politics are replected in property rights of data
• University
• Flanders – IWT
• National Science Foundation
• European Projects
• Private companies
History of data availability
• Some milestones:
– Up to 17th century:
• Salons – letters
– 1665 : first scientific journal : Philosophical
Transactions for the Royal Society of London
– Journals  libraries
– Journals  sent to PAYING abbonees
– Went on for 300 years
History of data availability
• 1950 s:
– Founding of science citation index
• Number of scitations:
– Influences journal impact
– Influences CV of scientist
• Hierarchy in journals
• Evaluation method of science
• Hierarchy journals  commercially interesting
• Journals were bought by commercial companies
• Price of journals raised
• Smaller journals into problems
• 1991: first preprint service
History of data availability
• 2003: World Summit on the Information
Society (WSIS)
– Role of ICT
– Access to information and knowledge
– http://www.itu.int/wsis/docs/geneva/official/dop.
html
• Berlin Declaration on Open Access to
Knowledge in the Sciences and Humanities
Berlin declaration
Data policies (1)
• What?
– Importance of setting rules
– Official document in project framework
– Official document accompagning a dataset
• Contents:
– Who is owner of data
– Who may use what parts of the dataset
– How to get access to data
– On what timeframe is revision needed
– What may be done with the data
Data policies (2)
• Examples:
– If data used: data originator co-author of publication
• Generation of lot of extra publication (benefit to share data)
– Data can be shared partly on global information
systems
• Example GBIF
• Full dataset: small part is shared with GBIF
• Advertisement of data
– Rules like: 2-3-5 years after publication data becomes
free
– Datamanagement: oblige to preserve all data in good
state

Mais conteúdo relacionado

Mais procurados

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Data base management system
Data base management systemData base management system
Data base management system
Navneet Jingar
 

Mais procurados (20)

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data models
Data modelsData models
Data models
 
Practical Guide to Data Governance Success
Practical Guide to Data Governance SuccessPractical Guide to Data Governance Success
Practical Guide to Data Governance Success
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Data base management system
Data base management systemData base management system
Data base management system
 
‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management ‏‏Chapter 8: Reference and Master Data Management
‏‏Chapter 8: Reference and Master Data Management
 
The importance of data
The importance of dataThe importance of data
The importance of data
 
Data Collection.pptx
Data Collection.pptxData Collection.pptx
Data Collection.pptx
 
Physical Database Design & Performance
Physical Database Design & PerformancePhysical Database Design & Performance
Physical Database Design & Performance
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Role of a DBA
Role of a DBARole of a DBA
Role of a DBA
 
Database management system
Database management systemDatabase management system
Database management system
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
 
Intro to Data Management Plans
Intro to Data Management PlansIntro to Data Management Plans
Intro to Data Management Plans
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Data Governance_Notes.pptx
Data Governance_Notes.pptxData Governance_Notes.pptx
Data Governance_Notes.pptx
 

Semelhante a Data management principles

Oceangraphic data formats
Oceangraphic data formatsOceangraphic data formats
Oceangraphic data formats
Fiddy Prasetiya
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
RCAHMW
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
jasonfrantz
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
Siva Rushi
 

Semelhante a Data management principles (20)

Oceangraphic data formats
Oceangraphic data formatsOceangraphic data formats
Oceangraphic data formats
 
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
Canllawiau CBHC ar gyfer Archifau Archaeolegol Digidol – Ymagwedd Gynaliadwy ...
 
Unit 4 and 5
Unit 4 and 5Unit 4 and 5
Unit 4 and 5
 
Essentials of R
Essentials of REssentials of R
Essentials of R
 
CAD Data Exchange format used in industry
CAD Data Exchange format used in industryCAD Data Exchange format used in industry
CAD Data Exchange format used in industry
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
Chap01 (ics12)
Chap01 (ics12)Chap01 (ics12)
Chap01 (ics12)
 
2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
 
Comp 501.pptx
Comp 501.pptxComp 501.pptx
Comp 501.pptx
 
Chapter 5 data resource management
Chapter 5  data resource managementChapter 5  data resource management
Chapter 5 data resource management
 
Sql Server2008
Sql Server2008Sql Server2008
Sql Server2008
 
Pandas
PandasPandas
Pandas
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
CS-324-6-2.pdf
CS-324-6-2.pdfCS-324-6-2.pdf
CS-324-6-2.pdf
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
CAD data exchange
CAD data exchangeCAD data exchange
CAD data exchange
 

Mais de Fiddy Prasetiya

Water pollution in indonesia
Water pollution in indonesiaWater pollution in indonesia
Water pollution in indonesia
Fiddy Prasetiya
 
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
Fiddy Prasetiya
 
Diversity copepods in deep sea coral
Diversity copepods in deep sea coralDiversity copepods in deep sea coral
Diversity copepods in deep sea coral
Fiddy Prasetiya
 
Assessment sg detection by remote sensing
Assessment sg detection by remote sensingAssessment sg detection by remote sensing
Assessment sg detection by remote sensing
Fiddy Prasetiya
 
Lecture toxicity testing
Lecture   toxicity testingLecture   toxicity testing
Lecture toxicity testing
Fiddy Prasetiya
 
Data mining – introduction
Data mining – introductionData mining – introduction
Data mining – introduction
Fiddy Prasetiya
 
Water quality degradation & cyanobacterial blooms
Water quality degradation & cyanobacterial bloomsWater quality degradation & cyanobacterial blooms
Water quality degradation & cyanobacterial blooms
Fiddy Prasetiya
 
Sea bird mortality at cabo san luca: presentation_fiddy
Sea bird mortality at cabo san luca: presentation_fiddySea bird mortality at cabo san luca: presentation_fiddy
Sea bird mortality at cabo san luca: presentation_fiddy
Fiddy Prasetiya
 
Primary production in Spuikom lagoon, Belgium
Primary production in Spuikom lagoon, BelgiumPrimary production in Spuikom lagoon, Belgium
Primary production in Spuikom lagoon, Belgium
Fiddy Prasetiya
 
Study on the behavior of the heavy metals
Study on the behavior of the heavy metalsStudy on the behavior of the heavy metals
Study on the behavior of the heavy metals
Fiddy Prasetiya
 
Benthic fauna of the inner part of ariake
Benthic fauna of the inner part of ariakeBenthic fauna of the inner part of ariake
Benthic fauna of the inner part of ariake
Fiddy Prasetiya
 
Allelopatic haslea ostrearia on different species of diatoms
Allelopatic haslea ostrearia on different species of diatomsAllelopatic haslea ostrearia on different species of diatoms
Allelopatic haslea ostrearia on different species of diatoms
Fiddy Prasetiya
 
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
Fiddy Prasetiya
 
2 proper 01 juni 2011 rt ok
2 proper 01 juni 2011 rt ok2 proper 01 juni 2011 rt ok
2 proper 01 juni 2011 rt ok
Fiddy Prasetiya
 
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
Fiddy Prasetiya
 

Mais de Fiddy Prasetiya (20)

Water pollution in indonesia
Water pollution in indonesiaWater pollution in indonesia
Water pollution in indonesia
 
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
Impact of aquaculture activity on phytoplankton diversity in djuanda reservoi...
 
Diversity copepods in deep sea coral
Diversity copepods in deep sea coralDiversity copepods in deep sea coral
Diversity copepods in deep sea coral
 
Assessment sg detection by remote sensing
Assessment sg detection by remote sensingAssessment sg detection by remote sensing
Assessment sg detection by remote sensing
 
Rq evaluation
Rq evaluationRq evaluation
Rq evaluation
 
Lecture toxicity testing
Lecture   toxicity testingLecture   toxicity testing
Lecture toxicity testing
 
Era2010
Era2010Era2010
Era2010
 
Relational databases
Relational databasesRelational databases
Relational databases
 
Data policies
Data policiesData policies
Data policies
 
Data mining – introduction
Data mining – introductionData mining – introduction
Data mining – introduction
 
Vliz poster fiddy
Vliz poster fiddyVliz poster fiddy
Vliz poster fiddy
 
Water quality degradation & cyanobacterial blooms
Water quality degradation & cyanobacterial bloomsWater quality degradation & cyanobacterial blooms
Water quality degradation & cyanobacterial blooms
 
Sea bird mortality at cabo san luca: presentation_fiddy
Sea bird mortality at cabo san luca: presentation_fiddySea bird mortality at cabo san luca: presentation_fiddy
Sea bird mortality at cabo san luca: presentation_fiddy
 
Primary production in Spuikom lagoon, Belgium
Primary production in Spuikom lagoon, BelgiumPrimary production in Spuikom lagoon, Belgium
Primary production in Spuikom lagoon, Belgium
 
Study on the behavior of the heavy metals
Study on the behavior of the heavy metalsStudy on the behavior of the heavy metals
Study on the behavior of the heavy metals
 
Benthic fauna of the inner part of ariake
Benthic fauna of the inner part of ariakeBenthic fauna of the inner part of ariake
Benthic fauna of the inner part of ariake
 
Allelopatic haslea ostrearia on different species of diatoms
Allelopatic haslea ostrearia on different species of diatomsAllelopatic haslea ostrearia on different species of diatoms
Allelopatic haslea ostrearia on different species of diatoms
 
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
2 presentasi pemulihan lahan borobudur 01 juni 2011 ya
 
2 proper 01 juni 2011 rt ok
2 proper 01 juni 2011 rt ok2 proper 01 juni 2011 rt ok
2 proper 01 juni 2011 rt ok
 
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
2 kriteria plb3 agro & hasil 01 juni 2011 hh ok.ppt
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Data management principles

  • 2. Contents • Introduction • Formats: from text to relational structures • Global scientific metadata systems • Data availability and access • Principles of data policies • Value of data
  • 3. Introduction • Basic principles of why data-mangement (see last week) – Selfish reasons – Altruistic reasons – Moral obligation  costs of generation of data • Nice ideas, nice examples, ... – Work behind is sometimes less nice – For datamanagement there some rules/techniques/principles
  • 4. 1. Data formats • Large heterogeneity in data formats • Data format = the physical or electronic shape in which data is stored • Piece of paper with hand written text = data format • However focuss here: – Electronic data formats – Commonly used data formats
  • 5. 1. Data formats • Why use which format? – Historical reasons: • Old data mostly in text based list formats • Software and technology is accompagning certain formats • Example: xml is only being used after its invention – Other reasons: • Depending on data generator: – Machine generated data (mostly ascii format) • Worldwide agreed formats for certain types of data – Facilitate exchange of data packages
  • 6. 1. Data formats • Exchange of data formats – Most formats are exchangeable into eachother – Mostly top down: • Relational structure  spreadsheet  txt-based
  • 7. Data formats: different classifications • Physical types: – ASCII – BINARY • Format types : 15 often used data types
  • 8. Dataformat – ascii format (1) • Ascii: American Standard Code for Information Interchange • ASCII data are encoded so that the human reader can see and understand the values, because they are displayed as normal integers and real numbers. This means that the actual digital file contains print and display information for the human-readable characters, not the actual values of the data. The benefit of using ASCII data is that the user can see, understand and edit the file contents directly; the downside of using ASCII is that the data files are much larger.
  • 9. Dataformat – ascii format (2) • Combination of letters and numbers • Readable by any computer • No complex software required
  • 10. Dataformat – Binary data • Binary data are numeric data whose values are expressed in bits and bytes, instead of the human-readable ascii code. • Number values can be stored in much smaller files:  be read more rapidly (by machines) • the method for large datafiles, especially gridded data. • To use binary data: not so easy  interpreting steps are required
  • 11. Dataformat – Binary data • Contents and structure of binary files may vary: – Type of data stored: • Bit (0-1) – 1 bit • Byte (0-255) – 8 bits • Short integer (-32,768 32,767) – 16 bits • Interpreter – translator is required
  • 12. Data formats – 15 common used types • Text – files  Ascii/Binary • Spreadsheets • Relational structures • Others – Images – Maps
  • 13. 1 & 2 : Auxiliary Formats • Auxiliary Formats - Information about data files; these are not really "data" files, but are included here for completeness – 1 Header Formats - Information about the format, location or geo-referencing; usually very short – 2 Metadata Formats - see also metadata
  • 14. 3. Document • Digital data in proprietary formats (or sometimes just simple ASCII) designed for visual inspection, but not for data processing • ASCII ,MS Word DOC , WordPerfect , HTML , PDF - Adobe Acrobat , PS/EPS - PostScript/Encapsulated PS , Desktop publisher programs - all proprietary ...
  • 15. 3. Document • Advantages: Very polished appearance; powerful editors available; compatibility with other major document editing software. • Disadvantages: (hard to use in data mining) – ASCII text must be extracted for the sections of interest. – Embedded images must be converted to more easily used GIF, JPG or BMP formats. PDF and PS/EPS very tricky to convert to other formats.
  • 16. 4. Gridded data • File formats: – ASCII : example - SURFER (*.GRD) - with "DSAA" header lines – Binary : Plain binary grids: byte, short integer, long integer, single-precision or double-precision; with or without ASCII Header Files (see earlier)
  • 17.
  • 18. 4. Gridded data • Creation of the Grid: – The gridded data file is created from scattered data points in the real world, by a process called "gridding." – mathematical methods to create the grid – algorithms are available to examine data points
  • 19. 4. Gridded data • Gridded data files commonly contain more than a single grid – Data mostly avaiable for different parameters – Using sequences of XYZ dimensions and parameter dimensions – There is no "correct" way to construct files of multiple data grids • It is extremely important to document the sequence in which the dimensions (XYZ location, time, parameters) are "read." • Vector Grids: To represent vectors (literally arrows showing the direction of flow) in ocean and meteorological datasets two methods have been devised: provide the U and V components of the vector, or provide the direction and magnitude of the arrow. Both of these methods have been adapted to grids, for vector results from gridded models for instance. The grids can be contained in separate files, or sequentially listed in the same file.
  • 20. 4. Gridded data • Advantages: – Saves storage space – XYZ storage which requires 3 data per gridpoint. – Binary takes much less space than ASCII. – Reading the data is usually a very straightforward creation of a • DO LOOP routine (or nest of routines) that follows the order in which the data were stored • Disadvantages: – Binary data are not liked by those who want "to see" their data at all times.
  • 21. 5. Hard copy • Older, hard copy datasets • necessary evil – (pre-60s) ocean data has never been digitized • These datasets range from technical reports to hand- written log sheets and lab sheets. – Reports usually contain enough information to be successfully digitized – Manuscript holdings often require tedious collation and cross-referencing in order to assemble all the needed parts. – Datasets with missing critical parts (e.g. station data) exist, as well as analysis and synthesis reports containing statistics, graphs and tables, but no data.
  • 22. 5. Hard copy • Examples: – Lab sheets – Journal articles – Technical Reports – 80-character punch cards - Included here because many locations lack the facilities to read them – Hand-annotated charts/graphs – Specimen identification cards – Diaries – Ship logs
  • 23. 5. Hard copy • Risk of data loss: – Rule in many data centres: No paper data should be mailed or shipped unless photocopied. – All ORIGINAL paper data should be gathered by the data manager immediately after the relevant cruise and grouped into named folios whose contents are indexed. • All paper data should be submitted to supervised digitization as soon as possible. – Example: heritage library • Metadata of hard copy data: should fully describe the folios – numbers of pages – Color of frontpage – Other identifying characteristics • Advantages: They still exist. • Disadvantages: – Cannot be used in modern digital analysis. – Digital capture is very labor intensive. – Access is a tricky political issue in some institutions. • Compatibilities: Published papers in good condition can be scanned and converted to ASCII text with many commercial packages. (OCR techniques) – Controll afterwards ….
  • 24. 5. Hard copy • From hard copy to digital copy ... – Technique used depends on aim and type of data – Often just transformed in ‘document’ format – If to other formats – often man-driven • In many cases going back to hard copy only way to work (due to lack of metadata, file versions, ...)
  • 25. 6. Simple Images – Graphics file without earth mapping information – Interpretation is purely man-based – Very variable – Many file formats: • TIFF, GIF, JPG, BMP … • RAW versus compressed – RAW: all image information is stored without compression – Compressed: JPG/GIF information is compressed by extrapolation, reducing colors  smaller files but loss of information
  • 26.
  • 27. 6. Simple images • Some images have added artistic borders - – outside the geographic grid: that obscure the pixel-to- coordinates relationship • Advantages – Quick visualization of data that may have originally been extremely complex. Subjective analyses that do not require positional accuracy. – Disadvantages Quantification difficult; synthesis nearly impossible unless with pictures derived in exactly the same fashion Compatibilities Nearly all graphic picture formats are interchangeable with editor programs.
  • 28. 7. Geo-referenced images • Graphics file, with ancillary mapping information, showing 1 or more parameters of the earth's system in a rectilinear grid, usually derived by processing and decimation of very high-density information from aerial or space sensors. – Coordinates of pixel correspond to XY geo- coordinate. – Color of pixel represents a parameter
  • 29. 7. Geo-referenced images • TIF files can be made into Geo-Referenced Image files by the addition of internal geographic tags, which require exact knowledge of the image dimensions and its proper location on the earth's surface. • JPG, TIF and BMP can be made into Geo-Referenced Image formats by the addition of header "world files," which require exact knowledge of the image dimensions and its proper location on the earth's surface. A world file is a simple ASCII file with the following contents: – X-pixel size (delta X) – Rotation term for row (normally zero) – Rotation term for column (normally zero) – Y-pixel size (delta Y) – X-coordinate of center of upper left pixel – Y-coordinate of center of upper left pixel • World files for TIF have the extension TFW; • world files for JPG have the extension JPW; • world files for BMP have the extension BPW.
  • 31. 8-9-10. Mapping data • Mapping - Mapping data consisting of digital representations of individual objects (points, lines, polygons, etc.) – 8 XY- Mapping line objects, in X (usually longitude) and Y (usually latitude) coordinates only – 9 List- Mapping objects (points, lines, symbols, text, etc.) without topology or descriptive attributes – 10 Geographic Information System (GIS) - Mapping objects (points, lines, polygons, etc.) on the earth incorporated into robust data assemblages that contain additional detailed information about the properties and topologies of the objects. [NOTE: Most GIS systems can also accommodate gridded, geo-referenced image, relational and spreadsheet formats.]
  • 32. 8. XY data • Description: – simplest kind of geographic information: • lines specified by their ordered X and Y coordinates. • country boundaries: separated by several different markers • ASCII Export Format from GEBCO Database/Software (actually YX in column order) • Advantages: Simple to write, easy to read (when ASCII). • Disadvantages: Contain no topological relationships between objects, or attributes of the objects. • Text is rendered as drawing instructions, and cannot be retrieved as recognizable data.
  • 33.
  • 34.
  • 35. 9. Mapping data - List • ordered list of "map primitives" to be drawn: – such as points, lines, circles, labels, etc. • These formats are extremely specific to certain software. • They could almost be called "plotter formats" because they do little more than draw pictures of geographically referenced information. • Small amounts of data can be included, however, coded into the appearance of such primitives as the circle (variable diameters), the vector arrow (variable lengths), and contour lines (colors). • Advantages; Usually easy to read/write. • Disadvantages exists in many variant subtypes; MS Word and WordPerfect differ markedly in the versions they accept.
  • 36. 10. Geographic Information System (GIS) • Charting and mapping: tools for natural resource management. • Digital methods are becoming much more common in ocean data analysis. • Geographic Information System (GIS) data formats contain complex, multi-theme collections of spatial information that can be used to create maps and charts, and to perform analyses. • The data formats that can support these systems are not just sufficient to draw maps, but also contain necessary ancillary data about the features included (in space and time). • NOTE: GIS files can be vector-type or raster-type, and many GIS software systems can handle both. Conversion utilities exist that can convert these files in either direction, although the raster-to- vector conversion often requires intensive quality control by skilled operators.
  • 37. 10. Geographic Information System (GIS) • Software: – Esri/Mapinfo/Surfer/... • Recently: also many online gis-tools – OBIS – Open Gis standards : Open Geospatial Consortium • an international industry consortium of 334 companies, government agencies and universities participating in a consensus process to develop publicly available geoprocessing specifications. • Open Geospatial Consortium (OGC) protocols include Web Map Service (WMS) and Web Feature Service (WFS).
  • 38. 10. Geographic Information System (GIS) • Formats Within This Group ESRI Shapefiles (SHP) , VPF • Advantages: – Rapid creation of new maps and charts using the same databases. – No laborious hand-drawing methods. – Synthesis of different kinds of information, on an as-needed basis, from a common pool of datasets. – Instant changes in projection, scale, coverage area, etc. • Disadvantages: – GIS formats tend to be very complex, and populating them with the actual data of interest is laborious. • Compatibilities Most of the major software systems now recognize each other's formats. – Most have ASCII export routines for simple versions of the internal datafiles (e.g. DXF).
  • 39. 11. Message data • Ocean and meteorological data compressed into official (usually WMO-sanctioned) formats for transmission over approved international channels, especially the WMO's Global Telecommunications System (GTS). These highly compacted formats usually require unpacking programs before they can be used for analysis purposes. [The Self-Describing Formats BUFR and GRIB are also often used for data and analysis messages within the GTS.] • Formats : DBCP-x, AAXX, BBXX, EEAA, EEBB, EECC, EEDD , IIAA, IIBB, IICC, IIDD , JJXX, JJYY, PPAA, PPBB, PPCC, PPDD , QQAA, QQBB, QQCC, QQDD , TTAA, TTBB, TTCC, TTDD , UUAA, UUBB, UUCC, UUDD , VVAA, VVCC , YYXX , ZZYY • As an example, the JJYY format encodes real-time bathythermograph data; it replaces an older format, JJXX, used until 1995.
  • 40. 11. Message data • Advantages : – Cheap and quick to send over often crowded circuits; widely accepted among non-technical marine community. – when of poor quality, they create a "placeholder" for the higher quality data which should follow • Disadvantages – Only very coarse resolution and/or low precision is possible due to the message format limitations.
  • 41.
  • 42. 11. Message data This element defines an observation report on temperature, salinity and currents at one particular location on the ocean surface, or in subsurface layers.
  • 43. 12. Relational database • A suite of spreadsheet-like tables with explicit links between them in special linkage arrangements (usually contained in additional tables). • This collection of linked tables, known as a Relational Database (RD), divides up very large initial tables into much smaller tables and eliminates much duplication of information that would otherwise be required. • Relational Databases require the use of special software (in which they are created, manipulated, and analyzed) called Relational Database Management Systems (RDMS). • Formats: MS Access, Oracle, Sybase, dBase, SQL Server
  • 44. 12. Relational database • Advantages: – Enormously flexible systems, capable of most typical statistical and graphical analyses of data. – Some have immediate Web compatibility for publishing databases directly on the Internet; ability to exchange data (via I/O operations or direct linking). • Disadvantages – Ocean data are seldom published in commercial RDMS formats, due to the machine- and software-specific requirements they would carry with them. – Users cannot immediately "look at" their data, although this only requires simple queries that can written in minutes. • More about these formats later
  • 45. 13. Spreadsheets • Spreadsheet formats are simply row-and-column data tables. • Easily be imported into several proprietary spreadsheet software programs and many public domain programs. • Each row is called a "record." • The separate "fields" may be labeled by a single "label row" at the beginning of the spreadsheet • Formats: EXCEL , WK* • Advantages – Extremely easy to create, read, quality-control and manipulate in commercial spreadsheet programs. Each record (data line) is unique and complete. • Disadvantages – Can be quite large, compared to binary files of the same data.
  • 46. 14. Self describing data formats • Data files that contain information about their own contents and structure. • Collections of other format types : – Together with metadata about the main data components. • The rules and syntax : – provided by (international) oversight groups • Examples: – HDF - widely used for satellite data archives – NetCDF - widely used for gridded data and satellite data – BUFR - meteorological format for observations – GRIB - meteorological format for gridded data • Advantages: – Metadata and data are "married" within a single structure – Software programs can find and browse desired data by working with the data files themselves rather than external indexes. – Wide use has given rise to a long list of community software and "read" libraries. • Disadvantages: – There is steep learning curve for all these formats, due to their complexity and comprehensiveness.
  • 47. 15. Stratified data formats • A very common method to reduce the large size of Spreadsheet format data is to take the slowly changing fields, which take up a lot of room in each record and to place them in a totally separate "Cruise/Station" record that precedes all "Data" records to which it refers. • Naturally, this new type of record will have a different format from the other records. • This process can be taken further, so that "Cruise" records, "Station" records, and "Data" records all have different formats. – significance in the order of the records: because each "Data" record takes its full meaning from the closest preceding "Cruise" and "Station" records. • ICES Standard Profile • Advantages: – Smaller in size than spreadsheet. • Disadvantages : – Tricky to write software, due to multiple line formats. – Usually the lines are formatted, so it is difficult for the human eye to read the data values. – Use with spreadsheet software is very limited (editing, block sorting/cutting/pasting) due to the different line formats. – Import to relational databases with "off the shelf" routines is impossible.
  • 48. 15. Stratified data formats Cruiseid A B C stationid x y Z W Sampleid l p K Sampleid2 l2 p2 K2 Sampleid3 l3 p2 K3 Stationid x2 y2 Z2 W2 ... ... ... ... ...
  • 49. 16. Extra - XML • Currently widely used • Data exchange format • Extensible Markup Language (XML)
  • 50. 16. Extra - XML • Text based – small file size • Ascii format • Similar to stratified  hierarchy • Formats defined by international organisations (see also stratified) • Metadata can be embeded in data • Data exchange format – through internet • Both for data delivery & data request • Used in GIS in recent versions of software • Web technology (e.g. Newsitems, search engines, ...)
  • 52. Introduction • Most common used data format next to spreadsheets. • Spreadsheets relatively easily • Research projects mostly claim data to be stored in relational database. • Understanding a relational structure opens the access to many data
  • 53. Relational databases - Data mining • Exploration of data • Prerequisite: data should be available in a minable format - database • Database = electronic document storing data – Non-relational: 1 bulk system with non-related items (eg. Msexcel files, text-documents, non- related-tables) – Relational: all items (tables) are linked to each other (see further)
  • 54. Relational databases Why using a database • Relational database: – All your data is stored in 1 file • Easy to retrieve data • Easy to backup – Data and metadata stored together • Data ... • Metadata: data about the data (documentation) – Many data-files contain undocumented values: – Species A has an abundance of 17 ( meaning of value 17?)
  • 55. Relational databases Why using a database • All data in a good relational designed database is only stored once: – Example: species list  typing errors • Nudora thorakista • Nudora thorrakista • Nudora thorakhista • Nudora thorakisa – 1 species  species richness calculation: 4 – Solution: 1 table with each species 1 record and use it as a reference
  • 56. Why using a database • Data is much more rigid ... – More difficult to make errors – E.g. Sorting in excell
  • 57. Relational databases Principle - Exercise • A practical example to understand ... – Make a list of 15 people you know – Make a list of all genders – Make a list of characters and indicate for each character whether nice or not – Make a list of countries • Start coupling all your lists • You made a relational database
  • 58. Relational database - biology Species person Places Sample Country Density Equipment
  • 62. Relational databases Principles • Think before you start ... – Structure of a database is the key to a good dataset – Structure has to translate the whole concept • One look at the structure (relational scheme) should explain the database
  • 63. Relational databases - components • Tables – Basic structures containing the data – Structure of table important – ID • Relations – Definition of how different tables are connected and form a sense-full unit • Queries – Extractions of data from database
  • 64. Table designs ... • A table consists of a series of Columns ... • Each record as such: – Different fields – Design of table must be done before data is entered – Each field: name, data type – Each field can also by formatted  layout Record ColumnField
  • 65. Table designs ... • Field types: – Numeric – integer/double – Text – Date/Time – Memo – Autonumber  ID – Yes/No
  • 66. Excercise on field types: • 12 • 15 jan 1988 • hallo • 12,456 • 12:56 • Azdazdazd azdda zda azdd dad zd dadazdzd azdazddazdd azdazd azdazd dzdzdzzd ada zzd azdaz dda azd da az d z azdzadazd a zd a azd azd z dd da a z a z zd d ddaa zd • 09:89
  • 67. Special field in a table: key • A key = a unique identifier for a record – Example: pasport number: • Number in a database which is unique and relates to all data about you – Each record in a table gets also a key – This key is used to link tables to each other – Example: • Nudora sp1 – id: 123776 • Nudora sp2 – id: 34688 – Advantage: species name changes: linked taxa remain linked
  • 68. Linking tables through id’s • Storing numbers is most effecient way to store data: • Nudora sp1 is found in the north sea with a density of 32 • Species 123776 is found in station 2 (North sea) with a density of 32 • Record in table density becomes: 123776 | 2 | 32
  • 69. Setting up relations between tables • Relations: links between tables • Connecting tables through certain fields in a rigid way to each other • Advantage: database becomes a strong unity • Types of relations: – 1 to many – Many to many ( = 2 times 1 to many)
  • 70. Examples of relations • Table places: field country (numeric) • Table countries – list of countries, each country has unique id • Relation is made between: – Field country in places – Field id in country • One to many relation: 1 record in table country linked to multiple records in places • No deleting of countries possible Places Country
  • 71. Examples of relations • Many to many • Id of sample • Id of species • Table density: unique combination of sample, species ... Species Sample Density
  • 72.
  • 73. Queries • All data in database: – Next step: get it out again – Selections on 1 table: by using filters – Selections on multiple tables: using queries – Queries can be saved and reused – Queries can be the basis for new queries
  • 76. Making a simple selection Query • Create ... Query in design view • Switching between views:
  • 77. Making a simple selection Query • Select the tables and/or queries needed
  • 78. Making a simple selection Query • Select the fields needed for output/selection/sorting
  • 79. Making a simple selection Query • Select the fields needed for output/selection/sorting
  • 80. Making a simple selection Query • Select the fields needed for output/selection/sorting
  • 81. Making a simple selection Query • Select the fields needed for output/selection/sorting
  • 82. Making a simple selection Query • Set the criteria
  • 83. Making a simple selection Query • Select the values to out put and add sorting options
  • 84. Output the results • Go to datasheet view
  • 85. Making a simple selection Query • Special options ...
  • 86. Exporting data • From msaccess it is possible to export to different formats! • Tables, queries, ... • Exports can be used to do further data mining: – Through MSExcell  making graphs – To do statistical analysis
  • 88.
  • 89. Step by step demonstration • Open a database • Different items in database • Open tables, sorting, filtering • Table design • Relationships • Queries
  • 90. Query operators = equals > Larger than < Smaller than >= larger than or equals Between ... And ... Is null Like ... Not like ...
  • 92. Query operators and both true or at least 1 true < Smaller than >= larger than or equals Between ... And ... Is null Like ... Not like ... >"q*" and <"u*" VOORNAAM René, Robbie, Stefan, Stijn, Tim, Tristam ="r*" or "s*" VOORNAAM Robbie, Stefan, Stijn
  • 93. Intermezzo ... Design a dataset • Research project: – You work with 3 persons on it – You will sample 4 times on 3 locations – You will measure 5 environmental characteristics – You will identify all species – You will count them – Extra: you will measure each specimen – Task: design on paper how your dataset will look like
  • 94. GLOBAL Scientific Data and Metadata systems
  • 95. Global Change Master Directory • NASA's Global Change Master Directory (GCMD): – is a comprehensive directory of descriptions of data sets of relevance to global change research. – includes descriptions of data sets covering : • climate change, agriculture, the atmosphere, biosphere, hydrosphere & oceans, geology, geography, and human dimensions of global change. – freely searchable – Only metadata records: • nature of the data (e.g., parameters measured, geographic location, time range) • where stored. • Adding data description simple: – A web-based registration form – free of charge
  • 96.
  • 97.
  • 98.
  • 99. Metadata standards (1) • Used to avoid the arbitrary use of properties when describing a dataset • A document that presents a set of statements: – rules of usage for metadata elements = metadata specification = metadata standard. • Some examples of metadata specifications: – The common metadata standards for describing geospatial datasets are ISO 19115, DIF and FGDC. – Common Communications Format: Developed by UNESCO and others as "a common bibliographic exchange format that would be useful both to libraries and other information services." - Used in UNESCO's library software – CDI: Common Data Index - Used to describe oceanographic cruises. The hard-copy forms were formerly known as ROSCOPS • Global catalog online at the ICES site – DIF: Directory Interchange Format: Format used by the Global Change Master Directory and MEDI Used to describe earth science datasets
  • 100. Metadata standards (1) • Dublin Core ISO 15836: An element set for describing a wide range of networked resources, focusing on bibliographic needs. – also been used for other metadata documentation purposes. Also known as NISO Standard Z39.85 • FGDC: Content Standard for Digital Geospatial Metadata (CSDGM) from the [US] Federal Geospatial Data Committee: – Used to describe geospatial data The FGDC metadata standard is lengthy (>200 fields) and compliance with the standard has proved to be difficult. • FGDC/BDP: FGDC with Biological Data Profile Extension: – A standard agreed upon at the International Meeting of Cataloguing Experts held in Copenhagen in 1969; it provides a standard order and content for the description of monographic material and facilitates the international exchange of bibliographic information by standardizing the elements to be used in the bibliographic description, assigning an order to these elements in the entry, – specifying a system of symbols to be used in punctuating these elements` • ISO 19115 ISO 19115:2003 Geographic Information – contains almost 300 elements. However only a small number of these form part of the core metadata and only a few of those comprising the core metadata are mandatory. – ISO 19115 allows the creations of extensions and profiles. A profile is a formalised extension requiring registration of the profile.
  • 101. Metadata standards (1) • MARC 21 Machine Readable Code: Most widely used format for bibliographic records – Several variants exist, e.g. US MARC • RDF: Proposed by W3C for cataloguing web resources Uses a complex syntax that incorporates the topology of the resource objects (i.e. captures relationships)
  • 102. Metadata standards (1) • W3C: – World wide Web consortium – develops interoperable technologies (specifications, guidelines, software, and tools) • System used: – Propose a specification – Period of evaluation – After common agreement – Setting of standard – Examples: XML/SVG/HTML/PNG
  • 103.
  • 104. ROSCOP • ROSCOP (Report of Observations/Samples collected by Oceanographic Programmes) • Conceived by IOC in the late 1960s • low level inventory: for tracking oceanographic data collected on Research Vessels • revised in 1990: re-named as CSR (Cruise Summary Report) • Disciplines included: – physical, chemical, and biological oceanography, fisheries, marine contamination/pollution, and marine meteorology.
  • 105.
  • 106. FGDC • Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata (FGDC) • Metadata Profile for Shoreline Data, FGDC-STD-001.2-2001 • The Federal Geographic Data Committee: – coordinates development of the National Spatial Data Infrastructure (NSDI). – The NSDI encompasses policies, standards, and procedures for organizations to cooperatively produce and share geographic data • promotes the coordinated development, use, sharing, and dissemination of geospatial data on a national basis.  US • This nationwide data publishing effort is known as the National Spatial Data Infrastructure (NSDI).
  • 107.
  • 108.
  • 109.
  • 110. Marine Metadata • A key part of any marine dataset: is the accompanying metadata. • Metadata describe : – content, quality, condition and other characteristics of a dataset. – mechanism to describe data in a consistent form • Some formats: – CSR/ROSCOP . The CSR System (also known as ROSCOP forms) is used to support a global inventory of data collected at sea and to provide ready access to scientists, program managers and data managers to timely information on data collected. – DIF/MEDI. The DIF format has been developed by NASA's Global Change Master Directory (GCMD), is used by the Marine Environmental Data Inventory (MEDI) metadata catalogue. This format has also been adopted by a number of international programs, including UNEP. • Structured descriptions of marine datasets are the oldest environmental metadata, going back to the early 1960's with the creation of the "Report of Observations/Samples Collected from Oceanographic Platform" (ROSCOP) paper forms. ROSCOPs have been replaced by CSRs (see below), while a number of other major systems have been created to describe larger (or more complex) data collections. Three major systems of great importance to contemporary oceanography are described here:
  • 111. • Marine Environmental Data Inventory • MEDI is a directory system: – marine related datasets and data inventories – IOC’s International Oceanographic Data and Information Exchange (IODE) system. – MEDI contains metadata or "data about data". • The aim of MEDI is to address the questions: – What data do we have? – When and where was it collected? – Who holds that data? • MEDI is a reference point for locating marine and coastal datasets • MEDI is not limited to governmental datasets or restricted to freely distributed data. • Structure of MEDI: – based on the Global Change Master Directory (GCMD) developed by NASA – both systems use the DIF metadata format. • The MEDI authoring tool: encourage data collectors and scientists to produce metadata descriptions for their datasets.
  • 112.
  • 113. Data catalogues • The ability to discover and access oceanographic data resources for use in visualisation, planning, and decision support is an important requirement to support research and planning. • Data catalog or gateway provides search and access to referenced data • Populated with metadata which describe the attributes and contents of datasets, databases, images, maps, documents and other catalogs and collections of resources that are available both on-line and off-line. • Types: – Met-Ocean Data Catalogs - Catalogs (indexes) of data and products from all Biological, Chemical, Geological and Physical fields indicated in Sciences of Oceanography. The term "Met-Ocean" is used to emphasize the integration of marine meteorology and all aspects of climate into this category. – Remote Sensing Data Catalogs - Catalogs of data and products from remote sensing (usually satellites). – Ancillary and Applied Data Catalogs - Catalogs of data and products from all other fields useful to oceanography.
  • 114. Practical Example of metadata datasets • MARBEF – Marine biodiversity and ecosystem functioning network of excellence – European network of institutes – One of main aims during projects: • Inventorize all data generated through project • Inventorize also older data – Marbef website: access tot datasets metadata
  • 115.
  • 116. Practical Example of metadata datasets • What data is available from Mediterranean • To what taxon level is data available • What kind of environmental variables have been measured • Who is the dataset responsable • Experimental data? • What datasets are about molluscs • Geographic ranges of a dataset
  • 117. Data policies - history
  • 118. General background • Key principle: – Open access to data – as much as possible – Science – versus commercial – What is the economic value of data • Important in decisions about free availability of data • Fish stock/catch data • Sand extraction data • Oil/Gaz exploitation data – Lot of data is hidden: extracted by private companies
  • 119. General background • Ways to calculate economic value of certain data – Biodiversity data • Question: – Global distribution data of a shrimp • What price to pay for such data – Global distribution of whale species • What price to pay? – Global distribution of Gaz hydrate areas • What price to pay?
  • 120. General background • Data originator – data owner ... ??? – Master thesis at Ghent University: • Who is owner of the data? • Can you clame intellectual property? • When is intellectual property made? – Counting data of species – Weighing species – Identyfying species – Describing species – Genetic codes of species
  • 121. General background • Data ownership depends a lot on paying instance for projects! – Example of belgium – Politics are replected in property rights of data • University • Flanders – IWT • National Science Foundation • European Projects • Private companies
  • 122. History of data availability • Some milestones: – Up to 17th century: • Salons – letters – 1665 : first scientific journal : Philosophical Transactions for the Royal Society of London – Journals  libraries – Journals  sent to PAYING abbonees – Went on for 300 years
  • 123. History of data availability • 1950 s: – Founding of science citation index • Number of scitations: – Influences journal impact – Influences CV of scientist • Hierarchy in journals • Evaluation method of science • Hierarchy journals  commercially interesting • Journals were bought by commercial companies • Price of journals raised • Smaller journals into problems • 1991: first preprint service
  • 124. History of data availability • 2003: World Summit on the Information Society (WSIS) – Role of ICT – Access to information and knowledge – http://www.itu.int/wsis/docs/geneva/official/dop. html • Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities
  • 125.
  • 127. Data policies (1) • What? – Importance of setting rules – Official document in project framework – Official document accompagning a dataset • Contents: – Who is owner of data – Who may use what parts of the dataset – How to get access to data – On what timeframe is revision needed – What may be done with the data
  • 128. Data policies (2) • Examples: – If data used: data originator co-author of publication • Generation of lot of extra publication (benefit to share data) – Data can be shared partly on global information systems • Example GBIF • Full dataset: small part is shared with GBIF • Advertisement of data – Rules like: 2-3-5 years after publication data becomes free – Datamanagement: oblige to preserve all data in good state