Unraveling Multimodality with Large Language Models.pdf
GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data
1. Taxonomic Databases Working Group Annual Meeting 2011 GBIF: Issues in providing federated access to digital information related to biological specimens. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF) TDWG 2011
2.
3.
4. “ Wrapper ” Software PyWrapper (Python) TAPIR Link (PHP) DiGIR (PHP) Your database Insect Collection Install one of these ‘ wrappers ’ ABCD Bird Observations Herbarium Data DarwinCore DarwinCore
5. The promise of federation Insect Collection Herbarium Bird Observations Herbarium Any specimens from Thailand? GBIF Data Portal I will ask! I do! I do! I do! Nope! GBIF Data Portal as a Gateway
6. The challenge of federation Insect Collection Herbarium Bird Observations Herbarium Hello? Server Not Available GBIF Data Portal Hi!
7. The rise of Indexing Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of all of your data GBIF Data Portal (now with Data!) GBIF Data Portal as a Data Index
8. The wrong tools for the job Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of your data once per month Here is page one. If I go offline, s tart again Not too fast! You ask the same questions every time GBIF Data Portal (now with Data!)
10. A Refined Approach Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? This is fast! GBIF Data Portal (now with Data!) URL URL URL URL This is easy
11. 2007 Today 70 million 2010 2008 2009 147 million 180 million 201 million 302 million Growth Need for a new standard identified
To start with, GBIF strives to create a global biodiversity data network that facilitates free and open access to primary biodiversity data worldwide. Currently, the network includes over 9200 datasets from over 340 data publishers representing over 100 countries and international organisations. Collectively the network provides access to over 300 million data records.
The foundation of the GBIF data network has historically been based on access to biodiversity databases mediated through one of the TDWG protocols listed above. These different protocols support the means to query databases in a standard manner and receive data results formatted according to Darwin Core or ABCD XML specifications.
These protocols were designed to support a fully federated network where a user could query the network through a gateway, which would propagate the query to all the members of the network and assemble the resultant responses to the user.
The GBIF network, however, was never able to function in this federated role. Real-time querying of databases was hampered by many factors not the least of which was that at any given time up to ¼ of the data servers were offline.
As a result the GBIF data portals provide discovery of data through a central index. This index consists of a subset of all the data served through the network that can be used to answer the key questions related to the data store – what species are included, where were they found and when were they collected.
DIGIR, TAPIR and BIOCASE are not well suited for building indexes of databases. They require long iterations of queries to harvest an entire dataset. A dataset of 260,000 specimens, served via TAPIR allows 200 records to be retrieved per request. This requires 1300 request/response pairs and takes over 9 hours to compete. During this time 500 MB of XML data is transferred. This is transformed into a 32MB text file once the data are processed in the GBIF server which could have been further compressed to a 3MB zip file. Producing such a data export and zipping it would take under a minute if produced by the database itself. Thus in 2009, GBIF began to promote the use of a new indexing data format.
Darwin Core Archives provide Darwin Core-based occurrence and taxonomic data in a simple, text-based format. It simplifies the exchange of indexes by eliminating the use of federated transfer protocols. Data is accessed via a simple URL using HTTP.
Darwin Core Archives provide GBIF with the means to 1) reduce what is currently more than a months (or more) time between when a data publisher registers data and its subsequent appearance in the data portal. We anticipate that with increased uptake of Darwin Core Archive and improvements in our data integration processes, we can reduce the latency from approx. a month down to a week or less. In addition, Darwin Core Archive has enabled us to index very large datasets that simply could not be harvested using the federated protocols.
Thus, since the Darwin Core Archive standard has been adopted, GBIF has seen a significant increase in the numbers of data records published through the network with a 50% increase in 2011 alone.
A second significant issue that challenges effective delivery of biodiversity data in a federated network is due to issues of quality relating to geospatial properties of records.
This map shows raw data as harvested from data providers that is asserted to originate in the United States. Note the mirror image of the United States over India and China. This is due to a missing negative symbol in the longitude data value.
This is how the data looks like after improved interpretation methods have been applied. We can now recognise international waters and offshore islands.
Providing taxonomic access to biodiversity data is a key requirement for many users. Both DarwinCore and ABCD provide the means for data publishers to include the Linnean classification of the referenced species within the data record. In a federated network, the result is that the same taxon may be classified in different ways. Not only does this complicate assembling a common taxonomic backbone for organising indexed data, it also complicates distinguishing actual homonyms – cases where the same name has been applied to two different taxa. In addition scientific names are often misspelled and even a correctly spelled name may exist as many different orthographies.
GBIF assembles a taxonomic backbone from taxonomic sources that are more authoritative than the classifications included with collections data. These sources are derived from new capacities within the GBIF network that enable species information to be published through the GBIF network in the same manner as collections (species occurrence) data. The GBIF taxonomic backbone, once assembled from a mix of both authoritative and collections-based classifications, is now composed entirely from published taxonomic catalogue data.
An example of how this impacts data organisation and delivery is illustrated in the map above. A european bird species with a name not occurring in the Catalogue of Life was mistakenly placed within the hummingbirds (a new world group) based on classification information tied to some of the specimens. This resulted in the map above where one erroneous species grouping impacts the map for the entire family.
With access to a wider array of authoritative taxonomic sources, we are able to match more taxa using more reliable sources and improve the taxonomic backbone used to organise all species data records.
This improved taxonomic reconciliation extends to the resolution of homonyms – names for different taxa that are spelled alike. Relying solely on taxonomic information within occurrence data sources provides a confusing array of possible homonyms. Relying on taxonomic authority files reveals there are exactly two genera with this name and includes a common name to help distinguish them.
Lastly, informatics improvements complement the addition of authoritative taxonomic sources in providing better methods for matching names to authority files. GBIFs name parsing service parses names into recognised component parts and builds canonical representations of names that allow different forms of the same name to be matched to authority file information.