Talk presented at iEvoBio 2014 conference in Raleigh, North Carolina. Though there's a similar title and overlap with the talk I posted last week, there is new material here especially geared towards an informatics crowd savvy in the tools and technology.
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
1. Frontiers of discovery with
Encyclopedia of Life
TraitBank research and other case studies
Cyndy Parr
Smithsonian Institution National Museum of Natural History
parrc@si.edu @cydparr http://www.slideshare.net/csparr
2. Central challenges
• What are all the organisms on the planet?
• What do we know about them?
• How can we build new knowledge about
them?
3. GenBank
60 million DNA sequence records
900,000 species
4,000 genomes
How are these related to traits?
4. Phenomes: the next
frontier
In Phenoscape
57 publications had 565,158 anatomical trait
descriptions for 2,527 kinds of organisms
= 223 traits/organism
In ZFIN
38,189 trait descriptions for 4,727 genes for
Zebrafish
1.9 million species on the planet
= LOTS OF TRAITS
5. • How is EOL different
• How EOL gets used
• Introducing TraitBank
• Loading up TraitBank
• EOL & TraitBank in research
• Future of TraitBank
Outline
13. Anatolia Zooarchaeology Case Study led by
Alexandria Archive Institute
1. 14 different sites
2. 34+ zooarchaeologists
3. Decoding, cleanup, metadata documentation
4. 220,000+ specimens
5. 450 entities linked to 143 EOL taxon concepts
6. Anatomical entities linked to Uberon.org
7. Biometrics linked to measurement ontology
8. Collaborative analysis
Anatolia Zooarchaeology Case Study led by
Alexandria Archive Institute
1. 14 different sites
2. 34+ zooarchaeologists
3. Decoding, cleanup, metadata documentation
4. 220,000+ specimens
5. 450 entities linked to 143 EOL taxon concepts
6. Anatomical entities linked to Uberon.org
7. Biometrics linked to measurement ontology
8. Collaborative analysis
http://opencontext.org/
Kansa, E., Kansa, S. W., & Arbuckle, B. (2014). Publishing and Pushing:
Mixing Models for Communicating Research Data in Archaeology.
International Journal for Digital Curation, 9.
14. Page, R. D. M. (2013). BioNames: linking taxonomy,
texts, and trees. PeerJ, 1, e190. doi:10.7717/peerj.190
BioNames.org
Rod Page
16. Search & Download
Data Sources
Data Summaries on
EOL Taxon Pages
Which plants grow well in
acidic soil?
What do water bears eat?
What is the biggest
species of whale?
Structured Data
TraitBank
JSON-LD API
17. • Numeric data
(measurements)
• Categorical data
(controlled vocabulary)
• Species interactions
• Mostly summaries for
populations, species
• Individual specimens
• Higher taxa
http://eol.org/traitbank
released January 2014
26. TraitBank Uploading Darwin Core
Archives
Common names | Taxa | References | MeasurementsOrFacts | Associations | Events |
Occurrences
27. Term URIs from existing
ontologiesbioportal.bioontologies.org
Subject Area Ontology Example terms
Statistics
Semanticscience Integrated
Ontology (SIO)
mean, minimal value,
standard deviation
Units of
measure
Units of Measurement Ontology
(UO)
meter, years, degree
Celsius
Habitat
information
Environments Ontology (EnvO) wetland, desert, snow field
Attributes of
organisms
Phenotype Quality Ontology (PATO) aerobic, conical, evergreen
Plant attributes Plant Trait Ontology flower color, life cycle habit,
salt tolerance
Animal attributes Vertebrate Trait Ontology body mass, total life span,
onset of fertility
Animal natural
history
Animal Natural History and Life
History Ontology (ETHAN)
nocturnal, oviparous,
scavenger
28. Term URIs from existing
ontologies
•Where necessary: request terms
•Last resort: create provisional terms with
http://eol.org/schema/terms/xxxx
•Still to do
• create “equivalentTo” or “similarTo” relations
• even more fancy inference
30. TraitBank data sources
Sources include:
Databases
(OBIS, AnAge, Paleodb, Phenoscape)
Literature
(Dryad, Ecological Archives, Data tables)
Natural History Collections
(Label data)
Legacy/unpublished data
Loading up TraitBank
32. Text mining
Environments-EOL
Evangelos Pafilis, Hellenic Centre for Marine Research (HCMR), Institute of
Marine Biology, Biotechnology and Aquaculture (IMBBC), Crete, Greece
491,616 habitat terms for 136,548 taxa
34. Morphological Data from NMNH catalog
Abi Nishimura
Project: Clean-up morphological data from
NMNH KE-Emu catalog and publish to
TraitBank
Goal: Make it easier to access and analyze
this valuable morphological data
Sakurai Midori,
http://eol.org/data_objects/26918624
Raw data from Spectral Tarsier Tarsius tarsier
database search
35. RESULTS
•Primate data published (320 taxa)
•Comprehensive mammals data to be
published soon (4662 taxa)
•Bird catalog currently being mined
Wan Hong, http://eol.org/data_objects/29203274
36. Mineralization of tissue in
marine organisms
Jen Hammock with Steve Cairns
For modeling impacts of ocean acidification
143,000 records for 119,000 species and subspecies of Micro- and Macroalgae,
Cnidaria, Polychaetes, Bryozoans, Brachiopods, Sponges, Mollusks,
Echinoderms and Arthropods
Mineralized tissue =
●Biogenic silica
●Calcium carbonate
○ Calcite
○ and/or Aragonite
37.
38. 2013-14 EOL Rubenstein Fellows
EOL & TraitBank research
1. EnvO habitat terms (Pafilis et al.)
2. Altitude Specificity of Flower Coloration (Wright & Seltmann)
3. Morphological impacts of extinction risk in fish (Chang)
4. Butterfly-host plant associations (Ferrer-Parris et al.)
5. Taxon Tree Tool (Lin)
6. Global Biotic Interactions (GLoBI, Poelen & Mungall et al)
http://www.globalbioticinteractions.org/
7. Reol: An R interface for EOL (Banbury, O’Meara)
Banbury, B. L., & O’Meara, B. C. (2014). Ecology and Evolution, 4(12).
doi:10.1002/ece3.1109
41. 1. Character displacement across the Tree of Life
2. Illuminating the Dark Parts of the Tree of Life
3. Evolution in the usage of anatomical concepts in the biodiversity
literature
4. Planning for global change: using species interactions in
conservation
5. No place like home: Defining “habitat” for biodiversity science
6. Assessing risk status of Mexican amphibians
7. Quantifying color from digital imagery: color may determine
species’ responses to habitat edges and to climate change
8. More is less - Identifying global trends in species’ niche width
9. Identifying key species traits associated with climate change
vulnerability
NESCent-EOL-BHL Research
Sprint
42. Quantifying color from digital imagery
1. Automate processing of almost 300k images (of EOL’s 2.4 million)
2. Identify pinned specimen images
3. Process these for color and pattern information
4. Put this info into TraitBank
Elise Larsen, Yan Wong
43. Illuminating the Dark Parts of the Tree of
Life
Jessica Oswald, Karen Cranston, Gordon Burleigh, Cyndy Parr
1. Query EOL, GBIF,
GenBank for # records
2. Create score for amount
of information available
3. Map score to phylogeny
44. Global Genome Initiative Data Portal
For every family:
•Use TraitBank to assemble counts of records in repositories
•Compute a score (percentile) to assess knowledge available relative
to other families
•Make it easy to browse to find families that require effort
Beta launch end of June
45. • Decorate trees with traits
• NSF Genealogy of Life
• NSF Big Data
• NSF ABI Isotopes and Interactions
• Microsoft/WCMC Global Ecosystem Models
TraitBank future plans
46. Leveraging social networks
Ahn, J., et al.. (2012). Visually Exploring Social Participation in Encyclopedia of
Life. In 2012 International Conference on Social Informatics (pp. 149–156). IEEE.
Rotman, D., et al. (2014). Motivations affecting initial and long-term participation in
citizen science projects in three countries. In iConference 2014 Proceedings (pp.
110-124).
http://biotracker.umd.edu
• motivation model for citizen scientists
• international attitudes of scientists and
citizens to working together
• factors that increase curation network
activity
• currently working on motivations of EOL
content partners
47. Annotation of a specimen record
Ovary size and reproductive state
Age markers
Fat status
Body mass and other size
attributes
49. For more information
• See & cite Parr, et al. 2014 Biodiv. Data Journal
• See our TraitBank paper (in review)
http://www.semantic-web-journal.net/content/traitbank-practical
• Open source code https://github.com/EOL/
• APIs at http://eol.org/api
• Become an EOL Curator
50. Take home messages
• EOL can be useful for research
• TraitBank is already awesome
• Mutualism between collections,
EOL, citizen science
• Let’s collaborate
51. Atlas of Living Australia • Biodiversity Heritage Library Consortium • Chinese
Academy of Sciences • La Comisión Nacional para el Conocimiento y Uso de la
Biodiversidad (CONABIO) • The Field Museum • Harvard University • El Instituto
Nacional de Biodiversidad (INBio) • Marine Biological Laboratory • Missouri
Botanical Garden • Muséum National d’histoire Naturelle • Naturalis Netherlands
• New Library of Alexandria • Smithsonian Institution • South African National
Biodiversity Institute • All of our content providers and curators
Steve Cairnes • John Keltner • Katie Barker • Jonathan Coddington • Sean Brady •
Tom Orrell • Chris Meyers • Yan Wong • Jon Norenburg • Torsten Dikow • Yurong
He • Jenny Preece and others on BioTracker team • Pensoft Publishing • EOL
Science Advisory Board
Katja Schulz, Jen Hammock, Marie Studer, Jeff Holmes, Nathan Wilson, Patrick
Leary, Jeremy Rice, Lisa Walley, Bob Corrigan, Erick Mata, Dmitry Mozzherin, Abi
Nishimura • Sarah Miller • Anthony Goddard, Mark Westneat and former BioSynC
staff
http://eol.org @cydparr parrc@si.edu
Major Funding for TraitBank provided by the Alfred P. Sloan
Foundation. Fellows program supported by Daniel M.
Rubenstein, Research sprint by Richard Lounsbery Foundation.
52. 1. Terms are not in any existing ontology
e.g., seawater oxygen saturation, eutrophic pond, north-facing bluff
2. Synonyms are not included
e.g., vernal pond/intermittent pond
3. Standard classifications should be mapped
e.g., NatureServe, NOAA
4. Environment estimates vs. well-documented niche
parameters
e.g., text mining results vs. NatureServe habitats, OBIS data vs. niche analyses
Challenges
53. 14 datasets with 25k taxa, 422k interactions, for 3k locations
alpha version of ingestion, normalization, aggregation
alpha version of web API
alpha version of data exports
GLoBI http://globalbioticinteractions.wordpress.com/
Jorrit Poelen, Chris Mungall, James Simon GoMexSi
Notas do Editor
Thank you for inviting me for a keynote. Last year some of you may remember I gave an iEvoBio lightning talk and a preview of traitbank in the software bazaar. It is really great to have an extended time slot now to go into more detail now that we have launched TraitBank. I’m going to do a whirlwind tour here, but I will be putting the talk up on slideshare so you can get all the URLs
These are central challenges for the life sciences. We know there are at least 1.9 million. Estimates of course vary for how many are still left to describe, but considering the ones we know already, we want to learn as much as we can about htem. There are many reasons we want to know about these organisms, applied reasons, such as why and how some of them pathogens or pests, or for ecosystems services, or just because we strive to understand the processes of ecology, evolution, developmen. Iindividuals and groups of scientists are trying to learn everything we can about not only a few model organisms but the vast riches of biological diversity. EOL has really focused so far most on this second question – gathering together as much as possible about what we know.
This talk is really focussed on the third challenge – specifically how EOL can be part of the process of building new knowledge through research.
Before I talk about TraitBank, let’s think a little bit about GenBank. We are in the midst of a genomics revolution and most people would agree this has been a major advance in our ability to construct new knowledge about organisms. The cost to generate a full genome sequence is dropping more or less daily.
What is all this genetic information DOING? How does it relate to what we can see and measure about organisms, their phenotypes, or their traits? How does DNA interact with the environment to result in both normal and abnormal development? How did it evolve? How fast do DNA changes make a difference in the lives of organisms?
Last year I did some calculations. These may be a bit out of date but should still work for scale. Phenoscape is a database that is looking at anatomical traits in fishes. Looking just at 57 publications they have more than 500K descriptions for 2500 kinds of organisms.
ZFIN is a model organism database for zebrafish, a common model organism for developmental biologists. In just this one species they have captured nearly 40,000 traits – just for ONE very well-studied SPECIES
Lots of traits AND LOTS OF WAYS TO DESCRIBE THEM, whether talking about stovepiped projects and formats and vocabularies
The rate of accumulation of biodiversity knowledge over the ages has been impacted by major innovations. From the Linnaean system of taxonomy, to the microscope, to current tools for high throughput genomics, each set of technological advances ushers in a new era of exploration. The next frontier is arguably phenomics, the study at large scales of organismal phenotypes -- or more generally traits or attributes. The Encyclopedia of Life's contribution to this revolution is TraitBank -- a triple-store-based repository of structured descriptive information about all biodiversity, Leveraging its names-based infrastructure and tools for aggregation and curation, EOL now harvests, semantically annotates, and serves numeric, controlled vocabulary, and ecological relationship data across the tree of life and across biological disciplines. We review different kinds of workflows that populate TraitBank with information (over 6 million records so far) sourced originally from the literature or from disparate databases or from text mining. A centralized resource like TraitBank promises to enable many kinds of research, but here we will discuss its potential for large scale evolutionary informatics.
Several case studies illustrate TraitBank's power. A couple of efforts are analyzing the gaps in genomic and other knowledge across the tree of life. Others are finding evidence for character displacement at larger scales than ever before. There is great potential for explaining character evolution or assisting the search for the genetic bases for important phenotypic development. In all cases, analyses involve combining TraitBank data with data from complementary resources such as the Global Biodiversity Information Facility, GenBank, ITIS, or the Open Tree of Life. All of these cases are works in progress -- some of them were initiated at a Research Sprint held in February 2014 at the National Evolutionary Synthesis Center. They share certain challenges of aligning identifiers, navigating multiple methods of accessing data, and the need to visualize large amounts of data.
The future of phenomics likely parallels the genomics and bioinformatics revolutions. There will be advances in both automated technology and in distributed crowd-sourcing for capturing and curating descriptive information. Our initial research in this direction, in collaboration with University of Maryland, suggests that many challenges remain. However, our ability to incorporate this descriptive information in ever-larger and more sophisticated comparative analysis and modeling exercises should feed back to generate demand for even more advances in descriptive technology.
started 2008
text, media, literature
all species, genera, etc.
names infrastructure
data curation
2.6 million images1.3 million taxa with content
Over 5 million visitors/year
75,000 registered members
AND multi-lingual – latest global partner is the French National Museum of Natural History
We have a working infrastructure as well as more than 200 partners,
We harvest and sort text and multimedia by topic and by species and put it on our pages. Curation + user-added content from the crowds is added to the mix. This is fed back to providers, giving them traffic, quality control on their own content, and new content for them to use
And, we are already seeing spinoff products. We make it easy for developers, and everything is either public domain or CC-licensed so it can be re-used.
General reference by the public, people listen to our podcasts, cited in wikipedia, links from OneTree from James Rosindell, Field Guides, Notes from Nature
Games
James Rosindelll Luke Harmon Yan Wong and others One Zoom
Photomosaic from all the descendents of a particular mammal ancester, in the shape of Shrewdinger, a reconstruction of that ancestral mammal by
.
TraitBank data are ingested, standardized as much as possible using ontologies, and managed as triples in a Virtuoso triple store. The trait information for each taxon is displayed in a Data tab on EOL taxon pages. There is search and download and a JSON-LD interface.
This january we released the first version of our TraitBank platform. Scope is on the slide. Example is a chart made by ordering the body lengths of Cetaceans from largest to smallest, so the biggest here is Blue whale and we’re showing one value for a beaked whale. This visualization is just a prototype, we haven’t released it yet. If you want to find TraitBank you can go to this URL or look in the footer of every EOL page.
You can also go to almost any page on EOL with content and you will see a quick facts box on the overview tab. This is a selection of traits. These particular traits, things like growth habit, Invasive listings, habitat keywords, are all contrlled vocabulary gterms. If you go to “See all” or to the Data tab you’ll see..
We also have numeric data with units and life stage identifiers. If you hover over you can get a definition of the term – different sources may provide the same or similar data (Pantheria & anage provide body mass and weight. We try not to lose any information but group things together – point to tabs-- so you can see related measurments.
Each record is annotated with rich metadata, including provenance, citation, information about methods etc.
TraitBank has a search interface, which is currently limited to queries for a single trait, with the option to restrict the search to a particular clade or a range of values.
Search results can be explored through the EOL interface or downloaded in a csv file that contains all metadata for a given record, including the uri mappings to ontologies.
Unless otherwise noted, all terms are from Darwin Core except where an extension is noted. Not all properties are indicated.
Taxon is the core concept, and a taxon can have another taxon as a parent. This captures the biological hierarchy information. Taxa can have media associated with them that can be text, or images, or sounds, etc. We use a media extension to Darwin Core that draws from Audubon Core. They can also have common names and we use a vernacular name extension for that. Media objects and taxa can both have bibliographic references as well as agents, such as authors, suppliers, photographers, etc.
The media, common name, reference and agents
Using JSON-LD
So that web crawlers and other machines and programmers can easily find all the data
TraitBank data sources include databases like..., data from the literature deposited into repositories like Dryad, EA or Pangaea, label data from Natural History Collections and legacy or unpublished data from individual researchers or projects.
TraitBank currently holds data from X different data sets which provide about 7 million records for 326 traits across over 300,000 taxa.
Here’s an overview of all the traits we have, weighted by the number of records represented in our triple store.
TraitBank covers all kinds of traits, including geographic distribution, morphology, ecology, and life history, but we currently have a focus on marine data and data of use to conservation science.
We get most of these data from the Environments-EOL project, led by Vangelis Pafilis, which mines the taxon descriptions in the EOL text collection for environment descriptive terms and maps them to the vocabulary of the ENVO ontology.
Currently Environments-EOL provides us with almost half a million habitat keywords for over 100,000 taxa.
Ongoing project, but here’s an example species to show why it’s important:
Spectral tarsier– a small, nocturnal, insectivorous primate, found in Indonesia
This is some of what you’d see if you search Spectral Tarsier in NMNH database and download results.
PROBLEM WITH THIS FORMAT: Measurements all in one column. To study tail length, you’d have to have to manually extract that one measurement– painstaking and slow.
We cleaned up the data: separated the measurements and combined them with contextual “meta-data”
We uploaded the sorted data to TraitBank.
The “Spectral Tarsier” EOL page now contains organized, accessible NMNH measurement data (green arrows)
You can see that it would now be easier to study a measurement like tail length.
Note that many morphological categories ONLY have data from NMNH; It’s clearly important to make the database measurements accessible.
Data come from Steve Cairns and the literature so far
Here’s a scenario
Most at species and subspecies level
Conrolled vocabulary data,
annotated w/verbal modifiers (eg: “High Mg C”, “inferred from Superfamily”)
CHEME/
Saturation horizons with respect to all mineral phases are migrating toward the surface, potentially risking the survival of calcifiers in the neritic, shelf, and slope environment... Orr et al. (2005) predicted that by 2100 the Southern and the Arctic Oceans could be undersaturated with respect to aragonite, and then calcite would follow in ~50–100 years. This has also major implications for calcifying taxa at those latitudes.
(http://www.esajournals.org/doi/full/10.1890/09-0553.1)
If you went to trevor price talk last night you know that we are still trying to understand the evolution of color and how it might be impacted by the environment – Altitude specificity
How about some more focused work with EOL and traits. Next I’m going to give a few case studies where biologists really want to know the answers to some biological questions and are using TraitBank’s data and aggregation & integration & to be quite honest, people power to answer the questions.
Some of you may recall the Rubenstein program – jn the last year of our Rubenstein program, we have funded projects that aim to lay the groundwork for biological reasarch using EOL.
Use BHL or EOL and other sources to tackle biological questions
Matched each awardee with informatics expert
4-7 February 2014, Durham, NC
organized by Cynthia Parr and Craig McClain
Funded in part by Richard Lounsbery foundation
Some of these are conservation oriented research, e.g. 1, 6, and 9
Other topics are more basic evolutionary biology or ecology research or
We have focused so far on being the species-based repository for aggregating and integrating the information, not providing analysis tools but providing general access to it, which then can be served and repurposed for various other projects.
EOL has also been a platform for social science research.
IDigBio
Also, BioCubes
It is early days yet
Major Funding for the development of TraitBank was provided by the Alfred P. Sloan Foundation with additional support our global partner institutions
These are in addition to people that I called out earlier in the slides, and I’ve probably forgotten many
Another issue we are dealing with is expressing the value of environmental data as descriptors of an organism’s niche. For example, right now we use a very general “Habitat” concept to cover the Environments-EOL text mining results as well as NatureServe habitat key words which are based on thorough documentation of habitat use.