This document discusses enabling phylogenetic data to be more accessible to non-specialists. It describes current barriers like technical obstacles in data standards and social obstacles of data hoarding. The National Evolutionary Synthesis Center (NESCent) aims to address these issues through various initiatives. This includes developing ontologies, databases, and software to integrate phylogenetic and phenotypic data, as well as promoting open development practices.
ICT role in 21st century education and its challenges
Data Mining GenBank for Phylogenetic inference - T. Vision
1. Prospects for enabling Suppose you have the sequence of a protein-coding
phylogenetically informed gene, and are interested in its function. What is
the first thing you would do?
comparative biology on the web
• If it were me, I would search for conserved
domains that match records in Pfam and other
Todd Vision & Hilmar Lapp
1,2 1
protein domain databases.
1U.S. National Evolutionary Synthesis Center
• Are these databases complete?
2Dept. Of Biology, University of North Carolina
• Are they infallible?
at Chapel HIll
• Are they still useful?
Why are these data useful?
• You needn’t have mastery of the specialist
literature before the search
• A match connects you to a vast interconnected
world of information
• Why not worry about completeness?
! A negative result is not expensive
! Many broadly useful records are already present
• Why not worry about fallibility?
! The user can weigh the evidence once a match is
found
! Assertions should be exposed to scrutiny
1
2. Some observations The case of phylogenetic data
• This infrastructure is designed to disseminate data • There is a broad audience for phylogenetic data
to non-specialists ! Organismal phylogeny (e.g. Encyclopedia of Life)
• The relevant data may be derived from multiple ! Gene/protein trees
“studies”, not all of which are published • Many of the available resources are geared
toward specialist researchers & students
• Data is hoarded neither by the researcher nor by
the domain database • Non-specialists turn to taxonomic classifications
when they need organismal phylogenetic
• The search service is as widely disseminated as
information
the data
• Few know where to find gene/protein trees at all
• Semantic-level machine-to-machine
communication facilitates human comprehensive
TreeBase
Tree of Life Web Project
• screenshot
2
3. The NCBI taxonomy
• Provides
! A hierarchy for all species represented by DNA
sequences in Genbank
! Names and IDs for internal nodes
! An FTP dump
• But does NOT
! Include unsequences species
! Report confidence in topology or monophyly
! Taxonomic nuance (it has synonyms & common
names)
Node-oriented web services from
What if the NCBI taxonomy… the Tree of Life Web Project
Name
• Listed all taxa, including fossils? •
Description
•
• Allowed one to assess where there are
Authority
•
conflicting topologies?
Date
•
• Reported support values for clades? Other names
•
• Reported divergence time estimates for Completeness of children
•
nodes (e.g. from TimeTree) Extinction status
•
Confidence of position
•
• Reported the provenance of the data?
Monophyly
•
3
4. Further barriers to dissemination
Outline
of phylogenetic information
• Informatics @ NESCent
• Technical obstacles
• An example of a phylogenetically-informed
Technology for storing and querying trees
!
semantic web application for phenotype
Difficulties with exchange standards
!
data
Inference of consensus trees and supertrees
!
• Promoting interoperability and closing
Taxonomic intelligence
!
technical gaps in phyloinformatics through
Globally unique identifiers
!
open development
• Social obstacles
! Reluctance to provide incomplete or fallible
information
NESCent sponsored science
• Catalysis Meetings (large, one-time events)
! To foster new collaborations and synthetic research
• Working Groups
! Smaller, focused, multiple meetings
• Sabbatical Scholars
• Postdoctoral fellows
• Short-term visitor program
! 2 weeks to 3 months
! Encourage collaborative projects
• Application info: http://www.nescent.org
4
5. NESCent Informatics
Evolutionary Informatics WG
• Support for sponsored science and scientists
• Organizers: Arlin Stoltzfus and Rutger Vos ! Facilitating electronic collaboration
• Selected goals: ! Software/database development
! Providing HPC and other IT infrastructure
! XML serialization of NEXUS
• Cyberinfrastructure for synthetic science
! Formal grammar for validation and interconversion of
Data sharing
!
NEXUS & other formats
Software interoperability
!
! A transition model language for evolutionary models
Training
!
used in statistical inference
In partnership with major national and international
!
! An ontology for evolutionary comparative data analysis
efforts
• http://www.nescent.org/wg_evoinfo
Phylogenetic cyberinfrastructure to enable
GeoPhyloBuilder
comparative biology
• Two traditions in the recording of phenotype data
“Putting the ! Natural language descriptions and character matrices
geography into ! Statements made using anatomical and trait ontologies,
designed to capitalize on the semantic web
phylogeography”
• NESCent WG on morphological evolution in fish
! Organized by Paula Mabee and Monte Westerfield
David Kidd & Xianhua Liu
! Led to a larger project
• Aim is to integrate
• Extension for ArcGIS Software that creates a spatiotemporal
! Mutant phenotype data for zebrafish
GIS network model from a tree with georeferenced nodes.
! Comparative morphology data for the Ostariophysi
• 3D visualizations are possible through ArcSCENE.
• http://www.nescent.org/informatics/software.php
5
6. Describing phenotypes using
Ontologies
ontologies
• Defined terms with defined relationships • Entity-Quality system (EQ)
! e.g. Gene Ontology, Cell Ontology
• Entity term from an anatomy ontology
! zebrafish anatomy cell ontology, etc.
cell part_of
part_of • Quality term from Phenotype and Trait
Ontology (PATO)
cell
membrane
projection • e.g. Entity=dorsal fin, Shape=round
is_a is_a
axolemma part_of axon
Phenotype and Trait Ontology
Evolutionary character matrices
(PATO)
...
• Common phenotypic data format in
physical
evolutionary biology (e.g. NEXUS)
quality
optical
quality
• Characters + character states, similar to
chromatic
buoyancy
EQ
property
dorsal fin shape character 2
color
amplitude
round state
Species one
blue
pointed state
Species two
green
undulate state
Species three
bright blue dark blue
6
7. Character Matrix vs. EQ A scenario
• A geneticist observes a reduction in the number
Character of a particular bone type (e.g. branchiostegal ray)
Character
in a zebrafish mutant of her favorite gene.
State AO
• She asks: is this bone variable in number among
Entity Attribute Value PATO species in nature?
dorsal fin shape round
• She could query the evolutionary phenotype
database using:
Entity Quality ! Entity = Branchiostegal ray (from TAO)
! Qualities pertaining to attribute ‘count’ (from PATO)
• By examining additional changes on these same
• She could examine a visualization of the branches, she sees several parallelisms:
phylogenetic relationships of the taxa with ! loss of the swimbladder, pelvic fins, and scales
the relevant character changes mapped. ! elongation of the mandibular or hyoid arches
! reduction or loss of the opercle in syngnathids and
• She would see that most Ostariophysi have 3
saccopharyngoids.
rays, but that reduction has occurred ! a variety of other bones and soft tissues are lost or
multiple times: greatly modified
! solenostomids and syngnathids (ghost pipefishes • She might hypothesize that these trait
and pipefishes) correlations are all due to alterations in the
expression of the same suite of morphogens.
! giganturids
• She can select appropriate species from these
! saccopharyngoid (gulper and swallower) eels
lineages to follow-up experimentally.
7
8. Some anatomical ontologies
What data are needed to enable
this scenario? Amphibia
•
C. elegans
•
• Anatomy and trait ontologies
Fish (zebrafish, medaka, teleosts)
•
• Phenotypes in EQ syntax for
Insects (Drosophila, Mosquito, Hymenoptera)
•
! Zebrafish mutants (already exist)
Mammals (mouse, human)
•
! Species/clades of Ostariophysi
Plants (Arabidopsis, cereals, maize, all plants)
•
• Phylogenetic relationships among the
Ostariophysi
! Taxonomy ontology
Preserving published data for
NESCent
(Vision, Lapp,
Software Developers)
future integration efforts
Working groups U. Oregon
(Westerfield)
Curator interface
Usability testing
EQSYTE database
Sequence alignments (e.g. Treebase)
•
Liason to ZFIN
EQSYTE public interface
Liason to NCBO
Long-term population records (e.g. pedigrees)
•
USD
(Mabee, EQSYTE contents
2D and 3D images
Data Curator)
•
Zebrafish
Ostariophysan
phenotypic
Collection and locality information
phenotypic
•
& genetic
Morphology data NCBO
data
collaborators
(Arratia, Coburn,
Behaviorial observations
•
Applications
Ontologies
Hilton Lunderg, Mayden)
(Phenote, OBO-Edit)
(taxonomy, TAO,
PATO, homology)
Numerical tables
•
OBO
(host of TAO, PATO,
taxonomy ontology)
Etc.
•
Tulane U.
Phenotype Ontologies
(Rios/Ontology Curator)
for Evolutionary Biology
Ichthyology community
Liason to CToL Workshops
(DeepFin, Fishbase)
• Most of these data are lost upon publication
• These are the stuff of comparative biology
8
9. Dryad: A digital repository for published data
Journals and societies involved
in evolutionary biology
so far
American Naturalist (ASN)
•
Evolution (SSE)
•
Journal of Evolutionary Biology (ESEB)
•
Integrative and Comparative Biology (SICB)
•
Molecular Biology and Evolution (SMBE)
•
Molecular Ecology
•
Molecular Phylogenetics and Evolution
•
Systematic Biology (SSB)
•
NCSU Digital Library Initiative
2006 Phyloinformatics Hackathon
Open development
ATV NCL NESCent HyPhy PAUP* CIPRES GARLI TreeBase
• Open source refers only to the licensing of the
software code Bio::CDAT Biojava BioSQL JEBL Bioruby BioPerl Biopython
• At NESCent, we have been experimenting with
practices in open development
! Community contributes to a shared code base
! Higher barrier to entry
! Can be a substantial payoff in terms of interoperability,
functionality, usability, maintenance
! Surprisingly rare in academia
9
10. Hackathon mechanics
• Before the meeting
! Participants and users suggested integrative workflows
• At the meeting
Gaps in existing toolkits were identified
!
Subgroups collaborated on high priority targets
!
Followed a “use case” model
!
Subgroups and targets were allowed to be fluid
!
Users were on hand to provide datasets, test code,
!
provide their perspective
! Dedicated participants tasked with documentation
• All code is open-source and deposited in
established repositories
Accomplishments
• Reconciling trees
• Sequence family evolution ! BioPerl: Support for NJTree
! BioPerl: Support for TribeMCL, QuickTree, ! Biopython: Wrapper for Softparsmap
ClustalW, Phylip, PAML ! BioRuby: Model for phylogenetic trees and
networks with graph algorithms
! BioPerl & Biopython: Support for dN/dS-based
tests for selection in HyPhy ! BioSQL: Model for phylogenetic trees and
networks with optimization methods and
! Biojava: Parser for Phylip alignment format
topological queries
! BioRuby: Support for T-Coffee, MAFFT, and
Phylip
10
11. • Phylogenetic inference on non-molecular
• NEXUS compliance
characters
! BioPerl: Interoperability between Bio::Phylo and ! Biojava: Interoperability between Biojava and JEBL
BioPerl APIs ! Biojava & BioRuby: Level II-compliant NEXUS parsers
! BioRuby: NEXUS-compliant data model and parser for
! All:
PAUP and TNT results
Evaluated major APIs
!
Proposed compliance levels
!
• Phylogenetic footprinting Gathered test files exposing common errors
!
! BioPerl: Support for Footprinter, PhastCons, and using Fixed compliance issues in NCL and Bio::NEXUS reference
!
ClustalW over a sliding window implementations
Worked on integrating those into GARLI and BioPerl,
!
respectively
• Estimation of divergence times
! BioPerl: Draft design of r8s wrapper
Next hackathon
• Comparative Phylogenetic Methods in R
• December 10-14, 2007 • Student internships in open-source software
• Organizers: S. Kembel, H. Lapp, B. O'Meara, S. development
Price, T. Vision, A. Zanne ! Students work with any of a large number of
established OS projects
• http://hackathon.nescent.org/R_Hackathon_1
! Students and mentors work & communicate remotely
• NESCent recruited mentors and oversaw student
• Have an idea for a future event? Submit a progress
whitepaper! ! Eleven students worked on projects in visualization,
usability, interoperability & implementation of new
methods
11
12. NEXML Command-line BioSQL
• Student: Jamie Estill
Student: Jason Caravas
•
• Mentor: Hilmar Lapp
Mentor: Rutger Vos
•
• Commands for
Flexible serialization of phylogenetic objects
• Database initialization
!
Bio::TreeIO import
!
Perl Bio::Phylo module tools for NEXML
•
Bio::TreeIO export
!
parsing and serialization Tree query
!
Tree optimization
!
Tree manipulation
!
Conservation of phylogenetic
diversity
• Student: Klaas Hartmann
• Mentor: Tobias Thierer
• Implementation of algorithm and GUI for
optimal allocation of a finite budget to
individual species to maximize phylogenetic
diversity.
12
13. Bayesian calibration of Phyloinformatics Summer Course
divergence times
Teaching advanced
•
programming skills to
• Student: Michael Nowak phylogenetic methods
• Mentor: Derrick Zwickl developers
Focus is on software
•
technologies rather than
methodology
First year
•
• Fossil occurrence data is used to ! 10 days in July 2007
construct informative priors on ! Organized by Bill Piel of
TreeBASE
divergence times for Bayesian ! 8 co-instructors
analysis in, e.g. BEAST ! 23 students (11 female) in the
first year
Additional acknowledgements
Conclusions
Hackathon participants
• The future of web-enabled comparative biology is •
beginning to become clearer. GSoC mentors and students
•
! For a preview, see genomics! Summer course instructors
•
• The facile exchange of phylogenetic data is what Phenotype evolution project
•
will enable it. ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula
• Expect to be using technologies such as Mabee, Peter Midford, Monte Westerfield
ontologies and web services, which are now • Data depository:
largely foreign to phylogenetic researchers. ! Ryan Scherle, Jane Greenberg
• Also expect a shift toward open development.
! This will necessitate new modes of training for
academic phyloinformaticists.
13