1. An HPSSB (History,
Philosophy and Social
Studies of Biology) Approach
to Biomedical Ontologies
Sabina Leonelli
ESRC Centre for Genomics in Society
Department of Sociology and
Philosophy
University of Exeter
s.leonelli@exeter.ac.uk
2. An HPSSB Perspective on
the epistemic role of e-Science
Characterisation of experimental science as encompassing a variety
of ways of knowing and communicating, beyond what can be
formalised
e.g. modelling, experimental practices, tacit familiarity with instruments
and materials
This awareness needs to carry to e-Science: not attempt to replace
laboratory activities, but to complement them (attention: pointing to
new directions does not mean guiding research, exploratory quality
of experimentation)
• History of biology ‘big science’ infrastructure since WWII; history of
model organism research in biology, and of relations between
biological and medical research
• Philosophy of biology The role of data, theories, different types of
models, instruments and materials in experimental practices;
Epistemic functions of classification
• Social Studies of biology: Social organisation of science; Forms of
and conditions for cooperation and communication; Power relations
among actors; Institutional and economic context
3. Case Study: The Gene Ontology
• Arguably most successful bio-ontology to date
• Developed for use by community databases as
a standard for the annotation of gene products
history steeped in model organism research
• Good tool for data sharing:
– Choice of terms is based on research interests of
users
– Dynamic system: can be updated to reflect scientific
developments
• Flexibility comes from appropriate curation:
– Manual and labour-intensive (impossible to
automate)
– Research interests vary across epistemic cultures:
• How to choose relevant and intelligible labels?
4.
5. The Classification Problem
stability of classificatory categories
versus
dynamism and diversity of research
practices
Can classification through standard
categories enable collaborative research
without at the same time stifling its
development and pluralism?
6. GO as a Classification System
Making data travel across different epistemic
communities, to facilitate cross-species, integrative
research: classification of both biological phenomena
and data
• Data are associated to biological phenomena via machine-
readable labels
• Users can automatically assess the relevance of data as
evidence for claims about those phenomena
• To re-use data towards new discoveries, users need to assess
their reliability within their own research context: meta-data
enable users to ‘situate’ information through their own
expertise and tacit knowledge
=
data are de-contextualised for travel and re-
contextualised for appropriation by a new context
= access is differential: users can choose parameters
for their queries depending on their interests and
expertise
8. Classification of
data provenance
EVIDENCE CODES
Experimental evidence codes
- Inferred from Mutant Phenotype
- Inferred from Direct Assay
- Inferred from Genetic Interaction
- Inferred from Physical Interaction
- Inferred from Expression Pattern
Computational analysis
IEA - Inferred from Electronic Annotation
RCA - Reviewed Computational Analysis
ISS - Inferred from Sequence Similarity
Author statement
TAS - Traceable Author Statement
NAS - Non-traceable Author Statement
Curatorial statement
IC - Inferred by Curator
ND - No biological Data available
9. GO as an Expert Community
The threat of imperialism Vs. GO as ‘service to biology’:
whoever chooses labels and what counts as meta-data
determines nomenclature and protocols used as standard
across biology (and thus interpretation of data as well as
experimental set-ups)
1. De-contextualisation: separating data from information about ‘local’ features of data production
2. Abstraction: simplifying, eliminating or modifying characteristics of data to be standardised
3. Knowledge-stabilisation: define terms and relations to mirror (what they see as) the consensus
4. Situating: associate each dataset with a specific term (and thus a specific phenomenon)
Solution: Curator as mediator between requirements of e-
Science (consistency, computability, ease of use and wide
intelligibility) and the diverse practices characterising
experimental biology
• GO curators develop specific expertise to tackle the threat
– Cross-disciplinary training> awareness of diverse epistemic cultures
– Experience ‘at the bench’ > awareness of what users need and look for
• Community involvement (content meetings, feedback,
crowdsourcing, user training workshop and online
10. GO as a Scientific Institution
However: emergence of separate expertise is itself an
obstacle to dialogue with users. Curators face two
severe problems:
• Impossible to serve users without consultation, yet
users do not provide feedback: lack of interest, time,
expertise
• Need to minimise duplication/proliferation of labels,
yet each curator/ontology has a different perception/
function of/in the field
Solution: Consortia as regulatory centres --
standardisation as a tool to serve diversity in
epistemic practices and interests of users:
• Centralising expertise
• Centralising procedures
11. The Gene Ontology Consortium
• Michael Ashburner 1998: the terms used for data classification
should be the ones used to describe research interests
• July 1998: First meeting of the consortium, members from
Saccharomyces Genome Database, Mouse Genome Informatics,
FlyBase, Berkeley Drosophila Genome Project
• October 1999: funding application NIH, AstraZeneca
• 2000-1: Rapid expansion, including the Zebrafish Information
Network, the Rat Genome Database, The Arabidopsis Information
Resource, Gramene.
• 2002: Central office in Cambridge
• Grants from National Human Genome Research Institute (NHGRI),
NIH, EU, AstraZeneca, InciteGenomics, United States Department of
Agriculture, Research and Education Service and the UK Medical
Research Council.
• De facto standard for classification, annotation and dissemination of
genomic data in model organism biology
• In parallel: birth of the Open Biomedical Ontologies
Consortium
12. The Institutional Role of
Consortia: Enforcing
Collaboration
• Encourage feedback loops among curators:
– Rules for bio-ontology development
– Organisation of curator meetings and communication
– Enhancing accountability and clear division of labour
• Encourage dialogue with users:
– ‘Content meetings’
– Experiment on peer review procedures (e.g. Reactome)
– Liase with industry to align their data sharing practices
• Co-operate with journals (linking data disclosure with
publication)
E.g. Plant Physiology and TAIR: enforcing feedback on GO
• Train users and curators
– Workshops at conferences and elsewhere
– Enforce institutionalisation within universities (e.g. Stanford
Biomedical Informatics; graduate training in UK system
biology)
13. The multiple identities of GO
• GO needs to be playing several epistemic roles in biology
• Classification system
• Expert community
• Regulatory institution
• Exemplifies and regulates epistemic and social relations
between virtual (in silico) and material (wet) practices in
biology
• Despite institutionalisation within biology, still far from
having resolved tensions between curator’s vision of
what technology can do for science, and user needs and
practices
• Handling dissent on terms or definitions
• Providing sufficient meta-data to assess data provenance
• Non-overlapping datasets and checking data quality
• Long-term maintenance, strategies for revision and updating
(how has GO actually been revised?)
14. Thanks to ESRC for funding and several bio-ontology
curators (including the GO team at EBI) for their patience
and availability for interviews
• (in preparation) On the Role of Theory in Data-Driven Research:
The Case of Bio-Ontologies.
• (2010) Documenting the Emergence of Bio-Ontologies: Or, Why
Researching Bioinformatics Requires HPSSB. History and
Philosophy of the Life Sciences.
• (2010) Packaging Data for Re-Use: Databases in Model Organism
Biology. In Howlett, P and Morgan, MS (eds) How Well Do ‘Facts’
Travel. CUP.
• (2009) On the Locality of Data and Claims About Phenomena.
Philosophy of Science 76, 5.
• (2009) Centralising Labels to Distribute Data: The Regulatory Role
of Genomic Consortia. In Atkinson et al (eds.) Handbook for
Genetics and Society: Mapping the New Genomic Era. Routledge,
pp. 469-485.
• (2008) Bio-Ontologies as Tools for Integration in Biology.
Biological Theory 3, 1: 8-11.
15. Abstract
This paper reflects on the analytic challenges emerging from the
study of bioinformatic tools recently created to store and
disseminate biological data, such as databases, repositories and
bio-ontologies. I focus my discussion on the Gene Ontology, a
term that defines three entities at once: a classification system
facilitating the distribution and use of genomic data as evidence
towards new insights; an expert community specialised in the
curation of those data; and a scientific institution promoting the
use of this tool among experimental biologists. These three
dimensions of the Gene Ontology can be clearly distinguished
analytically, but are tightly intertwined in practice. I suggest that
this is true of all bioinformatic tools: they need to be understood
simultaneously as epistemic, social and institutional entities,
since they shape the knowledge extracted from data and at the
same time regulate the organisation, development and
communication of research. This viewpoint has one important
implication for the methodologies used to study these tools, that
is the need to integrate historical, philosophical and sociological
approaches. I illustrate this claim through examples of
misunderstandings that may result from a narrowly disciplinary
study of the Gene Ontology, as I experienced them in my own
research.