This document discusses the past, present, and future of knowledge representation in biology. It covers how ontologies have grown significantly in use over time for organizing biological facts and data. However, ontologies only represent part of biological knowledge, and there is potential to do more by connecting different types of knowledge, generating natural language descriptions, and representing knowledge about experiments and workflows in addition to entities and relationships. The document argues that biological knowledge representation has advanced beyond ontologies alone and could benefit from additional types of knowledge representation and reasoning.
The Past, Present and Future of Knowledge in Biology
1. The Past, Present and Future of
Knowledge in Biology
Robert Stevens
BioHealth Informatics Group
The University of Manchester
Manchester
United Kingdom
Robert.Stevens@manchester.ac.uk
2. Overview
• A look at the state of play
• For what are we using ontologies?
• What do we count as knowledge?
• Doing so much more with knowledge
• Stopping text being a dead end
3. Text and Ontologies: The Terrible
Twins of Knowledge in Biology
Robert Stevens
BioHealth Informatics Group
The University of Manchester
Manchester
United Kingdom
Robert.Stevens@manchester.ac.uk
6. Data are only as Good as their
Metadata
• There is a lot of biology out there…
• How these entities are described in our data varies
• We don’t even agree on what entities there are to
describe in our data
• This makes analysing data hard: You have to know
what your data represent
• …, but also how the entities described in your data
relate to each other
• We need to describe our data – their metadata
7. Creating Woods, not Trees
Genes
Proteins
Pathways
Interactions
Literature
Complex
Machines
Virtual
Organism
…. from biological facts, we make a system that is some model of a real organism
9. There’s a Lot of it About
Searching for “ontology” in five
year chunks on the ACM digital
portal
Searching for “ontology” in five
year chunks on the ACM digital
portal
Searching for “ontology” in five
year chunks on PubMed
Searching for “ontology” in five
year chunks on PubMed
10. It’s all Gruber’s Fault
• “In the context of knowledge sharing, the term ontology means a
specification of a conceptualisation. That is, an ontology is a description
(like a formal specification of a program) of the concepts and relationships
that can exist for an agent or a community of agents. This definition is
consistent with the usage of ontology as set-of-concept-definitions, but
more general. And it is certainly a different sense of the word than its use
in philosophy.” DOI:10.1006/knac.1993.1008 DOI:10.1006/ijhc.1995.1081
12. Everything with a Blob and Line is
called an Ontology
• Wide acceptance criteria
• Narrow evaluation criteria
• Different sort of knowledge for different
situations
• Different styles of representation; some
scruffy and some formal
• Representing knowledge in biology is more
than ontologies
• We could stop calling them ontologies
RDF
graph
RDF
graph
Database
schema
Database
schema
ThesaurusThesaurus
OWL
Ontology
OWL
Ontology
Formal
ontology
Formal
ontology
SKOS
vocabulary
SKOS
vocabulary
14. Knowing What We’ve got is so
Useful
• We could computationally handle lots of data,
but we couldn’t do so with what we know
about those data
• Ontologies so far mainly used for a common
tongue so that we can compare
• … and it works!
• Still getting lots of mileage from ontology
annotation
• …, But there is so much more
16. Classifying a Mouse
Individual Description:
Stops wriggling after 3 sec
Has 3 cm tail
Mass 10g
10 days old (since birth)
Strain C57Bl/6
Class Description:
Class:DepressedMouse
EquivalentTo:Mouse that
(wriggles For <=30 OR swims for <=45)
DataTransformation
17. Short tailed mouse
Class:ShortTailedMouse
EquivalentTo:Mouse that
hasPart EXACTLY 1 (Tail that hasAssay SOME
(LengthAssay that hasValue SOME int[<= 20) and hasUnit
SOME Millimetre))
SubClassOf: Mouse that
hasPart some (Tail that hasQuality SOME Short)
• We can recognise an instance of short-
tailed mouse, but we also know that it has
the quality “short”
• Even when the fact isn’t asserted
•First bullet
19. OWL’s Automated Reasoners
• Demonstrably useful in:
– Building ontologies
– Querying ontologies
– Can automatically annotate
– Have made “discoveries”
But there is more than OWL’s reasoning
20. Separation of Knowledge and
Software
• We realised a long time ago that we needed
to separate
• We only recently called this knowledge
component ontology
• We don’t really need to see the ontology
• We certainly shouldn’t show people OWL; it
“scares the horses”
• Ontology for software not humans (L. Hunter)
21. The Ontology cottage Industry
• We’ve industrialised data production
• We’ve (to some extent) industrialised data
analysis
• We’ve not really moved away from hand-
crafted, “whittled” ontologies
22. Can we have Mass Editing of
Ontologies?
• Probably not;
• Computer scientists in love with synchronous
editing
• …, but not really necessary (see CSCW)
• Mass gathering of Knowledge
23. Mass Gathering of Knowledge and the
Application of Patterns or a
metamodel
http://rightfield.org.uk http://www.e-lico.eu/populous
24. There’s so much more to Ontology
Building than editing Axioms
• Gathering knowledge
• Adding labels
• Adding other human orientated content
• Reviewing, checking suggesting
• Deploying, using, creating “views”
• Ontology comprehension
25. There’s More to KR than OWL
• OWL and its automated reasoners are useful
• But there is so much more to KR than
ontologies and OWL
• Higher order reasoning
• Rules
• Other sorts of reasoning
26. Generating natural language
Class: HeLa
SubClassOf: Cell,
bearer_of some 'cervical carcinoma’,
derives_from some 'Homo sapiens’,
derives_from some cervix,
derives_from some 'epithelial cell'
OWL
HeLa is a cell line. A hela is all of the
following: something that is bearer of
a cervical carcinoma, something that
derives from a homo sapiens,
something that derives from an
epithelial cell, and something that
derives from a cervix.
Generated natural language
Experimental Factor Ontology (EFO)
http://www.ebi.ac.uk/efo
27. Ontology as book
Title: Experimental Factor Ontology
Table of Contents
Chapter 1. Cell line
Chapter 2. Cell type
Chapter 3. Chemical Compound
Chapter 4. Organism
HeLa is a cell line. A hela is all of the
following: something that is bearer of
a cervical carcinoma, something
that derives from a homo sapiens,
something that derives from an
epithelial cell, and something that
derives from a cervix.
entry
29. It’s not Just “Things”
• Experiments produce data about things
• Proteins, genes, chemicals, reactions,
diseases, size, shape, speed, ….
• As well as this knowledge we have knowledge
of how it was done
• OBI is still the “things” to do with production
• We still need the methods of by which these
“things” were deployed
• The protocol
31. Workflows are knowledge about
methods
Get genes in region
Get pathways that
contain genes
Merge data into single files
Get gene descriptions
Get pathway descriptions
Cross-reference ids
Methods:
1. A QTL (region of chromosome) is entered into the
workflow, specified as base pairs. These base pairs
are subsequently used to identify, in the Ensembl
database, any genes that lie within this region.
2. Any genes found within this region are subsequently
annotated with Entrez and UniProt identifiers.
3. The Entrez and UniProt identifiers are then passed
to a KEGG id conversion Web Service, to cross-
reference the input ids to KEGG gene identifiers.
This enables gene descriptions and biological
pathway data to be returned from KEGG.
4. Each KEGG gene id is then used in a search for
KEGG pathways. Any pathways found to contain the
gene are returned as KEGG pathway ids.
5. Both KEGG gene and pathway ids are then sent to
individual services, provided by KEGG, which
provide a description of the gene and pathway.
6. The outputs of the workflow are then combined into
single flat files, which can be saved locally and used
to identify novel pathways and genes within the QTL
region.
35. What Next?
• Ontologies are not the only fruit
• We could stop calling them ontologies
• We need to produce “ontologies” faster
• We need to do more interesting things with our knowledge
• We need to make them pervade our tools
• We need then to be “agile”
• Open to other forms of KR and other forms of reasoning
• Adding to data automatically
• Generating our descriptions of data
36. Acknowledgements
• Simon Jupp for the slides
• Alan rector and Carole goble
• sysMoDB for rightField (Katy Wolstencroft, Stuart Owen, Matt
Horridge)
• Populous – Simon Jupp
• SWAT – richard Power, Sandra Williams and Allan third at the
OU
• EFO – James Malone and Helen Parkinson
• Steve Pettifer for the Utopia and MVC
• Paul Fisher and the Taverna team
• The myExperiment team at Southampton and Manchester
Notas do Editor
Slide Title: Literature
Lots of books in a library
Slide Title
Slide contains:
Book on the left with a plus sign
Black and white image, man sat at an old valve-style computer (i.e. manchester baby)
Text saying: genes, proteins, interactions, pathways
Mouse on the right
Text below images says:
(left) Literature
(middle) complex machines
(right) Organism
(bottom) “…. from biological facts, we make a system that is some model of a real thing” - Robert Stevens – 2008
All of which helps build better ontologies. But can we actually apply this computational amenability more
Directly to biological knowledge. In this example, which is work by Katy Wolstencroft, we have codified
Community knowledge about protein domains in phosphatases in OWL. We then take unknown protein sequences,
Pass then through interpro and stick them into the instance store, which is basically a database and reasoner tied together
Qualified Cardiniality!!!
Slide Title: Literature
Lots of books in a library