Ewan Birney Biocuration 2013

Curation
Ewan Birney (tweetable)

Who am I?
• Associate Director at
European Bioinformatics
Institute (EBI)
• Involved in genomics since I
was 19 (> 20 years!)
• Trained as a biochemist –
most people think I am CS
EBI is in Hinxton, South
• Analysed – sometimes lead
Cambridgeshire
–
human/mouse/rat/platypus
EBI is part of EMBL, ~like
etc genomes, ENCODE,
CERN for molecular biology
Others.

Molecular Biology
• The study of how life works – at a molecular level

• Key molecules:
• DNA – Information store (Disk)
• RNA – Key information transformer, also does stuff (RAM)
• Proteins – The business end of life (Chip, robotic arms)
• Metabolites – Fuel and signalling molecules (electricity)
• Theories of how these interact – no theories of to predict what
they are
• Instead we determine attributes of molecules and store them in
globally accessible, open, databases

Theory  Observation

Can accurately predict from models

Must directly observe
Molecular Geology, Climate High Energy
Biology Astronomy modelling Physics

This ratio is not well correlated with data size

~60PB High Energy Physics

Data Size
Molecular Astronomy
Biology
~5PB Climate Models

Ratio of model predictability

“Knowing stuff” is critical to biology…

• The bases of the human genome
• … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….
• The functions of proteins
• Enzymes, Transcription Factors, Signalling….
• The types of cells, their lineages and organ composition
• …and all the molecular components in each cell
• Small molecules
• … and their conversions, binding partners
• Structures of molecules, complexes and cells
• … at atomic and higher resolution

Two fundamental types of information

• Experimental data • Consensus Knowledge

• The result of a specific • Integration of different
experiment strands of information on a
• Often an experiment topic
specific, data heavy part • Realised as a
plus a “meta-data” part computationally accessible
• Might be contradictory scheme

• “Primary paper” • “Review article”

Experimental Data Entry

• Intact – Protein:Protein
interactions

• GWAS Catalog –
extraction of summary
statistics

Experimental Meta data capture

• Sample, CDS lines in
ENA
• Sample in Metabolights,
PRIDE etc
• Machine and analysis
specification in PDB,
PRIDE, ENA

Consensus integration of information

• GenCode gene models in
human
• Summaries and GO
assignment in UniProt
• Pathway information in
Reactome
• GO assignment and
summaries in MODs (eg,
PomBase, WormBase,
PhytoPathDB etc)

Knowledge frameworks

• The EC classification
• Cell type ontologies
• Cell lineages – Worms!
• SnowMed, HPO etc
• GO ontologies

Knowledge management

• Creation of rules
representing ENA
standards compliance
• Cross-ontology
coordination (eg, EFO) or
tieing (GO  ChEBI)
• RuleBase / UniRule
curation processes

Data Entry vs Programming

Direct Programmatic
Data Entry Data Entry

“Messy” Scripting
Improved
Data entry
tools RuleBase,
Computational Accessible
Standards

Curation Dilema

• If you do your job well… • If you do your job badly…

• Everyone assumes it’s • Everyone assumes it’s
easy easy
• People forget about the • People forget about the
complexity complexity

• You are ignored  • People complain 

Why we need an infrastructure…

Infrastructures are critical…

But we only notice them when they go wrong

Biology already needs an information
infrastructure

• For the human genome
• (…and the mouse, and the rat, and… x 150 now, 1000 in the
future!) - Ensembl
• For the function of genes and proteins
• For all genes, in text and computational – UniProt and GO
• For all 3D structures
• To understand how proteins work – PDBe
• For where things are expressed
• The differences and functionality of cells - Atlas

..But this keeps on going…

• We have to scale across all of (interesting) life
• There are a lot of species out there!
• We have to handle new areas, in particular medicine
• A set of European haplotypes for good imputation
• A set of actionable variants in germline and cancers
• We have to improve our chemical understanding
• Of biological chemicals
• Of chemicals which interfere with Biology

ELIXIR’s mission
To build a sustainable
European infrastructure for
biological
information, supporting life
science research and its
medicine
translation to:

environment

bioindustries

society

22

How?

Fully Centralised Fully Distributed

Pros: Stability, reuse, Pros: Responsive, Geographic
Learning ease Language responsive
Cons: Hard to concentrate Cons: Internal communication overhead
Expertise across of life science Harder for end users to learn
Geographic, language placement Harder to provide multi-decade stability
Bottlenecks and lack of diversity

Research Healthcare

International National
EBI / Elixir Healthcare
English National Language
Low legalities Complex legalities

2

Other infrastructures needed for biology
• EuroBioImaging
• Cellular and whole organism Imaging
• BioBanks (BBMRI)
• We need numbers – European populations – in particular for rare
diseases, but also for specific sub types of common disease
• Mouse models and phenotypes (Infrafrontier)
• A baseline set of knockouts and phenotypes in our most tractable
mammalian model
• (it’s hard to prove something in human)
• Robust molecular assays in a clinical setting (EATRIS)
• The ability to reliably use state of the art molecular techniques in a
clinical research setting

(you can follow me on twitter @ewanbirney)
I blog and update this on Google Plus publically

Ewan Birney Biocuration 2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ewan Birney Biocuration 2013

Semelhante a Ewan Birney Biocuration 2013 (20)

Mais de Iddo

Mais de Iddo (20)

Último

Último (20)

Ewan Birney Biocuration 2013