100,000 Genomes Project.

The 100,000
Genomes
Project
David Montaner
Bioinformatics Department
david.montaner@genomicsengland.co.uk
Valencia University, October 6th
2016

Talk Outline
1. Introduction & Background
2. Pipelines
3. Systems and Databases
4. Cancer
5. Rare Diseases
2

3
The 100,000 Genomes Project
Genomics England & Partners

Genomics England
• Owned by the Department of Health, UK
• Set up to deliver the 100,000 Genomes Project:
 100,000 whole genome sequences of National Health Service (NHS)
patients with:
• Rare Diseases (and family members)
• Cancer
Aims:
 Create an ethical and transparent programme based on consent
 Establish the infrastructure, human capacity & capability to set up a
genomic medicine service for the NHS and bring benefit to patients.
 Enable new scientific discovery and medical insights, and add to
the already extensive databases on human variation
 Working with the National Health Service (NHS), academics and industry
to make the UK a world leader in Genomic Medicine
4
Who are we & what are we doing?
Generate health & wealth

• Sequence 100,000 genomes
• Cancer and rare genetic disease
• Capture data delivered
electronically, store it securely
and analyse it
• within an English data centre
(reading library)
• Combine genomes with
extracted clinical information for
analysis, interpretation, and
aggregation
• Create capacity, capability and
legacy in personalised medicine
for the UK
Goals of Genomics England
1. To bring
benefit to
NHS patients
2. To enable
new scientific
discovery
and medical
insights
3. To create
an ethical
and
transparent
programme
based on
consent
4. To kickstart
the
development
of a UK
genomics
industry

Inception of the 100,000
genomes project (2012, 2014)
“If we get this right, we could
transform how we diagnose and
treat our most complex diseases
not only here but across the world”
(December 2012)
“I am determined to do all I can to
support the health and scientific
sector to unlock the power of DNA,
turning an important scientific
breakthrough into something that will
help deliver better tests, better drugs
and above all better care for
patients.”
(August 2014)

Schedule

2012 -2014: consortium creation

2014-2015: pilot studies

2016-2015: main project

Where are we?
9
Lodon


London:

Management

All data storage


Cambridge:

Software for genomic data storage


Oxford:

Software for clinical data storage and collection

Recruitment and clinical interface
13 “GMCs”, Scotland and
Northern Ireland• Genomic Medicine Centres
• Networks of NHS hospitals
including genomics labs
• 13 “Lead organisation” plus
71 “Local Delivery Partners”
• Contracted by NHS England
• Cover recruitment, data and
return of results
• Scotland
• Doing own sequencing
• Northern Ireland
• Similar to a GMC
• Contracted by NI payer
+

The Journey of a Genome
11
ACGTTTGAAGC
Consent &
Sample
collection
DNA
extraction
Bio-
repository
Sequencing
Variant
Calling
Interpretation
Feedback
to clinician
Validation
Treatment

The Journey of a Genome:
Partners
12
ACGTTTGAAGC
?
Consent &
Sample
collection
DNA
extraction
Bio-
repository
Sequencing
Variant
Calling
Interpretation
Feedback
to clinician
Validation
Treatment
Genome
Medicine
Centres (GMCs)
13x NHS
organisations
Genomics
England Clinical
Interpretation
Partnerships
(GeCIPs)
Collaborations of
clinicians &
academics,
> 2,000
researchers
Clinical
interpretation
companies
• Omicia
• Congenica
• Nextcode
Hiseq X Ten

GENE Consortium
• Working together on a year-long
Industry Trial involving a
selection of whole genome
sequences across cancer and rare
diseases
• Aims to identify most effective and
secure way to accelerate
development of new
diagnostics and treatments for
patients
• Working in a pre-competitive
environment
AbbVie
Alexion Pharmaceuticals
AstraZeneca
Berg Health
Biogen
Dimension Therapeutics
GSK
Helomics
NGM Biopharmaceuticals
Roche
Takeda
Genomics Expert Network for
Enterprises

14
BAM file
From Illumina
Variant Calling
pipelines: VCF file
QC1 QC2
Variant
Annotation
Tiering of variantsDispatchClinical
Interpretation
QC Portal Reporting portal
Medical
review
Validation
Simplified Workflow
Genomic Medicine Centre (GMC)

Bioinformatics Team Role
15
ACGTTTGAAGC
?
Consent &
Sample
collection
DNA
extraction
Biorepository
Sequencing
Variant
Calling
Interpretation
Feedback to
clinician
Validation
Treatment

Genomics Education
Health Education England
• MSc in Genomic Medicine
• 10 Universities across the UK
• Online training courses and resources
• The fundamentals of genomics
• Sample handling and DNA
extraction
• Bioinformatics
• How to support patients through
the consent process
Genomics England Communications
Team

Update on numbers:
at about 10%
• >10,000 genomes
received
• >1PB of primary data
• >1.3M files received or
generated and indexed
• 200M germline variants
databased
• 48M somatic variants
databased
• 70,000 HPO terms asserted
• >450,000 hospital episodes

100,000 Genomes
• Rare Disease
• Each Genome: 100Gb
• Trio is preferred so 300Gb per
participant
• x 50,000 participants =
15,000,000Gb total
• Cancer
• Germline: 100Gb
• Tumour: 200Gb
• 300Gb per patient
• x 25,000 participants =
15,000,000Gb total
• 10,000,000Gb = 10 Petabytes
• Expecting around 30 Petabytes
18
Huge Amount of Data
10 Billion Photos = 1.5 Petabytes
Data Processed in 1 day = 20 Petabytes

bertha_default 1.1.0
Single Sample QC & Processing
Analysis
Intake QC
Multi Sample QC
Cross Sample Contamination
Single-Sample QC Check Point
Identity by DecentMendelian Inconsistency Rate
Sex Check
Somatic VCF re-headering
Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check
Intake QC Check Point
Merge Array Genotypes
Multi-Sample QC Check Point
Consent Check Point
Variant Calling
Variant Normalisation
Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs
Variant Annotation
Variant Tiering
Interpretation Dispatch Exomiser
Delivery API
Integrity Check
MD5 Check
Validate BAM Picard
Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC
Fix Permissions
Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats
QC Stats Post-processing
Workflow
diagramme
Data intake
Single Sample QC & Processing
Multi-sample QC
Analysis
Interpretation Request Dispatched
InterpretationAPI

Bertha
Distributed Workflow Management System
Interpretation Dispatch
Message Broker
Tracki
ng DB
Job Scheduler
Dashboard
DeliveryAPI
Auditor
Orchestrator
Grid
Consumer
Oxford Bus

6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA: storage
Extensive capabilities to query across genotype and phenotype relationships
https://github.com/opencb/opencga

To be fully GA4GH compatible from v1.0
global data standards for Genomics - http://ga4gh.org/

Clinical data
+ 150 tables (+2000 variables)
Administrative & Consent
Clinical / medical reviews
Imaging, blood & non genetic tests
Disease status and phenotype
Family & pedigree
Treatments and clinical history
Security and logs:
CMCs access here
CatalogBioinformatics
Oxford

OpenCGA - Catalog
Metadata store and A&A for
OpenCGA
• Manages roles, groups,
acls
• Audit log
• LDAP integration
• Arbitrary schemas
(annotation sets)

Cellbase: annotation
Reference Genomic data warehouse
• Compared in testing against VEP
• More than 99.999% similarity in Consequence
types
• Phased annotation implemented for
MNVs
• Initial structural variation annotation
• Can annotate 4-5 families per hour
(>8000 variants/s) on a single
database instance
• Will have (very soon) an Rpackage
similar to biomaRt

PanelApp
27https://panelapp.extge.co.uk/crowdsourcing/PanelApp

Panel list
28
https://panelapp.extge.co.uk/crowdsourcing/PanelApp/

● Filter and classify variants
● Well-defined rules, stable across the project
● General, it works for any family configuration
● Implemented using VCF/Cellbase or OpenCGA
● Based on GA4GH variant model
● Uses pedigrees as defined at Genomics England
(Based on phenotips format) Uses PanelApp as
source of gene panels
Variant Tiering

Yes No
Tier 1 Tier 2Tier 3
Yes No
Expected pathogenic
(set criteria; transcript_ablation,
splice_donor_variant,
splice_acceptor_variant, stop_gained,
frameshift_variant, stop_lost,
initiator_codon_variant)
Is the variant in a gene in the Virtual Gene
Panel (green list) for that disorder?
Known Pathogenic
(Not implemented)
Yes No
Tier 3
Is the variant in a gene in the Virtual Gene
Panel (green list) for that disorder?
Other coding impact
(set criteria;
inframe_insertion
inframe_deletion
missense_variant
transcript_amplification
splice_region_variant
incomplete_terminal_codon_variant)
Impact of the variant?
Other
Does not fit any
of the other
criteria?
The variant allele is not commonly found in the general healthy population (set criteria for allele frequency filter)
Familial segregation
Allelic state matches known mode of inheritance for the gene and disorder (moi required)
Variant
Variant Tiering

Cancer
33
Which cancers?
• Lung
• Breast
• Colon
• Prostate
• Ovary
• Hematological
malignancies (CLL)
• Pediatric Cancers
atthew Parker, Lead Analyst for Cancer (Bioinformatics)
Why sequence?
• Disease of disordered
genomes
• >200 driver genes known
• Stratified
Management/targeted
therapy
• Complications:
Heterogeneity

Sequencing cancer genomes
34
Tumour
genome
Germline
genome
Germline
variants
Tumour
variants
Somatic
variation=

Coverage
35
High Depth
ATGCGTTCGATGAGTGATGAAACCCATGATGGATGCCGATGAGATGATG
Coverage
Germline Samples
35x Coverage
• Rare Disease
Participants
• Cancer “Normal”
Cancer Samples
75x Coverage
• Cancer “Tumour”
Samples
Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)

Normal
Contamination
Coverage
36
Why Higher Depth for Cancer?
Clonality/Heterogene
ity

Cancer Pilot
• Resections/Biopsies are
routinely fixed in formalin and
embedded in paraffin
• Causes DNA damage
• Difficult to extract DNA
• Fresh frozen logistically
difficult & not trusted to
maintain morphology
37
Fresh Frozen vs Formalin-fixed, paraffin-
embedded (FFPE) tumour samples

Cancer Pilot
• Difficulty in obtaining long
fragments
• “Random” DNA damage
• “Cross-links” DNA which can be
reversed – but currently at high
temperatures
• Chimeric fragments in library
preparation
38
Problems with FFPE
Heat
A T
Repetitive
Regions Re-
anneal causing
Chimeric
Reads
GC Rich
regions are
more robust
FFPE = Formalin-fixed, paraffin-embedded tumour samples

FF Copy Number Data
41atthew Parker, Lead Analyst for Cancer (Bioinformatics)

FFPE Copy Number Data
42atthew Parker, Lead Analyst for Cancer (Bioinformatics)

Fraction of overlapping SNVs
in FF and FFPE samples from 5 trios

Improving FFPE Sequencing
44
What can we do?
Procedur
e
Procedur
e FixationFixation
DNA
Extractio
n
DNA
Extractio
n
Library
Preparati
on
Library
Preparati
on
Cold Ischaemic Time
Storage Conditions
Time of Fixation
Size of Sample
pH of Fixative
Temperature of De-crosslinking
Addition of Salt
FFPE = Formalin-fixed, paraffin-embedded tumour samples

Cancer reports
45
• Quality metrics pre- and post-sequencing
• A small number of clinically actionable mutations
• Germline results which affect cancer development
• Remainder of results are mostly of research interest
for now, but in future may assist:
• Drug development
• Targeted treatment selection
• Prediction of prognosis
• Monitoring of disease progression

The case for whole genomes
• Severe intellectual disability occurs in 0.5% of newborns
• Whole-genome sequencing at 80x in 50 parent-offspring with no
diagnosis for their severe intellectual disability.
• Overall 62% increase in diagnostic yield with WGS.
• Most diagnoses were for de-novo dominant mutations, roughly
equally divided in SNVs and CNVs.
48
Gilissen et al (2014), Nature PMID: 24896178

Why make a genetic diagnosis?
49
For a patient with
rare disease
• Understand why their
condition happened
• More accurate knowledge of
how it might develop in
future
• Possible treatment avenues
• Early intervention may help
avoid disability
• Contact with others with the
same condition
For the family
• Predict whether family
members will get the
condition
• Offer screening/treatment to
prevent it
• Reproductive decisions
For medical research
• Further our understanding of
disease mechanisms
• Novel drug development or
drug repurposing

Rare disease programme
• Over 200 disorders so far
Data model: describes the clinical
information to be collected for each
disorder
Disorders nominated by the NHS and
academia
Eligibility & Exclusion criteria for
recruitment; rare, mendelian, unmet
clinical diagnostic need, prior genetic
testing
Virtual Gene panel to aid analysis
Challenges
• Equity of diseases for
inclusion
• Tightness of criteria
for patient inclusion
• Equity of WGS
consumption per
phenotype

The biggest challenge?
51
Interpretation
• ~5-10 million variants in our
genome
• ~3.5 million “known” SNPs
• ~0.5 million “novel” SNPs
• ~0.5 million small indels
• ~1000 large (>500bp) CNVs
• ~20,000-25,000 coding variants
• ~9,000-11,000 non-synonymous
• 92 rare missense variants (MAF
<0.1%)
• 5 rare truncating variants (MAF
<0.1%)
• 0-2 de novo variants

What information is needed?
52
To aid interpretation of variants
• Allele frequency: How common is the variant in the ‘healthy’
population?
• Familial segregation: Is the variant present in the family
members with the disorder, and not in those without it?
• Mode of inheritance: Does the pattern fit with the
inheritance within the family and what is known about the
gene?
• Likely consequence: Does the variant cause a change in the
protein sequence likely to affect function?
• Gene panel: Is the variant in a gene associated with causing
the disorder?
• Known pathogenicity? Has the variant been seen before in
people with the same disease?

Rare Diseases
Gender
• X chromosome homozygosity, Y chromosome genotyping
rate
• Copy number for X and Y chromosomes
Relatedness
• Mendelian error checking for parent-child pairs
• IBD sharing estimation for all participants
Inbreeding/ excess homozygosity
• Observed vs expected homozygosity
Ancestry
• Multidimensional scaling
53
Genetic data checks and analyses
herine Smith, Lead Analyst for Rare Disorders (Bioinformatics)

Rare Disease Pilot
54
4800 people
Primary Data
• 4,128 participants
data cleansed
• (15,065 including
family members),
• 149 different
conditions.
• 56,004 HPO terms
used
• 12,966 terms present
• 43,088 terms absent
Secondary Data
• Hospital Episodes
• 250,000 records
• 11,910 - Accident
Dept
• 37,479 - Inpatient
• 199418 - Outpatients

Rare disease pilot – 4,919 samples
55

Georgia
57
Georgia and her family
Image courtesy of Great Ormond
Street Hospital
• Undiagnosed condition that
included physical and mental
developmental delay, a rare eye
condition affecting sight, impaired
kidney function, verbal dyspraxia.
• Through enrolling in the project, a
mutation in a single gene was
found in Georgia’s genome which
is likely to be the cause of her
condition.
• Provides a molecular diagnosis for
her condition for the first time.
Maria Bitner-Glindzicz –
Great Ormond Street Hospital
http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/

Jessica
58
Jessica and her family.
Image courtesy of Great Ormond
Street Hospital.
“Now that we have this diagnosis there are
things that we can do differently almost
straight away. Her condition is one that has a
high chance of improvement on a special
diet, which means that her medication
dose is likely to decrease and her epilepsy
may be more easily controlled. Hopefully she
might have better balance so she can be
more stable and walk more…”
“…More than anything the outcome of the
project has taken the uncertainty out of life
for us and the worry of not knowing what was
wrong. It has allowed us to feel like we can
take control of things and make positive
changes for Jessica. It may also open doors to
other research projects that we can to go on.
These could be more specific to her condition
and we are hopeful that they could one
day find a cure.”
http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/
Mum, Kate Palmer:

100,000 Genomes Project.

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a 100,000 Genomes Project.

Semelhante a 100,000 Genomes Project. (20)

Mais de David Montaner

Mais de David Montaner (6)

Último

Último (20)

100,000 Genomes Project.