O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

100,000 Genomes Project.


Confira estes a seguir

1 de 59 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)


Semelhante a 100,000 Genomes Project. (20)


Mais recentes (20)

100,000 Genomes Project.

  1. 1. The 100,000 Genomes Project David Montaner Bioinformatics Department david.montaner@genomicsengland.co.uk Valencia University, October 6th 2016
  2. 2. Talk Outline 1. Introduction & Background 2. Pipelines 3. Systems and Databases 4. Cancer 5. Rare Diseases 2
  3. 3. 3 The 100,000 Genomes Project Genomics England & Partners
  4. 4. Genomics England • Owned by the Department of Health, UK • Set up to deliver the 100,000 Genomes Project:   100,000 whole genome sequences of National Health Service (NHS) patients with: • Rare Diseases (and family members) • Cancer Aims:  Create an ethical and transparent programme based on consent  Establish the infrastructure, human capacity & capability to set up a genomic medicine service for the NHS and bring benefit to patients.  Enable new scientific discovery and medical insights, and add to the already extensive databases on human variation  Working with the National Health Service (NHS), academics and industry to make the UK a world leader in Genomic Medicine 4 Who are we & what are we doing? Generate health & wealth
  5. 5. • Sequence 100,000 genomes • Cancer and rare genetic disease • Capture data delivered electronically, store it securely and analyse it • within an English data centre (reading library) • Combine genomes with extracted clinical information for analysis, interpretation, and aggregation • Create capacity, capability and legacy in personalised medicine for the UK Goals of Genomics England 1. To bring benefit to NHS patients 2. To enable new scientific discovery and medical insights 3. To create an ethical and transparent programme based on consent 4. To kickstart the development of a UK genomics industry
  6. 6. Inception of the 100,000 genomes project (2012, 2014) “If we get this right, we could transform how we diagnose and treat our most complex diseases not only here but across the world” (December 2012) “I am determined to do all I can to support the health and scientific sector to unlock the power of DNA, turning an important scientific breakthrough into something that will help deliver better tests, better drugs and above all better care for patients.” (August 2014)
  7. 7. Schedule  2012 -2014: consortium creation  2014-2015: pilot studies  2016-2015: main project
  8. 8. Where are we? 8 Lodon
  9. 9. Where are we? 9 Lodon   London:  Management  All data storage   Cambridge:  Software for genomic data storage   Oxford:  Software for clinical data storage and collection
  10. 10. Recruitment and clinical interface 13 “GMCs”, Scotland and Northern Ireland• Genomic Medicine Centres • Networks of NHS hospitals including genomics labs • 13 “Lead organisation” plus 71 “Local Delivery Partners” • Contracted by NHS England • Cover recruitment, data and return of results • Scotland • Doing own sequencing • Northern Ireland • Similar to a GMC • Contracted by NI payer +
  11. 11. The Journey of a Genome 11 ACGTTTGAAGC Consent & Sample collection DNA extraction Bio- repository Sequencing Variant Calling Interpretation Feedback to clinician Validation Treatment
  12. 12. The Journey of a Genome: Partners 12 ACGTTTGAAGC ? Consent & Sample collection DNA extraction Bio- repository Sequencing Variant Calling Interpretation Feedback to clinician Validation Treatment Genome Medicine Centres (GMCs) 13x NHS organisations Genomics England Clinical Interpretation Partnerships (GeCIPs) Collaborations of clinicians & academics, > 2,000 researchers Clinical interpretation companies • Omicia • Congenica • Nextcode Hiseq X Ten
  13. 13. GENE Consortium • Working together on a year-long Industry Trial involving a selection of whole genome sequences across cancer and rare diseases • Aims to identify most effective and secure way to accelerate development of new diagnostics and treatments for patients  • Working in a pre-competitive environment AbbVie Alexion Pharmaceuticals AstraZeneca Berg Health Biogen Dimension Therapeutics GSK Helomics NGM Biopharmaceuticals Roche Takeda Genomics Expert Network for Enterprises
  14. 14. 14 BAM file From Illumina Variant Calling pipelines: VCF file QC1 QC2 Variant Annotation Tiering of variantsDispatchClinical Interpretation QC Portal Reporting portal Medical review Validation Simplified Workflow Genomic Medicine Centre (GMC)
  15. 15. Bioinformatics Team Role 15 ACGTTTGAAGC ? Consent & Sample collection DNA extraction Biorepository Sequencing Variant Calling Interpretation Feedback to clinician Validation Treatment
  16. 16. Genomics Education Health Education England • MSc in Genomic Medicine • 10 Universities across the UK • Online training courses and resources • The fundamentals of genomics • Sample handling and DNA extraction • Bioinformatics • How to support patients through the consent process Genomics England Communications Team
  17. 17. Update on numbers: at about 10% • >10,000 genomes received • >1PB of primary data • >1.3M files received or generated and indexed • 200M germline variants databased • 48M somatic variants databased • 70,000 HPO terms asserted • >450,000 hospital episodes
  18. 18. 100,000 Genomes • Rare Disease • Each Genome: 100Gb • Trio is preferred so 300Gb per participant • x 50,000 participants = 15,000,000Gb total • Cancer • Germline: 100Gb • Tumour: 200Gb • 300Gb per patient • x 25,000 participants = 15,000,000Gb total • 10,000,000Gb = 10 Petabytes • Expecting around 30 Petabytes 18 Huge Amount of Data 10 Billion Photos = 1.5 Petabytes Data Processed in 1 day = 20 Petabytes
  19. 19. 19 Pipelines
  20. 20. bertha_default 1.1.0 Single Sample QC & Processing Analysis Intake QC Multi Sample QC Cross Sample Contamination Single-Sample QC Check Point Identity by DecentMendelian Inconsistency Rate Sex Check Somatic VCF re-headering Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check Intake QC Check Point Merge Array Genotypes Multi-Sample QC Check Point Consent Check Point Variant Calling Variant Normalisation Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs Variant Annotation Variant Tiering Interpretation Dispatch Exomiser Delivery API Integrity Check MD5 Check Validate BAM Picard Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC Fix Permissions Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats QC Stats Post-processing Workflow diagramme Data intake Single Sample QC & Processing Multi-sample QC Analysis Interpretation Request Dispatched InterpretationAPI
  21. 21. Bertha Distributed Workflow Management System Interpretation Dispatch Message Broker Tracki ng DB Job Scheduler Dashboard DeliveryAPI Auditor Orchestrator Grid Consumer Oxford Bus
  22. 22. 6 node Hadoop cluster: • Transform: 97 min • Load: 80 sec • Merge: 84 sec • Millisecond response times for regional queries • Whole genome filtering queries for all individuals within seconds OpenCGA: storage Extensive capabilities to query across genotype and phenotype relationships https://github.com/opencb/opencga
  23. 23. To be fully GA4GH compatible from v1.0 global data standards for Genomics - http://ga4gh.org/
  24. 24. Clinical data + 150 tables (+2000 variables) Administrative & Consent Clinical / medical reviews Imaging, blood & non genetic tests Disease status and phenotype Family & pedigree Treatments and clinical history Security and logs: CMCs access here CatalogBioinformatics Oxford
  25. 25. OpenCGA - Catalog Metadata store and A&A for OpenCGA • Manages roles, groups, acls • Audit log • LDAP integration • Arbitrary schemas (annotation sets)
  26. 26. Cellbase: annotation Reference Genomic data warehouse • Compared in testing against VEP • More than 99.999% similarity in Consequence types • Phased annotation implemented for MNVs • Initial structural variation annotation • Can annotate 4-5 families per hour (>8000 variants/s) on a single database instance • Will have (very soon) an Rpackage similar to biomaRt
  27. 27. PanelApp 27https://panelapp.extge.co.uk/crowdsourcing/PanelApp
  28. 28. Panel list 28 https://panelapp.extge.co.uk/crowdsourcing/PanelApp/
  29. 29. Platform for interpretation
  30. 30. ● Filter and classify variants ● Well-defined rules, stable across the project ● General, it works for any family configuration ● Implemented using VCF/Cellbase or OpenCGA ● Based on GA4GH variant model ● Uses pedigrees as defined at Genomics England (Based on phenotips format) Uses PanelApp as source of gene panels Variant Tiering
  31. 31. Yes No Tier 1 Tier 2Tier 3 Yes No Expected pathogenic (set criteria; transcript_ablation, splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant, stop_lost, initiator_codon_variant) Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder? Known Pathogenic (Not implemented) Yes No Tier 3 Is the variant in a gene in the Virtual Gene Panel (green list) for that disorder? Other coding impact (set criteria; inframe_insertion inframe_deletion missense_variant transcript_amplification splice_region_variant incomplete_terminal_codon_variant) Impact of the variant? Other Does not fit any of the other criteria? The variant allele is not commonly found in the general healthy population (set criteria for allele frequency filter) Familial segregation Allelic state matches known mode of inheritance for the gene and disorder (moi required) Variant Variant Tiering
  32. 32. 32 The Cancer Programme
  33. 33. Cancer 33 Which cancers? • Lung • Breast • Colon • Prostate • Ovary • Hematological malignancies (CLL) • Pediatric Cancers atthew Parker, Lead Analyst for Cancer (Bioinformatics) Why sequence? • Disease of disordered genomes • >200 driver genes known • Stratified Management/targeted therapy • Complications: Heterogeneity
  34. 34. Sequencing cancer genomes 34 Tumour genome Germline genome Germline variants Tumour variants Somatic variation=
  35. 35. Coverage 35 High Depth ATGCGTTCGATGAGTGATGAAACCCATGATGGATGCCGATGAGATGATG Coverage Germline Samples 35x Coverage • Rare Disease Participants • Cancer “Normal” Cancer Samples 75x Coverage • Cancer “Tumour” Samples Dr Matthew Parker, Lead Analyst for Cancer (Bioinformatics)
  36. 36. Normal Contamination Coverage 36 Why Higher Depth for Cancer? Clonality/Heterogene ity
  37. 37. Cancer Pilot • Resections/Biopsies are routinely fixed in formalin and embedded in paraffin • Causes DNA damage • Difficult to extract DNA • Fresh frozen logistically difficult & not trusted to maintain morphology 37 Fresh Frozen vs Formalin-fixed, paraffin- embedded (FFPE) tumour samples atthew Parker, Lead Analyst for Cancer (Bioinformatics)
  38. 38. Cancer Pilot • Difficulty in obtaining long fragments • “Random” DNA damage • “Cross-links” DNA which can be reversed – but currently at high temperatures • Chimeric fragments in library preparation 38 Problems with FFPE Heat A T Repetitive Regions Re- anneal causing Chimeric Reads GC Rich regions are more robust atthew Parker, Lead Analyst for Cancer (Bioinformatics) FFPE = Formalin-fixed, paraffin-embedded tumour samples
  39. 39. Read Alignment
  40. 40. CG Content
  41. 41. FF Copy Number Data 41atthew Parker, Lead Analyst for Cancer (Bioinformatics)
  42. 42. FFPE Copy Number Data 42atthew Parker, Lead Analyst for Cancer (Bioinformatics)
  43. 43. Fraction of overlapping SNVs in FF and FFPE samples from 5 trios
  44. 44. Improving FFPE Sequencing 44 What can we do? Procedur e Procedur e FixationFixation DNA Extractio n DNA Extractio n Library Preparati on Library Preparati on Cold Ischaemic Time Storage Conditions Time of Fixation Size of Sample pH of Fixative Temperature of De-crosslinking Addition of Salt atthew Parker, Lead Analyst for Cancer (Bioinformatics) FFPE = Formalin-fixed, paraffin-embedded tumour samples
  45. 45. Cancer reports 45 • Quality metrics pre- and post-sequencing • A small number of clinically actionable mutations • Germline results which affect cancer development • Remainder of results are mostly of research interest for now, but in future may assist: • Drug development • Targeted treatment selection • Prediction of prognosis • Monitoring of disease progression
  46. 46. 46 Rare Disease Programme
  47. 47. 47
  48. 48. The case for whole genomes • Severe intellectual disability occurs in 0.5% of newborns • Whole-genome sequencing at 80x in 50 parent-offspring with no diagnosis for their severe intellectual disability. • Overall 62% increase in diagnostic yield with WGS. • Most diagnoses were for de-novo dominant mutations, roughly equally divided in SNVs and CNVs. 48 Gilissen et al (2014), Nature PMID: 24896178
  49. 49. Why make a genetic diagnosis? 49 For a patient with rare disease • Understand why their condition happened • More accurate knowledge of how it might develop in future • Possible treatment avenues • Early intervention may help avoid disability • Contact with others with the same condition For the family • Predict whether family members will get the condition • Offer screening/treatment to prevent it • Reproductive decisions For medical research • Further our understanding of disease mechanisms • Novel drug development or drug repurposing
  50. 50. Rare disease programme • Over 200 disorders so far Data model: describes the clinical information to be collected for each disorder Disorders nominated by the NHS and academia Eligibility & Exclusion criteria for recruitment; rare, mendelian, unmet clinical diagnostic need, prior genetic testing Virtual Gene panel to aid analysis Challenges • Equity of diseases for inclusion • Tightness of criteria for patient inclusion • Equity of WGS consumption per phenotype
  51. 51. The biggest challenge? 51 Interpretation • ~5-10 million variants in our genome • ~3.5 million “known” SNPs • ~0.5 million “novel” SNPs • ~0.5 million small indels • ~1000 large (>500bp) CNVs • ~20,000-25,000 coding variants • ~9,000-11,000 non-synonymous • 92 rare missense variants (MAF <0.1%) • 5 rare truncating variants (MAF <0.1%) • 0-2 de novo variants
  52. 52. What information is needed? 52 To aid interpretation of variants • Allele frequency: How common is the variant in the ‘healthy’ population? • Familial segregation: Is the variant present in the family members with the disorder, and not in those without it? • Mode of inheritance: Does the pattern fit with the inheritance within the family and what is known about the gene? • Likely consequence: Does the variant cause a change in the protein sequence likely to affect function? • Gene panel: Is the variant in a gene associated with causing the disorder? • Known pathogenicity? Has the variant been seen before in people with the same disease?
  53. 53. Rare Diseases Gender • X chromosome homozygosity, Y chromosome genotyping rate • Copy number for X and Y chromosomes Relatedness • Mendelian error checking for parent-child pairs • IBD sharing estimation for all participants Inbreeding/ excess homozygosity • Observed vs expected homozygosity Ancestry • Multidimensional scaling 53 Genetic data checks and analyses herine Smith, Lead Analyst for Rare Disorders (Bioinformatics)
  54. 54. Rare Disease Pilot 54 4800 people Primary Data • 4,128 participants data cleansed • (15,065 including family members), • 149 different conditions.  • 56,004 HPO terms used • 12,966 terms present • 43,088 terms absent Secondary Data • Hospital Episodes • 250,000 records • 11,910 - Accident Dept • 37,479 - Inpatient • 199418 - Outpatients
  55. 55. Rare disease pilot – 4,919 samples 55
  56. 56. Relatedness checking 56
  57. 57. Georgia 57 Georgia and her family Image courtesy of Great Ormond Street Hospital • Undiagnosed condition that included physical and mental developmental delay, a rare eye condition affecting sight, impaired kidney function, verbal dyspraxia. • Through enrolling in the project, a mutation in a single gene was found in Georgia’s genome which is likely to be the cause of her condition. • Provides a molecular diagnosis for her condition for the first time. Maria Bitner-Glindzicz – Great Ormond Street Hospital http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/
  58. 58. Jessica 58 Jessica and her family. Image courtesy of Great Ormond Street Hospital. “Now that we have this diagnosis there are things that we can do differently almost straight away. Her condition is one that has a high chance of improvement on a special diet, which means that her medication dose is likely to decrease and her epilepsy may be more easily controlled. Hopefully she might have better balance so she can be more stable and walk more…” “…More than anything the outcome of the project has taken the uncertainty out of life for us and the worry of not knowing what was wrong. It has allowed us to feel like we can take control of things and make positive changes for Jessica. It may also open doors to other research projects that we can to go on. These could be more specific to her condition and we are hopeful that they could one day find a cure.” http://www.genomicsengland.co.uk/first-children-recieve-diagnoses-through-100000-genomes-project/ Mum, Kate Palmer:
  59. 59. 59 Thank you!