O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Detecting Somatic Mutation - Ensemble Approach

8.945 visualizações

Publicada em

Detecting Somatic Mutation - Ensemble Approach

Publicada em: Software
  • Seja o primeiro a comentar

Detecting Somatic Mutation - Ensemble Approach

  1. 1. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 김광중, 홍창범 KT GenomeCloud 2015  한국유전체학회  동계심포지엄   NGS를  Data를  이용한  생물정보분석  Workshop   2015.2.4~2.5  
  2. 2. Overview Overview Challenge Literature Search Mutation callers Comparison of mutation callers Simple consensus approach Integrated/Ensemble Approach Summary
  3. 3. Challenge Motivation Sequencing reads from tumor samples are diluted by normal cells lower signal-to-noise ratio: allele frequency 5% SNVs cannot be called with high significance The genomes of primary tumors are genetically heterogeneous with frequent rearrangements copy number alterations subclones Need highly sensitive and specific mutation-calling methods
  4. 4. Terminologies Terminologies Challenge Literature Search Mutation Callers Comparison of mutation callers Simple consensus approach Integrated/Ensemble Approach Summary
  5. 5. Comparison of mutation callers Literature Search
  6. 6. Comparison of mutation callers-Data Sets Literature Search whole genome sequencing (WGS) melanoma sample and matched blood 90% tumor content paired-end reads, ~50x coverage whole exam sequencing (WES) tumor samples 18 lung tumor-normal pairs 70~80% tumor content paired-end reads, 63x cell lines 7 lung cancer cell lines paired-end reads, 233x
  7. 7. Comparison of mutation callers-Experiments Literature Search sSNV detection tools Validation PCR and direct sequencing of genomic DNA on only deleted, functionally important sSNVs Simulation 10 tumor-normal pairs (WES), 100x coverage
  8. 8. Comparison of mutation callers- Results (1) Literature Search
  9. 9. Comparison of mutation callers- Results (2) Literature Search Synthetic data: 10 tumor-normal pairs (WES), 100x coverage
  10. 10. Comparison of mutation callers- Summary of Findings Literature Search VarScan2 performed the best; MuTect follows VarScan2: better at higher allele frequencies MuTect: more sensitive with low allele frequencies strand-bias filtering is useful eliminate many false positives common problem with Illumina seuquencing data still a challenge: how to discern sSNVs and normal alternate alleles? to call ultras-rare sSNVs: targeted deep sequencing recommended over WES or WGS
  11. 11. Simple consensus approach Literature Search
  12. 12. Simple consensus approach-Data Sets Literature Search whole exam sequencing (WES) 27 ovarian tumors and their matched germline samples HiSeq 2000 sequencer, using 100 bp paired-end reads mean coverage ranged from 102~225x in tumor and 119~118x germlines validation Sanger sequencing somatic SNV detection programs JointSNVMix2, MuTect, and SomaticSniper implement sophisticated detection algorithms used in major tumor sequencing studies
  13. 13. Simple consensus approach-identification somatic SNVs Literature Search MuTect (v 1.0.27783) only the default parameter set was applied not labeled as ‘REJECT’ JointSNVMix2 (v 0.7.5) default prior genotype probabilities used for training set joint probability 0.9999 or greater SomaticSniper (v 1.0.0) using joint genotyping mode (-J option) default prior probability of a somatic mutation (0.01) mapping quality of 0 were filtered predictions with a ‘somatic score’ of 40 or greater SAMtools mpileup mapping qualities directionality depth of reads
  14. 14. total read depth was of 8 or greater in both the T/N mutant allele frequency of ≥20% in tumor and ≥5% germline mutant allele supported by read mapping in both for/rev orientations variant call in only on tumor (exception of the BRAF V600, KRAS G12/13 hotspot) combined total of 9,226 somatic SNV predictions median of 321 predictions per sample (range 147~695) SomaticSniper and JointSNVMix2: most mutation per sample (median 171, 173) MuTect was more conservative (median 115) Simple consensus approach-Prdiction Results Literature Search
  15. 15. Simple consensus approach-Properties of Predictions Literature Search non-reference allele frequency in germline S,J substantial number of reads with non-ref alleles significant number of germline variants into the call set M is much more stringent on evidence for non-ref alleles non-reference allele frequency in tumor one or two programs have a lower proportion of non-ref reads not having sufficient allelic ratios to be predicted as somatic but enough support to rise above the thresholds of at least one program
  16. 16. Simple consensus approach-Validation Results Literature Search
  17. 17. Simple consensus approach-Filtering Results Literature Search taking consensus between GATK Unified Genotyper mate-pair rescue read filtering minimum read depth of 10 in both the tumor and germline 2 true positive
  18. 18. Simple consensus approach-Summary of Findings Literature Search Powerful method for increasing the validation rate while maintaining maximum sensitivity Similar effects are likely to influence other bioinformatics classification problems Prove effective for a variety of genomics and bioinformatics analyses
  19. 19. Integrated/Ensemble Approach Integrated/Ensemble Approach Challenge Literature Search Mutation callers Comparison of mutation callers Simple consensus approach Integrated/Ensemble Approach Summary
  20. 20. Integrated/Ensemble Approach Integrated/Ensemble Approach Ensemble Using multiple learning algorithms to obtain better predictive performance (Three somatic SNV callers: SomaticSniper, MuTect, and VarScan2) Integrated For better performance, we will use additional filtering GATK Unified Genotyper: filtering SNVs predicted in the tumor but not the gremlin Scoring system: help us to identify strong and relevant mutation candidates
  21. 21. Integrated/Ensemble Approach Integrated/Ensemble Approach Subject tumor.bam normal.bam MuTect somatic.vcf VarScan2 somatic.vcf SomaticSniper somatic.vcf GATK tumor.vcf GATK normal.vcf Consensus (gatk) somatic.vcf filtered (GATK) somatic.vcf Cosmic, CCLE validate somatic list validated(GATK) somatic.vcf SAMtools mpileup
  22. 22. Integrated/Ensemble Approach: Data Sets (CGHub) Integrated/Ensemble Approach https://cghub.ucsc.edu/datasets/benchmark_download.html CGHub Cancer Genomics Hub a resource of the National Cancer Institute Cancer Genome Atlas (TCGA) consortium and related projects
  23. 23. Integrated/Ensemble Approach: TCGA Benchmark 4 Integrated/Ensemble Approach Three parts to mutation calling exercise: derived from grade 3 breast ductal carcinomas (breast cancer) HCC1143 (50x) vs. HCC1143 BL (60x) HCC1954 (58x) vs. HCC1954 BL (71x) Simulate normal contamination and sub clone expansion for both: Total: 28 . bam files, ~4.3 TB
  24. 24. Integrated/Ensemble Approach: download using GeneTorrent Integrated/Ensemble Approach GeneTorrent client software for downloading sequence data from CGHub’s repository two main programs: gtdownload and cgquery get public key public key: https://cghub.ucsc.edu/software/downloads/cghub_public.key TCGA key: approval to access the restricted data from the ICGC-DACO download uuid (xml file) CGHub CGHub CGHub
  25. 25. Validation Data Sets Integrated/Ensemble Approach COSMIC Catalogue of somatic mutations in cancer Cell Lines Project Wellcome Trust Sanger Institute http://cancer.sanger.ac.uk/cancergenome/projects/cell_lines/ CCLE Cancer Cell Line Encycolpedia Broad Institute and Novartis Institute for Biomedical Research http://www.broadinstitute.org/ccle/home
  26. 26. Validation Data Sets (18) Integrated/Ensemble Approach 17:5445207-5445207 17:7577538-7577538 17:10411982-10411982 17:43364293-43364293 17:47892946-47892946 17:67538038-67538038 17:67012449-67012449 17:48538716-48538716 17:27936181-27936181 17:79650824-79650824 17:79638782-79638782 17:76528554-76528554 17:6683197-6683197 17:73235515-73235515 17:39505636-39505636 17:33310040-33310040 17:56083818-56083818 17:37374298-37374298
  27. 27. Java Application: version Integrated/Ensemble Approach Java version Java6 and Java 7 used in many systems Select Java version use “update-alternatives —config java” MuTect run at Java6/ GATK run at Java7 :-(
  28. 28. Java Application: running options Integrated/Ensemble Approach -Xmx7g 자바 프로그램의 초기 힙사이즈를 설정 자바프로그램을 구동하기 위해, 초기 설정된 메모리 사이즈는 64M “java.lang.OutOfMemoryError” 힙사이즈가 부족해서 발생 -Djava.io.tmpdir=/tmp 시스템의 property 값을 설정 자바가 사용할 temporary 디렉토리를 설정 java [-java_options] -jar jarfile [jarfile_options] java -Xmx10g -Djava.io.tmpdir=/tmp -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence human_g1k_v37_decoy.fasta
  29. 29. SomaticSniper v1.0.4 Integrated/Ensemble Approach bam-somaticsniper -J -F vcf -n HCC1143_Normal -t HCC1143_Tumor -f ${gatk_b37} $ {input_bam1} ${input_bam2} HCC1143_chr17_somaticsniper.vcf
  30. 30. MuTect v1.1.4 Integrated/Ensemble Approach java -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence ${gatk_b37} -- input_file:normal ${input_bam2} --input_file:tumor ${input_bam1} --out HCC1143_chr17_mutect.out --vcf HCC1143_chr17_mutect.vcf --coverage_file HCC1143_chr17_mutect.cov.wig.txt -nt 7 --normal_sample_name normal -- tumor_sample_name tumor -L 17
  31. 31. VarScan2 v2.3.7 Integrated/Ensemble Approach samtools mpileup -f ${gatk_b37} -Q 20 -q 20 -B ${input_bam2} ${input_bam1} > hcc1143_chr17.mpileup java -jar VarScan.v2.3.7.jar somatic hcc1143_chr17.mpileup HCC1143_chr17.varscan -- mpileup 1 --output-vcf 1
  32. 32. GATK UnifiedGenotyper Integrated/Ensemble Approach java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -L 17 -o hcc1143_chr17.gatk.normal.vcf -I ${input_bam2} --genotype_likelihoods_model BOTH - minIndelFrac 0.2 --min_base_quality_score 17 -- standard_min_confidence_threshold_for_calling 30.0 -- standard_min_confidence_threshold_for_emitting 30.0 --baq CALCULATE_AS_NECESSARY --baqGapOpenPenalty 30.0 --defaultBaseQualities -1 -- validation_strictness STRICT --interval_merging ALL -R ${gatk_b37} -nt 7
  33. 33. GATK SelectVariants Integrated/Ensemble Approach Select variants from a VCF source discordance: select all calls missed in mycalls, but present in Hiscalls concordance: select all calls made by both mycalls and Hiscalls selectType MNP/SNP: select only multi-allelic SNPs and MNPs select restrict the output vcf to a set of intervals
  34. 34. Ensemble approach - results & rank score Integrated/Ensemble Approach each filtered count (total variants count/filtered count) SomaticSniper: 2,381/624 MuTect: 132,239/4,318 VarScan2: 89,986/1,457 concordance call (204 variants) total 460 variants exclude gatk germlines: 324 variant include gatk cancer sample: 204 variants validation count (total variants count/validated count) SomaticSniper validate: 2,381/9(+4) MuTect: 132,239/13(+8) VarScan2: 89,986/6(+1) filterd consensus: 204/5 rank score: 1 rank score: 2 rank score: 5 rank score: 3 rank score: 4
  35. 35. Summary Summary Challenge Literature Search Mutation callers Comparison of mutation callers Simple consensus approach Integrated/Ensemble Approach Summary
  36. 36. Summary Summary Identifying somatic changes from tumor and matched normal sequence requires accurate detection of somatic point mutations with low allele frequencies in impure and heterogeneous cancer samples Mutations called by multiple tools are of higher-confidence than mutations called by single tools Utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research Capable of running multiple tools in parallel, providing faster total run-time
  37. 37. References References Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., et al. (2013). Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Medicine, 5(10), 91. doi:10.1186/gm495 Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., et al. (2013). A simple consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90. doi:10.1186/gm494 Roberts, N. D., Kortschak, R. D., Parker, W. T., Schreiber, A. W., Branford, S., Scott, H. S., et al. (2013). A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18), 2223–2230. doi:10.1093/bioinformatics/btt375 Xu, H., DiCarlo, J., Satya, R. V., Peng, Q., & Wang, Y. (2014). Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics, 15(1), 244. doi:10.1186/1471-2164-15-244 Kim, S. Y., Jacob, L., & Speed, T. P. (2014). Combining calls from multiple somatic mutation-callers. BMC Bioinformatics, 15(1), 154–10. doi:10.1186/1471-2105-15-154 L wer, M., Renard, B. Y., de Graaf, J., Wagner, M., Paret, C., Kneip, C., et al. (2012). Confidence-based Somatic Mutation Evaluation and Prioritization. PLoS Computational Biology, 8(9), e1002714. doi:10.1371/journal.pcbi.1002714 Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., et al. (2012). Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics, 28(2), 167–175. doi:10.1093/bioinformatics/btr629 Fischer, A., Vázquez-García, I., Illingworth, C. J. R., & Mustonen, V. (2014). High-definition reconstruction of clonal composition in cancer. CellReports, 7(5), 1740–1752. doi:10.1016/j.celrep.2014.04.055 Frampton, G. M., Fichtenholtz, A., Otto, G. A., Wang, K., Downing, S. R., He, J., et al. (2013). Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nature Biotechnology, 31(11), 1023–1031. doi:10.1038/nbt.2696 Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213–219. doi:10.1038/nbt.2514 Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., et al. (2012). JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907–913. doi:10.1093/bioinformatics/bts053 Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568–576. doi:10.1101/gr.129684.111 Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., et al. (2012). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311–317. doi:10.1093/bioinformatics/ btr665

×