3. Challenge
Motivation
Sequencing reads from tumor samples are diluted by normal
cells
lower signal-to-noise ratio: allele frequency 5%
SNVs cannot be called with high significance
The genomes of primary tumors are genetically heterogeneous
with frequent rearrangements
copy number alterations
subclones
Need highly sensitive and specific mutation-calling methods
7. Comparison of mutation callers-Experiments
Literature Search
sSNV detection tools
Validation
PCR and direct sequencing of genomic DNA
on only deleted, functionally important sSNVs
Simulation
10 tumor-normal pairs (WES), 100x coverage
9. Comparison of mutation callers- Results (2)
Literature Search
Synthetic data: 10 tumor-normal pairs (WES), 100x coverage
10. Comparison of mutation callers- Summary of Findings
Literature Search
VarScan2 performed the best; MuTect follows
VarScan2: better at higher allele frequencies
MuTect: more sensitive with low allele frequencies
strand-bias filtering is useful
eliminate many false positives
common problem with Illumina seuquencing data
still a challenge: how to discern sSNVs and normal
alternate alleles?
to call ultras-rare sSNVs: targeted deep sequencing
recommended over WES or WGS
12. Simple consensus approach-Data Sets
Literature Search
whole exam sequencing (WES)
27 ovarian tumors and their matched germline samples
HiSeq 2000 sequencer, using 100 bp paired-end reads
mean coverage ranged from 102~225x in tumor and 119~118x germlines
validation
Sanger sequencing
somatic SNV detection programs
JointSNVMix2, MuTect, and SomaticSniper
implement sophisticated detection algorithms
used in major tumor sequencing studies
13. Simple consensus approach-identification somatic SNVs
Literature Search
MuTect (v 1.0.27783)
only the default parameter set was applied
not labeled as ‘REJECT’
JointSNVMix2 (v 0.7.5)
default prior genotype probabilities used for training set
joint probability 0.9999 or greater
SomaticSniper (v 1.0.0)
using joint genotyping mode (-J option)
default prior probability of a somatic mutation (0.01)
mapping quality of 0 were filtered
predictions with a ‘somatic score’ of 40 or greater
SAMtools mpileup
mapping qualities
directionality
depth of reads
14. total read depth was of 8 or greater in both the T/N
mutant allele frequency of ≥20% in tumor and ≥5% germline
mutant allele supported by read mapping in both for/rev orientations
variant call in only on tumor (exception of the BRAF V600, KRAS G12/13
hotspot)
combined total of 9,226 somatic SNV predictions
median of 321 predictions per sample (range 147~695)
SomaticSniper and JointSNVMix2: most mutation per sample (median 171, 173)
MuTect was more conservative (median 115)
Simple consensus approach-Prdiction Results
Literature Search
15. Simple consensus approach-Properties of Predictions
Literature Search
non-reference allele frequency in germline
S,J substantial number of reads with non-ref alleles
significant number of germline variants into the call set
M is much more stringent on evidence for non-ref alleles
non-reference allele frequency in tumor
one or two programs have a lower proportion of non-ref reads
not having sufficient allelic ratios to be predicted as somatic
but enough support to rise above the thresholds of at least one program
17. Simple consensus approach-Filtering Results
Literature Search
taking consensus between GATK Unified Genotyper
mate-pair rescue read filtering
minimum read depth of 10 in both the tumor and germline
2 true positive
18. Simple consensus approach-Summary of Findings
Literature Search
Powerful method for increasing the validation rate
while maintaining maximum sensitivity
Similar effects are likely to influence other
bioinformatics classification problems
Prove effective for a variety of genomics and
bioinformatics analyses
20. Integrated/Ensemble Approach
Integrated/Ensemble Approach
Ensemble
Using multiple learning algorithms to obtain better predictive performance
(Three somatic SNV callers: SomaticSniper, MuTect, and VarScan2)
Integrated
For better performance, we will use additional filtering
GATK Unified Genotyper: filtering SNVs predicted in the tumor but not the gremlin
Scoring system: help us to identify strong and relevant mutation candidates
22. Integrated/Ensemble Approach: Data Sets (CGHub)
Integrated/Ensemble Approach
https://cghub.ucsc.edu/datasets/benchmark_download.html
CGHub
Cancer Genomics Hub
a resource of the National Cancer Institute
Cancer Genome Atlas (TCGA) consortium and related projects
23. Integrated/Ensemble Approach: TCGA Benchmark 4
Integrated/Ensemble Approach
Three parts to mutation calling exercise:
derived from grade 3 breast ductal carcinomas (breast cancer)
HCC1143 (50x) vs. HCC1143 BL (60x)
HCC1954 (58x) vs. HCC1954 BL (71x)
Simulate normal contamination and sub clone expansion for both:
Total: 28 . bam files, ~4.3 TB
24. Integrated/Ensemble Approach: download using GeneTorrent
Integrated/Ensemble Approach
GeneTorrent
client software for downloading sequence data from CGHub’s repository
two main programs: gtdownload and cgquery
get public key
public key: https://cghub.ucsc.edu/software/downloads/cghub_public.key
TCGA key: approval to access the restricted data from the ICGC-DACO
download uuid (xml file)
CGHub
CGHub
CGHub
25. Validation Data Sets
Integrated/Ensemble Approach
COSMIC
Catalogue of somatic mutations in cancer Cell Lines Project
Wellcome Trust Sanger Institute
http://cancer.sanger.ac.uk/cancergenome/projects/cell_lines/
CCLE
Cancer Cell Line Encycolpedia
Broad Institute and Novartis Institute for Biomedical Research
http://www.broadinstitute.org/ccle/home
27. Java Application: version
Integrated/Ensemble Approach
Java version
Java6 and Java 7 used in many systems
Select Java version
use “update-alternatives —config java”
MuTect run at Java6/ GATK run at Java7 :-(
28. Java Application: running options
Integrated/Ensemble Approach
-Xmx7g
자바 프로그램의 초기 힙사이즈를 설정
자바프로그램을 구동하기 위해, 초기 설정된 메모리 사이즈는 64M
“java.lang.OutOfMemoryError” 힙사이즈가 부족해서 발생
-Djava.io.tmpdir=/tmp
시스템의 property 값을 설정
자바가 사용할 temporary 디렉토리를 설정
java [-java_options] -jar jarfile [jarfile_options]
java -Xmx10g -Djava.io.tmpdir=/tmp -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence human_g1k_v37_decoy.fasta
33. GATK SelectVariants
Integrated/Ensemble Approach
Select variants from a VCF source
discordance: select all calls missed in mycalls, but present in Hiscalls
concordance: select all calls made by both mycalls and Hiscalls
selectType MNP/SNP: select only multi-allelic SNPs and MNPs
select restrict the output vcf to a set of intervals
36. Summary
Summary
Identifying somatic changes from tumor and matched
normal sequence requires accurate detection of somatic
point mutations with low allele frequencies in impure and
heterogeneous cancer samples
Mutations called by multiple tools are of higher-confidence
than mutations called by single tools
Utilizing multiple callers can be a powerful way to construct
a list of final calls for one’s research
Capable of running multiple tools in parallel, providing
faster total run-time
37. References
References
Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., et al. (2013). Detecting somatic point mutations in cancer genome
sequencing data: a comparison of mutation callers. Genome Medicine, 5(10), 91. doi:10.1186/gm495
Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., et al. (2013). A simple consensus approach
improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90. doi:10.1186/gm494
Roberts, N. D., Kortschak, R. D., Parker, W. T., Schreiber, A. W., Branford, S., Scott, H. S., et al. (2013). A comparative analysis
of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18), 2223–2230. doi:10.1093/bioinformatics/btt375
Xu, H., DiCarlo, J., Satya, R. V., Peng, Q., & Wang, Y. (2014). Comparison of somatic mutation calling methods in amplicon
and whole exome sequence data. BMC Genomics, 15(1), 244. doi:10.1186/1471-2164-15-244
Kim, S. Y., Jacob, L., & Speed, T. P. (2014). Combining calls from multiple somatic mutation-callers. BMC Bioinformatics,
15(1), 154–10. doi:10.1186/1471-2105-15-154
L wer, M., Renard, B. Y., de Graaf, J., Wagner, M., Paret, C., Kneip, C., et al. (2012). Confidence-based Somatic Mutation
Evaluation and Prioritization. PLoS Computational Biology, 8(9), e1002714. doi:10.1371/journal.pcbi.1002714
Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., et al. (2012). Feature-based classifiers for somatic mutation
detection in tumour-normal paired sequencing data. Bioinformatics, 28(2), 167–175. doi:10.1093/bioinformatics/btr629
Fischer, A., Vázquez-García, I., Illingworth, C. J. R., & Mustonen, V. (2014). High-definition reconstruction of clonal
composition in cancer. CellReports, 7(5), 1740–1752. doi:10.1016/j.celrep.2014.04.055
Frampton, G. M., Fichtenholtz, A., Otto, G. A., Wang, K., Downing, S. R., He, J., et al. (2013). Development and validation of a
clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nature Biotechnology, 31(11), 1023–1031.
doi:10.1038/nbt.2696
Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., et al. (2013). Sensitive detection of somatic
point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213–219. doi:10.1038/nbt.2514
Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., et al. (2012). JointSNVMix: a probabilistic model for accurate
detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907–913.
doi:10.1093/bioinformatics/bts053
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: Somatic mutation and copy
number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568–576. doi:10.1101/gr.129684.111
Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., et al. (2012). SomaticSniper: identification of
somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311–317. doi:10.1093/bioinformatics/
btr665