Mais conteúdo relacionado Semelhante a Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach (20) Mais de Hong ChangBum (20) Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach1. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble
Approach
CB Hong ⇤
, KJ Kim
4-5 February 2015
Contents
1 TCGA Benchmark 4 Data Set 3
1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sample Data Set DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 îú⌧ Ì Ù Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 ‰µ` pt0 Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Somatic Mutation Prediction 6
2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Full Consensus / Partial Consensus sSNV lX0 11
3.1 Bi-allelic SNPÃ îúX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Full Consensus / Partial Consensus lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 î D0 ©X0 13
4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ) . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Filtering SNVs - full consensus (›µ •) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . 14
4.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Validation 15
5.1 COSMIC, CCLE pt0 DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Validation ⇠â - consensus / parital consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 0¿ Somatic Mutation Callers - Strelka, Virmid 17
6.1 Strelka (1Ñ38 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Virmid (33Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
⇤KT GenomeCloud hongiiv@gmail.com
1
2. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 2
7 ⌅¥ l| ⌅ ¨⇧§ 19
7.1 ‰µ© ¨⇧§ ⌧Ñ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞਩ê . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4 ¨⇧§ ‹§ Ù LD¥0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.5 ¨⇧§ | ‹§ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.6 ¨⇧§ X‹§l î X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.7 | ( Ö9¥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.8 ¨⇧§ $∏Ãl Ù . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.9 ¨⇧§ Uï ttX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10 ¨⇧§ å⌅∏Ë¥ $XX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10.1 APT| t© å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 3
1 TCGA Benchmark 4 Data Set
¯ ‰µ–⌧î TCGA mutation calling benchmark4 datasetsD t©XÏ ¥ªå somatic mutationD >D¿– t⌧
LD ¸ ÉÖ»‰. Genome sequencing benchmakr dataset@ x⌅ < tumor ÿ – | D((5%-95%)X Normal
ÿ D <iXÏ ›1 pt0Ö»‰. t ⌘–⌧ ∞¨î n40t60 (mixed with 60% of the tumor and 40% of the
normal)¸ t– QXî normal sampleD ¨©` ÉÖ»‰. t˘ pt0î BAM Ϙ< TCGA Benchmark Hò
t¿–⌧ ‰¥‹ •i»‰.
1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹
• ‰¥‹ S/W $X - Key/UUID | ‰¥‹ - ÿ ‰¥‹
• ‹)TCGA Benchmark Data SetD ⌅ Public Key ‰¥‹
• https://cghub.ucsc.edu/datasets/benchmark download.html
$ cd
$ wget https:// cghub.ucsc.edu/software/downloads/cghub_public.key
• π |X ‰¥‹ Ù| ÏhXî UUID(universally unique identifier, ›ƒê) |
• TCGA Benchmark cell line: HCC1143 tumor 50x
$ curl https:// cghub.ucsc.edu/cghub/metadata/ analysisAttributes ?
analysis_id=ad3d4757 -f358 -40a3 -9d92 -742463 a95e88
-o uuid.txt
$ more uuid.txt
<?xml version="1.0" encoding="utf -8" standalone="yes"?>
<center_name >UCSC </ center_name >
<study >TCGA_MUT_BENCHMARK_4 </study >
<files >
<file >
<filename >G15511.HCC1143 .1.bam </ filename >
<filesize >255795959440 </ filesize >
</file >
• gtdownload| t© pt0 ‰¥‹
$ cd
$ gtdownload -c cghub_public.key -vv -d uuid.txt
1.2 Sample Data Set DX0
• BAMX |Ä Ì îú - ,(sort) - xqÒ (index)
¸…¥ Ë⌅ îú (-b: bam Ϙ< ú%)
$ cd
$ samtools view -b in.bam 1 > chr1.bam
$ samtools sort chr1.bam chr1_sorted
$ samtools index chr1_sorted.bam
• π ÌX îú (BED | t©)
4. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 4
$ cd
$ cat chr17.bed
17:5967 -6207
17:11197 -11389
17:11806 -12018
17:13897 -14017
17:22307 -22427
17:30843 -30963
17:31151 -31279
17:63618 -63738
17:65398 -65638
17:69410 -69530
17:96838 -97108
17:131511 -131661
17:169155 -169395
17:170984 -171254
17:177205 -177355
17:260100 -260308
17:262897 -263257
17:263317 -263947
$ cat chr17.bed |xargs samtools view -b in.bam
> exome.bam
$ samtools sort exome.bam exome_sorted
$ samtools index exome_sorted.bam
1.3 îú⌧ Ì Ù Ux
• readƒ ⌅X Ù| bed Ϙ< ú%‰. ⌅Ëà ucsc genome browserX custom track< î XÏ align
⌧ read Ù| Ux` ⇠ à‰.
$ cd
$ bamToBed -i exome_sorted.bam > cov_1.bed
• BAM |X ‰Ñ¨¿| BED | ú%Xp, read depth Ù| ৆¯®< ¯¨0 ⌅ Ù ©
⇠ à‰.
$ cd
$ samtools view -b exome_sorted.bam |
genomeCoverageBed -ibam stdin > cov_2.bed
1.4 ‰µ` pt0 Ux
• ÿ , ⌅¯®, |§ pt0 ©]
$ cd /somatic_bench
$ pwd
/somatic_bench
$ ls -al
total 176
drwxr -xr -x 7 root root 4096 Jan 21 15:25 .
drwxr -xr -x 25 root root 4096 Jan 20 08:53 ..
5. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 5
drwxr -xr -x 9 root root 4096 Jan 21 08:15 app
drwxr -xr -x 2 root root 4096 Jan 21 14:38 bam
drwxr -xr -x 2 root root 4096 Jan 19 11:43 reference
drwxr -xr -x 2 root root 4096 Jan 21 15:24 script
drwxr -xr -x 2 root root 151552 Jan 21 12:59 tmp
$ more /somatic_bench/script/ somatic_call_bench .sh
input_bam1="/somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam"
input_bam2="/somatic_bench/bam/hcc1143.ccle.b.sorted.bam"
gatk_b37="/somatic_bench/reference/ human_g1k_v37_decoy .fasta"
temp_dir="/somatic_bench/tmp/"
$ cd
$ ln -s /somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam tumor.bam
$ ln -s /somatic_bench/bam/hcc1143.ccle.b.sorted.bam normal.bam
1.5 ¨X0
• ⌅¯® ©]: wget, curl, gtdownload, samtools, bedtools(bamToBed, genomeCoverageBed)
• ∞¸<: –Xî ÌÃt t¨Xî .bam, t˘ .bamX coverage| Ùϸî .bed
6. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 6
2 Somatic Mutation Prediction
SomaticSniper, VarScan2, MuTectD t©XÏ ÿ pt0K< Ä0 (tumor@ matched normal bam) somatic mu-
tationD >D≈»‰.
• Ñ Ö9: https://gist.github.com/hongiiv/06611f189f4c8158edb0
• SAMtools: v0.1.19
• GATK: v2.8.1
• MuTect: v1.1.4
• SomaticSniper: v1.0.4
• Strelka: v1.0.14
• Virmid: v1.1.1
2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 )
SomaticSniperî Varscan2| Ç ÃÒ4 YX Li Ding– Xt 2011D ⌧⌧⇠»<p, Bayesian probability@ poste-
rior filteringD t©‰. ¸î π’<î High computational e ciency| Ùx‰.
• -J: joint genotyping mode with default prior probability of a somatic mutation (0.01)
• -n, -t: normal/tumor sample id (for VCF header)
• -F: output Ϙ (classic, vcf, bed)
• -f: ref.fasta |X Ω
$ cd
$ bam - somaticsniper
-J
-F vcf
-n HCC1143_Normal
-t HCC1143_Tumor
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta
tumor.bam normal.bam
HCC1143_somaticsniper .vcf
• (D05X) Reads with a mapping quality of 0 were filtered prior to somatic mutation identification. Predictions
with ’somatic score’ of 40 or greater were considered for subsequent downstaream validation and analysis step.
• GATKXSelectVariants| t©XÏ –Xî variantsÃD îú` ⇠ à‰.
• VCF |X FORMAT D‹X SSC (somatic score), MQ (mapping quality) Ù| t©
$ cd
$ ln -s /somatic_bench/app/GenomeAnalysisTK -2.8 -1/ GenomeAnalysisTK .jar ./
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 2
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
7. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 7
$ java -version
java version "1.7.0 _72"
Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode)
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_somaticsniper .vcf
-o HCC1143_somaticsniper_filter .vcf
-sn HCC1143_Tumor -sn HCC1143_Normal
-select 'vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("SSC") >= 40
&& (vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("MQ") > 0 ||
vc.getGenotype(" HCC1143_Normal "). getExtendedAttribute ("MQ") > 0)'
• D0 ⌅/ƒX mutation /⇠ DPX0
$ cd
$ grep -v "#" HCC1143_somaticsniper .vcf |wc -l
583
$ grep -v "#" HCC1143_somaticsniper_filter .vcf |wc -l
161
2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ)
VarScan2î ÃÒ4 YX Li Ding– Xt SomaticSniperÙ‰ 1D ¶@ 2012D ⌧⌧⇠»‰. ‰x 4‰¸î Ϩ
Fisher exact test@ filtering and FDR correctionD ¨©‰. ¸î π’< high-quality sSNVs– t⌧ sensitive
detectionD ⇠â‰. ‰x 4‰¸ Ϩ Ö% |D .bam |t Dà pileup ⇣î mpileup |D Ö% î‰.
• samtoolsX mpileupD t©XÏ normal, tumor– t⌧ pileup/mpileup ϘD ›1‰.
• mpileup ˃–⌧ -q 1 (skip alignments with mapQ smaller than INT), -B (disable BAQ computation) 5XD µt
filter| ⇠â‰.
• VarScan–⌧ mpileup1
ϘD Ö%< ¨©Xî Ω∞ ’–mpileup 1’ 5XD ‰.
$ cd
$ samtools mpileup
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-q 1 -B normal.bam > HCC1143_n.pileup
$ samtools mpileup
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-q 1 -B tumor.bam > HCC1143_t.pileup
$ ln -s /somatic_bench/app/VarScan/VarScan.v2 .3.3. jar ./
$ java -jar VarScan.v2 .3.7. jar
somatic HCC1143_n.pileup HCC1143_t.pileup
HCC1143_varscan
--output -vcf 1
14617150 positions in tumor
14616970 positions shared in normal
13721478 had sufficient coverage for comparison
10tX 8⌧‰@ samtoolsX pileupD ¨©Xî ÉD 0 < $Ö⇠¥ à¿Ã, samtools ≈pt∏ ⇠t⌧ pileup@ ¨|¿‡ mpileup
< ¥ ⇠»‰. X¿Ã mpileup<ƒ XòX ÿ à pileupt •X‰. <` varscan–⌧î N/T ®P Ïh⌧ mpileup |D ¿–‰.
8. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 8
13700958 were called Reference
0 were mixed SNP -indel calls and filtered
18427 were called Germline
1562 were called LOH
450 were called Somatic
81 were called Unknown
0 were called Variant
• VarScan2X ⇠â∞¸ Dò@ ⇡t INDEL¸ SNP Ïh⌧ ∞¸| VCF ‹ ›1⌧‰ (HCC1143 varscan.indel.vcf,
HCC1143 varscan.snp.vcf).
drwxr -xr -x 2 root root 4096 Jan 30 09:52 ./
drwxr -xr -x 5 root root 8192 Jan 30 09:35 ../
-rw -r--r-- 1 root root 402354 Jan 30 09:47 HCC1143_varscan .indel.vcf
-rw -r--r-- 1 root root 2691462 Jan 30 09:47 HCC1143_varscan .snp.vcf
• VarScan2X ∞¸ ⌘, HCC1143varscan.snp.vcf XprocessSomaticısomaticFilter|tXD0|¸.
• processSomatic: high-confidence2
/low-confidence Somatic mutationsD Ѩt ‰.
• somaticFilter: ê‡t –Xî D0 5X –min-coverage, –p-value, –indel-file Ò © •X‰.
$ cd
$ java -jar VarScan.v2 .3.3. jar processSomatic -help
USAGE: java -jar VarScan.jar process [status -file] OPTIONS
status -file - The VarScan output file for SNPs or Indels
OPTIONS
--min -tumor -freq - Minimum variant allele frequency in tumor [0.10]
--max -normal -freq - Maximum variant allele frequency in normal [0.05]
--p-value - P-value for high -confidence calling [0.07]
$ java -jar VarScan.v2 .3.3. jar processSomatic HCC1143_varscan .snp.vcf
Reading input from HCC1143_varscan .snp.vcf
Opening output files:
17914 VarScan calls processed
382 were Somatic (102 high confidence)
16048 were Germline (15431 high confidence)
1451 were LOH (1447 high confidence)
• processSomaticX ∞¸ Germline, LOH, Somatic– t⌧ high confidence, low confidenceX ©]t Ïh
⌧ ∞¸| ›1‰.
$ ls
-rw -r--r-- 1 2413169 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline
-rw -r--r-- 1 2320566 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline.hc
-rw -r--r-- 1 216574 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH
-rw -r--r-- 1 215997 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH.hc
-rw -r--r-- 1 59990 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic
-rw -r--r-- 1 17055 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic.hc
• VarScan2X ∞¸ VCFX Ω∞ ALT allele– ’G/T’ Ò< 0Xîp tî îƒ Ñ – –Ï| ⌧›‰. 0|
⌧ ’G,T’X ⌅ )›< ¿Ω‰.
2tumor–⌧ minimum variant allele frequency 0.1, normal–⌧ maximum variant allele frequency 0.05
9. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 9
$ cd
$ perl -pe 's/tA //tA ,/' HCC1143_varscan .snp.vcf.Somatic.hc |
perl -pe 's/tT //tT ,/'|
perl -pe 's/tG //tG ,/'|
perl -pe 's/tC //tC ,/' > HCC1143_varscan_filter .vcf
• D0 ƒX mutation /⇠
$ cd
$ grep -v "#" HCC1143_varscan_filter .vcf |wc -l
102
2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ)
MuTect@ Broad–⌧ ⌧⌧⌧ 4 Bayesian probability with pre- and post- filteringD ⇠âXp, πà low allelic-fraction
–⌧ sSNVs– t⌧ sensitive detectionD ⇠â‰.
• MuTectî ê 1.6 Ñ⌅–⌧Ã ŸëX0 L8– ⌅¨ Java Ñ⌅D Ux ƒ– Dî‹ update-alternatives| t
©XÏ Ñ⌅D ¿Ω‰.
$ cd
$ ln -s /somatic_bench/app/mutect/muTect -1.1.4. jar ./
$ samtools index normal.bam
$ samtools index tumor.bam
$ cp /somatic_bench/reference/ccle.gatk.bed ./
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 1
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
$ java -version
java version "1.6.0 _45"
Java(TM) SE Runtime Environment (build 1.6.0_45 -b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45 -b01 , mixed mode)
$ java -jar muTect -1.1.4. jar --analysis_type MuTect
--reference_sequence /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--cosmic /somatic_bench/reference/ b37_cosmic_v54_120711 .vcf
--dbsnp /somatic_bench/reference/dbsnp_132_b37.leftAligned.vcf
--input_file:normal normal.bam
--input_file:tumor tumor.bam
--out HCC1143_mutect .out
--vcf HCC1143_mutect .vcf
--coverage_file HCC1143.mutect.cov.wig.txt
--normal_sample_name HCC1143_Normal
--tumor_sample_name HCC1143_Tumor
-L ccle.gatk.bed
10. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 10
• (D05X) Predictions not labeled as ’REJECT’ were accepted as confident somatic mutation predictions, and
subsequent downstream validation and analysis steps.
• D0– ¨©` GATKî ê 1.7 Ñ⌅D Dî X¿ update-alternatives| t©XÏ ê Ñ⌅D ¿Ω‰.
• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ÄÑt PASS⌧ (REJECT| ⌧x) variantsÃ
>D∏‰.
$ cd
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 2
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
$ java -version
java version "1.7.0 _72"
Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode)
$ java -jar GenomeAnalysisTK .jar -T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_mutect .vcf
-o HCC1143_mutect_filter .vcf
-sn HCC1143_Tumor -sn HCC1143_Normal
-select 'vc.isNotFiltered ()'
• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ ÄÑt PASS⌧ (REJECT| ⌧x) variantsÃ
>D∏‰.
$ cd
$ java -jar GenomeAnalysisTK .jar -T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_mutect .vcf
-o HCC1143_mutect_filter .vcf
-sn HCC1143_Tumor -sn HCC1143_Normal
--excludeFiltered
• D0 ƒX mutation /⇠
$ cd
$ grep -v "#" HCC1143_mutect_filter .vcf |wc -l
109
2.4 ¨X0
• ⌅¯® ©]: VarScan2, SomaticSniper, MuTect, GATK
• ∞¸<: 4ƒ D0 DÃ⌧ somatic mutation (161, 102, 112)
11. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 11
3 Full Consensus / Partial Consensus sSNV lX0
SomaticSniper, VarScan2, MuTect 3ÖX SNV detecting toolsX full consensus callD >î‰. ∞ multi-allelic¸ indel
@ ⌧p‰.
3.1 Bi-allelic SNPÃ îúX0
• ¨⌅ D0 ∞¸– t⌧ multi-allelicD ⌧pX‡ SNPà îú‰.
• GATKX SelectVariants| t©XÏ -selectTypeD SNP (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO VARIATION),
-restrictAllelesTo| BIALLELIC (MULTIALLELIC or BIALLELIC)< ‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_mutect_filter .vcf
-o HCC1143_mutect_1 .vcf
-selectType SNP
-restrictAllelesTo BIALLELIC
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_somaticsniper_filter .vcf
-o HCC1143_somaticsniper_1 .vcf
-selectType SNP
-restrictAllelesTo BIALLELIC
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_varscan_filter .vcf
-o HCC1143_varscan_1 .vcf
-selectType SNP
-restrictAllelesTo BIALLELIC
3.2 Full Consensus / Partial Consensus lX0
• Partial Consensus (SomaticSniper/MuTect, MuTect/VarScan2, VarScan2/SomaticSniper)@ somatic caller 3Ö–
⌅¥ consensus| l‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_somaticsniper_1 .vcf
--concordance HCC1143_mutect_1 .vcf
-o HCC1143_SM.vcf
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_mutect_1 .vcf
--concordance HCC1143_varscan_1 .vcf
-o HCC1143_MV.vcf
12. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 12
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_varscan_1 .vcf
--concordance HCC1143_somaticsniper_1 .vcf
-o HCC1143_VS.vcf
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SM.vcf
--concordance HCC1143_varscan_1 .vcf
-o HCC1143_SMV.vcf
3.3 Full Consensus / Partial Consensus /⇠ lX0
• full consensus ✏ parital consensus /⇠| l‰.
$ cd
$ grep -v "#" HCC1143_SM.vcf |wc -l
45
$ grep -v "#" HCC1143_MV.vcf |wc -l
38
$ grep -v "#" HCC1143_VS.vcf |wc -l
42
$ grep -v "#" HCC1143_SMV.vcf |wc -l
32
3.4 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: consensus / parital consensus pt0
13. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 13
4 î D0 ©X0
GATK Unified Genotyper| t©XÏ specificity| ù ‹¨ ⇠ à‰.
4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ)
• GATK UnifiedGenotyper| t©XÏ Normal/Tumor ÿ – t SNP| calling‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T UnifiedGenotyper
-o HCC1143_gatk.tumor.vcf
-I tumor.bam
--genotype_likelihoods_model SNP
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-L ccle.gatk.bed
$ java -jar GenomeAnalysisTK .jar
-T UnifiedGenotyper
-o HCC1143_gatk.normal.vcf
-I normal.bam
--genotype_likelihoods_model SNP
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-L ccle.gatk.bed
4.2 Filtering SNVs - full consensus (›µ •)
• GATK UnifiedGenotyper| t©XÏ ›1⌧ Normal/Tumor X variants| t©XÏ SNVs predicted in tumor
but not the germlines D0| ⇠â‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SMV.vcf
--discordance HCC1143_gatk.normal.vcf
-o HCC1143_SMV_discordance_normal .vcf
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SMV_discordance_normal .vcf
--concordance HCC1143_gatk.tumor.vcf
-o HCC1143_final_filter_concordance .vcf
4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect)
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SM.vcf
--discordance HCC1143_gatk.normal.vcf
14. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 14
-o HCC1143_SM_discordance_normal .vcf
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SM_discordance_normal .vcf
--concordance HCC1143_gatk.tumor.vcf
-o HCC1143_SM_final_filter_concordance .vcf
4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0
• GATK D0| » consensus ✏ parital consensus /⇠| l‰.
$ cd
$ grep -v "#" HCC1143_final_filter_concordance .vcf |wc -l
32
$ grep -v "#" HCC1143_SM_final_filter_concordance .vcf |wc -l
45
4.5 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: GATK D0| © consensus / parital consensus pt0
15. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 15
5 Validation
COSMIC¸CCLEX HCC1143 ÿ – ¿t ¨§∏| ¿‡ º»ò |XXî¿| LD¯‰. validation.list
|@ ⌧Ñ– •⌧ | ⇣î ‰¥‹ (https://gist.github.com/hongiiv/42194181ce6402d8b629)XÏ ¨©i»‰.
5.1 COSMIC, CCLE pt0 DX0
• COSMIC¸ CCLEX HCC1143 ÿ – ¿t ©] ( 103⌧)D ı¨‰.
$ cd
$ cp /somatic_bench/reference/validation.list ./
$ cat validation.list | wc -l
103
5.2 Validation ⇠â - consensus / parital consensus
• Ö filter⌧ consensus/partial consensus (SomaticSniper/MuTect)– t⌧ á⌧ |XXî¿| Ux‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_final_filter_concordance .vcf
-o all.val.filter.vcf
-L validation.list
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SM_final_filter_concordance .vcf
-o sm.val.filter.vcf
-L validation.list
$ grep -v "#" all.val.filter.vcf | wc -l
6
$ grep -v "#" sm.val.filter.vcf | wc -l
9
• î GATK D0⌅X consensus ¿t– t⌧ á⌧ |XXî¿| Ux‰.
$ cd
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SMV.vcf
-o all.val.vcf
-L validation.list
$ java -jar GenomeAnalysisTK .jar
-T SelectVariants
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
--variant HCC1143_SM.vcf
-o sm.val.vcf
-L validation.list
$ grep -v "#" all.val.vcf |wc -l
6
16. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 16
$ grep -v "#" sm.val.vcf |wc -l
9
• consensus: before GATK filter (32/6) - after GATK filter (32/6)
• partial consensus-SM: before GATK filter (45/9) - after GATK filter (45/9)
5.3 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: Ö consensus / partial consensus@ COSMIC, CCLE@ |XXî /⇠
17. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 17
6 0¿ Somatic Mutation Callers - Strelka, Virmid
6.1 Strelka (1Ñ38 )
Bayesian probability with posterior filtering| t© somatic mutation caller 2012D |˯ò Ç ⌅¯®t
‰. |˯òX alignerx issactò eland –à D»| bwaƒ ¿–‰.‰â)ït |⇠ ⌅¯®‰¸î }⌅ ‰x
)›D t©Xîp tî |˯ò ¸ ⌧ issac ⇣ D∑ ‰â)ïD ¨©Xp, tî XòX ⌅ ∏|
®( < ¨X‡ | 1àå ¨X0 ⌅XÏ Makefile t|î ›D ¨©Xî make |î ¯¨| t©
X0 L8t‰.
• Strelka| ¨©X0 ⌅t⌧î StrelkaX 5Xt •⌧ |t DîXp, 0¯ < bwa, eland, isaac 3⌧X
aligner| ⌅ 0¯ 5XD ⌧ı‰.
• 0¯ 5X–⌧ exometò target sequencingX Ω∞ isSkipDepthFilters = 1 ¿ ‰.
$ ll /somatic_bench/app/strelka -1.0.14/ etc/
total 20
drwxrwxr -x 2 viz viz 4096 Jul 10 2014 ./
drwxr -xr -x 7 root root 4096 Jan 30 11:06 ../
-rw -rw -r-- 1 viz viz 3658 Jul 10 2014 strelka_config_bwa_default .ini
-rw -rw -r-- 1 viz viz 3683 Jul 10 2014 strelka_config_eland_default .ini
-rw -rw -r-- 1 viz viz 3821 Jul 10 2014 strelka_config_isaac_default .ini
• Strelka $X⌧ †¨@ Ñ ∞¸ • †¨– t⌧ ¿⇠ $ D ‰.
• 0¯ 5X |D ı¨X‡ configureStrelkaWorkflow.pl Ö9< Ñ Ö9¥| ›1‰.
• É¥ƒ Ñ Ö9D make| µt ‰âXp tL -j 5XD µt Ñ – ¨©` thread (cpu) /⇠| ¿ ‰.
• INDEL¸ SNP ƒƒX VCF Ϙ< ›1⇠p, pass ⌧ ɸ raw somatic 4⌧X ∞¸ |t
›1⌧‰.
$ STRELKA_INSTALL_DIR =/ somatic_bench/app/strelka -1.0.14/
echo $ STRELKA_INSTALL_DIR
/somatic_bench/app/strelka -1.0.14/
$ WORK_DIR =/ root/myWork
$ cp $ STRELKA_INSTALL_DIR /etc/ strelka_config_isaac_default .ini config.ini
$ STRELKA_INSTALL_DIR /bin/ configureStrelkaWorkflow .pl
--normal =/ root/normal.bam
--tumor =/ root/tumor.bam
--ref=/ somatic_bench/reference/ human_g1k_v37_decoy .fasta
--config=config.ini --output -dir =./ myAnalysis
$ cd ./ myAnalysis
$ make -j 8
$ ll myAnalysis/results/
total 88
drwxr -xr -x 2 root root 4096 Jan 30 11:39 ./
drwxr -xr -x 5 root root 4096 Jan 30 11:37 ../
-rw -r--r-- 1 root root 13452 Jan 30 11:37 all.somatic.indels.vcf
-rw -r--r-- 1 root root 36736 Jan 30 11:37 all.somatic.snvs.vcf
-rw -r--r-- 1 root root 7098 Jan 30 11:37 passed.somatic.indels.vcf
-rw -r--r-- 1 root root 16070 Jan 30 11:37 passed.somatic.snvs.vcf
• Ö pass⌧ somatic SNPX /⇠| Ux‰.
$ cd myAnalysis/results/
$ grep -v "#" passed.somatic.snvs.vcf|wc -l
62
18. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 18
6.2 Virmid (33Ñ)
Virmidî 2013D 8 YP @¡∞ P⇠ Ç å⌅∏Ë¥Ö»‰. ÿ ¡D µt tumor–⌧ normal ÿ X pro-
portionD ©‰ (↵).
• Ö pass⌧ somatic SNPX /⇠| Ux‰.
$ java -jar /somatic_bench/app/Virmid -1.1.1/ Virmid.jar
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-D /root/tumor.bam
-N /root/normal.bam
-t 8
-w /root/virmid
$ cd /root/virmid
$ ls -la
$ ls -al
total 98024
drwxr -xr -x 2 root 4096 Jan 30 16:00 ./
drwxr -xr -x 8 root 8192 Jan 30 15:32 ../
-rw -r--r-- 1 root 1252161 Jan 30 16:03 tumor.bam.virmid.germ.all.vcf
-rw -r--r-- 1 root 955213 Jan 30 16:03 tumor.bam.virmid.germ.passed.vcf
-rw -r--r-- 1 root 262 Jan 30 16:00 tumor.bam.virmid.gm
-rw -r--r-- 1 root 36564 Jan 30 16:03 tumor.bam.virmid.loh.all.vcf
-rw -r--r-- 1 root 2233 Jan 30 16:01 tumor.bam.virmid.loh.passed.vcf
-rw -r--r-- 1 root 992 Jan 30 16:03 tumor.bam.virmid.report
-rw -r--r-- 1 root 1364144 Jan 30 15:29 tumor.bam.virmid.sample.control.bai
-rw -r--r-- 1 root 53107377 Jan 30 15:29 tumor.bam.virmid.sample.control.bam
-rw -r--r-- 1 root 1364104 Jan 30 15:29 tumor.bam.virmid.sample.disease.bai
-rw -r--r-- 1 root 41746178 Jan 30 15:29 tumor.bam.virmid.sample.disease.bam
-rw -r--r-- 1 root 84053 Jan 30 16:03 tumor.bam.virmid.som.all.vcf
-rw -r--r-- 1 root 6883 Jan 30 16:03 tumor.bam.virmid.som.passed.vcf
$ grep -v "#" tumor.bam.virmid.som.passed.vcf|wc -l
78
19. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 19
7 ⌅¥ l| ⌅ ¨⇧§
7.1 ‰µ© ¨⇧§ ⌧Ñ
• ⌧Ñ ¸å: xxx.xxx.xxx.xxx
• Dt: edu01, edu02
• T8: kogo2015
• ˘⌘ç: http://xxx.xxx.xxx.xxx:8787
7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞਩ê
• http://www.chiark.greenend.org.uk/˜sgtatham/putty/download.html ⌘ç
• Intel x86© putty.exe| ‰¥‹ i»‰.
• Host Name: xxx.xxx.xxx.xxx / Port: xx
• Security Alert =t (t ’ (Y)’| ›i»‰.
• ¯x Dt: `˘ @ Dt@ T8| ¨©i»‰.
7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê
• Â(OSX)X Ω∞ ’Q©⌅¯®, ¯¨, 0¯⇣ app’D ‰âi»‰. ¨⇧§X Ω∞ ’Tt ⇣î ê ¨
⇧§X ⌅¯® Tt–⌧ 0¯⇣D ‰â i»‰.
$ ssh user_id@host_name
$ ssh root@127 .0.0.1
• ssh Ö9D t©XÏ ‰µ© ¨⇧§ ⌧Ñ– ⌘çi»‰. ´à¯ ⌘ç‹ yes| ›Xt T8| ;î Ttt
ò$å ⇠p tL ÄÏ @ T8| Ö%XÏ ⌘çi»‰.
7.4 ¨⇧§ ‹§ Ù LD¥0
¯ 8⌧î ¨⇧§ 0Ï⇣3
X Xòx ’Ubuntu (∞Ñ,)’| 0⇠< $Öi»‰. ƒƒX ‹ ∆î Ω∞ ®‡ Ö
XX ¨⇧§– ¨©t •i»‰. ¨⇧§î ‰ë 0Ï⇣¸ X‹Ë¥¡–⌧ ŸëXî ¥ ¥⌧Ö»‰. ê‡X
¨⇧§ ¥† XΩ–⌧ ŸëXî¿| LDP¥| å⌅∏Ë¥ $X‹ ê‡X ¨⇧§– i å⌅∏Ë¥X
$X •i»‰.
• ⌅¨ ê‡t ¨©Xî ¨⇧§ 0Ï⇣X ÖX ›ƒXî )ïÖ»‰. UbuntuX Ω∞ 4à 0Ï⇠î ¨⇧§
¥ ¥⌧ ⌅¨ ‡Ñ⌅@ 14.04 LTS (Long Term Support)4
Ñ⌅Ö»‰.
$ cat /etc/issue.net
Ubuntu 12.04.1 LTS
• ¨⇧§î ‰ë X‹Ë¥ XΩ–⌧ ¥ ⇠p ¨⇧§| ¿–Xî å⌅∏Ë¥‰@ tÏ X‹Ë¥– 0|
‰â |D 0 ⌧ıi»‰. 0|⌧ ⌅¨ ê‡t ¨©Xî X‹Ë¥ Ù| Lt ꇖå fiî å⌅∏Ë
¥| ‰¥‹XÏ ¨©` ⇠ ൻ‰. ¨⇧§ ⌧Ñ •D X‹Ë¥ ¨ë ›ƒ@ ’-m’ â, machine 5XD µt
L ⇠ ൻ‰. ’x86’@ Intel 0⇠X CPU| X¯Xp, ’64’î 64D∏ X‹Ë¥| X¯5
i»‰.
$ uname -m
x86_64
3¨⇧§î lå ‹á ƒÙ¸ pDH ƒÙ Ѩ⇠p ƒÙƒ ‰ë 0Ï⇣t t¨‰.
4T‹Ö@ Trusty TahrÖ»‰.
5Tà ⌅Ï⌧ x64|‡ ⌅i»‰.
20. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 20
• ‰⇣@ ¨⇧§ ¥ ¥⌧X uÏ< ¨©êX Ö9D ‰⌧ X‹Ë¥| µt ‰âXƒ] i»‰. ¨⇧§ ‰⇣
@ ¨©Xî 0Ï⇣– 0| ⌧ ‰x Ñ⌅D ¨©i»‰. ⌅¨ • ‡X ¨⇧§ ‰⇣@ 3.14.3dmfh 2014D
5‘6| ⌧⌧ Ñ⌅Ö»‰. ¨⇧§ 0Ï⇣@ t⌥å ⌧⌧ ‰⇣D 0⇠< ⌧ë)»‰. ¨⇧§X ‰⇣
Ù ›ƒ tÙƒ] X†µ»‰.
$ uname -r
3.2.0 -32 - virtual
• X@ ¨⇧§ Ö9¥| Ö% D t| ‰âXî XΩ< ’PATH’î ⌅8§ ŸëXî )ï– •D |
Xî ✓x XΩ ¿⇠ ⌘X XòÖ»‰. exportî tÏ XΩ¿⇠X ✓D $ Xî Ö9¥ Ö»‰. ¨⇧§–
Ö9D Ö%Xt PATH– $ ⌧ †¨| ∞ Ä…XÏ t˘ Ö9¥ àî¿| UxX‡ t| ‰âi
»‰. 0|⌧ ê‡X ¡⌘ å⌅∏Ë¥| $XX‡ ¨⇧§ ¡–⌧ ‰âXî Ω∞ ⇠‹‹ PATH| ¿ t| ¥
–⌧‡¿ ‰ât •Xp ¯⌥¿ J@ Ω∞ å⌅∏Ë¥ $X⌧ †¨ ¥–⌧à ‰ât •i»‰.
X XΩ ¿⇠ Ux@ ’env’ Ö9< LD º ⇠ à<p, PATHî ’export’| µt $ i»‰.
$ env | grep PATH
MANPATH =/usr/local/texlive /2013/ texmf/doc/man:
PATH =/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
INFOPATH =/usr/local/texlive /2013/ texmf/doc/info:
$ export PATH =/BIO/app/bwa -0.7.5a/:$PATH
$ env | grep PATH
7.5 ¨⇧§ | ‹§
¨⇧§X X@ XòX <¨ §l| |¨ < ÏÏ Ì< lÑXÏ ¨Xp X@ | ‹§
D ›1XÏ | ✏ †¨| ¨` ⇠ ൻ‰.
• ¨⇧§ ‹§@ ÏÏ ¨©ê ¨©Xî ‹§< ê ê‡X ‡ Ìx H †¨| ¿‡ ൻ
‰. H †¨¥–⌧î ê‡t |D ›1, ≠⌧ •i»‰. H †¨ tŸXî Ö9@ ’cd’ Ö9
tp, ⌅¨ †¨ Ωî ’pwd’ Ö9< Ux` ⇠ ൻ‰.
$ cd
$ pwd
/home/hongiiv
• †¨ ɇ t˘ †¨ tŸX0
$ cd
$ mkdir sample_data
$ ls -la
total 2203488
drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 .
drwxr -xr -x 3 root root 4096 May 7 13:14 ..
-rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history
-rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout
-rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc
drwxr -xr -x 2 root root 4096 May 29 10:34 sample_data
$ cd sample_data
$ pwd
/home/hongiiv/sample_data
• †¨ ✏ | ≠⌧X0
21. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 21
$ cd
$ rm -rf sample_data
$ ls -la
total 2203488
drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 .
drwxr -xr -x 3 root root 4096 May 7 13:14 ..
-rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history
-rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout
-rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc
$
• ¨⇧§ | ‹§ Ù0
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 19G 14G 4.8G 74% /
udev 3.9G 4.0K 3.9G 1% /dev
tmpfs 1.6G 188K 1.6G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 3.9G 0 3.9G 0% /run/shm
/dev/xvdb1 79G 38G 38G 50% /home/hongiiv/test
• <¨ X‹§l X Ù Ù0 - 21.5 GBX <¨ x /dev/xvda X‹§lî vxda1, xvda2 2⌧X
X< l1⇠¥ à<p Linux, Linux swapX |‹§ÑD Ux` ⇠ ൻ‰.
$ fdisk -l
Disk /dev/xvda: 21.5 GB , 21474836480 bytes
255 heads , 63 sectors/track , 2610 cylinders , total 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00034212
Device Boot Start End Blocks Id System
/dev/xvda1 2048 40038399 20018176 83 Linux
/dev/xvda2 40038400 41940991 951296 82 Linux swap / Solaris
Disk /dev/xvdb: 300.6 GB , 300647710720 bytes
171 heads , 35 sectors/track , 98112 cylinders , total 587202560 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x3459a991
Device Boot Start End Blocks Id System
/dev/xvdb1 2048 587202559 293600256 8e Linux LVM
• | ‹§ »¥∏ Ù Ux
$ cat /etc/fstab
proc /proc proc nodev ,noexec ,nosuid 0 0
/dev/xvda1 / ext3 errors=remount -ro 0 1
/dev/xvda2 none swap sw 0 0
22. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 22
7.6 ¨⇧§ X‹§l î X0
• fdisk| µt î ⌧ X‹§l| Ux ƒ T›, |‹§ ›1, »¥∏X 3˃| p– X‹§l
| ¨©i»‰. USB •X| ¨⇧§– x›X0 ⌅t⌧î mount ¸ ÃD pXt )»‰.
$ fdisk /dev/xvdb
$ mkfs.ext3 /dev/xvdb1
$ mkdir /new_hdd
$ mount /dev/xvdb1 /new_hdd
$ cd /new_hdd
$ df -h
7.7 | ( Ö9¥
• touch - | l0 0x »¥ | ›1Xpò |t ›1⌧ ‹⌅D ¿Ω` ⇠ ൻ‰. ⌅9 ⌅¥ (
å⌅∏Ë¥ $Xò P!‹ ¨©Xî Ö9¥ ⇡¿X‹0 绉.
$ touch a
$ ls -al
-rw -r--r-- 1 root root 0 Jun 18 10:04 a
$ date
Wed Jun 18 10:05:10 KST 2014
$ touch -c a
$ ls -al
-rw -r--r-- 1 root root 0 Jun 18 10:05 a
• cat - |X ¥©D UxXpò ⌅Ë §lΩ∏ ë1‹ ¨©i»‰. ’cat ¿ test’ Ö9< test|î |D
›1Xt⌧ | ¥©D ë1i»‰. ë1t DÃ⌧ ƒ–î ’ctrl+D’ ѺD Ï `8ò, ⇠ ൻ‰.
$ cat > test
hi there
my name is hong
$ cat test
hi there
my name is hong
$ ls -al
-rw -r--r-- 1 root root 25 Jun 18 10:09 test
• 𠆨X |X /⇠ 80
$ ls -l . | grep ^- | wc -l
50
• |X π 8êÙ ‹ëXî ÄÑD ⌧x ÄÑ ú%X0Ö»‰. VCF |¸ ⇡t ’’ ‹ëXî ÄÑ@
¸ x Ω∞ ¸ ÃD ⌧x ‰⌧ ⌅¿tX ¨§∏| ú%i»‰. ⇣î ¯ ⇠ ¸ ÄÑÃD ú%i
»‰.
$ cd /BIO/data/gatk
$ grep -v "#" dbsnp_138.hg19.vcf| wc -l
8087914
$ grep -F "#" dbsnp_138.hg19.vcf |wc -l
165
23. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 23
• π ¸…¥Ã ú%i»‰. t˘ ¸…¥X L ≥⌧ ’-d’, +ê⌧’-c’< ,t •i»‰.
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| more
chrM
chrM
chrM
chrM
chrM
chrM
chrM
chrM
chrM
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| sort -d
chr1
chr2
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| uniq -c
475 chrM
4723878 chr1
3363561 chr2
$ grep -v "#" dbsnp_138.hg19.vcf |
awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}'
chrM is: 16390
chrM is: 16391
chrM is: 16429
chrM is: 16445
chrM is: 16499
• ú%< ú%⇠î ¥©D | •X0
$ grep -v "#" dbsnp_138.hg19.vcf |
awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}' > ~/chr_pos.txt
$ grep -v "#" dbsnp_138.hg19.vcf |
awk '{if ($1 == "chr1") printf "chrM is: %sn", $2}' >> ~/chr_pos.txt
7.8 ¨⇧§ $∏Ãl Ù
• $∏Ãl x0òt§– Ù eth0X inet addrt xÄ–⌧ ⌅¨ ¨⇧§ ⌘ç • ¸å6
Ö»‰.
$ ifconfig
eth0 Link encap:Ethernet HWaddr 02:00:5b:73:00:33
inet addr: 172.27.252.234 Bcast: 172.27.255.255
inet6 addr: fe80::5bff:fe73:33/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:501386 errors:0 dropped:0 overruns:0 frame:0
TX packets:346879 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:19357734604 (1 GB) TX bytes:2720265191 (2 GB)
Interrupt:68
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
6¨⇧§ ⌧ÑX ¸åî 172.27.252.234 êX ‰µ XΩ– 0| ‰tå ‹⌧‰.
24. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 24
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4337 errors:0 dropped:0 overruns:0 frame:0
TX packets:4337 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2203478 (2.2 MB) TX bytes:2203478 (2.2 MB)
7.9 ¨⇧§ Uï ttX0
¨⇧§î ‰ë UïD ¿–Xp, å⌅∏Ë¥ò pt0| 0ÏXî Ω∞ Uï⌧ |D t©XÏ 0Ïi»‰.
• ¨⇧§–⌧ ¨©Xî ‰ë Uï t⌧ )ïÖ»‰. UïD t⌧ |H–î 8⌧ ‰¥àµ»‰. 8⌧|
⌧| < x‹î Ñ–åî ¡àt ¸¥—»‰.
$ cd
$ cp -R /BIO/data/compress ./ compress
$ cd compress
$ gzip -d compress01.gz
$ tar xvfz compress02.tar.gz
$ unzip compress03.zip
$ bzip2 -d comress04.bz2
$ tar xvfz compress05.tar.gz
$ tar xvf compress06.tar.bz2
• gzip: Recommended for fast network connections
• bzip2: Recommended for slower network connections (smaller size but takes longer to compress)
• zip: Not recommended but is provided as an option for those who cannot open the above formats
• ©…X Uï⌧ ⌅¥ pt0– t UïD t⌧X¿ J‡ ¯¨ |X ¥© UxXî )ïÖ»‰. FASTQ
|ÒD UxXîp ©i»‰.
$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz | more
$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.tar.gz | tar -tvf -
7.10 ¨⇧§ å⌅∏Ë¥ $XX0
|⇠ < ¨⇧§– å⌅∏Ë¥| $XXî )ï@ ‰LX 3 ¿ )ït ൻ‰. ´à¯î t ¨ (‰â)
|D Uï ‹ ⌧ıXî )ï< ⌅Ëà UïD t⌧XÏ ¨©t •X‰. Pà¯î ¨⇧§–⌧ ⌧ı
Xî (§¿| t©Xî )ï< ∞Ñ,X Ω∞ APT|î (§¿ ¨ ⌅¯®D t©‰. 8à¯î å§
|D t©XÏ $XXî )ït‰.
7.10.1 APT| t© å⌅∏Ë¥ $X
• APT| t© (§¿ ≈pt∏
$ apt -get update
$ apt -get install bwa
Reading package lists ... Done
Building dependency tree
Reading state information ... Done
Use 'apt -get autoremove ' to remove them.
Suggested packages:
samtools
25. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 25
The following NEW packages will be installed:
bwa
0 upgraded , 1 newly installed , 0 to remove and 153 not upgraded.
Need to get 135 kB of archives.
After this operation , 286 kB of additional disk space will be used.
Fetched 135 kB in 3s (40.1 kB/s)
Selecting previously unselected package bwa.
(Reading database ...17 files and directories currently installed .)
Unpacking bwa (from .../ archives/bwa_0 .6.1 -1 _amd64.deb) ...
Processing triggers for man -db ...
Setting up bwa (0.6.1 -1) ...
$ bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.6.1 - r104
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]
Command: index index sequences in the FASTA format
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA -SW for long queries
fastmap identify super -maximal exact matches
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
pac2cspac convert PAC to color -space PAC
stdsw standard SW/NW alignment
• NGS ( å⌅∏Ë¥ $X| ⌅t ¯¨ 0¯ $X⇠¥| Xî (§¿ ©]Ö»‰.
$ apt -get update -y
$ apt -get install gcc -y
$ apt -get install make -y
$ apt -get install zlib1g -dev -y
$ apt -get install libncurses5 -dev -y
$ apt -get install g++ -y
$ apt -get install tcl tk -y
$ apt -get install tcl -dev -y
$ apt -get install unzip -y
$ apt -get install curl -y
$ apt -get install screen -y
$ apt -get install python -dev -y
$ apt -get install python -software -properties -y
$ add -apt -repository ppa:webupd8team/java
$ apt -get update -y
$ apt -get install oracle -java7 -installer -y
26. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 26
7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X
• å§ $XX0
$ cd
$ cp /BIO/app/bwa -0.7.4. tar.bz2 ./
$ tar xvf bwa -0.7.4. tar.bz2
$ cd bwa -0.7.4
$ make
$ ./bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.7.4 - r385
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]
Command: index index sequences in the FASTA format
mem BWA -MEM algorithm
fastmap identify super -maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA -SW for long queries
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
$ bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.6.2 - r126
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]