This document summarizes Illumina's efforts to generate a population-based structural variant callset using whole genome sequencing data from 3,000-4,000 samples. Key points include using multiple variant callers and population genetics analysis to generate hypotheses about common structural variants, assembling putative deletions to refine breakpoints, developing a graph-based genotyping tool, and validating variants using depth analysis and Mendelian inheritance checks. The goals are to improve consistency and accuracy in calling common structural variants across any sample.
2. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
2
Populations as a CNV/SV discovery tool
● Collected high-depth WGS data from 3,000-4,000 samples
● Current SV & CNV callers and SNP information can be
used for hypothesis generation
- Manta, Canvas & SNP GT information (HWE)
● Confirm CNVs and SVs and refine break points
- Combination of depth and assembly
● Create targeted callers for common SVs
- Develop improved methods to call these variants on any sample
(improved consistency and accuracy)
3. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
3
Hypothesis generation
● Manta (split read) calls from ~3,000 samples
- Good for deletions up to ~10kb
- These variants should work well for graph-based calling
● Canvas (depth) calls from ~3,000 samples
- Primarily deletions >10kb
- These variants should be callable using targeted depth analysis
● Population genetics (SNP calls from ~2,200 Europeans)
- Good for most size ranges but relies on SNPs overlapping CNV
- May identify variants that split read methods miss
4. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
4
Small SV Validation and
boundary resolution
5. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
5
Assembly of putative deletions
● Assemble common deletions using different methods
- K-mer approach (SPAdes)
- String assembly (SGA)
● Large number of samples should improve assembly
- 5% deletion ~ 4,500x depth (in 3,000 sequenced samples)
- Easy if we can identify just the reads associated with the deletion
- Low complexity (e.g. STRs) and variability around break-ends are
problematic (but highlight issues that need to be resolved)
● Assembled ~9,800 deletions to break point resolution
- Starting from Manta calls in PG and population
6. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
6
Genotyping deletions with graphs
● Graph alignment + genotyping
tool
- Create a sequence graph for deletions
- Align + count reads on BP edges and
on BP boundaries
● “Genotype” deletions via mixture
model on read counts.
● Very preliminary results and
some obvious improvements are
still in progress
REF+ALT REF+ALT
REF REF BP endREF BP start
ALT BP25 BP up
25 BP down 25 BP up
25 BP down
Candidate events in Platinum Genomes:
7. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
7
PG Pedigree Genotyping
● High-level variant statistics
- ~3,700 hom-ref, 830 hom-alt
- ~1,100 variants consistent
- ~2,000 variants probably wrong in
<= 2 samples
- ~1,900 need improvement
8. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
8
Giab Trio Analysis
● GIAB Ashkenazim Trio:
- HG002 – Son
- HG003 – Father
- HG004 – Mother
● Trio Statistics (Total: 9,739)
Outcome Count Percentage
Homref everywhere 3,425 35.2%
Pass 2,131 21.9%
Conflict (5) / Male X is het (28) 33 0.3%
Parents het 1,045 10.7%
Called in 2 1,637 16.8%
Called in <= 1 1,468 15.1%
9. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
9
Depth-based validation
and boundary resolution
10. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
10
Canvas-identified deletion in the population
● Characterization of a
high frequency call
● Use fine-grained depth-
map to isolate
boundaries of the call
11. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
11
Improving breakpoints – depth pileup
● Characterization of a
high frequency call
● Use fine-grained depth-
map to isolate
boundaries of the call
1 kb
1 kb
12. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
12
Improving breakpoints – six double deletions
180 bp
180 bp
• Assembly from 6 hom del samples resolves breakpoints and co-inserted sequence
• Expect ~5 double deletions in 3,000 samples for deletions with ~4% frequency
13. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
13
Next steps (for deletions)
● Complete merging of all the putative common deletions
- Currently some parts of this work are proceeding in parallel and we
have not merged the analysis/variants yet
● Attempt to validate and refine all deletions
- LD, depth, assembly
● Finish graph-based GT and depth-based GT
- QC these GT callers using both Mendel errors and HWE
● Map calls onto PG and GIAB samples
● Demonstrate other variants on other samples
- Sequencing 150 Coriell samples of diverse ethnicity to provide
additional support of calls within (and outside of) GIAB families
14. COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
14
Acknowledgements
Mitch Bekritsky (SNP analysis)
Andy Gross (population depth)
Nathan Johnson (assembly)
Peter Krusche (graph genotyping)