SlideShare a Scribd company logo
1 of 57
Download to read offline
Lecture 1: Sequence alignment, data formats, QC,
and data processing
Thomas Keane
Sequence Variation Infrastructure Group
WTSI
Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture1.pdf
WTAC NGS Course, Hinxton 10th
April 2014
Some Background
Established the Vertebrate Resequencing Informatics team in 2008
● Bioinformaticians and software developers
● PIs: David Adams and Richard Durbin
● April 2014- establishing Sequence Variation Infrastructure group at WTSI
Large scale NGS data processing
● 1000 genomes production and releases
● UK10K production group
● Exome and whole-genome sequencing
Computational methods
● Samtools
○ Widely used software for NGS analysis
● VCF and VCF tools
○ Widely used format and suite of tools for NGS variation analysis
● Structural variation
○ SVMerge
■ Detect structural variants (SVs) by integrating calls from several existing SV callers
○ RetroSeq
■ Detecting non-reference transposable elements
Comparative genomics
● Mouse genomes project – 17 mouse genomes deeply sequenced
● RNA-editing across mouse strains
● Transposable elements evolution and selection in mouse strains
● Human rare diseases
● Isolated human populations
Sequence assembly
● De novo assembly and gene finding of 18 mouse strains Richard Durbin David Adams
WTAC NGS Course, Hinxton 10th
April 2014
Zhicheng
Liu
WTAC NGS Course, Hinxton 10th
April 2014
Lecture 1: Sequence alignment, data formats, QC, and data
processing
WTAC NGS Course, Hinxton 10th
April 2014
➢ NGS Data Formats
➢ Sequence Alignment
➢ QC from Alignments
➢ NGS Data Processing Workflows
➢ Lab Exercises
WTAC NGS Course, Hinxton 10th
April 2014
Primary NGS Data Formats
Fastq
● Unaligned read sequences with base qualities
BAM
● Aligned or unaligned reads
● Text and binary formats
CRAM
● Aligned or unaligned reads
● Advanced compression models
VCF
● Flexible variant call format
● Arbitrary types of sequence variation
● SNPs, indels, structural variations
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
FASTQ
FASTQ is a simple format for raw unaligned sequencing reads
● Simple extension to the FASTA format
● Sequence and an associated per base quality score
Originally standard for storing capillary data
Format
● Subset of the ASCII printable characters
● ASCII 33–126 inclusive with a simple offset mapping
● perl -w -e "print ( unpack( 'C', '%' ) - 33 );”
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
SAM/BAM
SAM (Sequence Alignment/Map) format
● Single unified format for storing read alignments to a reference genome
BAM (Binary Alignment/Map) format
● Binary equivalent of SAM
● Developed for fast processing/indexing
Key features
● Can store alignments from most aligners
● Supports multiple sequencing technologies
● Supports indexing for quick retrieval/viewing
● Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)
● Reads can be grouped into logical groups e.g. lanes, libraries, samples
● Widely support by variant calling software packages
Replacement to SRF & fastq
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
SAM/BAM
No. Name Description
1 QNAME Query NAME of the read or the read pair
2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.)
3 RNAME Reference sequence NAME
4 POS 1-Based leftmost POSition of clipped alignment
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR Extended CIGAR string (operations: MIDNSHP)
7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
8 MPOS 1-Based leftmost Mate POSition
9 ISIZE Inferred Insert SIZE
10 SEQ Query SEQuence on the same strand as the reference
11 QUAL Query QUALity (ASCII-33=Phred base quality)
WTAC NGS Course, Hinxton 10th
April 2014
Heng Li et al (2009) The Sequence Alignment/Map format and SAMtools, Bioinformatics, 25:2078-2079
HS18_07983:1:2203:5095:109107#36 163 ENA|AJ011856|AJ011856.1 412 60 100M = 471 159
ATAAAATTATTAATAATAATCAATATGAAATTAATAAAAATCTTATAAAAAAGTAATGAATACTCCTTTTTAAAAATAAAAAGGGGTTCGGTCCCCCCCC
9BCDGDEHGEHFHHGFHHJGHFHIGHFIGHFHGGGHGHGHGHJGHHGHHHGGHHHIGGGGGFGGDGGHFHFIGEGHGFGGHFEDGG4GHGGGFHGFHIEF
X0:i:1 X1:i:0 MD:Z:100 RG:Z:1#36.1
WTAC NGS Course, Hinxton 10th
April 2014
Cigar Format
Cigar has been traditionally used as a compact way to represent a
sequence alignment
Operations include
● M - match or mismatch
● I - insertion
● D - deletion
SAM extends these to include
● S - soft clip (ignore these bases)
● H - hard clip (ignore and remove these bases)
E.g.Read: ACGCA-TGCAGTtagacgt
Ref: ACTCAGTG—-GT
Cigar: 5M1D2M2I2M7S
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
What is the cigar line?
E.g. Read: tgtcgtcACGCATG---CAGTtagacgt
Ref: ACGCATGCGGCAGT
Cigar:
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
Read Group Tag
Each lane has a unique RG tag that contains meta-data for the lane
RG tags
● ID: SRR/ERR number
● PL: Sequencing platform
● PU: Run name
● LB: Library name
● PI: Insert fragment size
● SM: Individual
● CN: Sequencing center
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
1000 Genomes BAM File
WTAC NGS Course, Hinxton 10th
April 2014
Command: samtools view -h my.bam | less -S
WTAC NGS Course, Hinxton 10th
April 2014
1000 Genomes BAM File
WTAC NGS Course, Hinxton 10th
April 2014
samtools view –H my.bam | less -S
How is the BAM file sorted?
How many different sequencing centres contributed lanes to this BAM file?
What is the alignment tool used to create this BAM file?
How many different sequencing libraries are there in this BAM? Hint: RG tag
WTAC NGS Course, Hinxton 10th
April 2014
SAM/BAM Tools
Several tools and programming APIs for interacting with SAM/BAM files
Samtools - Sanger/C (http://samtools.sourceforge.net)
● Convert SAM <-> BAM
● Sort, index, BAM files
● Flagstat - summary of the mapping flags
● Merge multiple BAM files
● Rmdup - remove PCR duplicates from the library preparation
Picard - Broad Institute/Java (http://picard.sourceforge.net)
● MarkDuplicates, CollectAlignmentSummaryMetrics, CreateSequenceDictionary, SamToFastq,
MeanQualityByCycle, FixMateInformation…….
● Bio-SamTool - Perl (http://search.cpan.org/~lds/Bio-SamTools/)
● Pysam - Python (http://code.google.com/p/pysam/)
BAM Visualisation
● BamView, LookSeq, Gap5: http://www.sanger.ac.uk/Software
● IGV: http://www.broadinstitute.org/igv/
● Tablet: http://bioinf.scri.ac.uk/tablet/
WTAC NGS Course, Hinxton 10th
April 2014
WTAC NGS Course, Hinxton 10th
April 2014
CRAM Format
BAM files are too large
● ~1.5-2 bits per base pair
Increases in disk capacity are being far outstripped by sequencing technologies
BAM stored all of the data
● Every read base
● Every base quality
● Using conventional compression techniques
CRAM: Two important concepts
● Reference based compression
● Controlled loss of quality information
Widely seen as the sequencing format of the future
● Support for CRAM being actively added to Samtools and Picard
Thomas Keane, WTSI 2th
April 2014
Reference Based Compression
Thomas Keane, WTSI 2th
April 2014
Reference Based Compression
WTAC NGS Course, Hinxton 10th
April 2014
CRAM: Reference-based sequence data compression
WTAC NGS Course, Hinxton 10th
April 2014
CRAM Support
Currently
● CRAM Java toolkit (EBI)
● Scramble (WTSI)
Coming soon
● Samtools (WTSI) upcoming release
● Picard/GATK (Broad) in development
2014: WTSI aim to put CRAM into full
production pipelines
WTAC NGS Course, Hinxton 10th
April 2014
Lecture 1: Sequence alignment, data formats, QC, and data
processing
WTAC NGS Course, Hinxton 10th
April 2014
➢ NGS Data Formats
➢ Sequence Alignment
➢ Data QC
➢ NGS Data Processing Workflows
➢ NGS Visualisation and Inspection
WTAC NGS Course, Hinxton 10th
April 2014
Sequence Alignment
Sequence alignment in NGS is
● Process of determining the most likely source within the reference genome sequence that the
observed DNA sequencing read is derived from
Principles and approaches to sequence alignment have not changed
Basic Local Alignment Search Tool (BLAST)
● ‘Seed and extend’ approach
● Query sequences vs. larger database of sequences
● Split query sequences into short sequences (~10bp) and search for locations where these
cluster in the larger database of sequences
● Nucleotide blast, protein blast, blastx, tblastn, tblastx….
NGS: Nucleotide based alignment
● Very small evolutionary distances (human-human, strains of the reference genome)
● Allows for assumptions about the number of expected mismatches to speedup alignment
programs
NGS has just massively scaled up a challenge that has existed since the inception of bioinformatics
WTAC NGS Course, Hinxton 10th
April 2014
Hash Table Alignment
All hash table based algorithms essentially follow the same seed-and-extend paradigm
K-mer is a short fixed sequence of nucleotides
Typical algorithm
● Build a profile (index) of all possible k-mers of length n and the locations in the reference
genome they occur
○ Several Gbytes in size for human genome
● Foreach sequence read
○ Split into k-mers of length n
○ Lookup the locations in the reference via the index (seed phase)
○ Pick location on the genome with most k-mer hits
○ Perform Smith-Waterman alignment to fully align the read to the region
○ Output the alignment of each read onto the reference in BAM (or equivalent) format
Hash of the reads: MAQ, ELAND, ZOOM and SHRiMP
● Smaller but more variable memory requirements
Hash the reference: SOAP, BFAST and MOSAIK
● Advantage: constant memory cost
WTAC NGS Course, Hinxton 10th
April 2014
Hash Table Alignment
Sequencing reads
Kmer hash Reference Genome
WTAC NGS Course, Hinxton 10th
April 2014
Suffix/Prefix Tree Based Aligners
Store all possible suffixes or prefixes to enable fast string matching
A suffix trie, or simply a trie, is a data structure that stores all the
suffixes of a string, enabling fast string matching. To establish the link
between a trie and an FM-index, a data structure based on Burrows-
Wheeler Transform (BWT)
FM-Index based
● Small memory footprint
Examples
● MUMmer, BWA, bowtie
Still require a final step to generate local alignment Delcher et al (1999) NAR
WTAC NGS Course, Hinxton 10th
April 2014
Smith-Waterman Algorithm
Algorithm for generating the optimal pairwise alignment between two
sequences
Time consuming to carry out for every read
● Only applied to a small subset of the reads that don’t have an exact match
● Important for correctly aligning reads with insertions/deletions
Match: +1
Mismatch: 0
Gap open: -1
WTAC NGS Course, Hinxton 10th
April 2014
Mapping Qualities
What if there are several possible places in the genome to align your sequencing
read?
Genomes contain many different types of repeated sequences
● Transposable elements (40-50% of vertebrate genomes)
● Low complexity sequence
● Reference errors and gaps
Mapping quality is a measure of how confident the aligner is that the read is
corresponds to this location in the reference genome
● Typically represented as a phred score (log scale)
● Q10 = 1 in 10 incorrect
● Q20 = 1 in 100 incorrect
Paired-end sequencing is useful
● One end maps inside a repetitive elements and one outside in unique sequence
● Then the combined mapping quality can still be high
● Hence always do paired-end sequencing!
WTAC NGS Course, Hinxton 10th
April 2014
Mapping Qualities
WTAC NGS Course, Hinxton 10th
April 2014
Alignment Limitations
Read Length and complexity of the genome
● Very short reads difficult to align confidently to the genome
● Low complexity genomes present difficulties
○ Malaria is 80% AT – lots of low complexity AT stretches
Alignment around indels
● Next-gen alignments tend to accumulate false SNPs near true indel
positions due to misalignment
● Smith-Waterman scoring schemes generally penalise a SNP less than a
gap open
● New tools developed to do a second pass on a BAM and locally realign the
reads around indels and ‘correct’ the read alignments
High density SNP regions
● Seed and extend based aligners can have an upper limit on the number of
consecutive SNPs in seed region of read (e.g. Maq - max of 2 mismatches
in first 28bp of read)
● BWT based aligners work best at low divergence
WTAC NGS Course, Hinxton 10th
April 2014
Read Length vs. Uniqueness
WTAC NGS Course, Hinxton 10th
April 2014
Example Indel
WTAC NGS Course, Hinxton 10th
April 2014
Scaling Up
30-40Gbp per HiSeq lane
● Aligning a single lane of reads can take a long time on a single computer
Parallel computing
● A form of computation in which many calculations are carried out
simultaneously
@read1
ACGTANATCN
+
$$%SSG$%££@
@read2
AGCNTNCTCA
+
£$$%£$%%^&
BAM
@read1
ACGTANATCN
+
$$%SSG$%££@
@read2
AGCNTNCTCA
+
£$$%£$%%^&
BAM
WTAC NGS Course, Hinxton 10th
April 2014
Scaling Up
Two main approaches to speeding up read alignment
● Simple parallelism by splitting the data
○ Split lane into 1Gbp chunks and align independently on different processors
■ BWA ~8 hours per 1Gbp chunk
○ Merge chunk BAM files back into single lane BAM
■ ‘samtools merge’ command
@read1
ACGTANATCN
+
$$%SSG$%££@...
BAM● Utilise multiple processors on single computer
○ Modern computers have >1 processing core or CPU
○ Most aligners can use more than one processor on same computer
○ Much easier for user
■ Just supply the number of processors to use (e.g. BWA -t option)
Fastq
split1
Fastq
split2
Fastq
split3
Fastq
split4
BAM1 BAM2 BAM3 BAM4
Sequencing Lane
(Fastq, 30-40Gbp)
Split
(1Gbp)
Align
Merge
WTAC NGS Course, Hinxton 10th
April 2014
Lecture 1: Sequence alignment, data formats, QC, and data
processing
WTAC NGS Course, Hinxton 10th
April 2014
➢ NGS Data Formats
➢ Sequence Alignment
➢ QC from alignments
➢ NGS Data Processing Workflows
➢ NGS Visualisation and Inspection
WTAC NGS Course, Hinxton 10th
April 2014
Data QC from Alignments
Several useful metrics to check to assess the quality of your data and
alignments produced
● Number of reads mapped, bases mapped, duplicate fragments, reads
w/adaptor, error rate, fragment size distribution, genotype check
Genotype check – is this the correct sample?
● Use an external set of genotypes for the sample to assess the likelihood
that the sample is the expected sample e.g. genotyping chip
Biases in sequencing
● GC vs. depth
● Indel ratio
● Read cycle vs. base content
WTAC NGS Course, Hinxton 10th
April 2014
Suggested Auto QC
WTAC NGS Course, Hinxton 10th
April 2014
GC of Reads
WTAC NGS Course, Hinxton 10th
April 2014
GC vs. Depth
WTAC NGS Course, Hinxton 10th
April 2014
Fragment Size
WTAC NGS Course, Hinxton 10th
April 2014
Fragment Size
Experiment: 100bp paired-end sequencing.
Can you spot any problems with this library fragment size for this experiment?
WTAC NGS Course, Hinxton 10th
April 2014
Indels per Cycle
WTAC NGS Course, Hinxton 10th
April 2014
Lecture 1: Sequence alignment, data formats, QC, and data
processing
WTAC NGS Course, Hinxton 10th
April 2014
➢ NGS Data Formats
➢ Sequence Alignment
➢ Data QC from alignments
➢ NGS Data Processing Workflows
➢ NGS Visualisation and Inspection
WTAC NGS Course, Hinxton 10th
April 2014
NGS Workflows
Next-gen sequencing experiments
● Several, tens or hundreds of samples
● One or more sequencing libraries per sample
● Sample could constitute several libraries
How the data is processed can have consequences on quality of variant calling
Alignment of the reads onto the reference is just the first step
● QC of data is very important for good calls
○ Biases in the library or sequence data will produce unexpected results
or missed variant calls
○ E.g. GC biases
● How the data is processed prior to variant calling is important
○ Certain computational steps that should be carried out to improve the
quality of the data and alignments prior to calling
● Mapping -> improvement -> merging -> variant calling
WTAC NGS Course, Hinxton 10th
April 2014
Data Production Workflow
Merge Up
BAMBAM BAMLibrary
merge Library
NA34842 NA87465 Sample/PlatformSample
merge
Import
+
Improvement
Fastq Fastq Fastq …… Fastq Fastq
BAM BAM BAM BAM BAM
Alignment
(bwa, smalt,
bowtie etc)
BAM BAM BAM BAM BAM
BAM
Improvement
……
……
Freeze
WTAC NGS Course, Hinxton 10th
April 2014
Data Production Workflow
Cross-sample BAMs
Merge
across
…
Chr1 Chr2 Chr3
NA19294
NA18943
NA19305
.
.
NA19309
…
…
RG:NA19294
RG:NA18943
RG:NA19305
.
.
.
.
.
.
.
.
.
Variant
Calling
Samtools GATK
VQSR
BEAGLE
Impute2
Genome STRiP
Final VCF ☺
VEP Annotation
SVMergeSNPs/indels
WTAC NGS Course, Hinxton 10th
April 2014
BAM Improvement
Lane level operation carried out after alignment
Input: BAM
Process 1: Local realignment
Process 2: Base quality recalibration
Output: (improved) BAM
WTAC NGS Course, Hinxton 10th
April 2014
Realignment
Short indels in the sample relative to the reference can pose difficulties for
alignment programs
Indels occurring near the ends of the reads are often not aligned correctly
● Excess of SNPs rather than introduce indel into alignment
Realignment algorithm
● Input set of known indel sites and a BAM file
● At each site, model the indel haplotype and the reference haplotype
● Given the information on a known indel
○ Which scenario are the reads more likely to be derived from?
● New BAM file produced with read cigar lines modified where indels have been
introduced by the realignment process
Software
● Implemented in GATK from Broad (IndelRealigner function)
What sites?
● Previously published indel sites, dbSNP, 1000 genomes, generate a rough/high
confidence indel set
Longer reads and better aligners (e.g. BWA-MEM) reducing the need to carry out this
WTAC NGS Course, Hinxton 10th
April 2014
Realignment
WTAC NGS Course, Hinxton 10th
April 2014
Base Quality Recalibration
Each base call has an associated base call quality
● What is the chance that the base call is incorrect?
○ Illumina evidence: intensity values + cycle
● Phred values (log scale)
○ Q10 = 1 in 10 chance of base call incorrect
○ Q20 = 1 in 100 chance of base call incorrect
● Accurate base qualities essential measure in variant calling
Rule of thumb: Anything less than Q20 is not useful data
Illumina sequencing
● Control lane or spiked control used to generate a quality calibration table
● If no control – then use pre-computed calibration tables
Quality recalibration
● 1000 genomes project sequencing carried out on multiple platforms at multiple
different sequencing centres
● Are the quality values comparable across centres/platforms given they have all been
calibrated using different methods?
WTAC NGS Course, Hinxton 10th
April 2014
Base Quality Recalibration
Original recalibration algorithm
● Align subsample of reads from a lane to human reference
● Exclude all known dbSNP+1000G pilot SNP sites
○ Assume all other mismatches are sequencing errors
● Compute a new calibration table bases on mismatch rates per position on the
read
Pre-calibration sequence reports Q25 base calls
● After alignment - it may be that these bases actually mismatch the reference at a
1 in 100 rate, so are actually Q20
Recent improvements – GATK package
● Reported/original quality score
● The position within the read
● The preceding and current nucleotide (sequencing chemistry effect) observed by
the sequencing machine
● Probability of mismatching the reference genome
NOTE: requires a reference genome and a catalog of variable sites
WTAC NGS Course, Hinxton 10th
April 2014
Base Quality Recalibration Effects
N.B. Always replot quality values when trying BQSR on a new set of samples or species
WTAC NGS Course, Hinxton 10th
April 2014
Data Production Workflow
BAMBAM BAMLibrary
merge Library
Fastq Fastq Fastq …… Fastq Fastq
BAM BAM BAM BAM BAM
Alignment
(bwa, smalt etc)
BAM BAM BAM BAM BAM
BAM
Improvement
Lane/Plex
BAM BAM Sample/PlatformSample
merge
WTAC NGS Course, Hinxton 10th
April 2014
Library Merge
Library level operation carried out after BAM improvement
Input: Multiple Lane BAMs
Process 1: Merge BAMs (picard - MergeSamFiles)
Process 2: Duplicate fragment identification
Output: BAM
WTAC NGS Course, Hinxton 10th
April 2014
Library Duplicates
All second-gen sequencing platforms are NOT single molecule
sequencing
● PCR amplification step in library preparation
● Can result in duplicate DNA fragments in the final library prep.
● PCR-free protocols do exist – require larger volumes of input DNA
Generally low number of duplicates in good libraries (<5%)
● Align reads to the reference genome
● Identify read-pairs where the outer ends map to the same position on the
genome and remove all but 1 copy
○ Samtools: samtools rmdup or samtools rmdupse
○ Picard/GATK: MarkDuplicates
Can result in false SNP calls
● Duplicates manifest themselves as high read depth support
WTAC NGS Course, Hinxton 10th
April 2014
Library Duplicates
WTAC NGS Course, Hinxton 10th
April 2014
Duplicates and False SNPs
WTAC NGS Course, Hinxton 10th
April 2014
Software Tools
Alignment
● BWA: http://bio-bwa.sourceforge.net/bwa.shtml
● Smalt: http://www.sanger.ac.uk/resources/software/smalt/
● Stampy: http://www.well.ox.ac.uk/project-stampy
BAM Improvement
● Realignment (GATK): http://www.broadinstitute.org/gsa/wiki/index.
php/Local_realignment_around_indels
● Recalibration: http://www.broadinstitute.org/gsa/wiki/index.
php/Variant_quality_score_recalibration
Library Merging
● BAM Merging (Picard): http://picard.sourceforge.net/command-line-
overview.shtml#MergeSamFiles
● Duplicate Marking/removal (Picard): http://picard.sourceforge.
net/command-line-overview.shtml#MarkDuplicates
WTAC NGS Course, Hinxton 10th
April 2014
Lecture 1: Sequence alignment, data formats, QC, and data
processing
WTAC NGS Course, Hinxton 10th
April 2014
➢ NGS Data Formats
➢ Sequence Alignment
➢ QC from Alignments
➢ NGS Data Processing Workflows
➢ Lab Exercises
WTAC NGS Course, Hinxton 10th
April 2014
Lab Exercises
1. Align two lanes to produce BAM files with BWA
2. Generate some basic QC information from the alignments
3. Carry out the data processing workflow to make merged library
BAM files
4. Visualise the BAM files with IGV

More Related Content

What's hot

NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineHai-Wei Yen
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then somebasepairtech
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vstQiang Kou
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilChristian Frech
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 

What's hot (20)

NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis Pipeline
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Macs course
Macs courseMacs course
Macs course
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 

Viewers also liked

Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
The server of the Spanish Population Variability
The server of the Spanish Population VariabilityThe server of the Spanish Population Variability
The server of the Spanish Population VariabilityJoaquin Dopazo
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the CloudMatt Wood
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Next generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsNext generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsDr. Gerry Higgins
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...VHIR Vall d’Hebron Institut de Recerca
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Thomas Keane
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
 

Viewers also liked (20)

Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
The server of the Spanish Population Variability
The server of the Spanish Population VariabilityThe server of the Spanish Population Variability
The server of the Spanish Population Variability
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Next generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsNext generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomics
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
 
AMR surveillance in Europe: historical background and future outlook. Hajo G...
AMR surveillance in Europe: historical background and future outlook.  Hajo G...AMR surveillance in Europe: historical background and future outlook.  Hajo G...
AMR surveillance in Europe: historical background and future outlook. Hajo G...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 

Similar to Sequence alignment, QC, and processing

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017Megumi Takeshita
 
IoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentationIoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentationVEDLIoT Project
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Matthieu Schapranow
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionPaul Groth
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in FilesystemsConferencias FIST
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdfPramodhN3
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to resultsAGRF_Ltd
 

Similar to Sequence alignment, QC, and processing (20)

NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017
the 10 Top Wireshark Tips & Tricks Megumi Takeshita Sharkfest2017
 
IoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentationIoT Tech Expo 2023_Pedro Trancoso presentation
IoT Tech Expo 2023_Pedro Trancoso presentation
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Cram4
Cram4Cram4
Cram4
 
Decoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from ExecutionDecoupling Provenance Capture and Analysis from Execution
Decoupling Provenance Capture and Analysis from Execution
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in Filesystems
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 

More from Thomas Keane

Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingThomas Keane
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesThomas Keane
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Thomas Keane
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Thomas Keane
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Thomas Keane
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 

More from Thomas Keane (9)

Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and Challenges
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 

Recently uploaded

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 

Sequence alignment, QC, and processing

  • 1. Lecture 1: Sequence alignment, data formats, QC, and data processing Thomas Keane Sequence Variation Infrastructure Group WTSI Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture1.pdf
  • 2. WTAC NGS Course, Hinxton 10th April 2014 Some Background Established the Vertebrate Resequencing Informatics team in 2008 ● Bioinformaticians and software developers ● PIs: David Adams and Richard Durbin ● April 2014- establishing Sequence Variation Infrastructure group at WTSI Large scale NGS data processing ● 1000 genomes production and releases ● UK10K production group ● Exome and whole-genome sequencing Computational methods ● Samtools ○ Widely used software for NGS analysis ● VCF and VCF tools ○ Widely used format and suite of tools for NGS variation analysis ● Structural variation ○ SVMerge ■ Detect structural variants (SVs) by integrating calls from several existing SV callers ○ RetroSeq ■ Detecting non-reference transposable elements Comparative genomics ● Mouse genomes project – 17 mouse genomes deeply sequenced ● RNA-editing across mouse strains ● Transposable elements evolution and selection in mouse strains ● Human rare diseases ● Isolated human populations Sequence assembly ● De novo assembly and gene finding of 18 mouse strains Richard Durbin David Adams WTAC NGS Course, Hinxton 10th April 2014 Zhicheng Liu
  • 3. WTAC NGS Course, Hinxton 10th April 2014 Lecture 1: Sequence alignment, data formats, QC, and data processing WTAC NGS Course, Hinxton 10th April 2014 ➢ NGS Data Formats ➢ Sequence Alignment ➢ QC from Alignments ➢ NGS Data Processing Workflows ➢ Lab Exercises
  • 4. WTAC NGS Course, Hinxton 10th April 2014 Primary NGS Data Formats Fastq ● Unaligned read sequences with base qualities BAM ● Aligned or unaligned reads ● Text and binary formats CRAM ● Aligned or unaligned reads ● Advanced compression models VCF ● Flexible variant call format ● Arbitrary types of sequence variation ● SNPs, indels, structural variations WTAC NGS Course, Hinxton 10th April 2014
  • 5. WTAC NGS Course, Hinxton 10th April 2014 FASTQ FASTQ is a simple format for raw unaligned sequencing reads ● Simple extension to the FASTA format ● Sequence and an associated per base quality score Originally standard for storing capillary data Format ● Subset of the ASCII printable characters ● ASCII 33–126 inclusive with a simple offset mapping ● perl -w -e "print ( unpack( 'C', '%' ) - 33 );” WTAC NGS Course, Hinxton 10th April 2014
  • 6. WTAC NGS Course, Hinxton 10th April 2014 SAM/BAM SAM (Sequence Alignment/Map) format ● Single unified format for storing read alignments to a reference genome BAM (Binary Alignment/Map) format ● Binary equivalent of SAM ● Developed for fast processing/indexing Key features ● Can store alignments from most aligners ● Supports multiple sequencing technologies ● Supports indexing for quick retrieval/viewing ● Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space) ● Reads can be grouped into logical groups e.g. lanes, libraries, samples ● Widely support by variant calling software packages Replacement to SRF & fastq WTAC NGS Course, Hinxton 10th April 2014
  • 7. WTAC NGS Course, Hinxton 10th April 2014 SAM/BAM No. Name Description 1 QNAME Query NAME of the read or the read pair 2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.) 3 RNAME Reference sequence NAME 4 POS 1-Based leftmost POSition of clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality) WTAC NGS Course, Hinxton 10th April 2014 Heng Li et al (2009) The Sequence Alignment/Map format and SAMtools, Bioinformatics, 25:2078-2079 HS18_07983:1:2203:5095:109107#36 163 ENA|AJ011856|AJ011856.1 412 60 100M = 471 159 ATAAAATTATTAATAATAATCAATATGAAATTAATAAAAATCTTATAAAAAAGTAATGAATACTCCTTTTTAAAAATAAAAAGGGGTTCGGTCCCCCCCC 9BCDGDEHGEHFHHGFHHJGHFHIGHFIGHFHGGGHGHGHGHJGHHGHHHGGHHHIGGGGGFGGDGGHFHFIGEGHGFGGHFEDGG4GHGGGFHGFHIEF X0:i:1 X1:i:0 MD:Z:100 RG:Z:1#36.1
  • 8. WTAC NGS Course, Hinxton 10th April 2014 Cigar Format Cigar has been traditionally used as a compact way to represent a sequence alignment Operations include ● M - match or mismatch ● I - insertion ● D - deletion SAM extends these to include ● S - soft clip (ignore these bases) ● H - hard clip (ignore and remove these bases) E.g.Read: ACGCA-TGCAGTtagacgt Ref: ACTCAGTG—-GT Cigar: 5M1D2M2I2M7S WTAC NGS Course, Hinxton 10th April 2014
  • 9. WTAC NGS Course, Hinxton 10th April 2014 What is the cigar line? E.g. Read: tgtcgtcACGCATG---CAGTtagacgt Ref: ACGCATGCGGCAGT Cigar: WTAC NGS Course, Hinxton 10th April 2014
  • 10. WTAC NGS Course, Hinxton 10th April 2014 Read Group Tag Each lane has a unique RG tag that contains meta-data for the lane RG tags ● ID: SRR/ERR number ● PL: Sequencing platform ● PU: Run name ● LB: Library name ● PI: Insert fragment size ● SM: Individual ● CN: Sequencing center WTAC NGS Course, Hinxton 10th April 2014
  • 11. WTAC NGS Course, Hinxton 10th April 2014 1000 Genomes BAM File WTAC NGS Course, Hinxton 10th April 2014 Command: samtools view -h my.bam | less -S
  • 12. WTAC NGS Course, Hinxton 10th April 2014 1000 Genomes BAM File WTAC NGS Course, Hinxton 10th April 2014 samtools view –H my.bam | less -S How is the BAM file sorted? How many different sequencing centres contributed lanes to this BAM file? What is the alignment tool used to create this BAM file? How many different sequencing libraries are there in this BAM? Hint: RG tag
  • 13. WTAC NGS Course, Hinxton 10th April 2014 SAM/BAM Tools Several tools and programming APIs for interacting with SAM/BAM files Samtools - Sanger/C (http://samtools.sourceforge.net) ● Convert SAM <-> BAM ● Sort, index, BAM files ● Flagstat - summary of the mapping flags ● Merge multiple BAM files ● Rmdup - remove PCR duplicates from the library preparation Picard - Broad Institute/Java (http://picard.sourceforge.net) ● MarkDuplicates, CollectAlignmentSummaryMetrics, CreateSequenceDictionary, SamToFastq, MeanQualityByCycle, FixMateInformation……. ● Bio-SamTool - Perl (http://search.cpan.org/~lds/Bio-SamTools/) ● Pysam - Python (http://code.google.com/p/pysam/) BAM Visualisation ● BamView, LookSeq, Gap5: http://www.sanger.ac.uk/Software ● IGV: http://www.broadinstitute.org/igv/ ● Tablet: http://bioinf.scri.ac.uk/tablet/ WTAC NGS Course, Hinxton 10th April 2014
  • 14. WTAC NGS Course, Hinxton 10th April 2014 CRAM Format BAM files are too large ● ~1.5-2 bits per base pair Increases in disk capacity are being far outstripped by sequencing technologies BAM stored all of the data ● Every read base ● Every base quality ● Using conventional compression techniques CRAM: Two important concepts ● Reference based compression ● Controlled loss of quality information Widely seen as the sequencing format of the future ● Support for CRAM being actively added to Samtools and Picard
  • 15. Thomas Keane, WTSI 2th April 2014 Reference Based Compression
  • 16. Thomas Keane, WTSI 2th April 2014 Reference Based Compression
  • 17. WTAC NGS Course, Hinxton 10th April 2014 CRAM: Reference-based sequence data compression
  • 18. WTAC NGS Course, Hinxton 10th April 2014 CRAM Support Currently ● CRAM Java toolkit (EBI) ● Scramble (WTSI) Coming soon ● Samtools (WTSI) upcoming release ● Picard/GATK (Broad) in development 2014: WTSI aim to put CRAM into full production pipelines
  • 19. WTAC NGS Course, Hinxton 10th April 2014 Lecture 1: Sequence alignment, data formats, QC, and data processing WTAC NGS Course, Hinxton 10th April 2014 ➢ NGS Data Formats ➢ Sequence Alignment ➢ Data QC ➢ NGS Data Processing Workflows ➢ NGS Visualisation and Inspection
  • 20. WTAC NGS Course, Hinxton 10th April 2014 Sequence Alignment Sequence alignment in NGS is ● Process of determining the most likely source within the reference genome sequence that the observed DNA sequencing read is derived from Principles and approaches to sequence alignment have not changed Basic Local Alignment Search Tool (BLAST) ● ‘Seed and extend’ approach ● Query sequences vs. larger database of sequences ● Split query sequences into short sequences (~10bp) and search for locations where these cluster in the larger database of sequences ● Nucleotide blast, protein blast, blastx, tblastn, tblastx…. NGS: Nucleotide based alignment ● Very small evolutionary distances (human-human, strains of the reference genome) ● Allows for assumptions about the number of expected mismatches to speedup alignment programs NGS has just massively scaled up a challenge that has existed since the inception of bioinformatics
  • 21. WTAC NGS Course, Hinxton 10th April 2014 Hash Table Alignment All hash table based algorithms essentially follow the same seed-and-extend paradigm K-mer is a short fixed sequence of nucleotides Typical algorithm ● Build a profile (index) of all possible k-mers of length n and the locations in the reference genome they occur ○ Several Gbytes in size for human genome ● Foreach sequence read ○ Split into k-mers of length n ○ Lookup the locations in the reference via the index (seed phase) ○ Pick location on the genome with most k-mer hits ○ Perform Smith-Waterman alignment to fully align the read to the region ○ Output the alignment of each read onto the reference in BAM (or equivalent) format Hash of the reads: MAQ, ELAND, ZOOM and SHRiMP ● Smaller but more variable memory requirements Hash the reference: SOAP, BFAST and MOSAIK ● Advantage: constant memory cost
  • 22. WTAC NGS Course, Hinxton 10th April 2014 Hash Table Alignment Sequencing reads Kmer hash Reference Genome
  • 23. WTAC NGS Course, Hinxton 10th April 2014 Suffix/Prefix Tree Based Aligners Store all possible suffixes or prefixes to enable fast string matching A suffix trie, or simply a trie, is a data structure that stores all the suffixes of a string, enabling fast string matching. To establish the link between a trie and an FM-index, a data structure based on Burrows- Wheeler Transform (BWT) FM-Index based ● Small memory footprint Examples ● MUMmer, BWA, bowtie Still require a final step to generate local alignment Delcher et al (1999) NAR
  • 24. WTAC NGS Course, Hinxton 10th April 2014 Smith-Waterman Algorithm Algorithm for generating the optimal pairwise alignment between two sequences Time consuming to carry out for every read ● Only applied to a small subset of the reads that don’t have an exact match ● Important for correctly aligning reads with insertions/deletions Match: +1 Mismatch: 0 Gap open: -1
  • 25. WTAC NGS Course, Hinxton 10th April 2014 Mapping Qualities What if there are several possible places in the genome to align your sequencing read? Genomes contain many different types of repeated sequences ● Transposable elements (40-50% of vertebrate genomes) ● Low complexity sequence ● Reference errors and gaps Mapping quality is a measure of how confident the aligner is that the read is corresponds to this location in the reference genome ● Typically represented as a phred score (log scale) ● Q10 = 1 in 10 incorrect ● Q20 = 1 in 100 incorrect Paired-end sequencing is useful ● One end maps inside a repetitive elements and one outside in unique sequence ● Then the combined mapping quality can still be high ● Hence always do paired-end sequencing!
  • 26. WTAC NGS Course, Hinxton 10th April 2014 Mapping Qualities
  • 27. WTAC NGS Course, Hinxton 10th April 2014 Alignment Limitations Read Length and complexity of the genome ● Very short reads difficult to align confidently to the genome ● Low complexity genomes present difficulties ○ Malaria is 80% AT – lots of low complexity AT stretches Alignment around indels ● Next-gen alignments tend to accumulate false SNPs near true indel positions due to misalignment ● Smith-Waterman scoring schemes generally penalise a SNP less than a gap open ● New tools developed to do a second pass on a BAM and locally realign the reads around indels and ‘correct’ the read alignments High density SNP regions ● Seed and extend based aligners can have an upper limit on the number of consecutive SNPs in seed region of read (e.g. Maq - max of 2 mismatches in first 28bp of read) ● BWT based aligners work best at low divergence
  • 28. WTAC NGS Course, Hinxton 10th April 2014 Read Length vs. Uniqueness
  • 29. WTAC NGS Course, Hinxton 10th April 2014 Example Indel
  • 30. WTAC NGS Course, Hinxton 10th April 2014 Scaling Up 30-40Gbp per HiSeq lane ● Aligning a single lane of reads can take a long time on a single computer Parallel computing ● A form of computation in which many calculations are carried out simultaneously @read1 ACGTANATCN + $$%SSG$%££@ @read2 AGCNTNCTCA + £$$%£$%%^& BAM @read1 ACGTANATCN + $$%SSG$%££@ @read2 AGCNTNCTCA + £$$%£$%%^& BAM
  • 31. WTAC NGS Course, Hinxton 10th April 2014 Scaling Up Two main approaches to speeding up read alignment ● Simple parallelism by splitting the data ○ Split lane into 1Gbp chunks and align independently on different processors ■ BWA ~8 hours per 1Gbp chunk ○ Merge chunk BAM files back into single lane BAM ■ ‘samtools merge’ command @read1 ACGTANATCN + $$%SSG$%££@... BAM● Utilise multiple processors on single computer ○ Modern computers have >1 processing core or CPU ○ Most aligners can use more than one processor on same computer ○ Much easier for user ■ Just supply the number of processors to use (e.g. BWA -t option) Fastq split1 Fastq split2 Fastq split3 Fastq split4 BAM1 BAM2 BAM3 BAM4 Sequencing Lane (Fastq, 30-40Gbp) Split (1Gbp) Align Merge
  • 32. WTAC NGS Course, Hinxton 10th April 2014 Lecture 1: Sequence alignment, data formats, QC, and data processing WTAC NGS Course, Hinxton 10th April 2014 ➢ NGS Data Formats ➢ Sequence Alignment ➢ QC from alignments ➢ NGS Data Processing Workflows ➢ NGS Visualisation and Inspection
  • 33. WTAC NGS Course, Hinxton 10th April 2014 Data QC from Alignments Several useful metrics to check to assess the quality of your data and alignments produced ● Number of reads mapped, bases mapped, duplicate fragments, reads w/adaptor, error rate, fragment size distribution, genotype check Genotype check – is this the correct sample? ● Use an external set of genotypes for the sample to assess the likelihood that the sample is the expected sample e.g. genotyping chip Biases in sequencing ● GC vs. depth ● Indel ratio ● Read cycle vs. base content
  • 34. WTAC NGS Course, Hinxton 10th April 2014 Suggested Auto QC
  • 35. WTAC NGS Course, Hinxton 10th April 2014 GC of Reads
  • 36. WTAC NGS Course, Hinxton 10th April 2014 GC vs. Depth
  • 37. WTAC NGS Course, Hinxton 10th April 2014 Fragment Size
  • 38. WTAC NGS Course, Hinxton 10th April 2014 Fragment Size Experiment: 100bp paired-end sequencing. Can you spot any problems with this library fragment size for this experiment?
  • 39. WTAC NGS Course, Hinxton 10th April 2014 Indels per Cycle
  • 40. WTAC NGS Course, Hinxton 10th April 2014 Lecture 1: Sequence alignment, data formats, QC, and data processing WTAC NGS Course, Hinxton 10th April 2014 ➢ NGS Data Formats ➢ Sequence Alignment ➢ Data QC from alignments ➢ NGS Data Processing Workflows ➢ NGS Visualisation and Inspection
  • 41. WTAC NGS Course, Hinxton 10th April 2014 NGS Workflows Next-gen sequencing experiments ● Several, tens or hundreds of samples ● One or more sequencing libraries per sample ● Sample could constitute several libraries How the data is processed can have consequences on quality of variant calling Alignment of the reads onto the reference is just the first step ● QC of data is very important for good calls ○ Biases in the library or sequence data will produce unexpected results or missed variant calls ○ E.g. GC biases ● How the data is processed prior to variant calling is important ○ Certain computational steps that should be carried out to improve the quality of the data and alignments prior to calling ● Mapping -> improvement -> merging -> variant calling
  • 42. WTAC NGS Course, Hinxton 10th April 2014 Data Production Workflow Merge Up BAMBAM BAMLibrary merge Library NA34842 NA87465 Sample/PlatformSample merge Import + Improvement Fastq Fastq Fastq …… Fastq Fastq BAM BAM BAM BAM BAM Alignment (bwa, smalt, bowtie etc) BAM BAM BAM BAM BAM BAM Improvement …… …… Freeze
  • 43. WTAC NGS Course, Hinxton 10th April 2014 Data Production Workflow Cross-sample BAMs Merge across … Chr1 Chr2 Chr3 NA19294 NA18943 NA19305 . . NA19309 … … RG:NA19294 RG:NA18943 RG:NA19305 . . . . . . . . . Variant Calling Samtools GATK VQSR BEAGLE Impute2 Genome STRiP Final VCF ☺ VEP Annotation SVMergeSNPs/indels
  • 44. WTAC NGS Course, Hinxton 10th April 2014 BAM Improvement Lane level operation carried out after alignment Input: BAM Process 1: Local realignment Process 2: Base quality recalibration Output: (improved) BAM
  • 45. WTAC NGS Course, Hinxton 10th April 2014 Realignment Short indels in the sample relative to the reference can pose difficulties for alignment programs Indels occurring near the ends of the reads are often not aligned correctly ● Excess of SNPs rather than introduce indel into alignment Realignment algorithm ● Input set of known indel sites and a BAM file ● At each site, model the indel haplotype and the reference haplotype ● Given the information on a known indel ○ Which scenario are the reads more likely to be derived from? ● New BAM file produced with read cigar lines modified where indels have been introduced by the realignment process Software ● Implemented in GATK from Broad (IndelRealigner function) What sites? ● Previously published indel sites, dbSNP, 1000 genomes, generate a rough/high confidence indel set Longer reads and better aligners (e.g. BWA-MEM) reducing the need to carry out this
  • 46. WTAC NGS Course, Hinxton 10th April 2014 Realignment
  • 47. WTAC NGS Course, Hinxton 10th April 2014 Base Quality Recalibration Each base call has an associated base call quality ● What is the chance that the base call is incorrect? ○ Illumina evidence: intensity values + cycle ● Phred values (log scale) ○ Q10 = 1 in 10 chance of base call incorrect ○ Q20 = 1 in 100 chance of base call incorrect ● Accurate base qualities essential measure in variant calling Rule of thumb: Anything less than Q20 is not useful data Illumina sequencing ● Control lane or spiked control used to generate a quality calibration table ● If no control – then use pre-computed calibration tables Quality recalibration ● 1000 genomes project sequencing carried out on multiple platforms at multiple different sequencing centres ● Are the quality values comparable across centres/platforms given they have all been calibrated using different methods?
  • 48. WTAC NGS Course, Hinxton 10th April 2014 Base Quality Recalibration Original recalibration algorithm ● Align subsample of reads from a lane to human reference ● Exclude all known dbSNP+1000G pilot SNP sites ○ Assume all other mismatches are sequencing errors ● Compute a new calibration table bases on mismatch rates per position on the read Pre-calibration sequence reports Q25 base calls ● After alignment - it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20 Recent improvements – GATK package ● Reported/original quality score ● The position within the read ● The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine ● Probability of mismatching the reference genome NOTE: requires a reference genome and a catalog of variable sites
  • 49. WTAC NGS Course, Hinxton 10th April 2014 Base Quality Recalibration Effects N.B. Always replot quality values when trying BQSR on a new set of samples or species
  • 50. WTAC NGS Course, Hinxton 10th April 2014 Data Production Workflow BAMBAM BAMLibrary merge Library Fastq Fastq Fastq …… Fastq Fastq BAM BAM BAM BAM BAM Alignment (bwa, smalt etc) BAM BAM BAM BAM BAM BAM Improvement Lane/Plex BAM BAM Sample/PlatformSample merge
  • 51. WTAC NGS Course, Hinxton 10th April 2014 Library Merge Library level operation carried out after BAM improvement Input: Multiple Lane BAMs Process 1: Merge BAMs (picard - MergeSamFiles) Process 2: Duplicate fragment identification Output: BAM
  • 52. WTAC NGS Course, Hinxton 10th April 2014 Library Duplicates All second-gen sequencing platforms are NOT single molecule sequencing ● PCR amplification step in library preparation ● Can result in duplicate DNA fragments in the final library prep. ● PCR-free protocols do exist – require larger volumes of input DNA Generally low number of duplicates in good libraries (<5%) ● Align reads to the reference genome ● Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy ○ Samtools: samtools rmdup or samtools rmdupse ○ Picard/GATK: MarkDuplicates Can result in false SNP calls ● Duplicates manifest themselves as high read depth support
  • 53. WTAC NGS Course, Hinxton 10th April 2014 Library Duplicates
  • 54. WTAC NGS Course, Hinxton 10th April 2014 Duplicates and False SNPs
  • 55. WTAC NGS Course, Hinxton 10th April 2014 Software Tools Alignment ● BWA: http://bio-bwa.sourceforge.net/bwa.shtml ● Smalt: http://www.sanger.ac.uk/resources/software/smalt/ ● Stampy: http://www.well.ox.ac.uk/project-stampy BAM Improvement ● Realignment (GATK): http://www.broadinstitute.org/gsa/wiki/index. php/Local_realignment_around_indels ● Recalibration: http://www.broadinstitute.org/gsa/wiki/index. php/Variant_quality_score_recalibration Library Merging ● BAM Merging (Picard): http://picard.sourceforge.net/command-line- overview.shtml#MergeSamFiles ● Duplicate Marking/removal (Picard): http://picard.sourceforge. net/command-line-overview.shtml#MarkDuplicates
  • 56. WTAC NGS Course, Hinxton 10th April 2014 Lecture 1: Sequence alignment, data formats, QC, and data processing WTAC NGS Course, Hinxton 10th April 2014 ➢ NGS Data Formats ➢ Sequence Alignment ➢ QC from Alignments ➢ NGS Data Processing Workflows ➢ Lab Exercises
  • 57. WTAC NGS Course, Hinxton 10th April 2014 Lab Exercises 1. Align two lanes to produce BAM files with BWA 2. Generate some basic QC information from the alignments 3. Carry out the data processing workflow to make merged library BAM files 4. Visualise the BAM files with IGV