2. Content
• DNA sequencing
– Targets
• DNA, genes, exons, introns
• RNA
• How do we analyze NGS data?
– Genetic changes
– Mutational Signatures
– RNASeq
3. Sequencing DNA in the modern era
• DNA Sequencing is to convert real world DNA to digital DNA
• In 1980s
– Sanger sequencing
– Compare short regions of DNA
• Possible by hand
• In mid 2000s
– Parallelization of sequencing
reactions
– Generates billions of DNA reads
• DNA read: short stretch of DNA
– Compare whole genomes
• Impossible by hand
CACGTCTAAGGGCGAAGAGCTGACTGCTTTTTT
4. Targeting parts of the genome
• Human genome has 3 billion bases
• Be cost effective:
– Focus on part of genome related to your
subject
5. What is a gene?
• Human genome 3 billion bases
– 23 Chromosomes
– Certain stretches of DNA are code for proteins
which perform a wide variety of functions in
your body (~20,000 in total)
9. NGS Principles - Coverage
Sequence same part many times:
Coverage is number of times a base is covered by a read
10. NGS Principles - Coverage
• Not all reads retrieved are correct
– Many errors when sequencing
• DNA Library prep protocol
• Sequencing error rate
• Sequencing groups of cells
– Certain genetic changes only in small fraction
of cells
• Need to sequence the same part multiple
times to get confidence
– Amount depends on analysis & expectation
11. How to analyze the NGS data?
• Some might guess this is where the
bioinformatician comes in…
12. How to analyze the NGS data?
• Some might guess this is where the
bioinformatician comes in…
Too late - the bioinformatician should have
been helping you design the experiment
13. How to analyze NGS data?
• Tons of different options
– What is the research question?
• Common analysis: identify genetic changes
in the tumor
15. Identify the genetic changes
• Compare against reference human
genome
– Gives both germline and somatic mutations
• How to differentiate?
– Databases with common germline variants misses many
• Somatic mutations
– Take DNA from normal cells and tumor cells
– Filter mutations in normal
17. Identify mutations
• Automated pipelines to do this
– Example: Mutation calling tools take into
account
• Number of reads having the mutation versus all
reads (Mutation Allelic Fraction (MAF))
• Coverage at that position
• Read quality score
• If calling somatic mutations
– Mutation in the normal
• Every parameter makes assumptions about
the data – communicate the goal of the
project
18. Categorize Mutations
• Silent/Nonsilent
– Does the mutation alter phenotype?
• In exonic region
– Synonymous: Amino Acid Code stay the same
– Nonsynonymous: Changes Amino Acid Code
of protein
19. Categorize Mutations
• Oncogenesis
– Oncogenes (the gas)
• Cell growth
• Activation causes cancer
– Tumor Suppressor Genes (the breaks)
• DNA repair, slow down cell division
• Loss of function causes cancer
– Two Hit Hypothesis (Knudson 1971)
20. Mutational Signatures
• Find activated mutational processes
• Use the identified SNVs (single nucleotide
variants) to determine
– Use 1 base context on both 5’ and 3’ side
• .C. > .T.
• 6 base transition classes
– C>A, C>G, C>T,T>A,T>C,T>G
• 4 possible bases on both sides
• Total: 6 * 4 * 4 = 96 possible transitions
21. Mutational Signatures
Alexandrov, L.B. et al. Nature 2013
Biological processes generating
somatic mutations in cancer samples
Dataset:
4,938,362 mutations from 7,042
cancers
Aging signature Defective DNA MMR signature POLE signature
I T
CG>TA transitions at NpCpG
I, indels; T, transcriptional strand bias
CG>TA transitions at NpCpG
CG>AT transversions at CpCpC
C>A transversions at TpCpT; T>G at TpTpT