1. Next Generation Sequencing for
Model and Non-Model Organism
2nd day
Jun Sese and Kentaro Shimizu
sesejun@cs.titech.ac.jp
Ph.D course lecture @
Institute of Plant Biology, Univ. of Zurich
26/05/2011
2. Today’s Menu
• Lecture
• Current RNA-Seq analysis
• Genome and RNA Asembly
• Introduction to AWK
• First step of programming
• Exercise
• Visualization of mapped reads
• RNA-Seq analysis
• Genome assembly
2
3. Sequencerʼs Output
Genome Sequence
Mapping Program
Mapping Result
RNA-Seq
Visualization Further Analysis
3
4. RNA-Seq
• Which genes are highly expressed?
• Need to normalize by sequence length
• RPKM (Reads Par Kilo-basepair per Million reads)
[Mortazavi et al. Nature Methods. 2008]
• An initial gene expression counting method
Think about two genes expressed in a cell.
Suppose that a mRNA is expressed from each gene.
Short Gene Long Gene
2 8
Longer gene has more frequency.
4
5. RNA-Seq (contd)
• Some corrections including multiple-test and
fragment bias will be required.
• Srivastava and Chen. NAR. 2010
• Li, Jiang and Wong. Genome Research. 2010
• No standard method.
• After mapping reads, some tools are available to
count reads.
• Cufflinks
• HTSeq
• R packages
• DEGSeq [Wang et al. 2010]
• edgeR [Robinson, McCarthy and Smyth. 2010]
• DEseq [Anders and Huber. 2010]
5
6. Sequencerʼs Output Sequencer
Assemble
Genome Sequence
Mapping Program
Mapping Result
Visualization Further Analysis
6
7. Assembly
• Genome/Gene assembly is a kind of puzzle.
• Assemble a long sequence by combining short reads
ATATGGATG CTAAGCAT
TGCCATAT
CGAGGCAT
GATGCTAAG
CATATGCGA
GGCATGCC
GATGCTAAG
CTAAGCAT
CATATGCGA
CGAGGCAT
GGCATGCC
TGCCATAT
ATATGGATG
GATGCTAAGCATATGCGAGGCATGCCATATGGATG
7
8. Assembly programs also
depend on sequence length
• Sanger sequence
• Archine
• Roche 454
• Mira3, Newbler
• Illumina/SOLiD sequencers
• Velvet, ABySS, SOAPdenovo,...
• Recently gene(RNA) assemble programs have been
developed
• Oases http://www.ebi.ac.uk/~zerbino/oases/
• Trinity [Grabherr et al. Nature Biotech. 2011]
8
9. Overlap-Layout-Consensus
• Mainly used to assemble Sanger and Roche 454 sequences.
Kasahara and Morishita.
Large-scale genome sequence processing.
2006. 9
10. de Bruijn Graph approach
• Used in recent short read assemblers
• Velvet, ABySS,...
• Generate k-mer graph (de Bruijn graph), and then find minimum
paths covering all edges
• Originally introduced in Pevzner, Tang and Waterman, PNAS,
2001.
Miller, Koren and Sutton. Genomics. 2010.
10
12. Genome assembly problem
has no correct answer.
• True genome sequence exists, I know.
• In reality, we can not know the whole genome sequence
exactly.
• In most genome assemble study, some indexes are
used to check whether the assembly is success or not.
• Number of contigs
• Total length of contigs
• N50
• If you read EST sequences, the sequences can use to
check the assemble quality.
• Note: You can not use the ESTs to do assemble
genome because of keeping independency between
training set and test set.
12
13. Assembled sequences vary
between assemblers
• Compare 5 assemblers for RNA assembly with
Roche 454 reads
• Kumar and Blaxter. BMC Genomics. 2010.
• Compare Newbler, SeqMan, CLC (Commercial),
CAP3 and MIRA3 (Free)
• No winner
• Newbler 2.5 generates longest contigs
• SeqMan is the best for recapturing known genes
• MIRA3 is competitive for Newbler and SeqMan
13
14. Assembled sequences vary
between assemblers (contd)
• Compare 6 assemblers for genome assembly
• Bao et al. J. Hum Gen. 2010.
• Use 1.5 million reads. Human genome resequencing
data. Read length is 76 bp.
• Authors conclude that SOAPdenovo was the best.
•High genome coverage, low memory and fast.
• SSAKE and ABySS generated very longer contig than
SOAPdenovo.
• Because of shortage of # of reads, this comparison is
not practical.
•They selected reads because their machine only
have 32GB memory.
• Genome assembly require various parameters to get
“good” result. Authors did not mention about the
parameter tuning.
14
15. Change File Format
Sequencerʼs Output
Sequence Format
Genome Sequence
Mapping Program BWA, Bowtie, etc.
Mapping Result Output Format
We have to
change file format Visualization Further Analysis
15
16. Introduction to AWK
• “grep” is very useful command, but we may require more
complicated search.
• e.g., select lines whose third column is “Chr1.”
•‘grep “Chr1” file’ select lines even when the line contains
“Chr1” in first columns.
• e.g., select lines whose values are less than 100.
•Grep cannot compare values.
• Replace a word with other word in file.
• Editors can do that if file size is small.
• AWK is one of the traditional and simple solution.
• For more complicated tasks, script languages like perl,
python and ruby are useful.
• We here introduce “minimum” requirements about AWK.
• You can find many introductory documents about awk in
the Web.
16
17. AWK in a nutshell
• Process each line
• $n means n-th column.
• $1 is first column and $2 is second column.
• $0 means whole line
$ cat nums.tab # same as “cut -f2 nums.tab”
11.2 13.8 $ awk '{print $2}' nums.tab
10.9 7.7 13.8
15.2 7.0 7.7
9.4 10.9 7.0
8.8 9.1 10.9
9.1
# Only print second column is equal to “10.9”
# Compare with ‘grep “10.9” nums.tab’
$ awk '{if($2 == "10.9") print $0}' nums.tab
9.4 10.9
# Compare as numerical value
$ awk '{if($2 > 10) print $0}' nums.tab
11.2 13.8
9.4 10.9
$ awk '{if($2 > 10 & $2 < 12) print $0}' nums.tab
9.4 10.9
17
18. AWK in a nutshell (2)
$ cat nums.tab
11.2 13.8
10.9 7.7
15.2 7.0
9.4 10.9
8.8 9.1
# Print lines contains “9” in second column
$ awk '{if($2 ~ /9/) print $0}' nums.tab
9.4 10.9
8.8 9.1
# Print lines start from “1”
$ awk '{if($1 ~ /^9/) print $0}' nums.tab
9.4 10.9
# Replace special string
$ awk '{gsub(/10/,"15"); print $0}' nums.tab
11.2 13.8
15.9 7.7
15.2 7.0
9.4 15.9
8.8 9.1
“ ” is just string, and / / is regular expression.
18
19. Sequencerʼs Output
Genome Sequence
Mapping Program
Mapping Result
RNA-Seq
Visualization Further Analysis
19
20. Convert SAM to BAM
• SAM file is very large file size.
• We convert the SAM file into BAM file, which is
computer friendly format.
• Install SAMtools
• http://samtools.sourceforge.net/
# $ curl -O http://switch.dl.sourceforge.net/project/samtools/
samtools/0.1.16/samtools-0.1.16.tar.bz2
# $ bzip2 -dc samtools-0.1.16.tar.bz2 | tar xvf -
# $ ln -s samtools-0.1.16 samtools
# $ cd samtools
# $ make # $ cd ..
$ ./samtools/samtools faidx TAIR10_chr_all.fas
# Generate TAIR10_chr_all.fas.fai
# “” indicates that the line continues to next line.
# You do not need to input the “”
$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai
-o tha_reads.bam tha_reads.sam
# Generate tha_reads.bam
$ ./samtools/samtools sort tha_reads.bam tha_reads.sorted
# Sort reads and generate tha_reads.sorted.bam
$ ./samtools/samtools index tha_reads.sorted.bam tha_reads.sorted.bai
# Generate index of bam file into tha_reads.sorted.bai 20
21. Visualize mapped result (IGV)
• 1. Install IGV
•
$ unzip IGV_1.5.64.zip #install
2. Start IGV $ java -Xmx1g -jar IGV_1.5.64/igv.jar #start IGV
# Wait a minute. New window will appear.
3. Select A.thaliana
(TAIR10)
4. File > Load from File > Select
“tha_reads.sorted.bam”
5. Zoomin, Zoomin...but it is
difficult to find mapped reads :(
21
23. Mapped reads on Chr1
• Use SRR038985_chr1.sam
• Include all reads mapped onto Chromosome 1
• Convert the SAM file into BAM, and load from IGV
# We can skip this > $ ./samtools/samtools faidx TAIR10_chr_all.fas
$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai
-o SRR038985_chr1.bam SRR038985_chr1.sam
$ ./samtools/samtools sort SRR038985_chr1.bam SRR038985_chr1.sorted
$ ./samtools/samtools index
SRR038985_chr1.sorted.bam SRR038985_chr1.sorted.bai
23
24. Visualize mapped result (Ensembl)
• Install BEDTools
• http://code.google.com/p/bedtools/
• Using bamToBed in the BEDTools, you can convert
bam format into BED format.
• BED format can describe simple track information.
Ensembl and UCSC genome browser can read this
file and display its contents.
# Skip install process
# $ curl -O http://bedtools.googlecode.com/files/
BEDTools.v2.12.0.tar.gz
# $ tar zxvf BEDTools.v2.12.0.tar.gz
# $ ln -s BEDTools-Version-2.12.0 BEDTools
# $ cd BEDTools-Version-2.12.0
# $ make
# $ cd ..
$ ./BEDTools/bin/bamToBed -i SRR038985_chr1.sorted.bam
> SRR038985_chr1.sorted.bed
24
25. Visualize mapped result (Ensembl)
• Go to http://plants.ensembl.org in
your browser
• Select Arabidopsis thaliana
• Click manage your data in left
column
• Select “Upload Data” in left column
• Name for this upload: my_reads
• Data format: BED
• Upload file: select your bed file
• DON’T push Upload now!!!
25
26. Problems...
• Two problems
• BED file is too large to upload. Maximum file size we can
upload to Ensembl Plants is 5MB
• We have to select region in the BED file.
• Chromosome name is different
• In our BED file, chromosome name is like “Chr1,” while in
ensembl, the name is just “1.”
• We have to convert the name.
• Finally, we can upload the BED file!
• It takes about a minute. Don’t push “Upload” button
repeatedly.
$ awk '{if($3 < 1000000) print $0}' SRR038985_chr1.sorted.bed
> SRR038985_chr1_to_1M.sorted.bed
# You can change region by replacing “$3 < 100000” with “$3 < 100000
&& $3 > 50000”
$ awk 'gsub(/^Chr/,"")' SRR038985_chr1_to_1M.sorted.bed
> SRR038985_chr1_to_1M.ensembl.bed
$ ls -lh SRR038985_chr1_to_1M.ensembl.bed
# Please check the file size is less than 5MB
26
27. Visualize mapped result (Ensembl)
• Click link “1:0-100000”
• You can see your reads on “my_reads” track.
• Only you can see your track
• You have to upload BED file again after you logout your
computer.
27
29. Count tags on each gene
• Most RNA-Seq tools depend on some libraries.
• We have to install several programs to use them.
• Some of them require administrator authority.
• Provide simple python script and count the numbers of tags.
# We skip download GFF file.
# GFF file contains gene positions on chromosomes.
# $ curl -O ftp://ftp.arabidopsis.org/home/tair/Genes/
TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
$ python count_reads_on_gene.py SRR038985_chr1.sam
TAIR10_GFF3_genes.gff > SRR038985_chr1.exp
SRR038985_chr1.exp
Gene Name Count Sort by count in reverse order
... % sort -k2 -nr SRR038985_chr1.exp
AT1G01046 0 AT1G18745 59
AT1G01050 1 AT1G16635 47
AT1G01060 0 AT1G21650 27
.... AT1G75163 16
... 29
30. A.lyrata reads and visualization
• A.lyrata genome paper was published on April. 2011.
• Genome sequence forms small contigs
• These status is similar to just after sequence assembly
• We map reads on A.lyrata and visualize the data in IGV.
• In Ensembl Plants, A.lyrata genome is already available.
However, unpublished genome sequence is not
available on the site.
• This is limitation of web application (web sites).
• We here select IGV again.
• IGV does not contain A.lyrata genome information.
• We start from importing genome and gene
informations.
30
32. Visualization of Mapped Result
• Load genome and genes. In IGV, File > Import Genome
• Name: A.lyrata (as you like!)
• Sequence File: Select your Alyrata_107_RM.fa
• Cytoband File: [empty]
• Gene File: Select your Alyrata_107_gene.gff3
• To check file contents, you need wait a moment.
• Then, save.
• Select file to save genome information.
• Load read information. In IGV, File > Load from File.
• Select “lyr_reads.sorted.bam”
32
34. Assemble reads with velvet
• This is toy example. We just check the usage.
• Genome/Gene assembly requires huge main memory.
• Velvet requires “AT LEAST” 12GB.
• Require two steps: velveth and velvetg
• For SOLiD reads, use velveth_de and velvetg_de
• Options are the same.
• Before run velvet, we have to change format using ABI’s script
called denovo2.0 (SOLiD only)
• http://solidsoftwaretools.com/gf/project/denovo/frs/?
action=FrsReleaseBrowse&frs_package_id=65
• After this process (if the reads come from genome), you can run gene
prediction programs (Fgenesh, EuGene, GenomeThreader etc.).
• Modern assemblers use de Brujin graph (k-mer graph). The change of
parameter k will change assemble result drastically.
• We have to generate many assemble results with various
parameters to obtain the best one.
34
35. # # Download and install velvet
# $ curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.03.tgz
# $ gzip -dc velvet_1.1.03.tgz | tar xvf -
# $ ln -s velvet_1.1.03 velvet
# $ cd velvet
# $ make color
# Download ABI’s scripts and extract it
$ gzip -dc denovo2.tgz | tar xvf -
# Preprocessing for velvet
$ perl ./denovo2/utils/solid_denovo_preprocessor_v1.2.pl --run_type
fragment -output chr1_de --f3_file SRR038985_chr1.csfasta
# Run velvet
$ ./velvet/velveth_de assemble_chr1 17 -fasta -short
chr1_de/doubleEncoded_input.de
$ ./velvet/velvetg_de assemble_chr1 -exp_cov auto
# assemble_chr1/contigs.fa contains generated contigs
# Show status
$ perl ./denovo2/utils/assembly_stats.pl assemble_chr1/contigs.fa
Sum contig length : 303616
Num contigs : 3796
Mean contig length : 79
Median contig length : 66
N50 value : 79
Max : 583
35
36. For Roche 454 (or IonTorrent)
• Read length is longer than illumina and SOLiD
• Traditional Sanger sequence analysis can be used
• Homology search with BLAST or BLAT
• Assembly with CAP3 or MIRA3
• Combining Roche 454 with illumina/SOLiD will produce better
result.
• Recent assemblies for long genome have used the
combination.
• One of the problems when we use BLAST/BLAT is that the
programs do not support modern file format such as SAM/BAM.
• Some programs such as GMAP support new format.
• To solve the problem, we make a format converting script and
use it.
36
37. Mapping 454 Reads
• We use EST sequences
• EST sequences contains poly-A tail and vector strings.
• For short reads, we did not this phase because the
sequences are too short to check whether they are vector
strings.
• Procedure
• Remove these sequences
•Use lucy
• Map trimmed sequences against genome
•BLAST and BLAT
• Convert the result to SAM format
• Convert the SAM to BAM and check the result in viewer.
37
38. # # Download and install lucy from http://lucy.sourceforge.net/
# curl -O http://jaist.dl.sourceforge.net/project/lucy/lucy/lucy
%201.20/lucy1.20.tar.gz
# gzip -dc lucy1.20.tar.gz | tar xvf -
# cd lucy-1.20p
# make; ln -s lucy-1.20p lucy
# Download blat executable file (For Mac OS X) and set it up
# $ curl -O
# http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/blat/blat
# $ chmod 755 blat
# To trim vector sequences, run lucy
$ ./lucy/lucy -vector pDNR.vec pDNRsplice.spl -out
roche454_test_trim.fasta roche454_test_trim.qual
roche454_test.fna roche454_test.qual
# Run blat. (About 7mins required)
# We may need to change score matrix to get meaningful alignment
$ ./blat -t=dna -q=dna -tileSize=8 -out=blast TAIR10_chr_all.fas
roche454_test_trim.fasta roche454_test_TAIR10.result
# Convert the result into SAM file
# -t option specifies the maximum threshold of E-value in SAM file.
$ ruby blastn2sam.rb -t 0.00001 -s roche454_test_TAIR10.result
> roche454_test_TAIR10_e5.sam
# After this process, you can do the same procedure as short reads
# (converting SAM to BAM and visualize the data in IGV.) 38
39. Concluding Remarks
• Analysis in this lecture is first step for bioinformatics and
computer science.
• Softwares and methods for analysis of next generation
sequencers are initial phase.
• Only mapping and assemble softwares are widely used.
Other processes are under development.
• To use NGS, we have to check the updates of softwares
and unpublished information.
• Use mailing lists and QA sites.
• Most softwares in biology have limited numbers of users.
• Think about Microsoft Word. Many users, but many...
• Many softwares have poor documentation.
• Bugs always exist.
• Good softwares update frequently to fix bugs and
catch up new information.
• If no software exists for your experiment, simple script may
help your analysis. 39