SlideShare uma empresa Scribd logo
1 de 47
Next-Gen Sequencing:
4 years in the trenches


           C. Titus Brown
  Asst Prof, CSE and Microbiology;
         BEACON NSF STC
      Michigan State University
            ctb@msu.edu
These slides are available online.


                  “titus brown slideshare”

           You can also e-mail me: ctb@msu.edu

Also note that these are my opinions and observations, culled
from personal experience, online material, and reading. I’m
      happy to cite/explain further upon request, but:
                   Your Mileage May Vary
Things I won’t talk about
Don’t work on/with/have anything useful to say about:
  Exome sequencing
  Ancient DNA
  ChIP-seq (protein-DNA interactions)


Work on but you’re probably not interested in:
  Metagenomics (sequencing uncultured microbial communities)
  Bioinformatics data structures and algorithms
Overview
 Shotgun sequencing basics


 Things everyone wants to know: how much $$...


 Various current problems & challenges


 Technology, now and future


 Some papers and projects worth looking at; & our own
  experiences
Two specific concepts:
First, sequencing everything at random is very much easier
 than sequencing a specific gene region. (For example, it will
 soon be easier and cheaper to shotgun-sequence all of E. coli
 then it is to get a single good plasmid sequence.)
Second, if you are sequencing on a 2-D substrate (wells, or
 surfaces, or whatnot) then any increase in density (smaller
 wells, or better imaging) leads to a squared increase in the
 number of sequences.

     These two concepts underlie the recent stunning increases
                                       in sequencing capacity.
What are current costs for
Illumina?
Approximate costs from MSU sequencing center, a few
  months ago, including labor:

RNAseq:
  $200 prep / sample
  Single-ended 1x50 -- $1100/lane – 100-150 mn reads
  Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2)


Barcoding samples, etc, gets complicated.
Discuss biology, etc with a sequencing geek before going
  forward!
What does this data really give
you??
 With RNAseq, you can do de novo (genome- and gene-annotation-
  independent) gene & isoform discovery and quantification; 50-
  100m reads/sample is probably “enough”
  (see: http://blog.fejes.ca/?p=607 for a good discussion)

 With genome resequencing, you can do variant
  analysis/discovery; I recommend 20x depth.

 De novo assembly of complex vertebrate genomes is not casual:
  Cheap short-read sequencing does not yet deliver good long-range
  contiguity; repeats, heterozygosity get in the way.
  Assembly & scaffolding process itself is still evolving.
Why so much data?
Why do we need 10-20x coverage (resequencing) or 50-
  100m reads (mRNAseq) with Illumina?

Two (linked) reasons:
  Shotgun sequencing is random
  Counting/sampling variation
1. Useful minimum coverage
depends on high average coverage
2. mRNAseq quantitation – must
overcome sampling variation
Coverage conclusions
More coverage rarely hurts (you can always discard data, but
  it is harder/more $$ to get more data from an old sample)

Your desired coverage numbers should be driven by
  sensitivity considerations.
Problems and challenges
Systematic bias in sequencing and software.


Genome assembly: scaffolding and sensitivity


Gene references


mRNAseq isoform construction
Resequencing: bias and error
         Calling SNPs by mapping --




                              U. Colorado
                              http://genomics-course.jasondk.org/?p=395
Both sequencing and bioinformatics
yield many low-frequency artifacts!
“Obvious” things like misalignments to paralogous/repeat
 sequences.
Indels are handled badly by current tools (up to 60% false
 positive rate?!)
Oxidation of DNA during library prep step (acoustic
 shearing) generated 8-oxoguanine “lesions” responsible for
 artifacts involving C>A/G>T triplets.

  => With any data set, especially big ones, there will both
           random and systematic error and bias.
       http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-
                                                truth-you-cant-handle-the-truth/
Suggestion: Cortex variant caller




                  Iqbal et al., Nat Genet. 2012, pmid 22231483
Genome assembly: scaffolding &
sensitivity
Everyone wants two things from a genome assembly --

Long/correct scaffolds


               See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-
for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon




Complete genome content
Sequence data
                                Reads

original DNA

  fragments




original DNA

  fragments

                    Sequenced ends



               http://www.cbcb.umd.edu/research/assembly_primer.shtml
               slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Contigs
Building contigs

                 ACGCGATTCAGGTTACCACG
                   GCGATTCAGGTTACCACGCG
                     GATTCAGGTTACCACGCGTA
                       TTCAGGTTACCACGCGTAGC
                         CAGGTTACCACGCGTAGCGC
  Aligned reads            GGTTACCACGCGTAGCGCAT
                             TTACCACGCGTAGCGCATTA
                                ACCACGCGTAGCGCATTACA
                                  CACGCGTAGCGCATTACACA
                                    CGCGTAGCGCATTACACAGA
                                      CGTAGCGCATTACACAGATT
                                        TAGCGCATTACACAGATTAG
Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG




                    slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Scaffolds
    Ordered, oriented contigs


    mate pairs
contigs



                                          gap size estimate



          Scaffold
                            contig
                                              gap




                     slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
                        http://dx.doi.org/10.6084/m9.figshare.100940
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt


Longer reads!
  Repeat copy 1                                   Repeat copy 2




       Long reads can span repeats      and heterozygous regions




                          Polymorphic contig 22
                           Polymorphic contig

   Contig 1                                                  Contig 4
                          Polymorphic contig 33
                           Polymorphic contig
Cod: PacBio results
         Mapping to the published genome
                  11.4 kbp subread




                    10.6 kbp subread




                   10.9 kbp subread




          slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Sensitivity – does your genome
include everything?
Generally not!


For example, the chick genome is missing a substantial
  number of genes from microchromosomes:
  723 genes from HSA19q missing from chicken galGal4.
  ESTs and RNAseq transcripts for many or most.
Approach - Digital normalization
(a computational version of library normalization)




                                        Digital normalization “smooths
                                         out” coverage from different
                                          loci, and can “recover” low
                                        coverage regions for assembly.
Applying diginorm to increase
sensitivity
Reassembled chick genome from 70x Illumina ->
 normalized reads in ~24 hours.
Contig assembly contained partial or complete matches to
 70% of previously unmappable transcripts assembled from
 chick mRNAseq

Together with Wes Warren (WUSTL), Hans Cheng (USDA
  ADOL), Jerry Dodgson (MSU) proposing to apply PacBio
  and normalization to improve chick genome; should be
  generalizable approach.
Mapping => mRNAseq quantitation




          Reference transcriptome required.
Existing chick gene models lack exons,
isoforms



                                                      Our data




                                                         Models



 *This gene contains at least 4 isoforms.
                                            Likit Preeyanon
(Exon detection is pretty good.)




                            Likit Preeyanon
Gene Modeler Pipeline (“gimme”?)
Merge transcripts together based on transcript mapping to
 genome; can include existing gene predictions, iterate.
Construct gene models
Remove redundant sequences
Predict strands and ORFs




                                               Likit Preeyanon
Some thoughts on bioinfo
Software is evolving very fast. Don’t worry about using the
  latest, but keep an eye on possible artifacts/problems with
  what you do use.

In NGS, online information (seqanswers, biostar, Twitter) is
  generally far less behind than publications.
Technology – where next?
Most slides taken from Lex Nederbragt:

http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-
technologies-at-the-norwegian-sequencing-centre-and-beyond
High-throughput sequencing
              Phase 1: more is better
        2005 GS20        200 000 reads             100 bp
             0.02 Gb/run


        2011 GS FLX+          1.2 million reads    750 bp
             0.7 Gb/run

        2006 GA               28 million reads       25 bp
             0.7 Gb/run




        2011 HiSeq 2000 3 billion reads           2x100 bp
                 600 Gb/run
         slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencing
                                   Phase 2: smaller is better
                                                             GS Junior from Roche/454
                                                                  0.04 GB/run
                                                                  400 bp reads
          0.7 GB/run
          700 bp reads


                                                             MiSeq from Illumina
                                                                  4.5 GB/run
                                                                 2x150 bp reads
          600 GB/run
          2x100 bp reads

                                                             PGM from Ion Torrent/
                                                                  Life Technologies
                                                                0.01, 0.1 or 1 GB/run
                                                                 100 or 200 bp reads
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt


   High-throughput sequencing
                   Why benchtop sequencing instruments?




                                                                      Diagnostics
Affordable price
per instrument                     Small projects


                                  Fast turn around time

http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/
http://www.vetlearn.com/ http://vanillajava.blogspot.com
Which instrument to choose?




        slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencing
                              Phase 3: single-molecule




C2 (current) chemistry:
Average read length 2500 bp
36 000 reads
90 MB per ‘run’




                          slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
S
                High-throughput sequencing
Real-time sequencing                                                         Technology
                                   Phospholinked hexaphosphate nucleotides

                                                     G                         A                  T                  C

                         b




                                            Lim of detection zone
                                               it



                                                   Fluorescence pulse
                       Intensity




e detection                                                                        Time
                                                         slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics
                                                                                                   ; Lex Nederbragt
        Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m
                                                                                 e            ethod is shown.
Need to combine Illumina + PacBio still.
                                           P_errorCorrection pipeline from

                                                                93% of reads recovered
                        2.7x
                                                  Alignments of at least 1kb to cod published assembly


           +




                                                                                             Error-corrected reads
                        23x


                                                                                         s
           +                                                                 w
                                                                                 rea
                                                                                     d
                                                                        Ra
                        24 cpus
                        4.5 days
                        100 Gb RAM


slides from http://slideshare.net/flxlex/ ; Lex
My perspective on tech:
Illumina HiSeq + benchtop sequencers (MiSeq) currently
  most reliable for data generation: data in hand, decent
  quality.

PacBio data is an excellent add-on for situations where long
  reads are needed (to bridge repeats or het regions).
Two final pieces of advice
Should you work with genome centers? Maybe.
  Genome centers are good at large, well funded projects.
  Their default pipelines are reliable but not always cutting edge.
  “Weird” problems (high heterozygosity, or complex repeats)
   may require more attention than they can give.
  They also have their own schedules and incentives.


Where should you go for contract sequencing?
  I get asked this a lot!
  My best recommendation is UC Davis.
  “Cheaper” is not always “better”; data quality can vary
    immensely.
Advertisement: next-gen sequence
course        http://bioinformatics.msu.edu/ngs-summer-course-2013

       June 10-June 20, Kellogg Biological Station; < $500
            Hands on exposure to data, analysis tools.
Acknowledgements
I showed work from Likit Preeyanon and Alexis Black
 Pyrkosz, in my lab
Hans Cheng is primary collaborator on chick work


USDA funded our technology development.


Lex Nederbragt for his slides :)

Mais conteúdo relacionado

Mais procurados

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)AllSeq
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Gunnar Rätsch
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Call for non-coding mRNA resource
Call for non-coding mRNA resourceCall for non-coding mRNA resource
Call for non-coding mRNA resourceMatthias Harbers
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 

Mais procurados (20)

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Rna seq
Rna seqRna seq
Rna seq
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)Practical Guide to the $1000 Genome (2014)
Practical Guide to the $1000 Genome (2014)
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Call for non-coding mRNA resource
Call for non-coding mRNA resourceCall for non-coding mRNA resource
Call for non-coding mRNA resource
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 

Destaque

CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 
The human genome project
The human genome projectThe human genome project
The human genome project14pascba
 
Human genome project
Human genome projectHuman genome project
Human genome projectShital Pal
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)@rtNya
 
Bloggingforbusiness2
Bloggingforbusiness2Bloggingforbusiness2
Bloggingforbusiness2Andre Kleist
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselKegler Brown Hill + Ritter
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Piet van Vugt
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10John Doyle
 
Cell :: Properties
Cell :: PropertiesCell :: Properties
Cell :: Propertiesrejita
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Edmund_Wheeler
 

Destaque (20)

CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
Human genome project 1
Human genome project 1Human genome project 1
Human genome project 1
 
The human genome project
The human genome projectThe human genome project
The human genome project
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Basic Steps of the NGS Method
Basic Steps of the NGS MethodBasic Steps of the NGS Method
Basic Steps of the NGS Method
 
Cap Editing
Cap EditingCap Editing
Cap Editing
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)
 
Bloggingforbusiness2
Bloggingforbusiness2Bloggingforbusiness2
Bloggingforbusiness2
 
Duygular
DuygularDuygular
Duygular
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate Counsel
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012
 
Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10
 
Cell :: Properties
Cell :: PropertiesCell :: Properties
Cell :: Properties
 
Br10 nybyggeri
Br10 nybyggeriBr10 nybyggeri
Br10 nybyggeri
 
Br10 sommerhus
Br10 sommerhusBr10 sommerhus
Br10 sommerhus
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09
 
Resume
ResumeResume
Resume
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

Semelhante a 2013 pag-equine-workshop

20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3GenomeInABottle
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethionGenomeInABottle
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
 

Semelhante a 2013 pag-equine-workshop (20)

20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
poster
posterposter
poster
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 

Mais de c.titus.brown

Mais de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 

2013 pag-equine-workshop

  • 1. Next-Gen Sequencing: 4 years in the trenches C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu
  • 2. These slides are available online. “titus brown slideshare” You can also e-mail me: ctb@msu.edu Also note that these are my opinions and observations, culled from personal experience, online material, and reading. I’m happy to cite/explain further upon request, but: Your Mileage May Vary
  • 3. Things I won’t talk about Don’t work on/with/have anything useful to say about: Exome sequencing Ancient DNA ChIP-seq (protein-DNA interactions) Work on but you’re probably not interested in: Metagenomics (sequencing uncultured microbial communities) Bioinformatics data structures and algorithms
  • 4. Overview  Shotgun sequencing basics  Things everyone wants to know: how much $$...  Various current problems & challenges  Technology, now and future  Some papers and projects worth looking at; & our own experiences
  • 5.
  • 6.
  • 7. Two specific concepts: First, sequencing everything at random is very much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.) Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences. These two concepts underlie the recent stunning increases in sequencing capacity.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. What are current costs for Illumina? Approximate costs from MSU sequencing center, a few months ago, including labor: RNAseq: $200 prep / sample Single-ended 1x50 -- $1100/lane – 100-150 mn reads Paired-end 2x100 -- $2500/lane – 200-300 mn reads (/ 2) Barcoding samples, etc, gets complicated. Discuss biology, etc with a sequencing geek before going forward!
  • 13. What does this data really give you??  With RNAseq, you can do de novo (genome- and gene-annotation- independent) gene & isoform discovery and quantification; 50- 100m reads/sample is probably “enough” (see: http://blog.fejes.ca/?p=607 for a good discussion)  With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.  De novo assembly of complex vertebrate genomes is not casual: Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way. Assembly & scaffolding process itself is still evolving.
  • 14. Why so much data? Why do we need 10-20x coverage (resequencing) or 50- 100m reads (mRNAseq) with Illumina? Two (linked) reasons: Shotgun sequencing is random Counting/sampling variation
  • 15. 1. Useful minimum coverage depends on high average coverage
  • 16. 2. mRNAseq quantitation – must overcome sampling variation
  • 17. Coverage conclusions More coverage rarely hurts (you can always discard data, but it is harder/more $$ to get more data from an old sample) Your desired coverage numbers should be driven by sensitivity considerations.
  • 18. Problems and challenges Systematic bias in sequencing and software. Genome assembly: scaffolding and sensitivity Gene references mRNAseq isoform construction
  • 19. Resequencing: bias and error Calling SNPs by mapping -- U. Colorado http://genomics-course.jasondk.org/?p=395
  • 20. Both sequencing and bioinformatics yield many low-frequency artifacts! “Obvious” things like misalignments to paralogous/repeat sequences. Indels are handled badly by current tools (up to 60% false positive rate?!) Oxidation of DNA during library prep step (acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets. => With any data set, especially big ones, there will both random and systematic error and bias. http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the- truth-you-cant-handle-the-truth/
  • 21. Suggestion: Cortex variant caller Iqbal et al., Nat Genet. 2012, pmid 22231483
  • 22. Genome assembly: scaffolding & sensitivity Everyone wants two things from a genome assembly -- Long/correct scaffolds See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions- for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon Complete genome content
  • 23. Sequence data Reads original DNA fragments original DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 24. Contigs Building contigs ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 25. Scaffolds Ordered, oriented contigs mate pairs contigs gap size estimate Scaffold contig gap slides from http://slideshare.net/flxlex/ ; Lex Nederbragt http://dx.doi.org/10.6084/m9.figshare.100940
  • 26. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt Longer reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 22 Polymorphic contig Contig 1 Contig 4 Polymorphic contig 33 Polymorphic contig
  • 27. Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 28. Sensitivity – does your genome include everything? Generally not! For example, the chick genome is missing a substantial number of genes from microchromosomes: 723 genes from HSA19q missing from chicken galGal4. ESTs and RNAseq transcripts for many or most.
  • 29. Approach - Digital normalization (a computational version of library normalization) Digital normalization “smooths out” coverage from different loci, and can “recover” low coverage regions for assembly.
  • 30. Applying diginorm to increase sensitivity Reassembled chick genome from 70x Illumina -> normalized reads in ~24 hours. Contig assembly contained partial or complete matches to 70% of previously unmappable transcripts assembled from chick mRNAseq Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.
  • 31. Mapping => mRNAseq quantitation Reference transcriptome required.
  • 32. Existing chick gene models lack exons, isoforms Our data Models *This gene contains at least 4 isoforms. Likit Preeyanon
  • 33. (Exon detection is pretty good.) Likit Preeyanon
  • 34. Gene Modeler Pipeline (“gimme”?) Merge transcripts together based on transcript mapping to genome; can include existing gene predictions, iterate. Construct gene models Remove redundant sequences Predict strands and ORFs Likit Preeyanon
  • 35. Some thoughts on bioinfo Software is evolving very fast. Don’t worry about using the latest, but keep an eye on possible artifacts/problems with what you do use. In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.
  • 36. Technology – where next? Most slides taken from Lex Nederbragt: http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing- technologies-at-the-norwegian-sequencing-centre-and-beyond
  • 37. High-throughput sequencing Phase 1: more is better 2005 GS20 200 000 reads 100 bp 0.02 Gb/run 2011 GS FLX+ 1.2 million reads 750 bp 0.7 Gb/run 2006 GA 28 million reads 25 bp 0.7 Gb/run 2011 HiSeq 2000 3 billion reads 2x100 bp 600 Gb/run slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 38. High-throughput sequencing Phase 2: smaller is better GS Junior from Roche/454 0.04 GB/run 400 bp reads 0.7 GB/run 700 bp reads MiSeq from Illumina 4.5 GB/run 2x150 bp reads 600 GB/run 2x100 bp reads PGM from Ion Torrent/ Life Technologies 0.01, 0.1 or 1 GB/run 100 or 200 bp reads slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 39. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt High-throughput sequencing Why benchtop sequencing instruments? Diagnostics Affordable price per instrument Small projects Fast turn around time http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/ http://www.vetlearn.com/ http://vanillajava.blogspot.com
  • 40. Which instrument to choose? slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 41. High-throughput sequencing Phase 3: single-molecule C2 (current) chemistry: Average read length 2500 bp 36 000 reads 90 MB per ‘run’ slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 42. S High-throughput sequencing Real-time sequencing Technology Phospholinked hexaphosphate nucleotides G A T C b Lim of detection zone it Fluorescence pulse Intensity e detection Time slides from http://slideshare.net/flxlex/ Nature Reviews |Genetics ; Lex Nederbragt Figure 4 |Real-time sequencing. Pacific Biosciences’ four-colour real-tim sequencing m e ethod is shown.
  • 43. Need to combine Illumina + PacBio still. P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to cod published assembly + Error-corrected reads 23x s + w rea d Ra 24 cpus 4.5 days 100 Gb RAM slides from http://slideshare.net/flxlex/ ; Lex
  • 44. My perspective on tech: Illumina HiSeq + benchtop sequencers (MiSeq) currently most reliable for data generation: data in hand, decent quality. PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).
  • 45. Two final pieces of advice Should you work with genome centers? Maybe. Genome centers are good at large, well funded projects. Their default pipelines are reliable but not always cutting edge. “Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give. They also have their own schedules and incentives. Where should you go for contract sequencing? I get asked this a lot! My best recommendation is UC Davis. “Cheaper” is not always “better”; data quality can vary immensely.
  • 46. Advertisement: next-gen sequence course http://bioinformatics.msu.edu/ngs-summer-course-2013 June 10-June 20, Kellogg Biological Station; < $500 Hands on exposure to data, analysis tools.
  • 47. Acknowledgements I showed work from Likit Preeyanon and Alexis Black Pyrkosz, in my lab Hans Cheng is primary collaborator on chick work USDA funded our technology development. Lex Nederbragt for his slides :)