SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Babelomics
NGS data Preprocessing
             Granada, June 2011


           Javier Santoyo-Lopez
                 jsantoyo@cipf.es
                http://bioinfo.cipf.es
               Genomics Department
   Centro de Investigacion Principe Felipe (CIPF)
                 (Valencia, Spain)
History of DNA Sequencing
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)



                              1870     Miescher: Discovers DNA

                              1940     Avery: Proposes DNA as ‘Genetic Material’
             Efficiency
                              1953     Watson & Crick: Double Helix Structure of DNA
          (bp/person/year)

            1                 1965     Holley: Sequences Yeast tRNAAla


            15
                              1970     Wu: Sequences λ Cohesive End DNA

            150

                                       Sanger: Dideoxy Chain Termination
                              1977     Gilbert: Chemical Degradation
            1,500

                              1980     Messing: M13 Cloning
            15,000


            25,000
                              1986     Hood et al.: Partial Automation


            50,000            1990      Cycle Sequencing
                                        Improved Sequencing Enzymes
            200,000                     Improved Fluorescent Detection Schemes



            50,000,000        2002

                                       Next Generation Sequencing
                                       Improved enzymes and chemistry
            100,000,000,000    2008    Improved image processing
Sequence Databases Trend
   EMBL database growth (March 2011)
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
NGS platforms comparison




06/14/11
                        Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
Adapted from John McPherson, OICR

Next-gen sequencers
                                                        Adapted from John McPherson, OICR




                                                         AB SOLiDv3
                                                         200Gb, 50 bp reads
                         100 Gb
                                                         Illumina HiSeq
 bases per machine run



                                                         150Gb, 100bp reads
                          10 Gb



                                                               454 GS FLX Titanium
                           1 Gb
                                                             0.4-1 Gb, 100-500 bp reads


                         100 Mb




                          10 Mb                                     ABI capillary sequencer
                                                                    (0.04-0.08 Mb,
                                                                    450-800 bp reads
                          1 Mb




                                  10 bp     100 bp               1,000 bp

                                          read length
Many Gbs of Sequences and...
• Data management becomes a challenge.
  – Moving data across file systems takes time (several hundred Gbs)


• What structure has the data?
  – Different sequencers output different files, but
  – There are some data formats that are being accepted widely (e.g.
    FastQ format)


• Raw sequence data formats
  –   SFF
  –   Fasta, csfasta
  –   Qual file
  –   Fastq
Fasta & Fastq formats
• FastA format (everybody knows about it)
  – Header line starts with “>” followed by a sequence ID
  – Sequence (string of nt).


• FastQ format
  – First is the sequence (like Fasta but starting with “@”)
  – Then “+” and sequence ID (optional) and in the following line are
    QVs encoded as single byte ASCII codes
    • Different quality encode variants


• Nearly all downstream analysis take FastQ as input
  sequence
Sequence to Variation Workflow
                  Raw
                  Data
                            FastQ



       IGV
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Sequence to Variation Workflow
                  Raw                        FastX
                                                        FastQ
                  Data                       Tookit
                            FastQ


                                                               BWA/
       IGV                                                    BWASW
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                 If something is wrong can I fix it?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                  If something is wrong can I fix it?
   Problem:
           HUGE files...
Why Quality Control and
         Preprocessing?
   Sequencer output:
            Reads + quality
            Is the quality of my sequenced data OK?
                   If something is wrong can I fix it?
   Problem:
            HUGE files... How do they look?
                @HWUSI­EAS460:2:1:368:1089#0/1
                TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG
                +HWUSI­EAS460:2:1:368:1089#0/1
                aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB

                @HWUSI­EAS460:2:1:368:528#0/1
                CTATTATAATATGACCGACCAGCTAGATCTACAGTC
                +HWUSI­EAS460:2:1:368:528#0/1
                abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_

   Files are flat files and are big... tens of Gbs (please... don't
    use MS word to see or edit them)
Sequence Quality
Per base Position

                                          Good data
                                         Consistent
                                          High quality along
                                          the read



  * The central red line is the median value
  * The yellow box represents the inter-quartile range (25-75%)
  * The upper and lower whiskers represent the 10% and 90% points
  * The blue line represents the mean quality
Sequence Quality
Per base Position


                  Bad data
                 High variance
                 Quality decrease
                  with length
Per Sequence Quality
     Distribution


                  Good data
                 Most are high-quality
                  sequences
Per Sequence Quality
       Distribution


                        Bad data
                       Not uniform
                        distribution


Low Quality Reads
Nucleotide Content
   per position


                  Good data
                 Smooth over
                  length
                 Organism
                  dependent (GC)
Nucleotide Content
   per position


                  Bad data
                 Sequence
                  position bias
GC Distribution



                  Good data
                 Fits with the
                  expected
                 Organism
                  dependent
Per sequence
GC Distribution


                 Bad data
                It does not fit
                 with expected
                Organism
                 dependent
                 Library
                 contamination?
Per base
GC Distribution


                      Good data
                     No variation
                      across read
                      sequence
Per base
GC Distribution


                      Bad data
                     Variation
                      across
                      read
                      sequence
Per base
N content


            It's not
            good if
            there are
            N bias per
            base
            position
Duplicated Sequences


      I don't expect high number
      of duplicated sequences:
              PCR artifact?
Distribution Length

   This is descriptive.
   Some sequencers
    output sequences of
    different length (e.g.
    454)
Overrepresented Sequences
Question:
       If you obtain the exact same sequence too
          many times      Do you have a problem?


Answer:
       Sometimes!


Examples                PCR primers (Illumina)
      GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT

      CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
K-mer Content



               Helps to detect
                problems
               Adapters?
Practical:
     FastQC and Fastx-toolkit

   Use FastQC to see your starting state.
   Use Fastx-toolkit to optimize different datasets
    and then visualize the result with FastQC to
    prove your success!

    Hints: Try trimming, clipping and quality filtering.



    Go to the tutorial and try the exercises...

Mais conteúdo relacionado

Mais procurados

RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and toolsKAUSHAL SAHU
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingSwathi Prabakar
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformaticsSumatiHajela
 
Illumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodIllumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodFekaduKorsa
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Microarray technology and applications
Microarray technology and applicationsMicroarray technology and applications
Microarray technology and applicationsPurnima Kartha
 
Metabolomics
MetabolomicsMetabolomics
MetabolomicsSUVO DAS
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencingDenis C. Bauer
 

Mais procurados (20)

NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and tools
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Metabolomics
MetabolomicsMetabolomics
Metabolomics
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
phylogenetics.pdf
phylogenetics.pdfphylogenetics.pdf
phylogenetics.pdf
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Illumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodIllumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) method
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Microarray technology and applications
Microarray technology and applicationsMicroarray technology and applications
Microarray technology and applications
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Metabolomics
MetabolomicsMetabolomics
Metabolomics
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 

Destaque

Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicQIAGEN
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the CloudMatt Wood
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 

Destaque (20)

Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones Infographic
 
True Single Molecule Sequencing
True Single Molecule SequencingTrue Single Molecule Sequencing
True Single Molecule Sequencing
 
Clinical Application 2.0
Clinical Application 2.0Clinical Application 2.0
Clinical Application 2.0
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
DNA sequencing
DNA sequencingDNA sequencing
DNA sequencing
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 

Semelhante a NGS Data Preprocessing

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewPatrick Merel
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nicolas Gobet
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Lex Nederbragt
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...confluent
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
 
Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...Lex Nederbragt
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)James Hadfield
 

Semelhante a NGS Data Preprocessing (20)

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies Overview
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
 
Xin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing PlenaryXin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing Plenary
 

Mais de cursoNGS

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformaticscursoNGS
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformaticacursoNGS
 

Mais de cursoNGS (8)

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformatics
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformatica
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

NGS Data Preprocessing

  • 1. Babelomics NGS data Preprocessing Granada, June 2011 Javier Santoyo-Lopez jsantoyo@cipf.es http://bioinfo.cipf.es Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)
  • 2. History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA 1940 Avery: Proposes DNA as ‘Genetic Material’ Efficiency 1953 Watson & Crick: Double Helix Structure of DNA (bp/person/year) 1 1965 Holley: Sequences Yeast tRNAAla 15 1970 Wu: Sequences λ Cohesive End DNA 150 Sanger: Dideoxy Chain Termination 1977 Gilbert: Chemical Degradation 1,500 1980 Messing: M13 Cloning 15,000 25,000 1986 Hood et al.: Partial Automation 50,000 1990 Cycle Sequencing Improved Sequencing Enzymes 200,000 Improved Fluorescent Detection Schemes 50,000,000 2002 Next Generation Sequencing Improved enzymes and chemistry 100,000,000,000 2008 Improved image processing
  • 3. Sequence Databases Trend EMBL database growth (March 2011)
  • 4. From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
  • 5. NGS platforms comparison 06/14/11 Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
  • 6. Adapted from John McPherson, OICR Next-gen sequencers Adapted from John McPherson, OICR AB SOLiDv3 200Gb, 50 bp reads 100 Gb Illumina HiSeq bases per machine run 150Gb, 100bp reads 10 Gb 454 GS FLX Titanium 1 Gb 0.4-1 Gb, 100-500 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
  • 7. Many Gbs of Sequences and... • Data management becomes a challenge. – Moving data across file systems takes time (several hundred Gbs) • What structure has the data? – Different sequencers output different files, but – There are some data formats that are being accepted widely (e.g. FastQ format) • Raw sequence data formats – SFF – Fasta, csfasta – Qual file – Fastq
  • 8. Fasta & Fastq formats • FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants • Nearly all downstream analysis take FastQ as input sequence
  • 9. Sequence to Variation Workflow Raw Data FastQ IGV BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 10. Sequence to Variation Workflow Raw FastX FastQ Data Tookit FastQ BWA/ IGV BWASW BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 11. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK?
  • 12. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?
  • 13. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files...
  • 14. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files... How do they look? @HWUSI­EAS460:2:1:368:1089#0/1 TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG +HWUSI­EAS460:2:1:368:1089#0/1 aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB @HWUSI­EAS460:2:1:368:528#0/1 CTATTATAATATGACCGACCAGCTAGATCTACAGTC +HWUSI­EAS460:2:1:368:528#0/1 abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_  Files are flat files and are big... tens of Gbs (please... don't use MS word to see or edit them)
  • 15. Sequence Quality Per base Position Good data  Consistent  High quality along the read * The central red line is the median value * The yellow box represents the inter-quartile range (25-75%) * The upper and lower whiskers represent the 10% and 90% points * The blue line represents the mean quality
  • 16. Sequence Quality Per base Position Bad data  High variance  Quality decrease with length
  • 17. Per Sequence Quality Distribution Good data  Most are high-quality sequences
  • 18. Per Sequence Quality Distribution Bad data  Not uniform distribution Low Quality Reads
  • 19. Nucleotide Content per position Good data  Smooth over length  Organism dependent (GC)
  • 20. Nucleotide Content per position Bad data  Sequence position bias
  • 21. GC Distribution Good data  Fits with the expected  Organism dependent
  • 22. Per sequence GC Distribution Bad data  It does not fit with expected  Organism dependent Library contamination?
  • 23. Per base GC Distribution Good data  No variation across read sequence
  • 24. Per base GC Distribution Bad data  Variation across read sequence
  • 25. Per base N content It's not good if there are N bias per base position
  • 26. Duplicated Sequences I don't expect high number of duplicated sequences:  PCR artifact?
  • 27. Distribution Length  This is descriptive.  Some sequencers output sequences of different length (e.g. 454)
  • 28. Overrepresented Sequences Question: If you obtain the exact same sequence too many times Do you have a problem? Answer: Sometimes! Examples PCR primers (Illumina)  GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT  CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
  • 29. K-mer Content  Helps to detect problems  Adapters?
  • 30. Practical: FastQC and Fastx-toolkit  Use FastQC to see your starting state.  Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! Hints: Try trimming, clipping and quality filtering. Go to the tutorial and try the exercises...