SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: https://bitly.com/BioinfoInternEx2014
Quality Control of NGS Data
1. Evaluation
2. Preprocessing
Quality Control of NGS Data
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Slide credit: Aureliano Bombarely
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as quality score and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Slide credit: Aureliano Bombarely
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq,
SRR404334_ch4.fq and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Slide credit: Aureliano Bombarely
Exercise 1:
4. Count number of sequences in each fastq file using
commands you learnt earlier.
5. Convert the fastq files to fasta.
6. Explore other tools in the FASTX toolkit.
7. Now count the number of sequences in fasta file and see
if the number of sequences has changed.
Evaluation
Tip: Use ‘grep’
Tip: Use ‘fastq_to_fasta -h’ to see help
Use Google if you are stuck
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Slide credit: Aureliano Bombarely
Evaluation: Sequence Quality
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 8
454
Pacific
Biosciences
Evaluation: Sequence Content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Evaluation: Sequence Content
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Duplication
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Evaluation: Duplication
7/8/2014 BTI PGRP Summer Internship Program 2014 12
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Overrepresented Sequences
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 13
Evaluation: Overrepresented Sequences
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 15
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 16
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 17
454
Pacific
Biosciences
Question 2.2: How many sequences there are per file in FastQC?
Question 2.3: Which is the length range for these reads?
Question 2.4: Which is the quality score range for these reads? Which
one looks best quality-wise?
Question 2.5: Do these datasets have read overrepresentation?
Question 2.6: Looking into the kmer content, do you think that the samples
have an adaptor?
Evaluation
Exercise 2:
1.Type ‘fastqc’ to start the FastQC program. Load the four
fastq sequence files in the program.
7/8/2014 BTI PGRP Summer Internship Program 2014 18
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 19
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a
dapters1.fa
• Run the read processing program over each of the datasets
using
• Min. qscore of 30
• Min. length of 40 bp
• Type ‘fastqc’ to start the FastQC program. Load the four
new fastq sequence files. Compare the results with the
previous datasets.
Preprocessing
Tip: Use ‘fastqc -h’ to see help
7/8/2014 BTI PGRP Summer Internship Program 2014 20
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 21
Solutions: https://bitly.com/BioinfoInternExSol2014

Mais conteúdo relacionado

Mais procurados

Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
LutzFr
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
Kalaivani P
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 

Mais procurados (20)

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
BWA-MEM2-IPDPS 2019
BWA-MEM2-IPDPS 2019BWA-MEM2-IPDPS 2019
BWA-MEM2-IPDPS 2019
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Protein-protein interaction networks
Protein-protein interaction networksProtein-protein interaction networks
Protein-protein interaction networks
 
The ensembl database
The ensembl databaseThe ensembl database
The ensembl database
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
 
Flash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisFlash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysis
 
The Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resourcesThe Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resources
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 

Semelhante a Quality Control of NGS Data

PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
Tanu Malik
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
Ivo Jimenez
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
Tanu Malik
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
zubin71
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010
Michael Price
 

Semelhante a Quality Control of NGS Data (20)

Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data Solutions
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Gnocchi batching
Gnocchi batchingGnocchi batching
Gnocchi batching
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 
Auditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software PackagesAuditing and Maintaining Provenance in Software Packages
Auditing and Maintaining Provenance in Software Packages
 
information management Project.docx
information management Project.docxinformation management Project.docx
information management Project.docx
 
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File FormatsFrom Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
 
Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0Apigee deploy grunt plugin.1.0
Apigee deploy grunt plugin.1.0
 
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep WebScalable Hadoop-Based Pooled Time Series of Big Video Data  from the Deep Web
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
 
Qtp-training A presentation for beginers
Qtp-training  A presentation for beginersQtp-training  A presentation for beginers
Qtp-training A presentation for beginers
 
Fedora Iptables
Fedora IptablesFedora Iptables
Fedora Iptables
 
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterGetting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
 
KineMatik November 2010
KineMatik November 2010KineMatik November 2010
KineMatik November 2010
 
Sequencing
SequencingSequencing
Sequencing
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 

Mais de Surya Saha

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...
Surya Saha
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Surya Saha
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
Surya Saha
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Surya Saha
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Surya Saha
 

Mais de Surya Saha (20)

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meeting
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all Omics
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
 
Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015
 
Quality Control of Sequencing Data
Quality Control of Sequencing DataQuality Control of Sequencing Data
Quality Control of Sequencing Data
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Último (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

Quality Control of NGS Data

  • 1. Surya Saha ss2489@cornell.edu BTI PGRP Summer Internship Program 2014 Slides: https://bitly.com/BioinfoInternEx2014 Quality Control of NGS Data
  • 2. 1. Evaluation 2. Preprocessing Quality Control of NGS Data 7/8/2014 BTI PGRP Summer Internship Program 2014 2 Slide credit: Aureliano Bombarely
  • 3. Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as quality score and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3 Slide credit: Aureliano Bombarely
  • 4. Exercise 1: 1. Untar and Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4 Slide credit: Aureliano Bombarely
  • 5. Exercise 1: 4. Count number of sequences in each fastq file using commands you learnt earlier. 5. Convert the fastq files to fasta. 6. Explore other tools in the FASTX toolkit. 7. Now count the number of sequences in fasta file and see if the number of sequences has changed. Evaluation Tip: Use ‘grep’ Tip: Use ‘fastq_to_fasta -h’ to see help Use Google if you are stuck 7/8/2014 BTI PGRP Summer Internship Program 2014 5 Slide credit: Aureliano Bombarely
  • 6. Evaluation: Sequence Quality Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 7 Good Illumina dataset Poor Illumina dataset
  • 8. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 8 454 Pacific Biosciences
  • 9. Evaluation: Sequence Content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10. Evaluation: Sequence Content 7/8/2014 BTI PGRP Summer Internship Program 2014 10 Good Illumina dataset Poor Illumina dataset
  • 11. Evaluation: Duplication Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12. Evaluation: Duplication 7/8/2014 BTI PGRP Summer Internship Program 2014 12 Good Illumina dataset Poor Illumina dataset
  • 13. Evaluation: Overrepresented Sequences Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 13
  • 14. Evaluation: Overrepresented Sequences 7/8/2014 BTI PGRP Summer Internship Program 2014 14 Good Illumina dataset Poor Illumina dataset
  • 15. Evaluation: Kmer content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 15
  • 16. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 16 Good Illumina dataset Poor Illumina dataset
  • 17. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 17 454 Pacific Biosciences
  • 18. Question 2.2: How many sequences there are per file in FastQC? Question 2.3: Which is the length range for these reads? Question 2.4: Which is the quality score range for these reads? Which one looks best quality-wise? Question 2.5: Do these datasets have read overrepresentation? Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? Evaluation Exercise 2: 1.Type ‘fastqc’ to start the FastQC program. Load the four fastq sequence files in the program. 7/8/2014 BTI PGRP Summer Internship Program 2014 18
  • 19. Goal: Trim the low quality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 19
  • 20. Exercise 3: • Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a dapters1.fa • Run the read processing program over each of the datasets using • Min. qscore of 30 • Min. length of 40 bp • Type ‘fastqc’ to start the FastQC program. Load the four new fastq sequence files. Compare the results with the previous datasets. Preprocessing Tip: Use ‘fastqc -h’ to see help 7/8/2014 BTI PGRP Summer Internship Program 2014 20
  • 21. Need Help?? 7/8/2014 BTI PGRP Summer Internship Program 2014 21 Solutions: https://bitly.com/BioinfoInternExSol2014