An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.
1. QBI’s Centre for Brain Genomics The informatics side of things [Sprengben [why not get a friend]] September 8, 2011
2. Objective of QBI’s Centre for Brain genomics On-time delivery Reliable data production Convincing data Easy delivery Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.
4. Detailed workflow September 8, 2011 Cbot HiSeq 30 diff. programs CASAVA Raw sequence reads projects flowcell HiSeq cluster cluster
5. Overview of Production Informatics framework September 8, 2011 Automatic Manual Processing Evaluation Run/ Data/ MakeFastq.sh trigger.sh armed trigger.sh html Unaligned/ bwa/, reCaAl/, variant/ Summary.html //clusterstorage Apache, IGV, R, UCSC //cluster-vm
6. Trigger.sh September 8, 2011 Keeping data separate from scripts Automating verification, quality control and summary HTML generation Rerunning pipeline from every point
8. Config.txt September 8, 2011 #******************** # Tasks #******************** mappingBWA="1" recalibrateQualScore="1" #******************** # Paths #******************** FASTA="/clusterdata/resources/hg19/hg19.fasta" SEQREG=chr1:229994688-230071581" DBSNP="/clusterdata/resources/hg19/snpdb132.vcf" #******************** # PARAMETER #******************** LIBRARY="QBI” ADDPARAMBWA=“--force single” Specifics what to do, e.g. mapping and recalibration Specifics where to find resources Customizes stanardsripts for this project
10. Summary.html Project Cards September 8, 2011 Sequence statistics Run check points Data Visualization Mapping stats Download Interesting Regions
11. Scaffold of pbsScripts.sh: Error catching September 8, 2011 Code example for setting up what errors to look out for # QCVARIABLES, loosing reads, unmapped read,no such file,file not found,bwa.sh: line Output in Summary.html >>>>>>>>>> Errors QC_PASS .. 0 have We are loosing reads/184 QC_PASS .. 0 have for unmapped read/184 QC_PASS .. 0 have no such file/184 QC_PASS .. 0 have file not found/184 QC_PASS .. 0 have bwa.sh: line/184
12. Scaffold of pbsScripts.sh: checkpoints September 8, 2011 qsub -by -jy [PBSOPTIONS] pbsScript.sh -k HISEQINF [PARAMETERS] Code example for setting up checkpoints in the pbsScript.sh echo “********* mapping” $BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai} $BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai} Output in Summary.html >>>>>>>>>> CheckPoints QC_PASS .. 184 have mapping/184 QC_PASS .. 184 have sorting and bam-conversion/184 QC_PASS .. 184 have mark duplicates/184 QC_PASS .. 184 have statistics/184 QC_PASS .. 184 have coverage track/184
14. The big picture Covering all aspects of: design*, set-up*, maintenance*, usage (*except cluster) Documentation: Project Server //project 5 TB raw data 750 GB processed data 57 GB external data 7 project-cards 10 Projects, 6 HiSeq-Runs 40 wiki pages, 250 Tasks, 551h logged 160 Commits 35 external programs 41 custom scripts (4197 lines of code) Application Backup/Version Control Data Warehousing Statistic Analysis HiSeq Output RSudio Raw Data Quality Control Project Cards Processed Data Processed Data Rsync Hypothesis Generation Software BWA, GATK, samtools, etc. Custom Scripts Custom Scripts Version Control Data Processing and Analysis External Genomic Resources Cluster Genomes, Annotation, etc. Project Server Content Galaxy Visualization IGV Genome Browser //cluster-vm //clusterstorage //groupshare, //ethan
15. Three things to remember Reliable data production Projects have all a similar structure and are processed in the same way Convincing data All steps are tightly quality controlled and the QC report is accessible Easy delivery We tailored data availability to skill-levels (webpage, Rstudio, console On time delivery Production informatics has priority on the cluster September 8, 2011 ( )
16. Next week NGS Discussion group: Methylation analysis Kevin Dudley and Danay Baker-Andresen September 8, 2011