32. Title TextThe First $1,000 Genome
http://systems.illumina.com/systems/hiseq-x-sequencing-system.html
33. Title TextExpectation of Data Processing
Power for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human genomes each year
• Has a run cycle of ~3 days and produces ~150 genomes each run cycle
• Running the industry standard BWA+GATK analysis pipeline to perform
this analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2
CPU – 12 core, 2.7 GHz with 96 GB DRAM) compute server takes ~24
hours per genome.
• To achieve the required throughput of 150 genomes every three days,
at least 50 of these servers are required.
• Should meet a target of ~28 minutes for the completion of the mapping,
aligning, sorting, de-duplication and variant calling of each genome.
34. Title Text次世代定序 (NGS) 101
https://www.broadinstitute.org/gatk/img/cartoon-blackbox-workflow-web-blackblue.png
35. Title TextGATK Best Practice
http://cdn.vanillaforums.com/gatk.vanillaforums.com/FileUpload/eb/44f317f8850ba74b64ba47b02d1bae.png
4,5百萬變
異怎麼分析
44. Title TextScale-Up vs. Scale-Out
Horizontal Scaling
(More Nodes)
VerticalScaling
(BiggerNodes)
More expensive server
(Big Memory, Many CPU cores)
Many commodity nodes
Amniocentesis羊膜穿刺術
For many years, scientists believed that female development was the default programme, and that male development was actively switched on by the presence of a particular gene on the Y chromosome. In 1990, researchers made headlines when they uncovered the identity of this gene, which they called SRY. Just by itself, this gene can switch the gonad from ovarian to testicular development. For example, XX individuals who carry a fragment of the Y chromosome that contains SRY develop as males.
「人類基因體計畫」(Human Genome Project, HGP),在6個國家合作之下,耗時13年、投入30億美元後,終於在2003年宣布完成人類基因體中30億個鹼基對的初步定序。
Huge amount of NGS data
~100 GB / human for Whole Genome Sequence (WGS)
200 TB for 1000 Genome
Hard to distinguish between Mutation and Noise
~ 1/100 of sequence error from Illumina sequencer
~ 1/1000 of differences between individuals
Hard to identify Variant of Unknown Significance (VUS)
Exome is only 1.5% of genome
Databases variety and inconsistence
Illumina HiSeq X Ten : 18,000 human genome per year
第十二名,重點是「國家」
Illumina HiSeq X Ten : 18,000 human genome per year
Illumina HiSeq X Ten : 18,000 human genome per year
Illumina HiSeq X Ten : 18,000 human genome per year
Drew Conway's Venn diagram of data science, 2010
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html
NGS data
-noise
-huge amount of data
-hard to represent
-hard to explain (VUS)
-hard to correlate