SlideShare a Scribd company logo
1 of 32
Data of Unusual Size in
   Metagenomics
                  C. Titus Brown
                  ctb@msu.edu
     Asst Professor, Michigan State University
  (Microbiology, Computer Science, and BEACON)
Openness

• Twit me! @ctitusbrown

• My blog: http://ivory.idyll.org/blog/

• Grants, preprints, etc: http://ged.msu.edu/

• Software: BSD, github.com/ged-lab/.
Thanks

• My lab, esp. Jason Pell, Arend Hintze, Adina
  Chuang Howe, Qingpeng Zhang, and Eric
  McDonald

• Michigan State, USDA and NSF for $$
“Three types of data scientists.”
       (Bob Grossman, U. Chicago, at XLDB 2012)


1. Your data gathering rate is slower than Moore’s Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.
Metagenomics
• Randomly sequence DNA from mixed
 microbial communities, e.g. soil.

• DNA sequencing rates (cost/volume) have
 been outpacing Moore’s Law for ~5 years
 now… A terabase for ~$10k today.
Analogy:
 feeding libraries into a paper shredder,
digitizing the shreds, and reconstructing
                the books.
“Shredding libraries” is a good analogy!


• Lots of copies of Dickens, “Tale of Two Cities”, and SAT
  study guides, etc.
• Not as many copies of <obscure hipster author>.
• Many different editions with minor differences, +
  Reader’s Digest, excerpts, etc.
• (Although for libraries we usually know the language)
Two points:
1. If we feed all of the
libraries in the world into
a paper shredder and
mix, how do we recover
the book content!?
Two points:


2. That’s actually
an awful lot of
data…
Digression: Data of Unusual Size (aka Big
        Data) in Scientific Research
• Research is already hard enough:

   – Novel, fast moving, heterogeneous data types.

   – Unknown answers.

• Big Data => scaling, requires good engineering

   – Apply or invent new data structures & algorithms.

   – Write usable, functioning, reusable software.

(Hint: academics are not good at one of these things)
The assembly problem
• The N**2 approach: look at all overlapping
  fragments.
• The word-based approach: further
  decompose words into fixed-length
  overlapping hashable words.
          (Only one of these scales…)
Shotgun sequencing




   “Coverage” is simply the average number of reads that overlap
                    each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top
                       through all of the reads.
Reducing to k-mers overlaps




  Note that k-mer abundance is not properly represented here! Each
             blue k-mer will be present around 10 times.
Errors create new k-mers




          Each single base error generates ~k new k-mers.
 Generally, erroneous k-mers show up only once – errors are random.
So, our k-mer data contains both true and
              false k-mers.
Random sampling => deep sampling needed




   Typically 10-100x needed for robust recovery (300 Gbp for human)
Can we efficiently distinguish true from false?




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
 please email: journals.permissions@oup.com
Uneven representation
 complicates matters.

                Since you’re sequencing at
                   random, you need to
                sequence deeply in order to
                be sensitive to rare hipster
                          books.

                 These rare hipster books
                  may be important to
                understanding culture: not
                  only best-sellers have
                        influence!
Streaming algorithm to do so:
    digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Streaming algorithm for lossy
       compression of data sets.
• Converts random sampling to systematic sampling by
  building an assembly graph on the fly
• Can discard up to 99.9% of data set and errors, and still
  retain all information necessary for assembly.
• Acts as a prefilter for assemblers; ~5 lines of Python.
• Each piece of data is only examined once (!)
• Most errors are never collected => low memory.
Separately, apply Bloom filters to storing
         the information/data.




      “Exact” is for best possible information-theoretical storage.

                                                             Pell et al., PNAS 2012
Some details
• This was completely intractable.
• Implemented in C++ and Python; “good practice” (?)
• We’ve changed scaling behavior from data to information.
• Practical scaling for ~soil metagenomics is 10-100x: need <
  1 TB of RAM for ~2 TB of data. ~2 weeks.
• Just beginning to explore threading, multicore, etc. (BIG
  DATA grant proposal)
• Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
My rules of thumb for Big Data
     (for a better tomorrow)


1. Write well-understood filters and
   components, not monolithic programs.
My rules of thumb for Big Data
     (for a better tomorrow)


2. Throw away data as quickly as possible.
My rules of thumb for Big Data
     (for a better tomorrow)


3. Scripting is an extremely effective way to
connect serious software to scientists.
My rules of thumb for Big Data
     (for a better tomorrow)

4. Streaming/online approaches are worth the
effort to develop them.

     (OK, this is obvious to this audience)
My rules of thumb for Big Data
     (for a better tomorrow)
1. Write well-understood filters and components, not
   monolithic programs.
2. Throw away data as quickly as possible.
3. Scripting is an extremely effective way to connect
   serious software to scientists.
4. Streaming/online approaches are worth the effort to
   develop them.

More Related Content

Viewers also liked

[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomicsMads Albertsen
 
العوالم الثلاثة
العوالم الثلاثةالعوالم الثلاثة
العوالم الثلاثةAhmad Darwish
 
Testtestest
TesttestestTesttestest
Testtestestderwick
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional ResponsibilityKegler Brown Hill + Ritter
 
慈濟心不變更
慈濟心不變更慈濟心不變更
慈濟心不變更tina59520
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemJames Price
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roPagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roIulian Ghisoiu
 
2014 msu-cloud-computing
2014 msu-cloud-computing2014 msu-cloud-computing
2014 msu-cloud-computingc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Orange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVOrange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVMobileMonday Tel-Aviv
 
Pluto Project Greg And Victor 1
Pluto Project Greg And Victor 1Pluto Project Greg And Victor 1
Pluto Project Greg And Victor 1Gregorio
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaKegler Brown Hill + Ritter
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?jessecadelina
 

Viewers also liked (20)

[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics
 
العوالم الثلاثة
العوالم الثلاثةالعوالم الثلاثة
العوالم الثلاثة
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Testtestest
TesttestestTesttestest
Testtestest
 
3 Hr Workbook
3 Hr Workbook3 Hr Workbook
3 Hr Workbook
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
慈濟心不變更
慈濟心不變更慈濟心不變更
慈濟心不變更
 
The Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of NehalemThe Next Generation of Intel: The Dawn of Nehalem
The Next Generation of Intel: The Dawn of Nehalem
 
Roman roads
Roman roadsRoman roads
Roman roads
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roPagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
 
2014 msu-cloud-computing
2014 msu-cloud-computing2014 msu-cloud-computing
2014 msu-cloud-computing
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Orange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVOrange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLV
 
Pluto Project Greg And Victor 1
Pluto Project Greg And Victor 1Pluto Project Greg And Victor 1
Pluto Project Greg And Victor 1
 
A perfect storm
A perfect stormA perfect storm
A perfect storm
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?
 

Similar to Metagenomics Data Assembly Using Streaming Algorithms

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 

Similar to Metagenomics Data Assembly Using Streaming Algorithms (20)

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Metagenomics Data Assembly Using Streaming Algorithms

  • 1. Data of Unusual Size in Metagenomics C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
  • 2. Openness • Twit me! @ctitusbrown • My blog: http://ivory.idyll.org/blog/ • Grants, preprints, etc: http://ged.msu.edu/ • Software: BSD, github.com/ged-lab/.
  • 3. Thanks • My lab, esp. Jason Pell, Arend Hintze, Adina Chuang Howe, Qingpeng Zhang, and Eric McDonald • Michigan State, USDA and NSF for $$
  • 4. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.
  • 5. Metagenomics • Randomly sequence DNA from mixed microbial communities, e.g. soil. • DNA sequencing rates (cost/volume) have been outpacing Moore’s Law for ~5 years now… A terabase for ~$10k today.
  • 6. Analogy: feeding libraries into a paper shredder, digitizing the shreds, and reconstructing the books.
  • 7. “Shredding libraries” is a good analogy! • Lots of copies of Dickens, “Tale of Two Cities”, and SAT study guides, etc. • Not as many copies of <obscure hipster author>. • Many different editions with minor differences, + Reader’s Digest, excerpts, etc. • (Although for libraries we usually know the language)
  • 8. Two points: 1. If we feed all of the libraries in the world into a paper shredder and mix, how do we recover the book content!?
  • 9. Two points: 2. That’s actually an awful lot of data…
  • 10. Digression: Data of Unusual Size (aka Big Data) in Scientific Research • Research is already hard enough: – Novel, fast moving, heterogeneous data types. – Unknown answers. • Big Data => scaling, requires good engineering – Apply or invent new data structures & algorithms. – Write usable, functioning, reusable software. (Hint: academics are not good at one of these things)
  • 11. The assembly problem • The N**2 approach: look at all overlapping fragments. • The word-based approach: further decompose words into fixed-length overlapping hashable words. (Only one of these scales…)
  • 12. Shotgun sequencing “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 13. Reducing to k-mers overlaps Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
  • 14. Errors create new k-mers Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  • 15. So, our k-mer data contains both true and false k-mers.
  • 16. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 17. Can we efficiently distinguish true from false? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 18. Uneven representation complicates matters. Since you’re sequencing at random, you need to sequence deeply in order to be sensitive to rare hipster books. These rare hipster books may be important to understanding culture: not only best-sellers have influence!
  • 19. Streaming algorithm to do so: digital normalization
  • 25. Streaming algorithm for lossy compression of data sets. • Converts random sampling to systematic sampling by building an assembly graph on the fly • Can discard up to 99.9% of data set and errors, and still retain all information necessary for assembly. • Acts as a prefilter for assemblers; ~5 lines of Python. • Each piece of data is only examined once (!) • Most errors are never collected => low memory.
  • 26. Separately, apply Bloom filters to storing the information/data. “Exact” is for best possible information-theoretical storage. Pell et al., PNAS 2012
  • 27. Some details • This was completely intractable. • Implemented in C++ and Python; “good practice” (?) • We’ve changed scaling behavior from data to information. • Practical scaling for ~soil metagenomics is 10-100x: need < 1 TB of RAM for ~2 TB of data. ~2 weeks. • Just beginning to explore threading, multicore, etc. (BIG DATA grant proposal) • Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
  • 28. My rules of thumb for Big Data (for a better tomorrow) 1. Write well-understood filters and components, not monolithic programs.
  • 29. My rules of thumb for Big Data (for a better tomorrow) 2. Throw away data as quickly as possible.
  • 30. My rules of thumb for Big Data (for a better tomorrow) 3. Scripting is an extremely effective way to connect serious software to scientists.
  • 31. My rules of thumb for Big Data (for a better tomorrow) 4. Streaming/online approaches are worth the effort to develop them. (OK, this is obvious to this audience)
  • 32. My rules of thumb for Big Data (for a better tomorrow) 1. Write well-understood filters and components, not monolithic programs. 2. Throw away data as quickly as possible. 3. Scripting is an extremely effective way to connect serious software to scientists. 4. Streaming/online approaches are worth the effort to develop them.