This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
5. Metagenome assembly
Library of Books Shredded Library “reconstructed”Library
Genome ~= Book Metagenome ~=Library
Sequencing ~=sampling the pieces
6. Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Common Human Cow Soil
Gigabytes
7. Complexity is another…
• Data complexity
– Contamination
– Number of microbial species
– Species abundance distribution
– Sequencing errors
• Algorithm complexity
– Multiple steps, each has different time/space
characteristics
10. 2010: MP/MPI on supercomputers
• Experienced software engineers
• Six months of development time
• One task fails, all tasks fail
Problems: Fast, scalable
Rob Egan @JGI
MPI version
412 Gb, 4.5B reads
2.7 hours on 128x24 cores
NESRC Supercomputer
12. Challenges in application
• 2-3 orders of magnitude slower than MPI
• IO optimization, e.g., reduce data copying
• Some problems do not easily fit into map/reduce
framework, e.g., graph-based algorithms
• Runs on AWS, but cost $$$ if not optimized
13. Addressing big data: Apache Spark
• New scalable programming paradigm
• Compatible with Hadoop-supported
storage systems
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
q Scale to big data
q Efficient
q Easy to develop
q Robust?
14. Goal: Metagenome read clustering
• Clustering reads based on their genome of
origin can reduce metagenome problem to
single-genome problem
• Ideally scale up to TB data sizes
15. Algorithm
• highly frequent k-
mers (n-grams)
Contamination
prediction
• K-mer generation
• Filter out low
frequency k-mers
(noise)
K-mer generation
and filtering • Compute edge
weight
• Remove low
weight edges
Read graph
generation and
filtering
• Connected
component
• Power-iteration
clustering
Graph partition
16. Platforms we run spark
• Standalone Spark on single large memory
server
• On-demand Spark cluster over HPC
• AWS Elastic Map Reduce (EMR)
17. Testing the accuracy of the algorithm with a
small toy test dataset
• Species:
– 6 bacterial species (10kb from each)
– Synthetic communities with random proportions of each genome, reads drawn from single
genome sequencing projects (noisy)
– Ideal situation (no shared sequences between genomes, sufficient sequencing coverage):
Reads of the same color belong to the same genome
18. Real world datasets
Dataset Number of spieces Sampling depth
Soil
metagenome
High Low
Cow rumen
metagenome
Medium Low to medium
Maize transcriptome
(“fake metagenome”)
Low High
19. Data grows during analysis
K-mers Full graph Filtered graph
Graph is ~200x larger than input data
21. Tuning parallelism for performance
Optimizing parameters can reduce total running time from > 90 min to 20 min
Decreasing partition size for graph construction
22. Scale well on small data
1 2 4 8
Contamination
prediction time
K-mer generation and
filter time
Read graph generation
and filter time
Connected Component
computation time
Total elapsed time
GB
On 50x r3-xlarge
Walltime
23. Performance over different datasets
Data complexity is a big driving factor for compute time
<0.5 hours
~5 hours
All three are 20GB datasets
24. Cost on AWS EMR spot instances
Projected cost for a typical 1TB dataset: ~$500
20GB datasets
25. Scaling up to 100Gb: failed
Read graph generation/filtering
Reducing partition numbers for graph partitioning
26. Potential solutions
• Avoid shuffling:
– generate the graph, save to disk, then merge
partitions outside of Spark.
• Size-specific parameters
– Larger datasets may not use parallelism parameters
optimized for smaller ones
• Your inputs…
27. Overall impression of Spark
ü Easy to develop
o Scala/python API
o Databricks notebook
ü Efficient
o Much, much faster
than Hadoop
o High cluster
utilization rate
? Robust
o Platform dependent
o 30% failure rate on AWS
? Scale
o Problem specific
o Intermediate data size may
change during running
o Problem complexity may
grow with scale