The document discusses using Hadoop for scientific workloads and summarizes early results from benchmarking Hadoop. It explores using Hadoop and MapReduce for data-intensive scientific applications like BLAST sequence analysis. Performance results show that Hadoop can provide comparable performance to existing parallel file systems. Challenges include lack of turn-key solutions, managing data formats, and performance tuning. The research aims to understand the unique needs of science clouds and how to effectively support data-intensive scientific applications on cloud platforms.
Handwritten Text Recognition for manuscripts and early printed texts
HADOOP-SCIENCE
1.
2.
3.
4.
5. Magellan Cloud at NERSC 720 nodes, 5760 cores in 9 Scalable Units (SUs) 61.9 Teraflops SU = IBM iDataplex rack with 640 Intel Nehalem cores 8G FC 10G Ethernet 14 I/O nodes (shared) 18 Login/network nodes 1 Petabyte with GPFS SU SU SU SU SU SU SU SU SU Load Balancer I/O I/O NERSC Global Filesystem Network Login Network Login QDR IB Fabric HPSS (15PB) Internet 100-G Router ANI
This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
Range of application classes with different models Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Data pipeline, task parallel workflow, image matching algorithms that should work Might be heavy on Io side but the other advantages might outweigh the performance Data Integration challenges ~ 35 science data products including atmospheric and land products products are in different projection, resolutions (spatial and temporal), different times data volume and processing requirements exceed desktop capacity
There is a huge spectrum of scientific applications - High energy physics, eco-sciences, bioinformatics at LBL. These have a varied set of requirements and a need for unlimited compute cycles and data storage. NERSC and IT provide infrastructure and resources for these applications. Other groups in CRD work closely with the scientists to explore and develop user interfaces, middleware, grid tools, data support, infrastructure tools for monitoring that are required to facilitate scientific exploration. Cloud computing brings in a new resource model of delivering “on-demand cycles at a cost” and a new set of programming models and tools. Many groups at LBL are interested in seeing how the different features of cloud computing would help them in their scientific explorations In general we need to explore the big question of how do we work closely with scientists to deliver a more diverse set of services that not just target the traditional HPC applications Make it easier for us to do what we have traditionally being doing? Help us do things differently than before? Can bring other users in?
Why Hadoop? What implications
Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Content maintenance consists of Integrating new metagenome datasets with the reference genomes every 2-4 weeks Involves running BLAST for identifying pair-wise gene similarities between new metagenome & reference genomes Reference genome baseline updated with new (~500) genomes every 4 months Involves running BLAST for refreshing pair-wise gene similarities between reference genomes, and between metagenome & reference genomes takes about 3 weeks on a Linux cluster with 256 cores Take away point is there is a growth in the databases BLAST is used majorly in pipleline
Hard limits in Hadoop config (3GB ulimit but DB > 3GB) Thrashing due to DB not fitting in available memory - first iteration 3.5 to 4.5 hrs for job to finish but 80% DB that fits into memory takes half the time Hadoop does not guarantee simultaneous availability of resources so time to solution is hard to predict
This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
Here are the different features of cloud and each has an attraction for a class of users. a. Who doesn’t want free cycles and the on-demand aspect is appealing. Getting 10 cpus for 1 hr now or getting 5 cpus for 2 hrs has the same cost. This combined with the idea that you don’t have to wait for CPUs is also very attractive for batch queue users. b. The virtual environments that seem common place tend to impose some overheads but when there are large parameteric studies such as BLAST, the overhead might be acceptable c. Users bear the brunt of OS and software upgrades – for e.g., supernova factory has code base that works only on 32 bit systems and as 64 bit systems are more common place they are restricted on where they can run. d. Science problems are exceeding current systems