HADOOP-SCIENCE

•

4 gostaram•977 visualizações

The document discusses using Hadoop for scientific workloads and summarizes early results from benchmarking Hadoop. It explores using Hadoop and MapReduce for data-intensive scientific applications like BLAST sequence analysis. Performance results show that Hadoop can provide comparable performance to existing parallel file systems. Challenges include lack of turn-key solutions, managing data formats, and performance tuning. The research aims to understand the unique needs of science clouds and how to effectively support data-intensive scientific applications on cloud platforms.

Tecnologia

Hadoop for Scientific Workloads ,[object Object],[object Object],[object Object],[object Object],[object Object],Lawrence Berkeley National Lab

Example Scientific Applications ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Supporting Science at LBL ,[object Object],[object Object],Scientists HPC and IT resources User interfaces, grid middleware, workflow tools, data management, etc ,[object Object],[object Object],[object Object],[object Object]

Magellan – Exploring Cloud Computing ,[object Object],[object Object],[object Object],[object Object]

Magellan Cloud at NERSC 720 nodes, 5760 cores in 9 Scalable Units (SUs)  61.9 Teraflops SU = IBM iDataplex rack with 640 Intel Nehalem cores 8G FC 10G Ethernet 14 I/O nodes (shared) 18 Login/network nodes 1 Petabyte with GPFS SU SU SU SU SU SU SU SU SU Load Balancer I/O I/O NERSC Global Filesystem Network Login Network Login QDR IB Fabric HPSS (15PB) Internet 100-G Router ANI

Magellan Research Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hadoop for Science ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hadoop Benchmarking: Early Results ,[object Object],[object Object],[object Object],[object Object],[object Object]

+ 287 Samples: ~105 Studies + 12.5 Mil genes 19 Mil genes IMG Systems: Genome & Metagenome Data Flow ,[object Object],[object Object],Every 4 months 65 Samples: 21 Studies IMG+2.6 Mil genes 9.1 Mil total Monthly On demand On demand ,[object Object],[object Object],[object Object],Monthly 5,115 Genomes 6.5 Mil genes

BLAST on Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hardware Platforms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

BLAST on Yahoo! M45 Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

HBase for Metagenomics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Magellan Application: De-novo assembly ,[object Object],[object Object],[object Object],[object Object],[object Object],Private/public cloud Memory requirements: ~500 GB (de Bruijn graph) CPU hours (single assembly): velveth: ~23h,velvetg: ~21h Source: Karan Bhatia

Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Acknowledgements ,[object Object],[object Object]

Cloud Usage Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

NERSC Magellan Software Strategy ,[object Object],[object Object],[object Object],ANI Magellan Cluster

Mais conteúdo relacionado

Mais procurados

Advanced Hadoop Tuning and Optimization Shivkumar Babshetty

Hadoop Overview & Architecture EMC

002 Introduction to hadoop v3Dendej Sawarnkatat

Hadoop 2EasyMedico.com

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh

Hadoop 1.x vs 2Rommel Garcia

Apache Hadoop MapReduce TutorialFarzad Nozarian

Hadoop Interview Question and Answerstechieguy85

Hadoop-IntroductionSandeep Deshmukh

Introduction To Elastic MapReduce at WHUGAdam Kawa

February 2014 HUG : Pig On TezYahoo Developer Network

LCA13: Hadoop DFS PerformanceLinaro

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi

Hadoop MapReduce Streaming and PipesHanborq Inc.

HadoopScott Leberknight

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Hadoop scalabilityWANdisco Plc

Hadoop Interview Questions and AnswersBig Data Interview Questions

Hadoop & MapReduceNewvewm

Hadoop & Big Data benchmarkingBart Vandewoestyne

Mais procurados (20)

Advanced Hadoop Tuning and Optimization

Hadoop Overview & Architecture

002 Introduction to hadoop v3

Hadoop 2

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...

Hadoop 1.x vs 2

Apache Hadoop MapReduce Tutorial

Hadoop Interview Question and Answers

Hadoop-Introduction

Introduction To Elastic MapReduce at WHUG

February 2014 HUG : Pig On Tez

LCA13: Hadoop DFS Performance

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce

Hadoop MapReduce Streaming and Pipes

Hadoop

Enterprise Scale Topological Data Analysis Using Spark

Hadoop scalability

Hadoop Interview Questions and Answers

Hadoop & MapReduce

Hadoop & Big Data benchmarking

Destaque

meter man logoGuillaume Steyn

01-05-16-35Sahar Samy

intermediateRaees Rehman

Material para fuenteToribio Pecero

Racer X TimTim Collins

Somos las más chicsSusanapalortega

NM 2011emiliomerayo

ExploratoriumapresentacaoFernando Rui Campos

skydrive_word_docWoodrow-LIrGqCwT Mathews-UNr3NDYz

Apostila 14Omar Gebara

Fiduciary wordleThe 401k Study Group ®

MBA DiplomaAlex Domazet

Emilioblog grand prix hradec králové 2013 emiliomerayo

American Spy Hidden Camera watchgeorge david

Psicología socialGabriela Martínez

Resultsemiliomerayo

Deber matemáticasCristhian Calderón

Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit

Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan

Sparkによる GISデータを題材とした時系列データ処理（Hadoop / Spark Conference Japan 2016 講演資料）Hadoop / Spark Conference Japan

Destaque (20)

meter man logo

01-05-16-35

intermediate

Material para fuente

Racer X Tim

Somos las más chics

NM 2011

Exploratoriumapresentacao

skydrive_word_doc

Apostila 14

Fiduciary wordle

MBA Diploma

Emilioblog grand prix hradec králové 2013

American Spy Hidden Camera watch

Psicología social

Results

Deber matemáticas

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境

Sparkによる GISデータを題材とした時系列データ処理（Hadoop / Spark Conference Japan 2016 講演資料）

Semelhante a HADOOP-SCIENCE

Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training

Many Task Applications for Grids and SupercomputersIan Foster

Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev

getFamiliarWithHadoopAmirReza Mohammadi

Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals

Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox

An experimental evaluation of performanceijcsa

HadoopZubair Arshad

BDA Mod2@AzDOCUMENTS.in.pdfKUMARRISHAV37

Hadoop TechnologyAtul Kushwaha

Introduction to hadoop and hdfsshrey mehrotra

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit

Bigdata processing with SparkArjen de Vries

Oct 2011 CHADNUG Presentation on HadoopJosh Patterson

Seminar pptRajatTripathi34

How can Hadoop & SAP be integratedDouglas Bernardini

sudoers: Benchmarking Hadoop with ALOJANicolas Poggi

Hadoop installation by santosh nageSantosh Nage

Introduction to Apache HadoopChristopher Pezza

Semelhante a HADOOP-SCIENCE (20)

Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Many Task Applications for Grids and Supercomputers

Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...

getFamiliarWithHadoop

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster

Matching Data Intensive Applications and Hardware/Software Architectures

An experimental evaluation of performance

Hadoop

BDA Mod2@AzDOCUMENTS.in.pdf

Hadoop Technology

Introduction to hadoop and hdfs

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Bigdata processing with Spark

Oct 2011 CHADNUG Presentation on Hadoop

Seminar ppt

How can Hadoop & SAP be integrated

sudoers: Benchmarking Hadoop with ALOJA

Hadoop installation by santosh nage

Introduction to Apache Hadoop

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Último

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

A Call to Action for Generative AI in 2024Results

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

🐬 The future of MySQL is Postgres 🐘RTylerCroy

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Histor y of HAM Radio presentation slidevu2urc

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Scaling API-first – The story of a global engineering organizationRadu Cotescu

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Slack Application Development 101 Slidespraypatel2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

HADOOP-SCIENCE

5. Magellan Cloud at NERSC 720 nodes, 5760 cores in 9 Scalable Units (SUs)  61.9 Teraflops SU = IBM iDataplex rack with 640 Intel Nehalem cores 8G FC 10G Ethernet 14 I/O nodes (shared) 18 Login/network nodes 1 Petabyte with GPFS SU SU SU SU SU SU SU SU SU Load Balancer I/O I/O NERSC Global Filesystem Network Login Network Login QDR IB Fabric HPSS (15PB) Internet 100-G Router ANI

10.

11.

12. BLAST Performance

13.

14.

15.

16.

17.

18.

19.

20.

Notas do Editor

This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
Range of application classes with different models Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Data pipeline, task parallel workflow, image matching algorithms that should work Might be heavy on Io side but the other advantages might outweigh the performance Data Integration challenges ~ 35 science data products including atmospheric and land products products are in different projection, resolutions (spatial and temporal), different times data volume and processing requirements exceed desktop capacity
There is a huge spectrum of scientific applications - High energy physics, eco-sciences, bioinformatics at LBL. These have a varied set of requirements and a need for unlimited compute cycles and data storage. NERSC and IT provide infrastructure and resources for these applications. Other groups in CRD work closely with the scientists to explore and develop user interfaces, middleware, grid tools, data support, infrastructure tools for monitoring that are required to facilitate scientific exploration. Cloud computing brings in a new resource model of delivering “on-demand cycles at a cost” and a new set of programming models and tools. Many groups at LBL are interested in seeing how the different features of cloud computing would help them in their scientific explorations In general we need to explore the big question of how do we work closely with scientists to deliver a more diverse set of services that not just target the traditional HPC applications Make it easier for us to do what we have traditionally being doing? Help us do things differently than before? Can bring other users in?
Why Hadoop? What implications
Part of IMG family of systems hosted at the DOE Joint Genome Institute (JGI) Supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes Content maintenance consists of Integrating new metagenome datasets with the reference genomes every 2-4 weeks Involves running BLAST for identifying pair-wise gene similarities between new metagenome & reference genomes Reference genome baseline updated with new (~500) genomes every 4 months Involves running BLAST for refreshing pair-wise gene similarities between reference genomes, and between metagenome & reference genomes takes about 3 weeks on a Linux cluster with 256 cores Take away point is there is a growth in the databases BLAST is used majorly in pipleline
Hard limits in Hadoop config (3GB ulimit but DB > 3GB) Thrashing due to DB not fitting in available memory - first iteration 3.5 to 4.5 hrs for job to finish but 80% DB that fits into memory takes half the time Hadoop does not guarantee simultaneous availability of resources so time to solution is hard to predict
This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
Here are the different features of cloud and each has an attraction for a class of users. a. Who doesn’t want free cycles and the on-demand aspect is appealing. Getting 10 cpus for 1 hr now or getting 5 cpus for 2 hrs has the same cost. This combined with the idea that you don’t have to wait for CPUs is also very attractive for batch queue users. b. The virtual environments that seem common place tend to impose some overheads but when there are large parameteric studies such as BLAST, the overhead might be acceptable c. Users bear the brunt of OS and software upgrades – for e.g., supernova factory has code base that works only on 32 bit systems and as 64 bit systems are more common place they are restricted on where they can run. d. Science problems are exceeding current systems
Science gateways S3 storage Phasing

HADOOP-SCIENCE

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a HADOOP-SCIENCE

Semelhante a HADOOP-SCIENCE (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Último

Último (20)

HADOOP-SCIENCE

Notas do Editor