O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Using R on High Performance Computers

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 20 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Using R on High Performance Computers (20)

Anúncio

Mais recentes (20)

Using R on High Performance Computers

  1. 1. using R and High Performance Computers an overview by Dave Hiltbrand
  2. 2. talking points ● why HPC? ● R environment tips ● staging R scripts for HPC ● purrr::map functions
  3. 3. what to do if the computation is too big for your desktop/laptop ? • a common user question: – i have an existing R pipeline for my research work. but the data is growing too big. now my R program runs for days (weeks) to finish or simply runs out of memory. • 3 Strategies – move to bigger hardware – advanced libraries/C++ – implement code using parallel packages
  4. 4. trends in HPC ➔ processors not getting faster ➔ increase performance => cram more cores on each chip ➔ requires reducing clock speed (power + heat) ➔ single-threaded applications will run SLOWER on these new resources, must start thinking in parallel https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
  5. 5. strategy 1: powerful hardware Stampede2 - HPC ● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes ● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes Maverick - Vis ● vis queue: 20 cores/ 256 GB mem/ 132 nodes ○ RStudio/ Jupyter Notebooks ● gpu queue: 132 NVIDIA Telsa K40 GPUs Wrangler - Data ● Hadoop/Spark ● reservations last up to a month
  6. 6. allocations open to national researcher community do you work in industry? XSEDE ● national organization providing computation resources to ~ 90% of cycles on Stampede2 tip if you need more power all you have to do is ask https://portal.xsede.org /allocations/resource- info
  7. 7. HPCs are: ➔ typically run with linux ➔ more command line driven ➔ daunting to Windows only users ➔ RStudio helps the transition
  8. 8. login nodes ➔ always log into the login nodes ➔ shared nodes with limited resources ➔ ok to edit, compile, move files ➔ for R, ok to install packages from login nodes ➔ !!! don’t run R Scripts!!! compute nodes ➔ dedicated nodes for each job ➔ only accessible via a job scheduler ➔ once you have a job running on a node you can ssh into the node
  9. 9. access R command line ● useful to install packages on login nodes ● using interactive development jobs you can request compute resources to login straight to a compute node and use R via command line RStudio ● availability depends on the structure of the HPC cluster ● at TACC the window to use RStudio is only 4 hours through the visual portal batch Jobs ● best method to use R on HPCs ● relies on a job scheduler to fill your request ● can run multiple R scripts on multiple compute nodes
  10. 10. sample batch script #!/bin/bash #---------------------------------------------------- # #---------------------------------------------------- #SBATCH -J myjob # Job name #SBATCH -o myjob.o%j # Name of stdout output file #SBATCH -e myjob.e%j # Name of stderr error file #SBATCH -p skx-normal # Queue (partition) name #SBATCH -N 1 # Total # of nodes (must be 1 for serial) #SBATCH -n 1 # Total # of mpi tasks (should be 1 for serial) #SBATCH -t 01:30:00 # Run time (hh:mm:ss) #SBATCH --mail-user=myname@myschool.edu #SBATCH --mail-type=all # Send email at begin and end of job #SBATCH -A myproject # Allocation name (req'd if you have more than 1) # Other commands must follow all #SBATCH directives... module list pwd date # Launch serial code... Rscript ./my_analysis.R > output.Rout >> error.Rerr # ---------------------------------------------------
  11. 11. .libPaths and Rprofile() using your Rprofile.site or .Rprofile files along with the .libPaths() command will allow you to install packages in your user folder and have them load up when you start R on the HPC. in R, a library is the location on disk where you install your packages. R creates a different library for each dot-version of R itself when R starts, it performs a series of steps to initialize the session. you can modify the startup sequence by changing the contents in a number of locations. the following sequence is somewhat simplified: ● first, R reads the file Rprofile.site in the R_Home/etc folder, where R_HOME is the location where you installed R. ○ for example, this file could live at C:RR- 3.2.2etcRprofile.site. ○ making changes to this file affects all R sessions that use this version of R. ○ this might be a good place to define your preferred CRAN mirror, for example. ● next, R reads the file ~/.Rprofile in the user's home folder. ● lastly, R reads the file .Rprofile in the project folder tip i like to make a .Rprofile for each GitHub project repo which loads my most commonly used libraries by default.
  12. 12. going parallel often you need to convert your code into parallel form to get the most out of HPC. the foreach and doMC packages will let you convert loops from sequential operation to parallel. you can even use multiple nodes if you have a really complex data set with the snow package. require( foreach ) require( doMC ) result <- foreach( i = 1:10, .combine = c) %dopar% { myProc() } require( foreach ) require( doSNOW ) #Get backend hostnames hostnames <- scan( "nodelist.txt", what="", sep="n" ) #Set reps to match core count num.cores <- 4 hostnames <- rep( hostnames, each = num.cores ) cluster <- makeSOCKcluster( hostnames ) registerDoSNOW( cluster ) result <- foreach( i = 1:10, .combine=c ) %dopar% { myProc() } stopCluster( cluster )
  13. 13. profiling ➔ simple procedure checks with tictoc package ➔ use more advanced packages like microbenchmark for multiple procedures ➔ For an easy to read graphic output use the profvis package to create flamegraphs checkpointing ➔ when writing your script think of procedure runtime ➔ you can save objects in your workflow as a checkpoint ◆ library(readr) ◆ write_rds(obj, “obj.rds”) ➔ if you want to run post hoc analysis it makes it easier to have all the parts
  14. 14. always start small i’m quick i’m slow build a toy dataset find your typo’s easier to rerun run the real data request the right resources once you run a small dataset you can benchmark resources needed
  15. 15. if you don’t already you need to Git Git is a command-line tool, but the center around which all things involving Git revolve is the hub— GitHub.com—where developers store their projects and network with like minded people. use RStudio and all the advanced IDE tools on your local machine then push and pull to GitHub to run your job. RStudio features built-in vcs track changes in your analysis, git lets you go back in time to a previous version of your file
  16. 16. Purrr Package Map functions apply a function iteratively to each element of a list or vector
  17. 17. the purrr map functions are an optional replacement to the lapply functions. they are not technically faster ( although the speed comparison is in nanoseconds ). the main advantage is to use uniform syntax with other tidyverse applications such as dplyr, tidyr, readr, and stringr as well as the helper functions. map( .x, .f, … ) map( vector_or_list_input, , function_to_apply, optional_other_stuff ) modify( .x, .f, …) ex. modify_at( my.data, 1:5, as.numeric) https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
  18. 18. map in parallel another key advantage from purrr is use of lambda functions which has been crucial for analysis involving multiple columns of a data frame. using the same basic syntax we create an anonymous function which maps over many lists simultaneously my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4, ~ ( .x + .y ) / 2 )) my.data %<>% mutate( var6 = pmap_dbl( list( .$var3, .$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 )) tip using the grammar of graphics, data, and lists through tidyverse packages can build a strong workflow
  19. 19. closing unburden your personal device ➔ learn basic linux cli using batch job submissions gives you the most flexibility ➔ profile/checkpoint/test resources are not without limits ➔ share your code don’t hold onto code until it’s perfect. use GitHub and get feedback early and often
  20. 20. $ questions -h refs: 1. https://jennybc.github.io/purrr-tutorial/ 2. https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes 3. https://learn.tacc.utexas.edu/mod/page/view.php?id=24 4. http://blog.revolutionanalytics.com/2015/11/r-projects.html

×