SlideShare a Scribd company logo
1 of 20
Download to read offline
I/O is hard.
Interpreting I/O results is harder.
 Countless techniques for doing I/O
  operations
    Varying difficulty
    Varying efficiency
 All techniques suffer from the
  same phenomenon, eventually it
  will turn over.
    Limited Disk Bandwidth
    Limited Interconnect                                 FPP Write (MB/s)
                                       35000
       Bandwidth
                                       30000
    Limited Filesystem Parallelism    25000

 Respect the limits or suffer the     20000

  consequences.                        15000

                                       10000

                                        5000

                                           0
                                               0   2000      4000     6000   8000   10000
16
 Lustre is parallel, not paranormal
 Striping is critical and often       14
  overlooked
                                       12
 Writing many-to-one requires a
  large stripe count                   10

 Writing many-to-many requires a      8
  single stripe
                                       6


                                       4


                                       2


                                       0
                                            Default Stripe   Stripe All
 Files inherit the striping of the parent   #include <unistd.h>
   directory                                 #include <fcntl.h>
                                             #include <sys/ioctl.h>
     Input directory must be striped        #include <lustre/lustre_user.h>
       before copying in data                int open_striped(char *filename,
                                               int mode, int stripe_size,
     Output directory must be striped         int stripe_offset, int stripe_count)
                                             {
       before running                          int fd;
 May also “touch” a file using the lfs        struct lov_user_md opts = {0};
                                               opts.lmm_magic = LOV_USER_MAGIC;
   command                                     opts.lmm_stripe_size = stripe_size;
                                               opts.lmm_stripe_offset = stripe_offset;
 An API to stripe a file                      opts.lmm_stripe_count = stripe_count;
   programmatically is often requested,
                                               fd = open64(filename, O_CREAT | O_EXCL
   here’s how to do it.                      | O_LOV_DELAY_CREATE | mode, 0644);
                                               if ( fd >= 0 )
     Call from only one processor               ioctl(fd, LL_IOC_LOV_SETSTRIPE,
                                             &opts);
 New support in xt-mpt for striping
   hints                                         return fd;
                                             }
     striping_factor
     striping_size
 We know that large writes
  perform better so we buffer           4MB    4MB         4MB

 We can control our buffer size
                                         5MB         5MB
 We can ALSO control our stripe
  size
 Misaligning Buffer and Stripe sizes    5MB         5MB
  can hurt your performance
                                        4MB    4MB         4MB



                                        4MB    4MB         4MB


                                        4MB    4MB         4MB
40
 In order to use O_DIRECT, data
  buffers must be aligned to page     35
  boundaries
                                      30
    O_DIRECT is rarely a good idea
                                      25
 Memory alignment can be done
  by:                                 20
    C: posix_memalign instead of
                                      15
      malloc
    FORTRAN: over-allocation and     10

      the loc function                5

 Aligning I/O buffers on page
                                      0
  boundaries can improve I/O               0    5000        10000        15000
  performance.                                 Shared   Shared Aligned
 This method is simple to
  implement and can utilize > 160
  OST limit
 This method is also very stressful
  on the FS and inconvenient with
  thousands of clients
    Too many opens at once floods
      the MDS
    Too many concurrent writers
      can stress the OSTs
    Too small writes kills
      performance
    Too many files stresses user
Performance Results                   Open Time
 40                                   9

 35                                   8

                                      7
 30
                                      6
 25
                                      5
 20
                                      4
 15
                                      3
 10
                                      2

 5                                    1

 0                                    0
      0    5000       10000   15000       0       5000   10000   15000
 Slightly more difficult to
  implement than fpp
    still fairly easy
 Generally slightly less efficient
  than fpp
 More convenient than many files
 Nicer to the MDS? Maybe
  marginally.
 Still can overload OSTs from many
  writers
 Try to make sure that two
  processors don’t need to write to
  same stripe
POSIX Shared                            Fortran Direct
fd = open64(quot;test.datquot;, mode, 0644);
/* Seek to start place for rank */      ! Establish Sizes
ierr64 = lseek64(fd, commrank*iosize,   reclength = 8*1024*1024
    SEEK_SET);                          iosize = reclength * 10
remaining = iosize;                     ! Starting Record For Rank
/* Write by buffers to the file */      recnum = (iosize * myrank)/reclength
while (remaining > 0)                   recs = iosize/8
{                                       numwords = recs/10
  i = (remaining < buffersize ) ?
       remaining : buffersize;
                                        open(fid, file='output/test.dat',
  /* Copy from data to buffer */
                                           status='replace', form='unformatted',
  memcpy(tmpbuf, dbuf, i);
  ierr = write(fd, tmpbuf, i);             access='direct', recl=reclength,
  if (ierr >= 0) {                         iostat=ierr)
     remaining -= ierr;                 ! Write a record at a time to the file
     dbuf += ierr;                      do i=1,recs,numwords
  } else                                  write(fid, rec=recnum, iostat=ierr)
  {                                         writebuf(i:i+numwords-1)
     MPI_Abort(MPI_COMM_WORLD, ierr);       recnum = recnum + 1
  }                                     end do
}                                       close(fid)
close(fd);
Performance Results                   Open Time
 30                                   9

                                      8
 25
                                      7

                                      6
 20
                                      5

 15                                   4

                                      3
 10                                   2

                                      1
 5
                                      0
                                          0       5000            10000   15000
 0
      0    5000       10000   15000                      Shared    FPP
 I/O Scaling Limitations
    Turns over above some number of clients
    Shared files are limited to 160 OSTs, but some filesystems have more
 Can we use this knowledge to improve I/O performance?
 Aggregate I/O via sub-grouping to
    Reduce number of clients using the FS
    Aggregate into larger I/O buffers
    Potentially cover > 160 OSTs via multiple shared files
 We can do this
    Via MPI-IO Collective Buffering
    By hand (many different ways)
 MPI-IO provides a way to handle buffering and grouping behind the scenes
    Advantage: Little or No code changes
    Disadvantage: Little or No knowledge of what’s actually done
 Use Collective file access
    MPI_File_write_all – Specify file view first
    MPI_File_write_at_all – Calculate offset for each write
 Set the cb_* hints
    cb_nodes – number of I/O aggregators
    cb_buffer_size – size of collective buffer
    romio_cb_write – enable/disable collective buffering
 No need to split comms, gather data, etc.
Yes! See
Mark Pagel’s
   Talk.
 Lose ease-of-use, gain control
 Countless methods to implement
    Simple gathering
    Serialized Sends within group
    Write token
    Double Buffered
    Bucket Brigade
    …
 Look for existing groups in your code
 Even the simplest solutions often
  seem to work.
    Try to keep the pipeline full
    Always be doing I/O
                                          I find your lack of faith in
 Now we can think about multiple             ROMIO disturbing.
  shared files!
 Every code uses these very                     IOR: Shared File Writes
  differently                         8000

 Follow as many of the same rules    7000
  as possible
                                      6000
 It is very possible to get good
                                      5000
  results, but also possible to get
  bad                                 4000

 Because Parallel HDF5 is written    3000

  over MPI-IO, it’s possible to use   2000
  hints                               1000

                                         0
                                             0      2000     4000      6000    8000   10000

                                                     POSIX          MPIIO     HDF5
Related CUG Talks/Papers               Thank You
 Performance Characteristics of the    Lonnie Crosby, UT/NICS
  Lustre File System on the Cray XT5    Mark Fahey, UT/NICS
  with Regard to Application I/O
                                        Scott Klasky, ORNL/NCCS
  Patterns, Lonnie Crosby
                                        Mike Booth, Lustre COE
 Petascale I/O Using The Adaptable
  I/O System, Jay Lofstead, Scott       Galen Shipman, ORNL
  Klasky, et al.                        David Knaak, Cray
 Scaling MPT and Other Features,       Mark Pagel, Cray
  Mark Pagel
 MPI-IO Whitepaper, David Knaak,
  ftp://ftp.cray.com/pub/pe/downlo
  ad/MPI-IO_White_Paper.pdf
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

More Related Content

Similar to Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

Replication Tips & Tricks
Replication Tips & TricksReplication Tips & Tricks
Replication Tips & TricksMats Kindahl
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR mattersAlexandre Moneger
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Railsdosire
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Tim Bunce
 
DPC2007 PHP And Oracle (Kuassi Mensah)
DPC2007 PHP And Oracle (Kuassi Mensah)DPC2007 PHP And Oracle (Kuassi Mensah)
DPC2007 PHP And Oracle (Kuassi Mensah)dpc
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslogamiable_indian
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
RubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - KeynoteRubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - KeynoteDr Nic Williams
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Replication Tips & Trick for SMUG
Replication Tips & Trick for SMUGReplication Tips & Trick for SMUG
Replication Tips & Trick for SMUGMats Kindahl
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About StoreconfigsBrice Figureau
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyNETWAYS
 
Caching and tuning fun for high scalability @ 4Developers
Caching and tuning fun for high scalability @ 4DevelopersCaching and tuning fun for high scalability @ 4Developers
Caching and tuning fun for high scalability @ 4DevelopersWim Godden
 

Similar to Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009) (20)

Replication Tips & Tricks
Replication Tips & TricksReplication Tips & Tricks
Replication Tips & Tricks
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )
 
DPC2007 PHP And Oracle (Kuassi Mensah)
DPC2007 PHP And Oracle (Kuassi Mensah)DPC2007 PHP And Oracle (Kuassi Mensah)
DPC2007 PHP And Oracle (Kuassi Mensah)
 
Os Edwards
Os EdwardsOs Edwards
Os Edwards
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslog
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
RubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - KeynoteRubyEnRails2007 - Dr Nic Williams - Keynote
RubyEnRails2007 - Dr Nic Williams - Keynote
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Vidoop CouchDB Talk
Vidoop CouchDB TalkVidoop CouchDB Talk
Vidoop CouchDB Talk
 
Replication Tips & Trick for SMUG
Replication Tips & Trick for SMUGReplication Tips & Trick for SMUG
Replication Tips & Trick for SMUG
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About Storeconfigs
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Securefile LOBs
Securefile LOBsSecurefile LOBs
Securefile LOBs
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross Lawley
 
Caching and tuning fun for high scalability @ 4Developers
Caching and tuning fun for high scalability @ 4DevelopersCaching and tuning fun for high scalability @ 4Developers
Caching and tuning fun for high scalability @ 4Developers
 

More from Jeff Larkin

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismJeff Larkin
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 

More from Jeff Larkin (15)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

  • 1.
  • 4.  Countless techniques for doing I/O operations  Varying difficulty  Varying efficiency  All techniques suffer from the same phenomenon, eventually it will turn over.  Limited Disk Bandwidth  Limited Interconnect FPP Write (MB/s) 35000 Bandwidth 30000  Limited Filesystem Parallelism 25000  Respect the limits or suffer the 20000 consequences. 15000 10000 5000 0 0 2000 4000 6000 8000 10000
  • 5. 16  Lustre is parallel, not paranormal  Striping is critical and often 14 overlooked 12  Writing many-to-one requires a large stripe count 10  Writing many-to-many requires a 8 single stripe 6 4 2 0 Default Stripe Stripe All
  • 6.  Files inherit the striping of the parent #include <unistd.h> directory #include <fcntl.h> #include <sys/ioctl.h>  Input directory must be striped #include <lustre/lustre_user.h> before copying in data int open_striped(char *filename, int mode, int stripe_size,  Output directory must be striped int stripe_offset, int stripe_count) { before running int fd;  May also “touch” a file using the lfs struct lov_user_md opts = {0}; opts.lmm_magic = LOV_USER_MAGIC; command opts.lmm_stripe_size = stripe_size; opts.lmm_stripe_offset = stripe_offset;  An API to stripe a file opts.lmm_stripe_count = stripe_count; programmatically is often requested, fd = open64(filename, O_CREAT | O_EXCL here’s how to do it. | O_LOV_DELAY_CREATE | mode, 0644); if ( fd >= 0 )  Call from only one processor ioctl(fd, LL_IOC_LOV_SETSTRIPE, &opts);  New support in xt-mpt for striping hints return fd; }  striping_factor  striping_size
  • 7.  We know that large writes perform better so we buffer 4MB 4MB 4MB  We can control our buffer size 5MB 5MB  We can ALSO control our stripe size  Misaligning Buffer and Stripe sizes 5MB 5MB can hurt your performance 4MB 4MB 4MB 4MB 4MB 4MB 4MB 4MB 4MB
  • 8. 40  In order to use O_DIRECT, data buffers must be aligned to page 35 boundaries 30  O_DIRECT is rarely a good idea 25  Memory alignment can be done by: 20  C: posix_memalign instead of 15 malloc  FORTRAN: over-allocation and 10 the loc function 5  Aligning I/O buffers on page 0 boundaries can improve I/O 0 5000 10000 15000 performance. Shared Shared Aligned
  • 9.  This method is simple to implement and can utilize > 160 OST limit  This method is also very stressful on the FS and inconvenient with thousands of clients  Too many opens at once floods the MDS  Too many concurrent writers can stress the OSTs  Too small writes kills performance  Too many files stresses user
  • 10. Performance Results Open Time 40 9 35 8 7 30 6 25 5 20 4 15 3 10 2 5 1 0 0 0 5000 10000 15000 0 5000 10000 15000
  • 11.  Slightly more difficult to implement than fpp  still fairly easy  Generally slightly less efficient than fpp  More convenient than many files  Nicer to the MDS? Maybe marginally.  Still can overload OSTs from many writers  Try to make sure that two processors don’t need to write to same stripe
  • 12. POSIX Shared Fortran Direct fd = open64(quot;test.datquot;, mode, 0644); /* Seek to start place for rank */ ! Establish Sizes ierr64 = lseek64(fd, commrank*iosize, reclength = 8*1024*1024 SEEK_SET); iosize = reclength * 10 remaining = iosize; ! Starting Record For Rank /* Write by buffers to the file */ recnum = (iosize * myrank)/reclength while (remaining > 0) recs = iosize/8 { numwords = recs/10 i = (remaining < buffersize ) ? remaining : buffersize; open(fid, file='output/test.dat', /* Copy from data to buffer */ status='replace', form='unformatted', memcpy(tmpbuf, dbuf, i); ierr = write(fd, tmpbuf, i); access='direct', recl=reclength, if (ierr >= 0) { iostat=ierr) remaining -= ierr; ! Write a record at a time to the file dbuf += ierr; do i=1,recs,numwords } else write(fid, rec=recnum, iostat=ierr) { writebuf(i:i+numwords-1) MPI_Abort(MPI_COMM_WORLD, ierr); recnum = recnum + 1 } end do } close(fid) close(fd);
  • 13. Performance Results Open Time 30 9 8 25 7 6 20 5 15 4 3 10 2 1 5 0 0 5000 10000 15000 0 0 5000 10000 15000 Shared FPP
  • 14.  I/O Scaling Limitations  Turns over above some number of clients  Shared files are limited to 160 OSTs, but some filesystems have more  Can we use this knowledge to improve I/O performance?  Aggregate I/O via sub-grouping to  Reduce number of clients using the FS  Aggregate into larger I/O buffers  Potentially cover > 160 OSTs via multiple shared files  We can do this  Via MPI-IO Collective Buffering  By hand (many different ways)
  • 15.  MPI-IO provides a way to handle buffering and grouping behind the scenes  Advantage: Little or No code changes  Disadvantage: Little or No knowledge of what’s actually done  Use Collective file access  MPI_File_write_all – Specify file view first  MPI_File_write_at_all – Calculate offset for each write  Set the cb_* hints  cb_nodes – number of I/O aggregators  cb_buffer_size – size of collective buffer  romio_cb_write – enable/disable collective buffering  No need to split comms, gather data, etc.
  • 17.  Lose ease-of-use, gain control  Countless methods to implement  Simple gathering  Serialized Sends within group  Write token  Double Buffered  Bucket Brigade  …  Look for existing groups in your code  Even the simplest solutions often seem to work.  Try to keep the pipeline full  Always be doing I/O I find your lack of faith in  Now we can think about multiple ROMIO disturbing. shared files!
  • 18.  Every code uses these very IOR: Shared File Writes differently 8000  Follow as many of the same rules 7000 as possible 6000  It is very possible to get good 5000 results, but also possible to get bad 4000  Because Parallel HDF5 is written 3000 over MPI-IO, it’s possible to use 2000 hints 1000 0 0 2000 4000 6000 8000 10000 POSIX MPIIO HDF5
  • 19. Related CUG Talks/Papers Thank You  Performance Characteristics of the  Lonnie Crosby, UT/NICS Lustre File System on the Cray XT5  Mark Fahey, UT/NICS with Regard to Application I/O  Scott Klasky, ORNL/NCCS Patterns, Lonnie Crosby  Mike Booth, Lustre COE  Petascale I/O Using The Adaptable I/O System, Jay Lofstead, Scott  Galen Shipman, ORNL Klasky, et al.  David Knaak, Cray  Scaling MPT and Other Features,  Mark Pagel, Cray Mark Pagel  MPI-IO Whitepaper, David Knaak, ftp://ftp.cray.com/pub/pe/downlo ad/MPI-IO_White_Paper.pdf