[View PDF]

 

Robert Ross, ANL

Application

  • Darshan collects concise I/O access pattern information from large-scale applications

Goal

  • Users: improve the performance of critical scientific applications
  • Administrators: gain insight into storage system deployments and usage
  • Researchers: guide future research directions HPC I/O

Requirement

  • Negligible impact on production applications
    • Transparency at leadership scale
    • Enable automatically for all users
  • Rapid feedback on the behavior of applications
    • User-friendly analysis tools for users to inspect I/O performance
    • Capture information from multiple levels of the I/O software stack

 

Histogram of I/O access sizes in a FLASH plot file

Histogram of I/O access sizes in a FLASH plot file. This data illustrated the presence of unexpected small write operations.  The file layout was optimized accordingly to reduce I/O time by one third [1].

[1] Latham et. al.  A case study for scientific I/O: improving the FLASH astrophysics code.  In Computational Science & Discovery, 2012.

Results

  • Darshan is deployed and automatically enabled for all users at ALCF and NERSC facilities.
  • Darshan has been used to successfully tune a wide variety of scientific applications.
  • The example shown here is from the FLASH application.  FLASH can be used to  solve a wide range of astrophysical, fluid dynamics, and plasma physics problems.  Improvements in FLASH’s I/O performance help to enhance scientific productivity.

Notes:

It is well understood that computational science codes exhibit storage access patterns that differ greatly from those seen in non-HPC settings. However, the community’s understanding of these patterns historically has been based on single-application studies and limited, high-overhead tracing of specific applications. The Darshan tool (http://www.mcs.anl.gov/darshan), developed under DOE ASCR funding, is a tool for transparently capturing I/O behavior of computational science codes, in production, with virtually no overhead. Darshan provides information that is regularly used for performance debugging of applications, but the value of this data goes beyond its use in tuning specific codes.

Darshan has been deployed at NERSC and ALCF as part of joint activities between SciDAC SDAV personnel and NERSC and ALCF.  Darshan is regularly used at both facilities for rapid performance debugging of critical applications.  In addition, over 150,000 unique log files have been anonymized and shared with the community to help inform research direction and future facility design. Thus this SDAV activity will have a lasting impact on HPC storage architectures, including future exascale platforms.