Reading and writing data efficiently from storage system is necessary for most scientific simulations to achieve good performance at scale. Many software solutions have been developed to decrease the I/O bottleneck. One well-known strategy, in the context of collective I/O operations, is the two-phase I/O scheme. This strategy consists of selecting a subset of processes to aggregate contiguous pieces of data before performing reads/writes. In this paper, we present TAPIOCA, an MPI-based library implementing an efficient topology-aware two-phase I/O algorithm.We show how TAPIOCA can take advantage of double-buffering and one-sided communication to reduce as much as possible the idle time during data aggregation. We also introduce our cost model leading to a topology-aware aggregator placement optimizing the movements of data. We validate our approach at large scale on two leadership-class supercomputers: Mira (IBM BG/Q) and Theta (Cray XC40). We present the results obtained with TAPIOCA on a micro-benchmark and the I/O kernel of a large-scale simulation. On both architectures, we show a substantial improvement of I/O performance compared with the default MPI I/O implementation. On BG/Q+GPFS, for instance, our algorithm leads to a performance improvement by a factor of twelve while on the Cray XC40 system associated with a Lustre filesystem, we achieve an improvement of four.
- François Tessier, Venkat Vishwanath, and Emmanuel Jeannot, "TAPIOCA: An I/O library for optimized topology-aware data aggregation on large-scale supercomputers", The IEEE Cluster Conference 2017
- François Tessier, Preeti Malakar, Venkatram Vishwanath, Emmanuel Jeannot and Florin Isaila, "Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers", 1st Workshop on Optimization of Communication in HPC runtime systems (IEEE COM-HPC16), Held in conjunction with ACM/IEEE SuperComputing'16 Conference [Preprint version] [Presentation]