Examining Job Exit Status Reveals Surprising Information about Computer Operations
Investigators: A Sim, W Yoo
Source Data
NERSC genepool cluster batch queue job exit status plus memory usage, CPU usage and so on
What We Found
- Clustering reveals strange outliers as red circles on the right. Outliers can detect the performance anomalies.
- Job exit status could be predicted with 99% accuracy
- Dips in exit status prediction rate coincide with known system-wide failures
What's Next
- Performance prediction (time, I/O, resource consumption) by job characterization and modeling
- Online failure detection using runtime measurements for early actions on jobs expected to fail