


























Continuous availability of HPC systems built from commodity components have become a primary concern as system size grows to thousands of processors. In this paper, we present the analysis of 8-24 months of real failure data collected from three HPC systems at the National Center for Supercomputing Applications (NCSA) during 2001-2004. The results show that the availability is 98.7-99.8% and most outages are due to software halts. On the other hand, the downtime are mostly contributed by hardware halts or scheduled maintenance. We also used failure clustering analysis to identify several correlated failures.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。