























Amid the rapid advancements in large machine learning (ML) models, universities worldwide are investing substantial funds and efforts into GPU clusters. However, managing a shared GPU cluster poses a pyramid of challenges, from hardware configuration to resource allocation among users. This paper introduces SING, a full-stack solution designed to streamline the management of shared GPU clusters in academic institutions. Motivated by the pressing need for efficient resource sharing and the challenges posed by limited staffing, we present a comprehensive view of SING's architecture and design choices, which achieves operational efficiency (i.e., low maintenance cost and high resource utilization). We also share experience and insights from the real-world operations of SING, including analysis of its usage patterns and management of incidents and failures. This paper is part of our ongoing effort to improve the management of shared ML clusters. We open-source relevant resources to facilitate the development and operation of similar clusters for ML.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。