























Abstract:In large-model distributed training, especially large language model workloads, gradient All-Reduce increasingly stresses the memory and communication path. This paper asks whether a Compute Express Link (CXL) memory controller can aggregate low-bit gradient signals as gradient cache lines pass through it, while preserving a 32-bit floating-point (FP32) path for workloads, layers, or phases that should not use low-bit approximation. We present NEURON-Fabric, a CXL-side controller architecture that performs packed gradient-binary (G-Binary) sign-count aggregation and gradient-ternary (G-Ternary) gated aggregation near CXL memory, with a control interface for selecting low-bit or FP32 paths. Cycle-level timing experiments show that the measured five-cycle low-bit aggregation datapath adds at most 1.67 percent exposed runtime overhead in the full last-level-cache miss regime; under bandwidth pressure, the same compute stage is hidden by CXL service time. Functional tests confirm byte-exact identity read-back, G-Binary sign-count aggregation, and G-Ternary gating. Training checks quantify the communication and accuracy tradeoff: low-bit aggregation remains close to FP32 on CIFAR-10/ResNet-18 and SST-2/DistilBERT, while full-path low-bit aggregation fails on CIFAR-100/ResNet-18. Layer-aware admission identifies the classifier head as sensitive; keeping the head on FP32 while applying low-bit aggregation to the backbone recovers most accuracy and reduces gradient traffic to 3.6-5.4 percent of the FP32 baseline. Hardware synthesis and FPGA place-and-route estimates suggest that the 512-bit aggregation datapath is small enough to be treated as a near-memory datapath extension rather than a separate accelerator-scale block.
From: Ziqiang Wang [view email]
[v1]
Sat, 13 Jun 2026 01:17:58 UTC (104 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。