





















Abstract:Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.
| Comments: | 8 pages, 6 figures, 1 table; ACM/IEEE ISLPED 2026 accepted paper |
| Subjects: | Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.24016 [cs.AR] |
| (or arXiv:2605.24016v1 [cs.AR] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24016 arXiv-issued DOI via DataCite (pending registration) |
From: Woojoo Lee [view email]
[v1]
Wed, 20 May 2026 04:39:11 UTC (1,031 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。