惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Hackread – Cybersecurity News, Data Breaches, AI and More
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 聂微东
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
PCI Perspectives
PCI Perspectives
I
InfoQ
罗磊的独立博客
云风的 BLOG
云风的 BLOG
U
Unit 42
The Last Watchdog
The Last Watchdog
Google Online Security Blog
Google Online Security Blog
T
Troy Hunt's Blog
E
Exploit-DB.com RSS Feed
Help Net Security
Help Net Security
H
Hacker News: Front Page
C
Comments on: Blog
Engineering at Meta
Engineering at Meta
W
WeLiveSecurity
N
News | PayPal Newsroom
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
S
Security Archives - TechRepublic
Hacker News - Newest:
Hacker News - Newest: "LLM"
Hacker News: Ask HN
Hacker News: Ask HN
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
The Exploit Database - CXSecurity.com
Google DeepMind News
Google DeepMind News
I
Intezer
P
Privacy International News Feed
Cisco Talos Blog
Cisco Talos Blog
P
Proofpoint News Feed
P
Privacy & Cybersecurity Law Blog
Project Zero
Project Zero
N
News and Events Feed by Topic
Simon Willison's Weblog
Simon Willison's Weblog
T
Threat Research - Cisco Blogs
AI
AI
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
L
LINUX DO - 热门话题
S
Security Affairs
V
V2EX - 技术
V
Vulnerabilities – Threatpost
Security Latest
Security Latest
SecWiki News
SecWiki News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Webroot Blog
Webroot Blog
H
Heimdal Security Blog
T
Threatpost
A
Arctic Wolf

Hacker News

Introducing Claude Opus 4.7 Qwen Studio The Future of Everything is Lies, I Guess: Where Do We Go From Here? GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Ancient DNA reveals pervasive directional selection across West Eurasia [pdf] Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus GitHub - Nightmare-Eclipse/RedSun: The Red Sun vulnerability repository GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - macOS26/Agent: Any AI, replaces Claude Code, Cursor, OpenClaw. Over 18 LLM providers (Claude, OpenAI, Gemini, Ollama, Zai, HF, Qwen) wired into a native Mac app that writes code, builds Xcode projects, bumps versions, manages git, automates Safari, use AppleScript, JS or Accessibility, extend Agent! w/ MCP Servers, run tasks from your iPhone via Messages. YouTube now lets you turn off Shorts I Made a Terminal Pager Burgers | マクドナルド公式 Commands — HackerNews CLI documentation ChatGPT for Excel PiCore - Raspberry Pi Port of Tiny Core Linux Live Nation illegally monopolized ticketing market, jury finds Google Broke Its Promise to Me. Now ICE Has My Data. Founding Engineer at Adaptional | Y Combinator CRISPR takes important step toward silencing Down syndrome’s extra chromosome Show HN: Libretto – Making AI browser automations deterministic US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf] Unexpected €54k billing spike in 13 hours: Firebase browser key without API restrictions used for Gemini requests Retrofitting JIT Compilers into C Interpreters IPv6 traffic crosses the 50% mark The Accursèd Alphabetical Clock Cybersecurity Looks Like Proof of Work Now Fragments: April 14 Cal.com Goes Closed Source: Why AI Security Is Forcing Our Decision | Cal.com - Scheduling Software for Online Bookings Laravel raised money and now injects ads directly into your agent When moving fast, talking is the first thing to break Too much discussion of the XOR swap trick Introduction to Spherical Harmonics for Graphics Programmers The Grand Line Building a Z-Machine in the worst possible language High-Level Rust: Getting 80% of the Benefits with 20% of the Pain GitHub - duguyue100/midnight-captain: Inspired by Midnight Commander, tailored to my taste. How to build a `git diff` driver · Jamie Tanna | Software Engineer Center for Responsible, Decentralized Intelligence at Berkeley The Local Universe’s Expansion Rate Is Clearer Than Ever, but Still Doesn’t Add Up - A new synthesis of astronomical measurements confirms a persistent mismatch that could point to physics beyond current models The air throughout our homes is infused with microplastics. But there are things you can do to breathe less of them The disturbing white paper Red Hat is trying to erase from the internet – OSnews The Future of Everything is Lies, I Guess: Annoyances ‘Abhorrent’: the inside story of the Polymarket gamblers betting millions on war Productive procrastination — Max van IJsselmuiden maps, territory and LMs 447 Terabytes per Square Centimetre at Zero Retention Energy: Non-Volatile Memory at the Atomic Scale on Fluorographane Show HN: Pardonned.com – A searchable database of US Pardons 20 Years on AWS and Never Not My Job The Seasons are Wrong Artemis II crew splashes down near San Diego after historic moon mission We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs How a dancer with ALS used brainwaves to perform live On filing the corners off my MacBooks Installing every* Firefox extension OpenClaw’s memory is unreliable, and you don’t know when it will break Steve Blank Nowhere Is Safe Chimpanzees in Uganda locked in vicious 'civil war', say researchers watgo - a WebAssembly Toolkit for Go linux/Documentation/process/coding-assistants.rst at master · torvalds/linux GitHub - callumlocke/json-formatter: Makes JSON easy to read. Founding Product Engineer at Bild AI | Y Combinator A compelling title that is cryptic enough to get you to take action on it GitHub - Keychron/Keychron-Keyboards-Hardware-Design: Industrial design files for Keychron keyboards and mice. 100+ models with CAD assets in STEP, DXF, DWG, and PDF. Source-available, with commercial use allowed for original compatible accessories within the license terms. [ANNOUNCE] WireGuardNT v0.11 and WireGuard for Windows v0.6 Released 1D-Chess Helium Is Hard to Replace Cooperative Vectors Introduction | Evolve Keeping a Postgres queue healthy — PlanetScale Our response to the Axios developer tool compromise Do Americans read print books, e-books or audiobooks more? The Zettelkasten Method in Obsidian: A Practical Setup Guide Artemis II Is Competency Porn and We Are Starving For It WeakC4 Flight Viz — Cockpit View A Mexican surveillance giant you’ve never heard of is now watching the U.S. border Surelock: Deadlock-Free Mutexes for Rust RISC-V 101 – what is it and what does it mean for Canonical? | Ubuntu The Problem That Built an Industry How Much Linear Memory Access Is Enough? | Solidean Investigating Split Locks on x86-64 Simplest hash functions Sybilproof reputation mechanisms (2005) [pdf] What is a property? How Complex is my Code? Static code analysis in Kotlin — tools overview Toffoli gates are all you need PGLite evangelism dcmake: a new CMake debugger UI Clojure on Fennel part one: Persistent Data Structures Fragments: April 2 Python Release Python install manager 26.1 The Life and Death of the Book Review - Liberties Bitcoin miners are losing $19,000 on every BTC produced as difficulty drops 7.8% God sleeps in the minerals Building slogbox Apple Silicon and Virtual Machines: Beating the 2 VM Limit Who was “Not Even Wrong” first? Pokemon Evolution Vs Darwinian Evolution The APL Programming Language Source Code
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
Aarush Gupta · 2026-06-10 · via Hacker News

07 Jun 2026

This post is a high-level explainer for my Master’s thesis, which involves designing hardware architectures for ultrafast inference and online learning using the Kolmogorov-Arnold Network (KAN) architecture. I’ll assume familiarity with standard machine learning concepts, as well as some understanding of hardware and digital circuits; read my previous post here for the latter.

Please read the two papers below for more information, particularly for details on benchmarks and notable results.

[FPGA 2026 Best Paper]
Duc Hoang*, Aarush Gupta*, and Philip C. Harris. “KANELÉ: Kolmogorov–Arnold Networks for Efficient LUT-based Evaluation.” Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 2026. https://dx.doi.org/10.1145/3748173.3779202

[ICML 2026]
Duc Hoang*, Aarush Gupta*, and Philip Harris. “Ultrafast on-FPGA Online Learning via Spline Locality in Kolmogorov-Arnold Networks.” arXiv preprint arXiv:2602.02056, 2026. https://arxiv.org/abs/2602.02056

*equal contribution

The case for machine learning on FPGAs

Most modern machine learning workloads, whether training or inference, run on graphics processing units (GPUs). Through hardware architectures that support a highly parallel execution model, GPUs can perform simple operations on large amounts of data with extremely high throughput. This makes them ideal for machine learning problems involving large architectures or batch-style training and inference.

However, complex GPU architectures cannot meet the demands of applications that require ultra-low latency (e.g. sub-microsecond latency) and high hardware efficiency. Processors (e.g. CPUs and GPUs) incur significant overhead from scheduling and optimizing instructions, dynamically accessing memory, and so on. Extremely specialized workloads with ultralow latency (e.g. $\sim$nanoseconds) and efficiency requirements are instead better served by custom hardware accelerators.

Field-programmable gate arrays, or FPGAs, are reconfigurable digital logic devices that are extremely well-suited for this style of custom hardware acceleration. FPGAs contain lookup tables (LUTs), which represent digital functions by enumerating the output value for every combination of binary inputs; flip-flops (FFs), which store state; and other memory and computation primitives. These components and the connections between them are reconfigured to design a custom digital circuit, allowing for low-level hardware architecture and algorithm co-design that enables ultrafast machine learning. Importantly, neural networks are implemented directly as digital logic, rather than as instructions that are sequentially executed on a processor.

Background

Fixed-point quantization

FPGAs and other digital devices fundamentally operate on bits rather than continuous values. However, we often think about arithmetic operations in neural networks (e.g. $\times, +$) as happening over the real numbers $\mathbb R$. We thus need to encode real numbers as bitstrings (sequences of bits), a process known as quantization. Operations like addition and multiplication then become binary functions.

One method for doing this is fixed-point quantization.

Fixed-point quantization represents numbers in base-2, where some bits (fractional bits) come after the decimal point. To illustrate, if we use 8 bits total with 4 fractional bits after the decimal point, we can represent $2^8$ values from $(-2^7) / 2^4 = -8$ to $(2^7 - 1) / 2^4 = 7.9375$, spaced evenly in increments of $1/2^4 = 0.0625$. We will assume here that the representable range is symmetric about zero.

In a fixed-point quantization scheme, we can only represent a discrete set of values in some fixed range, which will lead to approximation error when trying to represent real values. One focus of resource-efficient machine learning is minimizing this approximation error, or quantization error, to enable stable training and inference.

Lookup-table neural networks (LUT-NNs)

FPGAs implement digital logic primarily through lookup tables (LUTs), which are small components that represent arbitrary binary functions by storing their output for each combination of binary inputs. For example, $\text{AND} : \{0, 1\}^2 \to \{0, 1\}$1 is represented with a lookup table

Input ($x,y$) $x\text{ AND }y$
00 0
01 0
10 0
11 1

It then makes sense to learn these binary functions, represented as lookup tables, as core primitives of a neural network: such a network is known as a lookup-table neural network (LUT-NN). However, learning lookup tables through gradient descent or similar approaches is difficult.

To address this issue, recall that we can learn real-valued functions $f: \mathbb R \to \mathbb R$ through gradient descent. If we perform fixed-point quantization with $b_i$ input bits and $b_o$ output bits, $f$ becomes a binary function $f_Q : \{0, 1\}^{b_i} \to \{0, 1\}^{b_o}$. We can then learn continuous $f$ and quantize to get our desired lookup tables!

To convert $f$ into a LUT, we discretize the function domain and range into $N_i = 2^{b_i}$ and $N_o = 2^{b_o}$ values. The lookup table of $f_Q$ stores, for each of the input values $I \in \{I_0, I_1, \ldots, I_{N_i-1}\}$, the corresponding output.

LUT quantization process Quantizing a continuous $f$ (dashed line) to a binary function $f_Q$ (orange dots).

The example function $f$ above, where $q_{l-1}$ and $q_{l}$ represent quantization of the inputs and outputs, produces a binary function $f_Q$ with LUT:

\(q_{l-1}(x_l)\) \(q_l(x_{l+1})\)
00 000
01 011
10 100
11 111

We could also extend this approach to represent multivariate functions as lookup tables, where $f_m: \mathbb{R}^{d_i} \to \mathbb{R}^{d_o}$ would turn into a binary function with $d_i b_i$ input bits and $d_o b_o$ output bits.

In this lookup table, we store entries of size $d_o b_o$ for each of the $2^{d_i b_i}$ possible combinations of inputs.

This is clearly impractical for any reasonable $d_i$. LUT-based networks therefore combine smaller lookup tables with arithmetic operations to enable architectures that are expressive, resource efficient, and easily trainable.

Kolmogorov-Arnold Networks (KANs)

KANs replace the learnable weights and fixed activation functions in MLP (multi-layer perceptron) architectures with learnable activation functions. In this work, we demonstrate that KANs are a natural architecture for efficient, expressive LUT-NNs.

In a KAN layer, each edge carries a learnable univariate function instead of a scalar weight. For a KAN layer with $n_{\mathrm{in}}$ inputs and $n_{\mathrm{out}}$ outputs, the activation of the $q$-th output is

\[y_q = \sum_{p=1}^{n_{\mathrm{in}}} \phi_{q,p}(x_p),\]

where $\phi_{q,p} \colon \mathbb{R} \to \mathbb{R}$ are the learned edge activations. Compared to an MLP, which uses

\[y_q = \sigma\left( \sum_{p=1}^{n_{\mathrm{in}}} w_{q,p} x_p + b_q \right)\]

with fixed $\sigma$, the KAN places the nonlinearity in the edge functions $\{ \phi_{q,p} \}$ and keeps the node operation as a simple sum.

The next question is how to learn the KAN activations $\{ \phi_{q,p} \}$. To do so, we parametrize them as linear combinations of some functional basis:

\[\phi_{q,p}(x) = \sum_{i=1}^n c_{q,p,i}B_i(x),\]

which allows us to treat the coefficients $c_{q,p,i}$ as trainable parameters for gradient descent. The original KAN paper uses B-splines, which form a polynomial basis that is smooth and also local, i.e. only a subset of basis functions are nonzero for any given input. Additionally, B-splines, and therefore the activations $\{ \phi_{q,p} \}$, are defined over a small, finite domain (e.g. $[-1, 1]$), which turns out to be important.

Terminology: B-splines refer to the basis functions ${B_i}$, whereas activations refer to the learned $\phi_{q,p}(x) = \sum_{i=1}^n c_{q,p,i}B_i(x)$.

While the complete behavior of the KAN architecture has yet to be fully explored, it offers potential improvements over MLPs in scaling laws, parameter efficiency, and interpretability. For ultrafast machine learning, the first two characteristics are especially relevant.

KANs as trainable LUT-NNs

Approach

The key idea of our first paper is to use KANs as a principled way to build trainable LUT-NNs. We represent each activation function in the network with a separate LUT, using the procedure discussed in the LUT-NN background section. We now demonstrate that KAN activations are particularly well-suited for LUT representation.

Many other LUT-NN schemes represent multivariate functions as lookup tables, which is inefficient: as we saw earlier, the number of LUT entries scales exponentially with input dimension. In contrast, the core property of KANs is that they sum univariate activations, avoiding exponential scaling and enabling straightforward pruning (i.e. removing unimportant network components to reduce resource usage).2 Additionally, because each activation is defined over a small finite domain (e.g. $[-1,1]$), we can cover the entire input range when quantizing the function!

Implementation

Here, we train KANs in software (e.g. PyTorch on CPUs/GPUs) using standard techniques and then deploy fixed models for inference on FPGAs.

To perform on-FPGA inference for a trained KAN layer, we adopt a fixed-point quantization scheme and compute the activations $\{ \phi_{q,p} \}$ in parallel using lookup tables. Then, we perform the sum $\sum_{p=1}^{n_\text{in}} \phi_{q,p}(x_p)$ using $n_\text{in} - 1$ pairwise additions (arranged in an adder tree).

We convert each complete activation $\phi_{q,p}(x_p)$ to a lookup table, not the B-splines ${B_i}$ themselves. This is because the model is pretrained and activations are fixed at inference time.

For multi-layer networks, we construct the circuits described above for each layer (whose LUTs store their respective learned activations) and connect each layer’s outputs to the next layer’s inputs.

KAN LUT-based inference architecture Overview of the architecture for efficient LUT-based KAN inference on FPGAs.

Results

This framework matches and surpasses state-of-the-art neural network FPGA accelerators on metrics including latency and resource usage, with a 2700x speedup over prior KAN-FPGA implementations. Check out the paper for more details if you’re interested!

Real-time on-FPGA learning

Motivation

Training a model in software and deploying it to an FPGA provides extremely fast inference, but the model itself is still fixed after deployment. In many real-time settings, however, the system being modeled is not static: its state or properties can evolve at high frequencies. In applications such as quantum control and nuclear fusion, a model may therefore need to adapt its behavior within a fraction of a microsecond while still performing ultrafast inference.

This is the motivation for online learning: instead of treating the FPGA as an inference-only device for pretrained models, we also update the model in real time as new data arrives. We stream in inputs, run the model, compare each prediction against feedback or a target value, and use that error to update the model parameters. In other words, the forward pass, backward pass, and gradient update all run on the FPGA itself, rather than only the forward pass.

Unlike a CPU/GPU, which fetches weights from memory, calculates gradients, and writes them back, the FPGA implements the gradient update logic as a dedicated, parallel circuit that directly modifies the FPGA memory storing the coefficients.

Although the concept of real-time gradient updates on FPGAs has been underexplored and historically considered impractical, we demonstrate that the ideas of LUT-based KAN inference can be extended to enable this form of online learning at sub-microsecond timescales.3

Approach

Because we now want to train models on the FPGA rather than only run inference with static, pre-trained ones, we store the B-spline basis functions $\{ B_i \}$ in LUTs rather than the learned activations $\{ \phi_{q,p} \}$.

The reason is that the coefficients $c_{q,p,i}$ in

\[\phi_{q,p}(x) = \sum_{i=1}^n c_{q,p,i}B_i(x)\]

are updated as training progresses on-FPGA: since the activations $\{ \phi_{q,p} \}$ keep changing, we cannot precompute and store them.

Instead, we must look up B-spline values and multiply them by each activation’s coefficients to compute it.

We now demonstrate certain properties of the B-spline basis $\{ B_i \}$ that make gradient updates both sparse and stable under fixed-point quantization, enabling gradient-based learning with extremely low latency and a small resource footprint compared to prior approaches.

Local basis functions

B-spline basis functions have the property that only a small subset of them are nonzero (“active”) for any given input value. To achieve this, the input range is split into $G$ intervals, or “grid cells”. For B-spline polynomials of order $S$, $G+S$ basis functions form a complete basis, but only $S+1 \ll G+S$ of them are nonzero over any particular interval.

Local basis functions illustration B-spline basis functions for $G=3, S=2$. Note that only $S+1=3$ basis functions are nonzero over each grid cell.

We exploit this locality as follows. Within every interval, the $S+1$ active basis functions have the same shape (just shifted from one interval to the next); only their coefficients differ between intervals. To evaluate an activation, we compute those fixed $S+1$ basis functions, multiply them by the coefficients for the current interval, and sum the results.

As a result, the hardware logic required for the forward and backward passes of a KAN spline scales with $S+1$, not $G$. In particular, we can “horizontally” scale KANs by increasing the grid size $G$ to improve model expressivity without scaling up $S$.

Stable fixed-point training

A common issue with FPGA-based training is that neural network training often produces weights and gradients that vary widely in magnitude. However, with a fixed number of bits, there is a tradeoff between representing small values precisely (fine precision) and covering a large range of magnitudes.4

Furthermore, multi-layer perceptrons (MLPs) primarily consist of matrix multiplications whose output magnitudes scale with the input activation magnitudes, widening the range of values that quantization must cover. Even with bounded activation functions like sigmoid or tanh, the intermediate values fed into those activations suffer from the same issue when quantized.

However, it can be shown that the B-splines in KANs satisfy $\sum_i B_i(x) = 1$. For any given input $x$, the output

\[f(x) = c_1 B_1(x) + c_2 B_2(x) + \ldots + c_{S+1} B_{S+1}(x)\]

is therefore always bounded between the smallest and largest coefficient: $\min_i c_i \leq f(x) \leq \max_i c_i$. The gradients follow similar bounds. This means that, in KANs, both activations and gradients remain within predictable, input-independent ranges. This makes it easier to select an optimal quantization range, which reduces quantization error for gradient updates and significantly improves learning stability.

Implementation

Concretely, the forward pass for a KAN spline works as follows. For a given input $x$, we first compute its interval index $i$ and its offset $x_o = x - x_\text{interval start}$ within that interval. We then look up the basis function values $f_1(x_o), f_2(x_o), \ldots, f_{S+1}(x_o)$, and take their linear combination using interval $i$’s coefficients (which are themselves retrieved dynamically).

In computing parameter gradients during the backward pass, we use standard backpropagation techniques. Specifically, we precompute the B-spline derivatives in LUTs and reuse the activations stored during the forward pass.5 Gradients are then directly multiplied by a fixed learning rate and added to model parameters.

ECLAIR on-chip learning architecture Only a small fraction of basis functions are computed for the KAN activation, compared to dense matmuls for MLPs.

Results

In the paper, we demonstrate that KAN-based online learners can scale up to 50,000+ parameters while performing forward and backward passes at sub-microsecond latencies, which has, to our knowledge, not been achieved prior to this work for gradient-based learning. Additionally, compared to MLPs, KANs demonstrate significantly better hardware scaling (near-constant resource usage as we scale up $G$) and convergence on a variety of benchmarks, including function approximation, qubit readout, and non-stationary control.

Conclusion

This work shows that activations in Kolmogorov-Arnold Networks map naturally to lookup tables on FPGAs, allowing for extremely efficient nanosecond-latency inference. Additionally, properties of KAN activations, including B-spline locality and boundedness, allow for stable and sparse gradient updates on-FPGA. More broadly, we conclude that properties of Kolmogorov-Arnold Networks and their learned activations, while hard to exploit on GPUs, should be explored further on custom hardware accelerators.