惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tor Project blog
B
Blog RSS Feed
M
MIT News - Artificial intelligence
WordPress大学
WordPress大学
H
Hackread – Cybersecurity News, Data Breaches, AI and More
罗磊的独立博客
GbyAI
GbyAI
N
Netflix TechBlog - Medium
博客园 - 司徒正美
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
W
WeLiveSecurity
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
SecWiki News
SecWiki News
V
Vulnerabilities – Threatpost
Google DeepMind News
Google DeepMind News
C
CERT Recently Published Vulnerability Notes
T
Tailwind CSS Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
Martin Fowler
Martin Fowler
A
About on SuperTechFans
S
Security @ Cisco Blogs
T
Tenable Blog
C
Check Point Blog
N
News and Events Feed by Topic
S
SegmentFault 最新的问题
The GitHub Blog
The GitHub Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Attack and Defense Labs
Attack and Defense Labs
美团技术团队
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
C
Cisco Blogs
P
Palo Alto Networks Blog
V
V2EX
博客园 - 聂微东
Project Zero
Project Zero
酷 壳 – CoolShell
酷 壳 – CoolShell
D
Docker
N
News | PayPal Newsroom
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
小众软件
小众软件
Application and Cybersecurity Blog
Application and Cybersecurity Blog
人人都是产品经理
人人都是产品经理
V2EX - 技术
V2EX - 技术
I
Intezer
L
LINUX DO - 最新话题

Show HN

CSP Radar GitHub - awebai/aweb-team-coord-worktrees: An aweb team template for a minimum team with a permanent coordinator and worktrees with local developers. GitHub - fujibee/agmsg GitHub - lucastononro/notify: 100% local, free, offline attention skill for Claude Code: plays a sound and speaks a short status update when a long task finishes, blocks, or needs a decision. GitHub - sebastianwessel/skills: AI Skills tivatdoar / workout-to-work · GitLab GitHub - enumura1/py-sql-cleaner: Find, format, and safely extract embedded SQL from Python files. GitHub - intent-bench/intent-bench: Intent fulfillment benchmark for agentic AI engineering GitHub - steveking-gh/firmion: Firmion is DSL and engine for firmware image generation. GitHub - villagesql/villagesql-skills: Agent skills for VillageSQL - gemini-cli-extension; claude-code-plugin GitHub - 0gsd/enough: a personal language system for planning, writing, and translation. GitHub - Kaelio/ktx: ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer GitHub - ThatXliner/xtras: Xliner's Claude Code Skills GitHub - flightdeckhq/flightdeck: Observability and control plane for AI agents. GitHub - search-router/simple-search: Open-source reference app on top of the Search Router API: FastAPI + Jinja metasearch service with pluggable backends, deterministic mocks (no API key needed), RTL UI, Redis cache, and a demo ads cabinet. CSP Radar GitHub - Light-Heart-Labs/DreamServer: Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. GitHub - Diplomat-ai/diplomat-agent-ts: What can your TypeScript AI agent do to the real world? Scan your code. See which tool calls have zero checks Code Block Selector - Visual Studio Marketplace Prometheus dependency graph — interactive showcase | Riftmap Show HN: I made a vi-like modal keyboard plugin for Figma GitHub - run-llama/liteparse: A fast, helpful, and open-source document parser GitHub - dalemyers/Roar: A macOS CLI tool for notifications GitHub - district-solutions/open-agent-tools-coder: Enables small-to-large self-hosted ai models to use local source code when running tool-calling agentic workloads. We actively data mine 20,900+ (2+ TB) popular github repos using large and small ai models to create reuseable: json, markdown and parquet files for local-first tool-calling models. GitHub - progapandist/stripeek: A local TUI proxy for real-time Stripe API debugging, built for navigating complex payloads fast. GitHub - sir1st/hermes-desktop: All-in-one cross-platform desktop app for Hermes Agent — bundles Python + hermes-agent + hermes-web-ui GitHub - astefanutti/shaderbang: Shebang for Shaders Show HN: Generate Claude Code Workflows using Spec Driven Development approach GitHub - nixys/nxs-universal-chart: The Helm chart you can use to install any of your applications into Kubernetes/OpenShift Show HN: AI agents for UK GDAD PCF roles and their skills The Two Pillars: Mixer Mode and Meta-Software in the Reorganization of Software Work After AI GitHub - JaiCode08/teleport-env What 1,000+ Harness Experiments Taught Me About Self-Improving Agents Show HN: Liiists, a Markdown-first, iOS and CLI list app SwiperTab – Get this Extension for 🦊 Firefox (en-US) GitHub - kouhxp/fftext: Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command GitHub - sweetpad-dev/sweetpad: Develop Swift/iOS projects using VSCode GitHub - dogmaticdev/IRON: IRON a.k.a. Intermediate Representation Object Notation is a Interpreter/Database that is used to create Programming Languages. GitHub - sjhalani7/vaen: Package your AI coding harness into a portable .agent file, and share it across repos, teams, & the community without ever having to copy-paste instructions, skills, MCP config, or secrets. Show HN: Gandalf the Grader Show HN: Citadeld – replay any CI failure locally from a single file GitHub - tdortman/cuSBF: High-Performance GPU Super Bloom Filter coral-ai/claude-code-token-xray at main · Coral-Bricks-AI/coral-ai GitHub - ulyssestenn/funes: Funes is a Git-based framework for LLM-managed knowledge work: an AI Librarian ingests raw sources, builds an interlinked Markdown knowledge base, and uses it to produce cited reports, analyses, and other outputs. GitHub - ThatXliner/gah: Git Add Hunk, built for agents to use GitHub - harmont-dev/harmont-cli: Command-line client for the Harmont CI platform GitHub - brooksmcmillin/mcp-authflow: OAuth 2.0 Authorization Server framework for MCP servers GitHub - javaid-codes/audit-supply-chain-agents GitHub - amorey/gochan: A small library of common channel architectures for Go, inspired by Rust GitHub - arifozgun/OpenGem: Free, Open-Source AI API Gateway with Gemini, OpenAI & Anthropic Compatibility in 1 file GitHub - Pranesh950/BioPetals: 🌸 Run BIOxAI models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading GitHub - cnguyen14/bounty-doctor: Diagnose a GitHub bounty issue before you waste hours: detects honeypot scam repos, AI-bot attempt swarms, and stale contests. Show HN: CoreMCP – MCP Server for On-Prem DBs Show HN: KittyHTML – Render HTML/CSS as an inline image in your terminal GitHub - bingud/filemat: Web-based file manager Show HN: TruthLens – Free multi-signal deepfake image detector GitHub - apexlocal-jz/claude-usage-tray: Windows system-tray app showing your Claude Code rate-limit usage at a glance. Zero deps, ~300 lines of PowerShell. Cross-IDE (works regardless of VS Code, Cursor, plain terminal). Release v0.1.2.1 · kouhxp/yapsnap GitHub - noopolis/moltnet: Self-hostable chat network for AI agents. Pre-built bridges for Claude Code, Codex, and the Claws. Rooms, DMs, history. No Slack bots, no Matrix, no glue code. GitHub - tamerh/enju: Coordinating Humans, AI Agents, and Compute as Peers on a Shared Workflow Graph Show HN: Continuity-auth – Respect-weighted rate limits for the open web GitHub - luml-ai/luml: AI lifecycle platform where engineers and agents track experiments, train models, and ship to production. GitHub - mrdanielcasper/CoreTex: A UNIX-inspired, biomimetic, flat-file AI harness and knowledge engine. GitHub - clemg/pierre-github: Pierre's diffs.com and trees.software for Github GitHub - lyriks-io/unspaghettit: Behavior-driven AI development without prompt spaghetti. GitHub - sofumel/claude-handoff-revive: Resume Claude Code work after rate/usage/context limits without replaying the prior transcript. Auto-saves at 90%/95% usage. Plugin-installable, 10 languages. GitHub - dotexorg/saferpc: Typed, end-to-end encrypted RPC over any bidirectional channel. GitHub - BeeZeeAgent/beezee: Agent harness orchestration Legato Next.js Boilerplate for Internal Tools · CoreUI GitHub - clark-labs-inc/clark-hash: Clark Hash, 32x smaller searchable sketches for embeddings GitHub - ZeroPointRepo/youtube-mcp: The fastest YouTube transcript + YouTube search MCP for AI agents. Try for free. Typing Mastery — climb toward 100+ WPM, deliberately GitHub - Andebugulin/Awareen GitHub - fayzan123/claude-workflow-composer: Visual desktop app for composing multi-agent coding workflows. Drag agents, attach skills and MCPs, wire handoffs, export to .claude/ GitHub - StackOneHQ/stack-nudge We hardened an LLM agent. Each defense we added made it more exploitable. GitHub - alkait/WhatsKept: Agent-queryable WhatsApp history from an iOS backup — a single Go binary. GitHub - octelium/cordium: Open-source, general-purpose sandbox platform for devs and AI agents that provides identity-based secure access to infrastructure without credentials. GitHub - scosman/videowright: Build animated explainer videos with your coding agent GitHub - dipankar/dscode: The code editor you can take apart. GitHub - zoharbabin/web-researcher-mcp: MCP server (Go) for AI assistants: web search, content extraction, academic/patent/news research. Multi-provider routing, 4-tier scraping, search lenses. Works with Claude, Cursor, and any MCP client. GitHub - scanaislop/aislop: Catch the slop AI coding agents leave in your code: narrative comments, swallowed exceptions, as-any casts, dead code, oversized functions. 50+ rules across 7 languages (TypeScript, JavaScript, Python, Go, Rust, Ruby, PHP). Sub-second, deterministic, no LLM at runtime. MIT-licensed. GitHub - kouhxp/cheap-im: CPU-only voice agent approximating Thinking Machines' Interaction Models demo GitHub - unprovable/OrchidMantis: Orchid Mantis — standalone framework for Zero-Knowledge Proofs of eXploit (ZKPoX). GitHub - TangibleResearch/Halgorithem: A Algo designed to detect AI Hallucitions GitHub - CarpseDeam/Aura-IDE: An AI coding harness that shaped itself - Planner/Worker agents, repo awareness, surgical edits, validation, recovery, and safe diff approvals. GitHub - chojs23/concord: A feature-rich TUI client for Discord GitHub - aerf-spec/aerf: Agent Evidence Receipt Format (AERF) — an open specification for tamper-evident, independently verifiable records of AI agent actions. GitHub - Jwrede/tokentoll: Catch LLM cost changes in code review. Infracost for LLM spend. GitHub - samchon/ttsc: A `typescript-go` toolchain for compiler-powered plugins and type-safe execution + 500x faster lint integrated into compiler GitHub - Higangssh/homebutler: 🏠 Manage your homelab from chat. Single binary, zero dependencies. GitHub - olalie/tapmap: See where your computer connects and what stands out on a live world map. GitHub - Diplomat-ai/diplomat-agent: What can your AI agent do to the real world? Scan your code. See which tool calls have zero checks GitHub - Bajusz15/beacon: Open-source agent for secure remote access, monitoring, and deploys across home-lab and self-hosted machines like Raspberry Pi, N100, or any Linux server. Open web based TTY or tunnel Home Assistant and other local services securely without opening ports. BigTech AI News - Chrome 应用商店 GitHub - vinhnx/VTCode: VT Code is an open-source coding agent with LLM-native code understanding and robust shell safety. Supports multiple LLM providers with automatic failover and efficient context management. GitHub - Lumen-Labs/brainapi2: BrainAPI is a knowledge graph–powered AI memory layer that transforms unstructured data into structured knowledge, enabling intelligent search, recommendations, and contextual memory for AI agents and applications. GitHub - familiar-software/familiar: Let AI watch you work. Familiar lets your AI update its memory, skills, and knowledge by watching your screen. make sidebar/address bar rounded corner toggleable
ai_research_assistant_rnd_projects/proxy_block_cage_attention_introduction.md at main · iqbal1980/ai_research_assistant_rnd_projects
iqbal1980 · 2026-06-24 · via Show HN

Proxy Block-CAGE: Working with ChatGPT as a research assistant to create a new type of Sparce Attention algorithm

I started this project with a question that was half technical and half personal:

Could I use ChatGPT as a serious research partner, not just as a code generator, to help me explore a core AIML research problem?

I am Iqbal Addou, a PhD student in Bioinformatics and Computational Biology at George Mason University. I have about 23 years of software engineering experience and about 5 years of applied AIML engineering experience, mostly in computer vision. I know PyTorch, training workflows, and how to use modern AI frameworks, but I do not come from a long background in core model research. I am trying to pivot toward that world: efficient inference, sparse attention, long-context systems, and the mathematical/computational side of Transformers.

In my PhD work I have used genetic programming, genetic algorithms, optimization techniques, and physics-informed neural networks. Those methods are not always fashionable in modern LLM work, but I have learned to respect them because they force you to think in terms of search spaces, objectives, constraints, and failure modes. So I wanted to see if some of that mindset could be useful in a Transformer problem.

The first prompt was intentionally broad and a little unusual. I asked ChatGPT to help me find a computationally and mathematically intensive problem in LLMs and Transformers, and to explore out-of-the-box methods such as genetic programming, genetic algorithms, classical AI/ML, and older optimization ideas. I wanted something that could be formulated mathematically, implemented locally, and tested on my own hardware.

The project eventually became Proxy Block-CAGE, a small sparse-attention prototype for long-context prefill. The useful version is simple:

Split Q/K/V into blocks, use cheap pooled query/key block summaries to select a few key blocks per query block, then compute exact local attention only inside those selected key/value blocks.

The strongest result I got was on my laptop GPU: an NVIDIA RTX 2080 Max-Q 8GB. In a tiny 2-layer character-level Transformer trained on a custom text corpus, at 16,384 tokens, Proxy Block-CAGE ran generation-style prefill 2.61x faster than dense PyTorch attention, while using only 0.39% of dense attention capacity and preserving 100% final-token top-1/top-5 agreement on that small eval run.

This is not a production LLM claim. It is a reproducible research prototype and a request for honest feedback.


The result in one table

The most important runs were on a trained 2-layer character-level Transformer, using CUDA fp16 on the RTX 2080 Max-Q.

Context length Dense prefill Proxy Block-CAGE, capacity 64 Speedup Attention density Final-token top-1 agreement Final-token top-5 containment
8,192 7.925 ms 5.263 ms 1.51x 0.781% 95% 100%
12,288 14.980 ms 7.018 ms 2.13x 0.521% 100% 100%
16,384 24.301 ms 9.323 ms 2.61x 0.391% 100% 100%

For the 16,384-token run:

dense prefill_last_ms_median:        24.3009 ms
Proxy Block-CAGE 64 prefill median:   9.3229 ms
speedup:                              2.61x
density:                              0.00390625 = 0.390625%
final-token top-1 agreement:          100%
final-token top-5 containment:        100%

The model was tiny, the corpus was small, and the eval sample count was small. I am treating this as a stronger sanity check than random tensors, not as evidence that this works for production LLMs.


Why attention is expensive

After projection, a Transformer layer has:

$$ Q,K,V \in \mathbb{R}^{B \times H \times n \times d} $$

where:

B = batch size
H = number of attention heads
n = sequence length
d = head dimension

Dense causal attention computes:

$$ s_{ij} = \frac{q_i^\top k_j}{\sqrt d} $$

for valid causal pairs:

$$ j \le i $$

Then:

$$ a_{ij} = \frac{\exp(s_{ij})}{\sum_{\ell \le i}\exp(s_{i\ell})} $$

and:

$$ y_i = \sum_{j \le i} a_{ij}v_j $$

The compute is approximately:

$$ O(BHn^2d) $$

At 16,384 tokens, one head has:

$$ \frac{n(n+1)}{2} = \frac{16384 \cdot 16385}{2} = 134{,}225{,}920 $$

causal token-token score locations. With 4 heads:

$$ 4 \cdot 134{,}225{,}920 = 536{,}903{,}680 $$

That is about 537 million query-key score locations for one layer.

Dense attention is extremely optimized in PyTorch and FlashAttention-style kernels, so reducing arithmetic does not automatically mean a faster wall-clock runtime. But the arithmetic shows why long context is a natural place to look for sparsity.


The first idea: evolve attention cages

The original CAGE idea was not block routing. It was genetic/evolutionary attention-mask search.

For a query token (i), instead of attending to every previous key, define a small selected set:

$$ C_i \subseteq {0,1,\dots,i} $$

Then compute attention only inside that selected set:

$$ y_i^{\text{CAGE}}

\sum_{j\in C_i} \operatorname{softmax}_{j\in C_i} \left( \frac{q_i^\top k_j}{\sqrt d} \right)v_j $$

In the genetic algorithm version, an individual chromosome was a list of selected key indices. For example:

query token i = 100
candidate cage C_i = [2, 7, 19, 44, 88, 91, 97, 100]

That candidate means token 100 is only allowed to attend to those key/value positions.

A simple fitness function was:

$$ f(C_i)

\log \sum_{j\in C_i} \exp\left(\frac{q_i^\top k_j}{\sqrt d}\right) $$

The GA operations were ordinary evolutionary operations:

selection: keep better cages
elitism: preserve the best cages
crossover: mix two cages
mutation: replace some indices
repair: remove duplicates and enforce causal validity

This was useful conceptually and terrible computationally. Token-level CAGE required one GA search per query token. At 512 tokens, batch 1, 4 heads, that means:

$$ 1 \cdot 4 \cdot 512 = 2048 $$

separate GA units.

In one early benchmark, at only 512 tokens, token-level CAGE took about 7.29 seconds, while dense PyTorch attention took about 0.25 ms. That failure was important. It told me that the idea had to move from token-level search to something coarser.


The second idea: evolve key blocks instead of tokens

The next step was to group the sequence into blocks.

A key block is just a fixed-size contiguous group of key vectors after projection. A Transformer forms:

$$ Q=XW_Q, \qquad K=XW_K, \qquad V=XW_V $$

For each token position (j), there is a key vector (k_j) and a value vector (v_j). With:

we get:

key block 0 = token positions 0..31
key block 1 = token positions 32..63
key block 2 = token positions 64..95
...

Mathematically:

$$ R_s = {sb_k, sb_k+1, \dots, (s+1)b_k-1} $$

where (b_k) is the key block size.

A key block is not a paragraph, sentence, document chunk, or learned memory slot. In this prototype it is simply a contiguous range of key/value vectors.

Instead of evolving token cages, the block version evolves:

$$ C_r \subseteq {0,1,\dots,N_k-1} $$

where (C_r) is the set of selected key blocks for query block (r), and (N_k) is the number of key blocks.

Example:

query block = tokens 9600..9631
candidate block cage = [41, 299]

If each key block has 32 tokens, selecting 2 key blocks gives each query token at most:

$$ 2 \cdot 32 = 64 $$

candidate key/value positions.

This reduced the number of GA searches from one per query token to one per query block. It was a better direction, but it was still too slow in Python.


The mathematical correction: the GA objective collapsed to top-k

The block-level objective I used was approximately:

$$ F(C_r)

\log\sum_{s\in C_r}\exp(A_{r,s}) $$

where (A_{r,s}) is the score for key block (s) with respect to query block (r).

For a fixed cage size (m), this objective does not need a genetic algorithm. The exact solution is:

$$ C_r^* = \operatorname{TopK}_s(A_{r,s}, m) $$

The proof is simple. Suppose a selected block (a\in C_r) has lower score than an unselected block (b\notin C_r):

$$ A_{r,b} > A_{r,a} $$

Then:

$$ \exp(A_{r,b}) > \exp(A_{r,a}) $$

Replacing (a) with (b) increases:

$$ \sum_{s\in C_r}\exp(A_{r,s}) $$

and therefore increases:

$$ \log\sum_{s\in C_r}\exp(A_{r,s}) $$

So any optimum must contain the top (m) block scores.

This was one of the most useful moments in the project. The genetic algorithm helped me frame the problem, but the math killed the GA for this simple objective.

That changed the story from:

evolve attention cages with GA

to:

use the cage formulation, but select blocks with exact top-k

The third idea: avoid the full token-score matrix

Exact top-k block selection helped, but the early version still depended on too much full token-pair information. That defeats the purpose.

The next refinement was Proxy Block-CAGE.

Instead of scoring all token pairs first, I pool each query block and key block into summaries:

$$ \bar q_r = \frac{1}{|U_r|}\sum_{i\in U_r} q_i $$

$$ \bar k_s = \frac{1}{|R_s|}\sum_{j\in R_s} k_j $$

Then score query block (r) against key block (s):

$$ S_{r,s} = \frac{\bar q_r^\top \bar k_s}{\sqrt d} $$

Then select:

$$ C_r = \operatorname{TopK}_s(S_{r,s}, m) $$

where (m) is the number of selected key blocks.

In the best default configuration:

query_block_size = 32
key_block_size   = 32
block_cage_size  = 2
selected capacity = 64 tokens
pool             = mean

At 16,384 tokens:

$$ N_q = N_k = \frac{16384}{32} = 512 $$

The number of proxy block scores across 4 heads is:

$$ 4 \cdot 512 \cdot 512 = 1{,}048{,}576 $$

The local selected token-score upper bound is:

$$ 4 \cdot 16384 \cdot 64 = 4{,}194{,}304 $$

So the approximate score locations are:

$$ 1{,}048{,}576 + 4{,}194{,}304 = 5{,}242{,}880 $$

Dense causal attention at the same size has about:

$$ 536{,}903{,}680 $$

score locations across 4 heads.

The ratio is:

$$ \frac{536{,}903{,}680}{5{,}242{,}880} \approx 102.4 $$

So at 16K context, this configuration touches roughly 100x fewer score locations than dense causal attention. That does not translate to 100x wall-clock speedup because dense attention kernels are highly optimized and this prototype is not. But it explains why the sparse method begins to win at long context.


Proxy Block-CAGE algorithm

For each layer, batch item, and attention head:

  1. Compute (Q,K,V).

  2. Split query positions into query blocks (U_r).

  3. Split key/value positions into key blocks (R_s).

  4. Mean-pool each query block and key block:

    $$ \bar q_r = \frac{1}{|U_r|}\sum_{i\in U_r}q_i $$

    $$ \bar k_s = \frac{1}{|R_s|}\sum_{j\in R_s}k_j $$

  5. Compute proxy block scores:

    $$ S_{r,s}=\frac{\bar q_r^\top \bar k_s}{\sqrt d} $$

  6. Apply a causal block mask so query blocks cannot route to future-only key blocks.

  7. Select the top (m) key blocks:

    $$ C_r = \operatorname{TopK}s(S{r,s},m) $$

  8. Expand selected key blocks into selected key/value token positions.

  9. For every query token (i\in U_r), compute exact causal attention only over selected positions:

    $$ J_i = {j \in \cup_{s\in C_r}R_s : j \le i} $$

    $$ a_{ij}^{\text{CAGE}}

    \frac{ \exp\left(q_i^\top k_j / \sqrt d\right) }{ \sum_{\ell\in J_i} \exp\left(q_i^\top k_\ell / \sqrt d\right) } $$

    $$ y_i^{\text{CAGE}}

    \sum_{j\in J_i} a_{ij}^{\text{CAGE}}v_j $$

The approximation is in the selection of which blocks to include. Once blocks are selected, the local attention inside the selected cage is exact softmax attention.


Complexity

Dense attention is:

$$ O(BHn^2d) $$

Proxy Block-CAGE has two main costs.

First, all block-pair proxy scoring:

$$ O\left(BH d \frac{n^2}{b_qb_k}\right) $$

Second, exact local attention inside selected blocks:

$$ O(BHnmb_kd) $$

where:

b_q = query block size
b_k = key block size
m   = number of selected key blocks

So the total is approximately:

$$ O\left( BHd\left[ \frac{n^2}{b_qb_k} + nmb_k \right] \right) $$

With:

this becomes:

$$ O\left(BHd\left[\frac{n^2}{1024}+64n\right]\right) $$

This is important: with fixed block size, Proxy Block-CAGE is still technically quadratic because of the all-block-pair proxy score matrix. It is not truly linear attention. It is block-reduced quadratic routing plus sparse exact local attention.

That is why I am not claiming this beats systems like SubQ/SSA or other large-scale sparse attention approaches. The current method is a simpler, reproducible prototype with a clear speed/quality tradeoff on consumer hardware.


The prompting loop that mattered

The useful part of ChatGPT was not one magic prompt. It was the iterative loop.

The pattern was:

1. Formulate a strange research direction.
2. Turn it into math.
3. Implement a runnable PyTorch prototype.
4. Run it locally on my RTX 2080.
5. Paste the benchmark output back.
6. Interpret the result honestly.
7. Kill or simplify what failed.
8. Design the next benchmark.

The most important instruction I kept giving was essentially:

Do not just defend the idea. Tell me what the benchmark proves, what it falsifies, and what to try next.

That mattered because many things failed:

Token-level GA attention: too slow.
Block-level GA attention: still too slow.
Exact full-score top-k: cleaner, but too memory-heavy.
Proxy block selection: finally promising.
Training/backward: not ready.
GP local kernels: mathematically interesting, not faster in PyTorch yet.

The process felt less like asking for answers and more like having a tireless research engineer who could help me keep reformulating the problem.


The experiment progression

1. Token-level genetic attention failed

The first prototype evolved a cage of key tokens for every query token. It worked conceptually, but it was absurdly slow.

At 512 tokens:

dense PyTorch attention: ~0.25 ms
token-level CAGE:       ~7289 ms

This killed token-level GA as a practical implementation.

2. Block-level GA reduced the search count but was still too slow

Moving from tokens to blocks helped because the number of GA searches dropped from one per query token to one per query block.

But the Python/GA implementation was still seconds at modest context lengths.

3. Top-k replaced GA for the simple block objective

The block-level log-sum-exp objective had an exact top-k solution. This was a useful negative result: GA was unnecessary for that objective.

4. Proxy block selection avoided full token-pair scoring

The key practical step was to score pooled query/key block summaries rather than materializing the full token-token score matrix before selection.

In an attention-only benchmark, Proxy Block-CAGE crossed over around 6K context and became faster at longer contexts.

At 16,384 tokens:

dense attention:       ~14.98 ms
Proxy Block-CAGE 64:    ~3.97 ms

5. The speedup survived inside a Transformer block

A more realistic block benchmark included embedding, positional embedding, LayerNorm, QKV projection, attention, output projection, MLP, and residuals.

At 16,384 tokens:

dense block:           ~18.78 ms
Proxy Block-CAGE 64:    ~5.74 ms
hidden-state MSE:       ~0.001

6. The speedup survived a trained real-text sanity check

The final sanity check trained a tiny 2-layer character-level Transformer on a custom corpus, copied the same trained weights into Proxy Block-CAGE variants, and measured generation-style prefill.

The best 16,384-token real-text result was:

dense prefill:         ~24.30 ms
Proxy Block-CAGE 64:    ~9.32 ms
speedup:                2.61x
density:                0.39%
top-1 agreement:        100%
top-5 containment:      100%

Again: this is not production LLM evidence. It is a tiny trained model sanity check.


Genetic programming for local attention kernels

After Proxy Block-CAGE started working, I tried another out-of-the-box idea: use genetic programming to search for cheaper local attention kernels.

Standard softmax uses:

$$ g(x)=\exp(x) $$

where:

$$ x_{ij}=s_{ij}-\max_j s_{ij} $$

The GP search evolved scalar functions (g(x)) using primitives such as:

+ - * / max min avg exp log cos sin x^2 x^3 1/x 1/x^2

The normalized GP attention was:

$$ a_{ij}^{GP}

\frac{g(x_{ij})}{\sum_{\ell\in J_i}g(x_{i\ell})} $$

The search rediscovered (\exp(x)) as the accuracy optimum, which is expected because the target was softmax attention. More interestingly, it found simple non-exp approximations such as:

$$ g(x)=\frac{1}{(1-x)^2} $$

$$ g(x)=\max(1+0.25x,0.125)^2 $$

and a polynomial/max candidate:

$$ g(x)=\max\left(1.5+x,\max(1+0.25x,0.125)^2\right) $$

These are mathematically interesting because they avoid expensive transcendental functions. But when inserted into Proxy Block-CAGE in unfused PyTorch, they were not faster than local softmax. At 16,384 tokens, Proxy Block-CAGE with local softmax remained the best version:

dense:                  ~23.55 ms
CAGE + local softmax:     ~9.91 ms
CAGE + rational kernel:  ~11.46 ms
CAGE + clipquad kernel:  ~11.34 ms
CAGE + polymax kernel:   ~11.89 ms

I am treating the GP-kernel work as an early negative/possible future result. It may only become useful in a fused Triton/CUDA kernel where the operation count matters more directly.


How to run the final script

This article is paired with a single Python file:

proxy_block_cage_final.py

No separate vocabulary file or token file is needed.

The script builds a character-level vocabulary directly from the text corpus. It also includes an embedded fallback corpus, so it can run even if no text file is provided.

Requirements

Python 3.10+
PyTorch with CUDA recommended

The experiments above used:

Python: 3.14.6
PyTorch: 2.12.1+cu126
GPU: NVIDIA GeForce RTX 2080 with Max-Q Design
CUDA from PyTorch: 12.6
dtype: fp16

Optional: extract the embedded corpus

python proxy_block_cage_final.py --write-embedded-corpus my_corpus.txt

This writes the embedded corpus to my_corpus.txt and exits.

Quick smoke test

python proxy_block_cage_final.py \
  --device cuda \
  --dtype fp16 \
  --seq-len 512 \
  --batch 4 \
  --train-steps 50 \
  --eval-iters 5 \
  --model-dim 128 \
  --heads 4 \
  --layers 1 \
  --block-cage-sizes 2 \
  --kernels softmax \
  --output-dir quick_cage_run

Reproduce the 16K-style run

python proxy_block_cage_final.py \
  --device cuda \
  --dtype fp16 \
  --seq-len 16384 \
  --batch 1 \
  --train-steps 200 \
  --eval-iters 8 \
  --model-dim 256 \
  --heads 4 \
  --layers 2 \
  --block-cage-sizes 2 \
  --kernels softmax \
  --warmups 2 \
  --repeats 20 \
  --output-dir cage_text_quality_16384_repeat

On Windows Command Prompt, use ^ instead of \ for line continuation:

python proxy_block_cage_final.py ^
  --device cuda ^
  --dtype fp16 ^
  --seq-len 16384 ^
  --batch 1 ^
  --train-steps 200 ^
  --eval-iters 8 ^
  --model-dim 256 ^
  --heads 4 ^
  --layers 2 ^
  --block-cage-sizes 2 ^
  --kernels softmax ^
  --warmups 2 ^
  --repeats 20 ^
  --output-dir cage_text_quality_16384_repeat

Use your own text file

python proxy_block_cage_final.py \
  --device cuda \
  --dtype fp16 \
  --text-file my_corpus.txt \
  --seq-len 8192 \
  --batch 1 \
  --train-steps 300 \
  --eval-iters 20 \
  --model-dim 256 \
  --heads 4 \
  --layers 2 \
  --block-cage-sizes 2 \
  --kernels softmax \
  --warmups 2 \
  --repeats 20 \
  --output-dir cage_text_quality_8192

The script writes:

cage_gp_kernel_quality.csv
cage_gp_kernel_quality.json

The JSON contains the environment, config, training history, and result rows.


What the code compares

The final script trains a tiny dense character-level Transformer, then evaluates:

dense
proxy_topk_softmax
proxy_topk_rational
proxy_topk_clipquad
proxy_topk_polymax

The main method is:

That means Proxy Block-CAGE uses top-k block routing and exact local softmax inside the selected blocks.

The GP kernels are included as experimental variants:

rational: g(x) = 1 / (1 - x)^2
clipquad: g(x) = max(1 + 0.25x, 0.125)^2
polymax:  g(x) = max(1.5 + x, max(1 + 0.25x, 0.125)^2)

They are not the current speed winner.


What is novel here?

I do not want to overclaim novelty. Sparse attention is a large area. Related ideas include block-sparse attention, routing attention, local/global attention, token pruning, approximate attention, hierarchical sparse attention, and newer long-context systems.

The narrow contribution here is:

An independently developed, ChatGPT-assisted research prototype that starts from an evolutionary attention-mask formulation, discovers that the simple GA objective collapses to top-k, and converges to a training-free pooled block-routing sparse attention method with reproducible consumer-GPU long-context experiments.

I am not claiming to have invented sparse attention. I am claiming that this specific research path, implementation, and set of small but reproducible experiments are worth discussing.


Limitations

This project has many limitations.

First, the strongest trained-model experiments use a tiny character-level Transformer, not a pretrained LLM.

Second, the eval sample counts are small. The top-1/top-5 agreement numbers should be interpreted as sanity-check signals, not definitive language-model quality results.

Third, the implementation is a PyTorch prototype, not a fused CUDA/Triton kernel.

Fourth, memory usage is worse than dense attention in the current implementation. At 16K, the sparse version was faster, but peak memory was higher. This is likely due to gather/local-attention intermediate tensors.

Fifth, training/backward is not solved. In backward tests, dense attention was faster and used much less memory.

Sixth, the current routing step still scores all block pairs. With fixed block size, the algorithm is not truly linear-time attention. It is reduced quadratic block routing plus sparse local attention.

Seventh, the GP-discovered local kernels are not a current speed win in PyTorch. They are included as a negative/early result and possible future direction.


What I would like help with

I am sharing this because I want honest feedback and collaboration.

I am especially interested in hearing from people working on:

sparse attention
long-context inference
LLM systems
CUDA / Triton kernels
routing attention
approximate attention
GPU benchmarking
Transformer architecture research
AI-assisted scientific workflows

The next serious steps would be:

  1. Compare against stronger sparse-attention baselines.
  2. Test on a real pretrained model rather than a tiny character model.
  3. Add multiple seeds and stronger quality metrics.
  4. Implement the sparse local attention path as a fused Triton/CUDA kernel.
  5. Fix training/backward memory.
  6. Explore learned or cached routing.
  7. Revisit genetic algorithms only for richer non-top-k objectives, such as diversity, layout, locality, cache reuse, or latency-aware routing.

I am a GMU PhD student trying to pivot from applied AIML engineering into core AIML research. I would welcome academic collaboration, prior-art feedback, and serious technical criticism to refine and improve this work.

You can reach me at:

iaddou@gmu.edu
cto@ezducate.ai
https://www.linkedin.com/in/iqbaladdou/
https://x.com/IqbalAddou