惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
V
Visual Studio Blog
博客园 - Franky
酷 壳 – CoolShell
酷 壳 – CoolShell
Last Week in AI
Last Week in AI
博客园 - 叶小钗
博客园_首页
阮一峰的网络日志
阮一峰的网络日志
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Application and Cybersecurity Blog
Application and Cybersecurity Blog
TaoSecurity Blog
TaoSecurity Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
爱范儿
爱范儿
宝玉的分享
宝玉的分享
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
量子位
N
News and Events Feed by Topic
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recent Commits to openclaw:main
Recent Commits to openclaw:main
SecWiki News
SecWiki News
MyScale Blog
MyScale Blog
AI
AI
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
博客园 - 【当耐特】
Security Archives - TechRepublic
Security Archives - TechRepublic
F
Fortinet All Blogs
V2EX - 技术
V2EX - 技术
T
Troy Hunt's Blog
有赞技术团队
有赞技术团队
W
WeLiveSecurity
Project Zero
Project Zero
T
Tor Project blog
Help Net Security
Help Net Security
L
LINUX DO - 最新话题
IT之家
IT之家
The Hacker News
The Hacker News
腾讯CDC
Schneier on Security
Schneier on Security
N
News and Events Feed by Topic
C
Cisco Blogs
博客园 - 聂微东
Webroot Blog
Webroot Blog
Forbes - Security
Forbes - Security
M
MIT News - Artificial intelligence
C
Cyber Attacks, Cyber Crime and Cyber Security
雷峰网
雷峰网
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
A
About on SuperTechFans

Show HN

CSP Radar GitHub - awebai/aweb-team-coord-worktrees: An aweb team template for a minimum team with a permanent coordinator and worktrees with local developers. GitHub - fujibee/agmsg GitHub - lucastononro/notify: 100% local, free, offline attention skill for Claude Code: plays a sound and speaks a short status update when a long task finishes, blocks, or needs a decision. GitHub - sebastianwessel/skills: AI Skills tivatdoar / workout-to-work · GitLab GitHub - enumura1/py-sql-cleaner: Find, format, and safely extract embedded SQL from Python files. GitHub - intent-bench/intent-bench: Intent fulfillment benchmark for agentic AI engineering GitHub - steveking-gh/firmion: Firmion is DSL and engine for firmware image generation. GitHub - villagesql/villagesql-skills: Agent skills for VillageSQL - gemini-cli-extension; claude-code-plugin GitHub - 0gsd/enough: a personal language system for planning, writing, and translation. GitHub - Kaelio/ktx: ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer GitHub - ThatXliner/xtras: Xliner's Claude Code Skills GitHub - flightdeckhq/flightdeck: Observability and control plane for AI agents. GitHub - search-router/simple-search: Open-source reference app on top of the Search Router API: FastAPI + Jinja metasearch service with pluggable backends, deterministic mocks (no API key needed), RTL UI, Redis cache, and a demo ads cabinet. CSP Radar GitHub - Light-Heart-Labs/DreamServer: Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. GitHub - Diplomat-ai/diplomat-agent-ts: What can your TypeScript AI agent do to the real world? Scan your code. See which tool calls have zero checks Code Block Selector - Visual Studio Marketplace Prometheus dependency graph — interactive showcase | Riftmap Show HN: I made a vi-like modal keyboard plugin for Figma GitHub - run-llama/liteparse: A fast, helpful, and open-source document parser GitHub - dalemyers/Roar: A macOS CLI tool for notifications GitHub - district-solutions/open-agent-tools-coder: Enables small-to-large self-hosted ai models to use local source code when running tool-calling agentic workloads. We actively data mine 20,900+ (2+ TB) popular github repos using large and small ai models to create reuseable: json, markdown and parquet files for local-first tool-calling models. GitHub - progapandist/stripeek: A local TUI proxy for real-time Stripe API debugging, built for navigating complex payloads fast. GitHub - sir1st/hermes-desktop: All-in-one cross-platform desktop app for Hermes Agent — bundles Python + hermes-agent + hermes-web-ui GitHub - astefanutti/shaderbang: Shebang for Shaders Show HN: Generate Claude Code Workflows using Spec Driven Development approach GitHub - nixys/nxs-universal-chart: The Helm chart you can use to install any of your applications into Kubernetes/OpenShift Show HN: AI agents for UK GDAD PCF roles and their skills The Two Pillars: Mixer Mode and Meta-Software in the Reorganization of Software Work After AI GitHub - JaiCode08/teleport-env What 1,000+ Harness Experiments Taught Me About Self-Improving Agents Show HN: Liiists, a Markdown-first, iOS and CLI list app SwiperTab – Get this Extension for 🦊 Firefox (en-US) GitHub - kouhxp/fftext: Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command GitHub - sweetpad-dev/sweetpad: Develop Swift/iOS projects using VSCode GitHub - dogmaticdev/IRON: IRON a.k.a. Intermediate Representation Object Notation is a Interpreter/Database that is used to create Programming Languages. GitHub - sjhalani7/vaen: Package your AI coding harness into a portable .agent file, and share it across repos, teams, & the community without ever having to copy-paste instructions, skills, MCP config, or secrets. Show HN: Gandalf the Grader Show HN: Citadeld – replay any CI failure locally from a single file GitHub - tdortman/cuSBF: High-Performance GPU Super Bloom Filter coral-ai/claude-code-token-xray at main · Coral-Bricks-AI/coral-ai GitHub - ulyssestenn/funes: Funes is a Git-based framework for LLM-managed knowledge work: an AI Librarian ingests raw sources, builds an interlinked Markdown knowledge base, and uses it to produce cited reports, analyses, and other outputs. GitHub - ThatXliner/gah: Git Add Hunk, built for agents to use GitHub - harmont-dev/harmont-cli: Command-line client for the Harmont CI platform GitHub - brooksmcmillin/mcp-authflow: OAuth 2.0 Authorization Server framework for MCP servers GitHub - javaid-codes/audit-supply-chain-agents GitHub - amorey/gochan: A small library of common channel architectures for Go, inspired by Rust GitHub - arifozgun/OpenGem: Free, Open-Source AI API Gateway with Gemini, OpenAI & Anthropic Compatibility in 1 file GitHub - Pranesh950/BioPetals: 🌸 Run BIOxAI models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading GitHub - cnguyen14/bounty-doctor: Diagnose a GitHub bounty issue before you waste hours: detects honeypot scam repos, AI-bot attempt swarms, and stale contests. Show HN: CoreMCP – MCP Server for On-Prem DBs Show HN: KittyHTML – Render HTML/CSS as an inline image in your terminal GitHub - bingud/filemat: Web-based file manager Show HN: TruthLens – Free multi-signal deepfake image detector GitHub - apexlocal-jz/claude-usage-tray: Windows system-tray app showing your Claude Code rate-limit usage at a glance. Zero deps, ~300 lines of PowerShell. Cross-IDE (works regardless of VS Code, Cursor, plain terminal). Release v0.1.2.1 · kouhxp/yapsnap GitHub - noopolis/moltnet: Self-hostable chat network for AI agents. Pre-built bridges for Claude Code, Codex, and the Claws. Rooms, DMs, history. No Slack bots, no Matrix, no glue code. GitHub - tamerh/enju: Coordinating Humans, AI Agents, and Compute as Peers on a Shared Workflow Graph Show HN: Continuity-auth – Respect-weighted rate limits for the open web GitHub - luml-ai/luml: AI lifecycle platform where engineers and agents track experiments, train models, and ship to production. GitHub - mrdanielcasper/CoreTex: A UNIX-inspired, biomimetic, flat-file AI harness and knowledge engine. GitHub - clemg/pierre-github: Pierre's diffs.com and trees.software for Github GitHub - lyriks-io/unspaghettit: Behavior-driven AI development without prompt spaghetti. GitHub - sofumel/claude-handoff-revive: Resume Claude Code work after rate/usage/context limits without replaying the prior transcript. Auto-saves at 90%/95% usage. Plugin-installable, 10 languages. GitHub - dotexorg/saferpc: Typed, end-to-end encrypted RPC over any bidirectional channel. GitHub - BeeZeeAgent/beezee: Agent harness orchestration Legato Next.js Boilerplate for Internal Tools · CoreUI GitHub - clark-labs-inc/clark-hash: Clark Hash, 32x smaller searchable sketches for embeddings GitHub - ZeroPointRepo/youtube-mcp: The fastest YouTube transcript + YouTube search MCP for AI agents. Try for free. Typing Mastery — climb toward 100+ WPM, deliberately GitHub - Andebugulin/Awareen GitHub - fayzan123/claude-workflow-composer: Visual desktop app for composing multi-agent coding workflows. Drag agents, attach skills and MCPs, wire handoffs, export to .claude/ GitHub - StackOneHQ/stack-nudge We hardened an LLM agent. Each defense we added made it more exploitable. GitHub - alkait/WhatsKept: Agent-queryable WhatsApp history from an iOS backup — a single Go binary. GitHub - octelium/cordium: Open-source, general-purpose sandbox platform for devs and AI agents that provides identity-based secure access to infrastructure without credentials. GitHub - scosman/videowright: Build animated explainer videos with your coding agent GitHub - dipankar/dscode: The code editor you can take apart. GitHub - zoharbabin/web-researcher-mcp: MCP server (Go) for AI assistants: web search, content extraction, academic/patent/news research. Multi-provider routing, 4-tier scraping, search lenses. Works with Claude, Cursor, and any MCP client. GitHub - scanaislop/aislop: Catch the slop AI coding agents leave in your code: narrative comments, swallowed exceptions, as-any casts, dead code, oversized functions. 50+ rules across 7 languages (TypeScript, JavaScript, Python, Go, Rust, Ruby, PHP). Sub-second, deterministic, no LLM at runtime. MIT-licensed. GitHub - kouhxp/cheap-im: CPU-only voice agent approximating Thinking Machines' Interaction Models demo GitHub - unprovable/OrchidMantis: Orchid Mantis — standalone framework for Zero-Knowledge Proofs of eXploit (ZKPoX). GitHub - TangibleResearch/Halgorithem: A Algo designed to detect AI Hallucitions GitHub - CarpseDeam/Aura-IDE: An AI coding harness that shaped itself - Planner/Worker agents, repo awareness, surgical edits, validation, recovery, and safe diff approvals. GitHub - chojs23/concord: A feature-rich TUI client for Discord GitHub - aerf-spec/aerf: Agent Evidence Receipt Format (AERF) — an open specification for tamper-evident, independently verifiable records of AI agent actions. GitHub - Jwrede/tokentoll: Catch LLM cost changes in code review. Infracost for LLM spend. GitHub - samchon/ttsc: A `typescript-go` toolchain for compiler-powered plugins and type-safe execution + 500x faster lint integrated into compiler GitHub - Higangssh/homebutler: 🏠 Manage your homelab from chat. Single binary, zero dependencies. GitHub - olalie/tapmap: See where your computer connects and what stands out on a live world map. GitHub - Diplomat-ai/diplomat-agent: What can your AI agent do to the real world? Scan your code. See which tool calls have zero checks GitHub - Bajusz15/beacon: Open-source agent for secure remote access, monitoring, and deploys across home-lab and self-hosted machines like Raspberry Pi, N100, or any Linux server. Open web based TTY or tunnel Home Assistant and other local services securely without opening ports. BigTech AI News - Chrome 应用商店 GitHub - vinhnx/VTCode: VT Code is an open-source coding agent with LLM-native code understanding and robust shell safety. Supports multiple LLM providers with automatic failover and efficient context management. GitHub - Lumen-Labs/brainapi2: BrainAPI is a knowledge graph–powered AI memory layer that transforms unstructured data into structured knowledge, enabling intelligent search, recommendations, and contextual memory for AI agents and applications. GitHub - familiar-software/familiar: Let AI watch you work. Familiar lets your AI update its memory, skills, and knowledge by watching your screen. make sidebar/address bar rounded corner toggleable
GitHub - rayanht/alloy: Kernel authoring DSL, torch.compile backend and LLM serving for Apple Silicon.
rayanht · 2026-06-21 · via Show HN

Kernel authoring DSL, torch.compile backend and LLM serving for Apple Silicon.

Alloy is a compiler and runtime for GPU compute kernels on Apple Silicon. You write kernels in Python. Alloy compiles them to Metal through a tile IR pipeline; covering everything from per-thread scalar kernels to cooperative tiled GEMM with simdgroup MMA and automatic operator fusion for multi-kernel pipelines.

Status: technical preview. Requires Apple Silicon (M1+) and macOS 13+. The Python packages need Python 3.10–3.12.

Contents

  • Install
  • Inference server - Quickstart
    • Features
  • torch.compile backend
    • Training preview
  • Benchmarks
    • Causal LM Inference
    • Multimodal Inference
    • Embeddings Inference
  • Writing kernels
    • Tiled GEMM
    • Built-in ops
    • Kernel primitives
    • Automatic fusion
    • Framework interop
    • Inspect generated code
  • Why Alloy
  • Contributing
  • License

Install

Python (pip / uv)

pip install 'alloy-kit[serve]'   # local LLM server + CLI + torch.compile backend
pip install alloy-kit            # lean: just the GPU kernel compiler (no torch)
pip install 'alloy-kit[all]'     # + training / vision / audio research extras

# import alloy as al

The PyPI distribution is alloy-kit. The brackets are optional dependency groups: the lean base provides @al.kernel with the tile IR, MSL emitter and Metal dispatch machinery, and [serve] adds everything needed to run the server and the alloy CLI.

Standalone (no Python required):

curl -fsSL https://raw.githubusercontent.com/rayanht/alloy/main/installer/install.sh | sh

Installs a self-contained alloy CLI into /usr/local.

From source (contributors): see Contributing.

Inference server - Quickstart

Alloy serves a loopback HTTP API that's drop-in compatible with the OpenAI, Anthropic and Ollama clients.

Important

Run alloy tune <model> before serving for optimal performance

# Start the server in the foreground; loads the model
# from a local Ollama cache or Hugging Face if present.

alloy serve -m qwen3:0.6b                                   # Ollama tag
alloy serve -m bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M  # HF model
# OpenAI:
curl http://127.0.0.1:11434/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}'

# Ollama:
curl http://127.0.0.1:11434/api/chat \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}'
# Claude Code
alloy launch claude

The default port is 11434. Pass --port 11435 to alloy serve (or set ALLOY_PORT) to override.

Features

Feature Status
Warm-prefix KV reuse (bookmarks + branching) Stable
On-GPU sampling (temp / top-p / top-k / min-p / seed) Stable
Constrained decoding (xgrammar JSON + tool grammars) Stable
Tool calling (OpenAI / Anthropic / Ollama, per-family parsers) Stable
Reasoning / thinking split Stable
MoE inference Stable (Qwen3.5-MoE)
Vision input Stable (gemma4)
Audio input Stable (gemma4)
Embeddings Stable (nomic-embed-text)
Speculative decoding — PLD (prompt lookup) Opt-in (--spec pld)
Speculative decoding — MTP Opt-in (--spec mtp, Qwen3.5)
Speculative decoding — DFlash (block diffusion) Opt-in (--spec dflash)
Paged KV cache Opt-in (ALLOY_KV=paged)
KV cache quantization (int8 + fp16 scales) Opt-in (--kv-quant q8_0)
Supported quantizations

Model weights

source format supported
GGUF Q4_K (Q4_K_M / Q4_K_S)
GGUF Q5_0
GGUF Q6_K
GGUF Q8_0
GGUF F16 / BF16 / F32
GGUF Q2_K / Q3_K / Q5_K
GGUF Q4_0 / Q4_1 / Q5_1
GGUF IQ1 / IQ2 / IQ3 / IQ4 (IQ4_XS, IQ4_NL)
MLX 4-bit affine (group size 64 / 128)
MLX 2-bit / 3-bit / 6-bit / 8-bit

KV cache

format supported
fp16 (default)
q8_0
q4 / other

torch.compile backend

Alloy includes a torch.compile backend that compiles covered PyTorch FX graphs to fused Metal compute kernels.

import torch
import transformers
import alloy_torch  # registers the "alloy" backend

model = transformers.AutoModelForCausalLM.from_pretrained("gpt2").eval()
compiled = torch.compile(model, backend="alloy")

input_ids = torch.randint(0, model.config.vocab_size, (1, 16))
output = compiled(input_ids=input_ids)

The backend handles: FX graph decomposition, operator fusion (RMSNorm, RoPE, GELU, batched QKV, GEMM+LayerNorm, scalar broadcast), GQA-native attention, compiled dispatch plans, and tuning.

Runnable model examples live in examples/torch/:

  • mlp.py — multi-layer perceptron (Linear / LayerNorm / GELU)
  • resnet.py — GroupNorm ResNet (Conv2d + residual blocks)
  • transformer.py — pre-norm encoder block (SDPA + GELU MLP)

Training preview

A full torch.compile training step (forward, backward, and the optimizer update) runs end to end through Alloy and matches PyTorch eager within floating-point tolerance for dense transformer-style models: embeddings, linear layers, normalization, residual blocks, attention, cross-entropy, and the common optimizers (SGD, Adam, AdamW, RMSprop). A small language model trains end to end, and LoRA fine-tuning of a pretrained transformer works in model.train(). Enable it before torch.compile:

import torch
import torch.nn as nn
import torch.nn.functional as F
import alloy_torch  # registers the "alloy" backend
from alloy_torch.training import set_training_mode

set_training_mode(True)  # before torch.compile

model = nn.Sequential(nn.Linear(64, 128), nn.GELU(), nn.Linear(128, 1))
step = torch.compile(model, backend="alloy")
opt = torch.optim.AdamW(model.parameters(), lr=0.05)

x, y = torch.randn(32, 64), torch.randn(32, 1)
for _ in range(20):
    opt.zero_grad()
    loss = F.mse_loss(step(x), y)
    loss.backward()
    opt.step()

Fine-tuning a pretrained transformer with PEFT LoRA is the same shape:

import peft
import transformers

model = peft.get_peft_model(
    transformers.AutoModelForCausalLM.from_pretrained("gpt2"),
    peft.LoraConfig(target_modules=["c_attn"], task_type="CAUSAL_LM"),
)
step = torch.compile(model, backend="alloy")
opt = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=5e-3)

model.train()
for input_ids in batches:
    opt.zero_grad()
    loss = step(input_ids=input_ids, labels=input_ids).loss
    loss.backward()
    opt.step()

Runnable training examples live in examples/torch/:

It is still a preview. The backward pass does not yet cover convolutions or pooling, so CNN training is not supported. Inference is the primary, fully validated path.

Benchmarks

Reproduce with alloy bench <HF_OR_OLLAMA_TAG>

Causal LM Inference

HF Models

model quant pp512 tg128
LFM2.5-1.2B-Instruct-GGUF Q4_K_M 4222 508
bartowski/Llama-3.2-3B-Instruct-GGUF Q4_K_M 2061 198
Qwen_Qwen3-0.6B-GGUF Q4_K_M 8311 612

Ollama Models

model quant pp512 tg128
qwen2.5:0.5b Q4_K_M 12102 505
qwen3:0.6b Q4_K_M 10077 584
llama3.2:1b Q8_0 5653 324
qwen3.5:0.8b Q8_0 6141 349
deepseek-r1:1.5b Q4_K_M 3295 274
qwen2.5:1.5b Q4_K_M 3295 270
qwen3.5:2b Q8_0 3247 187
gemma4:e2b Q4_K_M 2121 175
qwen2.5:3b Q4_K_M 1617 185
gemma4:e4b Q4_K_M 1079 115
qwen3.5:4b Q4_K_M 1098 122
qwen3.5:9b Q4_K_M 598 78.6
qwen3.6:35b Q4_K_M 988 121

MLX Models

model quant pp512 tg128
Qwen/Qwen3-0.6B-MLX-4bit 4-bit g128 10063 710
LiquidAI/LFM2.5-1.2B-Instruct-MLX-4bit 4-bit g64 5688 589
mlx-community/Llama-3.2-3B-Instruct-4bit 4-bit g64 2173 220
mlx-community/Qwen3-4B-4bit 4-bit g64 1673 174
mlx-community/Qwen3-8B-4bit 4-bit g64 866 102

Multimodal Inference

model vision ms alloy TTFT alloy dec alloy wall
gemma4:e2b 229 455 172 1193
gemma4:e4b 257 665 99.0 1949

Embeddings Inference

Per-regime encoder tok/s from alloy bench nomic-embed-text --dataset embeddings.

regime batch seq tok/s
q_short 1 10 5094
q_long 1 256 19161
b8_short 8 10 14142
b8_long 8 128 11840

Writing kernels

import numpy as np
import alloy as al

@al.kernel
def blur(src, dst: al.output, W: al.constexpr, H: al.constexpr):
    x = al.program_id(0)
    y = al.program_id(1)
    acc = 0.0
    count = 0
    for dy in range(-1, 2):
        for dx in range(-1, 2):
            nx = x + dx
            ny = y + dy
            if nx >= 0:
                if nx < W:
                    if ny >= 0:
                        if ny < H:
                            acc = acc + al.load(src + ny * W + nx)
                            count = count + 1
    al.store(dst + y * W + x, acc / count)

W, H = 1920, 1080
img = np.random.rand(H, W).astype(np.float32)

out = blur[W, H](img.ravel(), W=W, H=H)
print(np.asarray(out).reshape(H, W))

NumPy and PyTorch arrays can be bound directly as inputs for covered contiguous host-memory paths. The kernel's al.output is allocated automatically and returned as an AlloyBuffer — convert with np.asarray(...) or .numpy(). Some interop paths allocate Alloy-owned shared buffers or require layout copies, so this is not a blanket promise that every input type and view is no-copy.

More runnable examples live in examples/kernel/:

Tiled GEMM

@al.kernel
def matmul(A, B_T, C: al.output,
           BLOCK_M: al.constexpr = 64, BLOCK_N: al.constexpr = 64, BLOCK_K: al.constexpr = 16):
    M, K = A.shape
    N = B_T.shape[0]
    pm = al.program_id(0)
    pn = al.program_id(1)
    rm = pm * BLOCK_M + al.arange(0, BLOCK_M)
    rn = pn * BLOCK_N + al.arange(0, BLOCK_N)
    rk = al.arange(0, BLOCK_K)
    a_ptrs = A + rm[:, None] * K + rk[None, :]
    b_ptrs = B_T + rn[:, None] * K + rk[None, :]
    acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)
    for k in range(0, K, BLOCK_K):
        a = al.load(a_ptrs, mask=(rm[:, None] < M) & (rk[None, :] < K))
        b = al.load(b_ptrs, mask=(rn[:, None] < N) & (rk[None, :] < K))
        acc += al.tile_dot(a, b, transpose_rhs=True)
        a_ptrs += BLOCK_K
        b_ptrs += BLOCK_K
    al.store(C + rm[:, None] * N + rn[None, :], acc, mask=(rm[:, None] < M) & (rn[None, :] < N))

This compiles to Metal with simdgroup matrix multiply-accumulate (MMA), cooperative tile loads, threadgroup shared memory staging, and optional double buffering all generated automatically from the tile IR.

Built-in ops

High-performance implementations of common operations.

C = al.dot_transpose_rhs(A, B)               # tiled GEMM with autotuning
s = al.softmax(x)                            # fused row-wise softmax
y = al.layernorm(x, gamma, beta)             # fused layer normalization
y, _ = al.rms_norm(x, weight)                # fused RMS normalization (+ per-row 1/rms)
L = al.cross_entropy(logits, labels)         # fused cross-entropy loss kernel

Builtins infer output shapes and constexpr values from input arrays. They compose with fusion. e.g. al.dot followed by an elementwise kernel automatically fuses the elementwise op as an epilogue.

Kernel primitives

# Grid and thread indexing
pid = al.program_id(0)                  # threadgroup position (block index)
tid = al.thread_id()                    # thread position within threadgroup
offs = pid * 1024 + al.arange(0, 1024)  # block-level offsets

# Memory
x = al.load(ptr + offs, mask=mask)       # masked global load
al.store(ptr + offs, val, mask=mask)     # masked global store
buf = al.shared(256)                     # threadgroup shared memory
loc = al.local(8)                        # per-thread register array
al.barrier()                             # threadgroup memory barrier
al.coop_load(buf, src_ptr, size)         # cooperative threadgroup load + barrier
al.copy4(dst, offset, src_ptr)           # vectorized 4-element load

# Tile operations (2D blocks for GEMM, attention, etc.)
acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)
acc += al.tile_dot(a, b, transpose_rhs=True)  # simdgroup MMA
reduced = al.simd_reduce(val)                 # cross-lane reduction

# Simdgroup (warp-level)
al.simd_shuffle_xor(val, offset)         # butterfly shuffle
al.simd_shuffle(val, lane)               # read from specific lane
acc = al.simd_matrix()                   # 8x8 matrix accumulator
al.simd_load(src, offset, stride)        # load into simd matrix
al.simd_mma(acc, a, b)                   # matrix multiply-accumulate

# Atomics
al.atomic_add(ptr, idx, val)                # atomic fetch-and-add (int32)
al.atomic_max(ptr, idx, val)
al.atomic_cas(ptr, idx, expected, desired)  # compare-and-swap

# Control flow — plain Python
if cond: ...
for i in range(N): ...
while cond: ...

Automatic fusion

# These three kernels fuse into one — no intermediate buffers allocated.
# Each call returns a lazy AlloyBuffer; feed it straight into the next:
t1 = scale[grid](x, N=N)          # t1 = x * 2.0
t2 = bias[grid](t1, N=N)          # t2 = t1 + 1.0
result = activate[grid](t2, N=N)  # result = relu(t2)

# Reading the result triggers one fused GPU submission:
print(result[0])

Framework interop

Pass PyTorch tensors or MLX arrays directly when their storage layout is supported:

import torch
x = torch.randn(32, 128)                  # CPU tensor, lives in unified memory
result = my_kernel[grid](x, M=32, N=128)  # x bound directly; result returned as an AlloyBuffer

Alloy's compiled plans may convert PyTorch input storage to Alloy-owned shared memory on first execution so subsequent dispatches can resolve Metal buffers by handle. That keeps subsequent dispatches free of per-call input copies for stable storage.

Inspect generated code

al.inspect(my_kernel, N=8192)                      # prints MSL source
al.inspect(my_kernel, level="tile-ir", N=8192)     # prints tile IR

Why Alloy

The problem. Metal compute is powerful but painful to program. You write MSL in a C++ dialect, manually manage buffer bindings, compile pipeline state objects, and set up command encoders. There's no equivalent of Triton, Numba, or CuPy for Metal.

What Alloy does. Python → tile IR → MSL, with a runtime that handles dispatch, caching, and optimization:

  • Shared-memory execution — Apple Silicon CPU and GPU share physical memory. Alloy binds caller buffers directly where the storage layout supports it, and uses Alloy-owned shared buffers when plan safety or alignment requires it.
  • Tile IR compiler — Python kernel source → AST → tile IR (loads, stores, reductions, MMA, barriers) → Metal Shading Language. Handles threadgroup sizing, shared memory allocation, simdgroup decomposition, and barrier placement automatically.
  • Automatic dispatch — builtins return lazy buffers that queue GPU work automatically. Reading results triggers a single fused Metal command buffer commit. No manual batch management needed.
  • Operator fusion — adjacent elementwise kernels fuse automatically, eliminating intermediate buffers. Elementwise ops fuse as prologues and epilogues into reductions, GEMM, softmax, and layernorm. Transposes fuse via stride absorption.
  • Tuning — exhaustive search over tile sizes, loop unrolling, double buffering, and matvec strategies.

Contributing

See CONTRIBUTING.md for dev setup, test commands, and PR conventions.

License

Apache License 2.0.