惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Application and Cybersecurity Blog
Application and Cybersecurity Blog
S
Securelist
K
Kaspersky official blog
Scott Helme
Scott Helme
C
CXSECURITY Database RSS Feed - CXSecurity.com
GbyAI
GbyAI
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
C
Cisco Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - Franky
Security Latest
Security Latest
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Y
Y Combinator Blog
T
Threat Research - Cisco Blogs
L
LINUX DO - 热门话题
C
Cyber Attacks, Cyber Crime and Cyber Security
Project Zero
Project Zero
Cisco Talos Blog
Cisco Talos Blog
月光博客
月光博客
I
Intezer
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
人人都是产品经理
人人都是产品经理
L
Lohrmann on Cybersecurity
Recorded Future
Recorded Future
Latest news
Latest news
V2EX - 技术
V2EX - 技术
T
The Exploit Database - CXSecurity.com
H
Heimdal Security Blog
F
Fortinet All Blogs
Cloudbric
Cloudbric
IT之家
IT之家
博客园 - 叶小钗
Microsoft Security Blog
Microsoft Security Blog
P
Proofpoint News Feed
博客园 - 司徒正美
Apple Machine Learning Research
Apple Machine Learning Research
PCI Perspectives
PCI Perspectives
AWS News Blog
AWS News Blog
H
Help Net Security
S
Security @ Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
Recent Announcements
Recent Announcements
Hacker News - Newest:
Hacker News - Newest: "LLM"
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
F
Full Disclosure
S
Schneier on Security
S
Security Affairs
T
Tenable Blog

Hacker News: Show HN

PurrrrrFocus: Pomodoro Timer App - App Store Workflow Engine — Multi-Step Orchestration for Bun RapidPhoto: Pro Photo Editor App - App Store GitHub - DheerG/swarms: Achieve extraordinary results with claude code across a variety of tasks SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Show HN: VCoding – A 5 MB native Windows IDE with no dynamic dependencies Show HN: LLMs don't hallucinate because they're bad at math, it's the format GitHub - Agent-FM/agentfm-core: AgentFM is a peer-to-peer network that turns everyday computers into a decentralized AI supercomputer. AgentFM lets you run massive AI workloads directly across a global mesh of idle CPUs and GPUs. Show HN: Tracking Top US Science Olympiad Alumni over Last 25 Years GitHub - Potarix/agent-hub: One place to talk to all your agents Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) GitHub - dubeyKartikay/lazyspotify: Terminal Spotify client for macOS and Linux GitHub - the-banana-tool/king-louie: Easy to use GUI Personal AI Assistant. Win/Linux/Mac. Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission GitHub - basteez/jsf-autoreload: maven plugin to enable hot reload on jsf projects uvm32/hosts/host-gdbstub at main · ringtailsoftware/uvm32 GitHub - labsai/EDDI: Config-driven engine that turns JSON into production-grade AI agents. Multi-agent orchestration, 12+ LLM providers, MCP/A2A protocols, RAG, persistent memory, and enterprise compliance (EU AI Act, GDPR, HIPAA). Built on Quarkus. GitHub - glitchnsec/fortyone-oss: AI Executive Assistant Platform Quickstart | Alien GitHub - muxshed/shed: One stream in, or many. Every destination, simultaneously. No cloud middleman, no per-channel fees, no limits. GitHub - ocrbase-hq/ocrbase: 📄 PDF/IMG ->.MD/JSON Document OCR API for PaddleOCR and GLMOCR. Self-hostable. GitHub - impactjo/home-memory: MCP server that lets your AI assistant remember everything about your home. GitHub - Sets88/dbcls: DbCls is a powerful terminal database client that supports various databases GitHub - neptun2000/heor-agent-mcp GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh RollQuation: Math Puzzles - Apps on Google Play GitHub - dropbox/witchcraft Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis GitHub - opentalon/opentalon: OpenTalon is an open-source platform built from the ground up in Go as a robust alternative to OpenClaw LinkedIn™ 职位抓取工具 - Chrome 应用商店 GitHub - EdoardoBambini/Agent-Armor-Iaga: AI agents are getting tool access — shell, file system, databases, APIs, secrets. But **nobody is governing what they actually do with it**. Frameworks like LangChain, CrewAI, AutoGen, and Claude Code give agents the power to execute. Agent Armor gives you the power to control, audit, and approve every single action before it happens. HN Vibes — Week 15, Apr 7–13 2026 GitHub - chojs23/ec: Easy terminal-native 3-way git mergetool vim-like workflow GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - JakOb-dotcom/cloud-sandbox-security-analysis: Technical analysis and Proof of Concept (PoC) regarding environment variable exfiltration in containerized cloud sandboxes via side-channel data leaks. Springboards - Flint Alpha Show HN: A simpler coding agent harness GitHub - audiodude/sudomake-friends GitHub - 256thFission/mini-mythos: OSS clone of Anthropic’s Mythos harness to locate C/C++ memory vulnerabilities Show HN: OpenParallax: OS-level privilege separation for AI agent execution Hacker News Sorted - Chrome 应用商店 Show HN: How to Install Docker on Ubuntu 24.04 LTS: Complete 2026 Guide GitHub - himanshudongre/smriti GitHub - sverrirsig/claude-control: macOS desktop dashboard for monitoring and managing multiple Claude Code sessions GitHub - ory/dockertest: Write better integration tests! Dockertest helps you boot up ephermal docker images for your Go tests with minimal work. Chiral - Chrome 应用商店 Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC GitHub - pmichaillat/latex-cv: Minimalist LaTeX template for academic CVs GitHub - oguzbilgic/posse: A web UI for Anthropic Managed Agents. GitHub - sshiraz/depsly: Dependency risk analysis tool for npm packages ABI Add safari/agent-harness — Safari browser automation via safari-mcp by achiya-automation · Pull Request #212 · HKUDS/CLI-Anything GitHub - Halfblood-Prince/trustcheck: Verify PyPI package attestations and improve Python supply-chain security GitHub - oguzbilgic/kern-ai: Agents that do the work and show it. GitHub - bruits/satteri: High-performance Markdown and MDX processing for the JavaScript ecosystem GitHub - tylergibbs1/feedstock: High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright GitHub - Grimm67123/grimmbot: The self-improving sandboxed and open-source AI agent. With persistent memory and scheduling. GitHub - whitevanillaskies/whitebloom: Local whiteboard that blooms. GitHub - hwdsl2/docker-whisper: Docker image for a self-hosted Whisper speech-to-text server with speaker diarization and OpenAI-compatible transcription and translation APIs. Powered by faster-whisper. Supports all Whisper models, NVIDIA GPU (CUDA) acceleration, JSON/SRT/VTT output, SSE streaming, offline mode, and multi-arch (amd64, arm64). GitHub - yisding/reviewwiggum GitHub - MarwanAlsoltany/serrors: Structured errors for Go: sentinel hierarchies, typed data, custom formatting, and slog integration. GitHub - soatok/age-php GitHub - Luthiraa/markitme GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits GitHub - tombedor/excalicharts GitHub - wh1le/excalidraw-edit: Open and edit .excalidraw files from the terminal. Offline, auto-saves to disk. MalExt Sentry - Malicious Extension Scanner - Chrome 应用商店 GitHub - syi0808/asciianimesvg: Generate animated ASCII art SVGs from text. CLI, Rust library, WASM, and web editor. GitHub - zaina-ml/ml_forge: A visual-based graph node editor for training computer vision models. GitHub - anakin87/llm-rl-environments-lil-course: 🌱 A little course on Reinforcement Learning Environments for evaluating and training Language Models GitHub - takaakit/superpowers-uml: Superpowers-UML modifies Superpowers to ensure a software development workflow in which AI agents design through UML modeling. AdriByte Studio - Sviluppo Web e Soluzioni Digitali GitHub - chouligi/angel-copilot: Your personalized Angel Investment Advisor Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 GitHub - agenteractai/lodmem: Level Of Detail Context Management for Agents GitHub - ostefani/subnetlens: A fast, concurrent network scanner with a TUI and plain-text CLI, built in Go. It discovers live hosts on your network, scans their open ports, resolves hostnames, and fingerprints operating systems—delivered. Cyber Pulse: Agentic Intel - Apps on Google Play Whisper API: Self-Hostable Speech to Text Transcription The Agent-Web Protocol Stack: A Research Thesis GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Show HN: Provepy – A Python decorator that proves your code using Lean and LLMs Show HN: Pardonned.com – A searchable database of US Pardons GitHub - patrickdappollonio/dux: Dux is a terminal UI that lets you run multiple AI coding agents side by side, each in its own git worktree, with full companion terminals, macros, commit generation, and a command palette that knows more tricks than you do. kMC Crystal Simulator Show HN: HyperFlow – A self-improving agent framework built on LangGraph GitHub - stef41/vibescore: 🎵 Grade your vibe-coded project. One command, instant letter grade across security, quality, dependencies, and testing. GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. imgur.com GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. GitHub - nowork-studio/toprank: Open-source Claude Code skills for SEO, SEM, Google Ads GitHub - tacomanator/sash: Lightweight macOS menu bar app for reliably cycling through windows of the current application. Appents | Social Media Management for Product-First Teams GitHub - pnhoang/youtube-spam-blocker: Automatically detects and hides spam messages in YouTube Live chat. Set rate limits, keyword filters, and block repeat offenders. GitHub - decisionnode/DecisionNode: CLI + Local MCP - A shared structured memory store across Claude Code, Cursor, Windsurf, Antigravity, and every MCP client. Semantically queryable. GitHub - AvaCodeSolutions/django-email-learning: An open source Django app for creating email-based learning platforms with IMAP integration and React frontend components. The $100K Gap in Kubernetes Security Tooling Function Calling Harness: From 6.75% to 100%
GitHub - rayanht/alloy: Kernel authoring DSL, torch.compile backend and LLM serving for Apple Silicon.
rayanht · 2026-06-21 · via Hacker News: Show HN

Kernel authoring DSL, torch.compile backend and LLM serving for Apple Silicon.

Alloy is a compiler and runtime for GPU compute kernels on Apple Silicon. You write kernels in Python. Alloy compiles them to Metal through a tile IR pipeline; covering everything from per-thread scalar kernels to cooperative tiled GEMM with simdgroup MMA and automatic operator fusion for multi-kernel pipelines.

Status: technical preview. Requires Apple Silicon (M1+) and macOS 13+. The Python packages need Python 3.10–3.12.

Contents

  • Install
  • Inference server - Quickstart
    • Features
  • torch.compile backend
    • Training preview
  • Benchmarks
    • Causal LM Inference
    • Multimodal Inference
    • Embeddings Inference
  • Writing kernels
    • Tiled GEMM
    • Built-in ops
    • Kernel primitives
    • Automatic fusion
    • Framework interop
    • Inspect generated code
  • Why Alloy
  • Contributing
  • License

Install

Python (pip / uv)

pip install 'alloy-kit[serve]'   # local LLM server + CLI + torch.compile backend
pip install alloy-kit            # lean: just the GPU kernel compiler (no torch)
pip install 'alloy-kit[all]'     # + training / vision / audio research extras

# import alloy as al

The PyPI distribution is alloy-kit. The brackets are optional dependency groups: the lean base provides @al.kernel with the tile IR, MSL emitter and Metal dispatch machinery, and [serve] adds everything needed to run the server and the alloy CLI.

Standalone (no Python required):

curl -fsSL https://raw.githubusercontent.com/rayanht/alloy/main/installer/install.sh | sh

Installs a self-contained alloy CLI into /usr/local.

From source (contributors): see Contributing.

Inference server - Quickstart

Alloy serves a loopback HTTP API that's drop-in compatible with the OpenAI, Anthropic and Ollama clients.

Important

Run alloy tune <model> before serving for optimal performance

# Start the server in the foreground; loads the model
# from a local Ollama cache or Hugging Face if present.

alloy serve -m qwen3:0.6b                                   # Ollama tag
alloy serve -m bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M  # HF model
# OpenAI:
curl http://127.0.0.1:11434/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}'

# Ollama:
curl http://127.0.0.1:11434/api/chat \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}'
# Claude Code
alloy launch claude

The default port is 11434. Pass --port 11435 to alloy serve (or set ALLOY_PORT) to override.

Features

Feature Status
Warm-prefix KV reuse (bookmarks + branching) Stable
On-GPU sampling (temp / top-p / top-k / min-p / seed) Stable
Constrained decoding (xgrammar JSON + tool grammars) Stable
Tool calling (OpenAI / Anthropic / Ollama, per-family parsers) Stable
Reasoning / thinking split Stable
MoE inference Stable (Qwen3.5-MoE)
Vision input Stable (gemma4)
Audio input Stable (gemma4)
Embeddings Stable (nomic-embed-text)
Speculative decoding — PLD (prompt lookup) Opt-in (--spec pld)
Speculative decoding — MTP Opt-in (--spec mtp, Qwen3.5)
Speculative decoding — DFlash (block diffusion) Opt-in (--spec dflash)
Paged KV cache Opt-in (ALLOY_KV=paged)
KV cache quantization (int8 + fp16 scales) Opt-in (--kv-quant q8_0)
Supported quantizations

Model weights

source format supported
GGUF Q4_K (Q4_K_M / Q4_K_S)
GGUF Q5_0
GGUF Q6_K
GGUF Q8_0
GGUF F16 / BF16 / F32
GGUF Q2_K / Q3_K / Q5_K
GGUF Q4_0 / Q4_1 / Q5_1
GGUF IQ1 / IQ2 / IQ3 / IQ4 (IQ4_XS, IQ4_NL)
MLX 4-bit affine (group size 64 / 128)
MLX 2-bit / 3-bit / 6-bit / 8-bit

KV cache

format supported
fp16 (default)
q8_0
q4 / other

torch.compile backend

Alloy includes a torch.compile backend that compiles covered PyTorch FX graphs to fused Metal compute kernels.

import torch
import transformers
import alloy_torch  # registers the "alloy" backend

model = transformers.AutoModelForCausalLM.from_pretrained("gpt2").eval()
compiled = torch.compile(model, backend="alloy")

input_ids = torch.randint(0, model.config.vocab_size, (1, 16))
output = compiled(input_ids=input_ids)

The backend handles: FX graph decomposition, operator fusion (RMSNorm, RoPE, GELU, batched QKV, GEMM+LayerNorm, scalar broadcast), GQA-native attention, compiled dispatch plans, and tuning.

Runnable model examples live in examples/torch/:

  • mlp.py — multi-layer perceptron (Linear / LayerNorm / GELU)
  • resnet.py — GroupNorm ResNet (Conv2d + residual blocks)
  • transformer.py — pre-norm encoder block (SDPA + GELU MLP)

Training preview

A full torch.compile training step (forward, backward, and the optimizer update) runs end to end through Alloy and matches PyTorch eager within floating-point tolerance for dense transformer-style models: embeddings, linear layers, normalization, residual blocks, attention, cross-entropy, and the common optimizers (SGD, Adam, AdamW, RMSprop). A small language model trains end to end, and LoRA fine-tuning of a pretrained transformer works in model.train(). Enable it before torch.compile:

import torch
import torch.nn as nn
import torch.nn.functional as F
import alloy_torch  # registers the "alloy" backend
from alloy_torch.training import set_training_mode

set_training_mode(True)  # before torch.compile

model = nn.Sequential(nn.Linear(64, 128), nn.GELU(), nn.Linear(128, 1))
step = torch.compile(model, backend="alloy")
opt = torch.optim.AdamW(model.parameters(), lr=0.05)

x, y = torch.randn(32, 64), torch.randn(32, 1)
for _ in range(20):
    opt.zero_grad()
    loss = F.mse_loss(step(x), y)
    loss.backward()
    opt.step()

Fine-tuning a pretrained transformer with PEFT LoRA is the same shape:

import peft
import transformers

model = peft.get_peft_model(
    transformers.AutoModelForCausalLM.from_pretrained("gpt2"),
    peft.LoraConfig(target_modules=["c_attn"], task_type="CAUSAL_LM"),
)
step = torch.compile(model, backend="alloy")
opt = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=5e-3)

model.train()
for input_ids in batches:
    opt.zero_grad()
    loss = step(input_ids=input_ids, labels=input_ids).loss
    loss.backward()
    opt.step()

Runnable training examples live in examples/torch/:

It is still a preview. The backward pass does not yet cover convolutions or pooling, so CNN training is not supported. Inference is the primary, fully validated path.

Benchmarks

Reproduce with alloy bench <HF_OR_OLLAMA_TAG>

Causal LM Inference

HF Models

model quant pp512 tg128
LFM2.5-1.2B-Instruct-GGUF Q4_K_M 4222 508
bartowski/Llama-3.2-3B-Instruct-GGUF Q4_K_M 2061 198
Qwen_Qwen3-0.6B-GGUF Q4_K_M 8311 612

Ollama Models

model quant pp512 tg128
qwen2.5:0.5b Q4_K_M 12102 505
qwen3:0.6b Q4_K_M 10077 584
llama3.2:1b Q8_0 5653 324
qwen3.5:0.8b Q8_0 6141 349
deepseek-r1:1.5b Q4_K_M 3295 274
qwen2.5:1.5b Q4_K_M 3295 270
qwen3.5:2b Q8_0 3247 187
gemma4:e2b Q4_K_M 2121 175
qwen2.5:3b Q4_K_M 1617 185
gemma4:e4b Q4_K_M 1079 115
qwen3.5:4b Q4_K_M 1098 122
qwen3.5:9b Q4_K_M 598 78.6
qwen3.6:35b Q4_K_M 988 121

MLX Models

model quant pp512 tg128
Qwen/Qwen3-0.6B-MLX-4bit 4-bit g128 10063 710
LiquidAI/LFM2.5-1.2B-Instruct-MLX-4bit 4-bit g64 5688 589
mlx-community/Llama-3.2-3B-Instruct-4bit 4-bit g64 2173 220
mlx-community/Qwen3-4B-4bit 4-bit g64 1673 174
mlx-community/Qwen3-8B-4bit 4-bit g64 866 102

Multimodal Inference

model vision ms alloy TTFT alloy dec alloy wall
gemma4:e2b 229 455 172 1193
gemma4:e4b 257 665 99.0 1949

Embeddings Inference

Per-regime encoder tok/s from alloy bench nomic-embed-text --dataset embeddings.

regime batch seq tok/s
q_short 1 10 5094
q_long 1 256 19161
b8_short 8 10 14142
b8_long 8 128 11840

Writing kernels

import numpy as np
import alloy as al

@al.kernel
def blur(src, dst: al.output, W: al.constexpr, H: al.constexpr):
    x = al.program_id(0)
    y = al.program_id(1)
    acc = 0.0
    count = 0
    for dy in range(-1, 2):
        for dx in range(-1, 2):
            nx = x + dx
            ny = y + dy
            if nx >= 0:
                if nx < W:
                    if ny >= 0:
                        if ny < H:
                            acc = acc + al.load(src + ny * W + nx)
                            count = count + 1
    al.store(dst + y * W + x, acc / count)

W, H = 1920, 1080
img = np.random.rand(H, W).astype(np.float32)

out = blur[W, H](img.ravel(), W=W, H=H)
print(np.asarray(out).reshape(H, W))

NumPy and PyTorch arrays can be bound directly as inputs for covered contiguous host-memory paths. The kernel's al.output is allocated automatically and returned as an AlloyBuffer — convert with np.asarray(...) or .numpy(). Some interop paths allocate Alloy-owned shared buffers or require layout copies, so this is not a blanket promise that every input type and view is no-copy.

More runnable examples live in examples/kernel/:

Tiled GEMM

@al.kernel
def matmul(A, B_T, C: al.output,
           BLOCK_M: al.constexpr = 64, BLOCK_N: al.constexpr = 64, BLOCK_K: al.constexpr = 16):
    M, K = A.shape
    N = B_T.shape[0]
    pm = al.program_id(0)
    pn = al.program_id(1)
    rm = pm * BLOCK_M + al.arange(0, BLOCK_M)
    rn = pn * BLOCK_N + al.arange(0, BLOCK_N)
    rk = al.arange(0, BLOCK_K)
    a_ptrs = A + rm[:, None] * K + rk[None, :]
    b_ptrs = B_T + rn[:, None] * K + rk[None, :]
    acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)
    for k in range(0, K, BLOCK_K):
        a = al.load(a_ptrs, mask=(rm[:, None] < M) & (rk[None, :] < K))
        b = al.load(b_ptrs, mask=(rn[:, None] < N) & (rk[None, :] < K))
        acc += al.tile_dot(a, b, transpose_rhs=True)
        a_ptrs += BLOCK_K
        b_ptrs += BLOCK_K
    al.store(C + rm[:, None] * N + rn[None, :], acc, mask=(rm[:, None] < M) & (rn[None, :] < N))

This compiles to Metal with simdgroup matrix multiply-accumulate (MMA), cooperative tile loads, threadgroup shared memory staging, and optional double buffering all generated automatically from the tile IR.

Built-in ops

High-performance implementations of common operations.

C = al.dot_transpose_rhs(A, B)               # tiled GEMM with autotuning
s = al.softmax(x)                            # fused row-wise softmax
y = al.layernorm(x, gamma, beta)             # fused layer normalization
y, _ = al.rms_norm(x, weight)                # fused RMS normalization (+ per-row 1/rms)
L = al.cross_entropy(logits, labels)         # fused cross-entropy loss kernel

Builtins infer output shapes and constexpr values from input arrays. They compose with fusion. e.g. al.dot followed by an elementwise kernel automatically fuses the elementwise op as an epilogue.

Kernel primitives

# Grid and thread indexing
pid = al.program_id(0)                  # threadgroup position (block index)
tid = al.thread_id()                    # thread position within threadgroup
offs = pid * 1024 + al.arange(0, 1024)  # block-level offsets

# Memory
x = al.load(ptr + offs, mask=mask)       # masked global load
al.store(ptr + offs, val, mask=mask)     # masked global store
buf = al.shared(256)                     # threadgroup shared memory
loc = al.local(8)                        # per-thread register array
al.barrier()                             # threadgroup memory barrier
al.coop_load(buf, src_ptr, size)         # cooperative threadgroup load + barrier
al.copy4(dst, offset, src_ptr)           # vectorized 4-element load

# Tile operations (2D blocks for GEMM, attention, etc.)
acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)
acc += al.tile_dot(a, b, transpose_rhs=True)  # simdgroup MMA
reduced = al.simd_reduce(val)                 # cross-lane reduction

# Simdgroup (warp-level)
al.simd_shuffle_xor(val, offset)         # butterfly shuffle
al.simd_shuffle(val, lane)               # read from specific lane
acc = al.simd_matrix()                   # 8x8 matrix accumulator
al.simd_load(src, offset, stride)        # load into simd matrix
al.simd_mma(acc, a, b)                   # matrix multiply-accumulate

# Atomics
al.atomic_add(ptr, idx, val)                # atomic fetch-and-add (int32)
al.atomic_max(ptr, idx, val)
al.atomic_cas(ptr, idx, expected, desired)  # compare-and-swap

# Control flow — plain Python
if cond: ...
for i in range(N): ...
while cond: ...

Automatic fusion

# These three kernels fuse into one — no intermediate buffers allocated.
# Each call returns a lazy AlloyBuffer; feed it straight into the next:
t1 = scale[grid](x, N=N)          # t1 = x * 2.0
t2 = bias[grid](t1, N=N)          # t2 = t1 + 1.0
result = activate[grid](t2, N=N)  # result = relu(t2)

# Reading the result triggers one fused GPU submission:
print(result[0])

Framework interop

Pass PyTorch tensors or MLX arrays directly when their storage layout is supported:

import torch
x = torch.randn(32, 128)                  # CPU tensor, lives in unified memory
result = my_kernel[grid](x, M=32, N=128)  # x bound directly; result returned as an AlloyBuffer

Alloy's compiled plans may convert PyTorch input storage to Alloy-owned shared memory on first execution so subsequent dispatches can resolve Metal buffers by handle. That keeps subsequent dispatches free of per-call input copies for stable storage.

Inspect generated code

al.inspect(my_kernel, N=8192)                      # prints MSL source
al.inspect(my_kernel, level="tile-ir", N=8192)     # prints tile IR

Why Alloy

The problem. Metal compute is powerful but painful to program. You write MSL in a C++ dialect, manually manage buffer bindings, compile pipeline state objects, and set up command encoders. There's no equivalent of Triton, Numba, or CuPy for Metal.

What Alloy does. Python → tile IR → MSL, with a runtime that handles dispatch, caching, and optimization:

  • Shared-memory execution — Apple Silicon CPU and GPU share physical memory. Alloy binds caller buffers directly where the storage layout supports it, and uses Alloy-owned shared buffers when plan safety or alignment requires it.
  • Tile IR compiler — Python kernel source → AST → tile IR (loads, stores, reductions, MMA, barriers) → Metal Shading Language. Handles threadgroup sizing, shared memory allocation, simdgroup decomposition, and barrier placement automatically.
  • Automatic dispatch — builtins return lazy buffers that queue GPU work automatically. Reading results triggers a single fused Metal command buffer commit. No manual batch management needed.
  • Operator fusion — adjacent elementwise kernels fuse automatically, eliminating intermediate buffers. Elementwise ops fuse as prologues and epilogues into reductions, GEMM, softmax, and layernorm. Transposes fuse via stride absorption.
  • Tuning — exhaustive search over tile sizes, loop unrolling, double buffering, and matvec strategies.

Contributing

See CONTRIBUTING.md for dev setup, test commands, and PR conventions.

License

Apache License 2.0.