惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

www.infosecurity-magazine.com
www.infosecurity-magazine.com
Security Archives - TechRepublic
Security Archives - TechRepublic
TaoSecurity Blog
TaoSecurity Blog
Cloudbric
Cloudbric
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
N
News and Events Feed by Topic
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Securelist
The Cloudflare Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
D
DataBreaches.Net
S
Schneier on Security
L
LangChain Blog
Jina AI
Jina AI
M
MIT News - Artificial intelligence
Recent Announcements
Recent Announcements
T
Tenable Blog
B
Blog RSS Feed
V
Visual Studio Blog
Simon Willison's Weblog
Simon Willison's Weblog
G
Google Developers Blog
T
The Exploit Database - CXSecurity.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
WordPress大学
WordPress大学
W
WeLiveSecurity
I
InfoQ
The Hacker News
The Hacker News
雷峰网
雷峰网
月光博客
月光博客
P
Privacy & Cybersecurity Law Blog
O
OpenAI News
Hacker News: Ask HN
Hacker News: Ask HN
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
The Last Watchdog
The Last Watchdog
P
Privacy International News Feed
Cyberwarzone
Cyberwarzone
S
SegmentFault 最新的问题
L
Lohrmann on Cybersecurity
人人都是产品经理
人人都是产品经理
V
V2EX
V
Vulnerabilities – Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Cybersecurity and Infrastructure Security Agency CISA
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Troy Hunt's Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
阮一峰的网络日志
阮一峰的网络日志
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
What is OpenAI's Parameter Golf Challenge, and why I spent a month on it
Swapnil Sawa · 2026-05-02 · via DEV Community

Parameter Golf — 16 MB. 10 min. 8×H100. One number to beat.

March 18th, 2026. OpenAI posts the rules for something called Parameter Golf.

Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s.

I read it twice and almost closed the tab. I trained a GAN a long time ago. I'd picked up transformer basics here and there. I had never trained a language model. I had never rented a GPU by the hour. Every word in the announcement was slightly beyond my reach.

I decided to try anyway.

A month later, my submission is open on OpenAI's repository at openai/parameter-golf#1747 — a hair behind the world record on the public leaderboard.

What follows is that month, told as a map of the ideas I had to understand along the way: partial rotary embeddings, quantized weights, test-time training, and a handful of things I'd never heard of on March 17th. The rest of this series unpacks them one at a time. This post is about why the challenge itself became the best curriculum I've ever found for getting into modern language modeling.


The rules, in plain language

Parameter Golf is a narrow, beautifully specified contest. You submit a single Python file — train_gpt.py — plus a compressed model blob. Together, they have to weigh less than 16 megabytes.

Submission anatomy — train_gpt.py + model.pkl.lzma, ≤ 16 MB total, runs on 8×H100 in 10 min

The file is run on a rented 8-GPU machine (8×H100, which costs about twenty dollars an hour), and your training script has ten minutes to turn random weights into a working language model. Then your model is scored on a held-out slice of the web — the FineWeb validation set — using a measure called bits-per-byte.

Bits-per-byte is a tidy idea once it clicks. Your model is given real text it has never seen, and for each next character it predicts, it assigns some probability. If the model thinks the real next character was likely, it "pays" few bits; if the real next character surprises it, it pays many. Sum the bill across the entire validation set, divide by the number of bytes, and you have the average number of bits your model needed to describe each byte of real English. Lower is better. The naive baseline OpenAI shipped with the repo scored 1.2244. A month later, the top of the leaderboard was 1.0810.

The gap between those two numbers is the whole game. And because the measure is tokenizer-agnostic, nobody can cheat by picking a friendly vocabulary — all scores are on the same, absolute scale.

What makes the rule set unusual isn't any single constraint. It's the pairing of all three.

The three-constraint squeeze: 16 MB, 10 min, fixed eval, all hitting one submission at once

Sixteen megabytes means you can't have many parameters; ten minutes means you can't train for long; a fixed evaluation means you can't tune against the test. Every trick someone uses is squeezed through all three at once. Which is, it turns out, a phenomenal way to force you to actually understand what each trick does.


The best curriculum I've ever found

I've tried to "learn modern transformers" three times in the last few years. Each time I read a handful of papers, felt smarter for a week, and forgot most of it. Parameter Golf broke that pattern, and I think I know why.

The rules are public, and so is the code. Every record submission on the leaderboard is a pull request. You can click through and read the exact working Python that beat every previous entry. Papers describe ideas — often after the fact, often in selective detail. This challenge gives you running implementations, in a repository you can clone, with discussions attached to the PRs explaining why each change was made.

The surface area is small enough to hold in your head. A training script in Parameter Golf is about a thousand lines. You can read it in one sitting. Compare that with trying to learn from a production training codebase: you drown before you learn anything.

The feedback loop is honest and quick. You train for ten minutes. You get a number. The number goes on a leaderboard. A better number or a worse number is the only arbiter. There's no benchmark gaming, no cherry-picked evaluation. Either your idea works or it doesn't, and you find out before lunch.

The incentive structure is generous. OpenAI put a million dollars of compute credits on the table so that newcomers — me included — could afford to try. I wasn't expected to show up with my own cluster.

All of that adds up to something I've never had before in machine learning: a concrete, bounded, self-scoring problem where every successful idea is already in front of me as working code.


The month, roughly

The arc of my month wasn't week-by-week. It was concept-by-concept. Four ideas I didn't understand on March 17th, each of which unlocked the next phase.

Four-phase descent: baseline 1.2244 → LoRA TTT 1.1573 → SP8192+GPTQ+SDClip 1.08563 → Partial RoPE 1.0820 (SOTA at writing: 1.0810)

Phase 1: tokenization is a compression scheme

I started by reading the naive baseline train_gpt.py line by line. I didn't type a thing for two days. I just wanted to understand what a small transformer — nine layers, 512 hidden dimensions, a 1024-word vocabulary — actually looked like as code.

The first real decision I made was to make the vocabulary bigger. The intuition, which I picked up from the SentencePiece paper and a Hugging Face tutorial, is that the vocabulary is itself a compression scheme: a bigger vocabulary breaks sentences into fewer, larger tokens, which means each step of training sees more context. The catch is that the embedding table — one learned vector per vocabulary entry — scales linearly with vocab size, and it has to fit in that 16MB budget. At 4096 entries it was already a quarter of the artifact. I ended up at 8192, the sweet spot where the token reduction paid for the extra bytes.

One stumble worth naming: I tried to retokenize the dataset the obvious way — load a shard into memory, run it through the new tokenizer, write it out — and my machine died. A single training shard is 191 megabytes of raw tokens. I had to rewrite the pipeline using memory-mapped files. I spent a weekend learning what np.memmap does.

Phase 2: test-time training is not cheating

My first working submission was a non-record — val_bpb 1.1573, mid-table at the time. But it was on OpenAI's repository, under my GitHub handle, and that mattered more than the score. I got it by adding a trick called LoRA test-time training: at evaluation time, after the model scores each chunk of text, it briefly fine-tunes itself on that chunk before predicting the next one. The first time I read about it I was sure it was cheating. It isn't — you're only training on tokens the model has already been graded on, not on tokens it will be scored against. The research lineage is well worth a read on its own.

One stumble: my first pull request included all my local experiment files, because I hadn't yet learned how to keep a clean submission branch. I had to rebuild it from scratch off openai/parameter-golf:main and re-submit. Nobody on the team made me feel dumb about it, which I still appreciate.

Phase 3: the leaderboard is a curriculum

The most counter-intuitive thing about this challenge is that every record submission is public, working code. I wrote a forty-line Python script that could pull down a winning submission, decompress the packed model blob (the submissions use LZMA plus base85 encoding), and leave me with the full train_gpt.py someone had used to beat everyone else.

Then I read them. Over and over. The same three or four ideas kept recurring at the top of the leaderboard: a quantization scheme called GPTQ, score-first test-time training, partial rotary position embeddings, and depth recurrence. Once I'd seen a name three times, I went and read the paper.

The single biggest unlock from this phase wasn't an attention trick — it was on the compression side. My naive quantization minimized weight reconstruction error: for each weight matrix I picked a scale, divided everything by it, rounded to int. That's the obvious thing. It's also the wrong objective. We don't actually care if the quantized weights are close to the originals. We care if the quantized model's predictions are close to the originals'. GPTQ flips the objective: for each weight column, it picks the rounding direction (up or down) that minimizes the downstream output error, using a Hessian estimated from a calibration pass through the model. Same int6/int8 bit width, dramatically smarter rounding.

The other half of the compression unlock was SDClip. Plain quantization clips weights at max(abs(row)). SDClip clips at k × std(row) instead — k=12.85 for the int6 layers, k=20 for the int8 embeddings. Same bit width again, just a smarter clipping threshold that produces lower-entropy quantized values. The compressed blob ended up at 0.455 bytes per parameter, down from 0.661 with naive max-clipping. That's the difference between fitting 24M params and fitting 35M into the same 16 MB. Suddenly the 11-layer SOTA-class architecture was on the table.

The other shift in Phase 3 wasn't about writing new code. It was about stopping being a spectator. I ran ablations: take one idea from the leading submission, turn it off, retrain, measure the hit. Four ablations cost me about fifty dollars of GPU time. They told me, clearly, which single idea was worth porting into my own stack.

Phase 4: partial RoPE is obvious, once you see it

The change that got me onto the near-top of the leaderboard was partial rotary embeddings. Rotary embeddings — RoPE — are how transformers encode position: instead of adding a position vector to each token, you rotate the query and key vectors by a position-dependent angle, and the dot product between them ends up depending only on their relative distance. It's elegant.

What I didn't know before reading the SOTA submission is that you don't have to rotate every dimension. You can rotate only 16 out of 64 head dimensions and leave the other 48 untouched.

Here's the cleaner version of why that works. RoPE splits each head into 32 dimension-pairs and rotates each pair at its own rate. Pair 0 rotates fastest — it wraps around a full circle every six tokens. Pair 16 is much slower: across an entire 2048-token training window it accumulates only about 12 degrees of rotation. Pair 24 accumulates 0.005 radians, which is essentially zero. Those slow pairs are "rotated" on paper and content dimensions in practice — the model can't really learn to interpret a phase shift it never sees. Partial RoPE just makes that explicit: rotate the 16 fast pairs that genuinely cycle, and let the other 48 pairs be pure content. The two effects compound — sharper attention from cleaner position signal, more capacity freed for what tokens actually mean.

Before retraining, I ran a sanity ablation: I took my already-trained model, monkey-patched apply_rotary_emb to identity (no rotation at all), and ran the eval. The score got 82% worse — bpb went from 1.26 to 2.30, which is a +1.03 hit. That convinced me the position signal mattered enormously and the rotation work wasn't a no-op. Then I retrained with partial RoPE instead of full RoPE, and the number went down.

The ablation discipline matters here. Three-seed standard deviation on a stable submission is around 0.0002 bpb. So a delta of 0.001 is real signal — anything smaller is seed noise and you're fooling yourself. The threshold I used was: any single feature that bought ≥ 0.001 bpb was a candidate to port; ≥ 0.0015 was a high-confidence path to close the gap to SOTA. That ends up being a single env-var flip per ablation in the SOTA's train_gpt.py, which is nice — the experimental cost is the GPU time, not the engineering.

I trained three seeds to prove it wasn't luck. I shipped the submission. The PR is open right now.

Four ideas I didn't know a month earlier, one pull request on the world's most-watched ML repo, and a leaderboard entry that sits a hair behind the record.

The 0.0010 bpb between me and the current SOTA is one or more of three known things. We share a lot — same SP8192 vocab, same 11-layer architecture, same MLP 4x, same LeakyReLU², same GPTQ + SDClip, same EMA, same warmdown. We differ on three structural pieces: I rotate all 64 RoPE dims, they rotate 16. I run sequential residuals, they run parallel residuals starting at layer 7 (a GPT-J trick). I have no depth recurrence, they loop layers 3–5 twice and switch it on at 35% of training. The closing experiment is to ablate each of those one at a time on the SOTA stack. I haven't run it yet. Post 5 of this series will, with the data.


What the rest of this series will cover

This post is the entry point. I will write more posts in the series on my learnings and insights I gained doing this challenge:

  • Post 2
  • Post 3
  • Post 4

Anyone can — because the on-ramp is there

I'm not going to pretend this was effortless. I burned real money on failed runs. I re-did my first pull request from scratch because I'd committed garbage. I learned what DevToolsActivePort is for reasons unrelated to the challenge and what np.memmap is for reasons very related. I had a weekend where nothing I tried moved the score and I wondered if I was one of those people who was going to flame out before shipping anything.

I kept going because the on-ramp was there, and I'd like to make the case that it is here for you too. The rules are bounded and public. The code is working and readable. The leaderboard doesn't care who you are. A thousand-dollar GPU grant can be requested on OpenAI's site. If you've been feeling like modern machine learning is a field that moved on without you, Parameter Golf is a concrete, unambiguous way to walk yourself back in.

The barrier to getting started on hard things is almost never intelligence. It's the absence of a clear on-ramp. Here, for once, the on-ramp is obvious. I'm going to spend the next four posts walking you up it.