What is OpenAI's Parameter Golf Challenge, and why I spent a month on it

March 18th, 2026. OpenAI posts the rules for something called Parameter Golf.

Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s.

I read it twice and almost closed the tab. I trained a GAN a long time ago. I'd picked up transformer basics here and there. I had never trained a language model. I had never rented a GPU by the hour. Every word in the announcement was slightly beyond my reach.

I decided to try anyway.

A month later, my submission is open on OpenAI's repository at openai/parameter-golf#1747 — a hair behind the world record on the public leaderboard.

What follows is that month, told as a map of the ideas I had to understand along the way: partial rotary embeddings, quantized weights, test-time training, and a handful of things I'd never heard of on March 17th. The rest of this series unpacks them one at a time. This post is about why the challenge itself became the best curriculum I've ever found for getting into modern language modeling.

The rules, in plain language

Parameter Golf is a narrow, beautifully specified contest. You submit a single Python file — train_gpt.py — plus a compressed model blob. Together, they have to weigh less than 16 megabytes.

The file is run on a rented 8-GPU machine (8×H100, which costs about twenty dollars an hour), and your training script has ten minutes to turn random weights into a working language model. Then your model is scored on a held-out slice of the web — the FineWeb validation set — using a measure called bits-per-byte.

Bits-per-byte is a tidy idea once it clicks. Your model is given real text it has never seen, and for each next character it predicts, it assigns some probability. If the model thinks the real next character was likely, it "pays" few bits; if the real next character surprises it, it pays many. Sum the bill across the entire validation set, divide by the number of bytes, and you have the average number of bits your model needed to describe each byte of real English. Lower is better. The naive baseline OpenAI shipped with the repo scored 1.2244. A month later, the top of the leaderboard was 1.0810.

The gap between those two numbers is the whole game. And because the measure is tokenizer-agnostic, nobody can cheat by picking a friendly vocabulary — all scores are on the same, absolute scale.

What makes the rule set unusual isn't any single constraint. It's the pairing of all three.

Sixteen megabytes means you can't have many parameters; ten minutes means you can't train for long; a fixed evaluation means you can't tune against the test. Every trick someone uses is squeezed through all three at once. Which is, it turns out, a phenomenal way to force you to actually understand what each trick does.

The best curriculum I've ever found

I've tried to "learn modern transformers" three times in the last few years. Each time I read a handful of papers, felt smarter for a week, and forgot most of it. Parameter Golf broke that pattern, and I think I know why.

The rules are public, and so is the code. Every record submission on the leaderboard is a pull request. You can click through and read the exact working Python that beat every previous entry. Papers describe ideas — often after the fact, often in selective detail. This challenge gives you running implementations, in a repository you can clone, with discussions attached to the PRs explaining why each change was made.

The surface area is small enough to hold in your head. A training script in Parameter Golf is about a thousand lines. You can read it in one sitting. Compare that with trying to learn from a production training codebase: you drown before you learn anything.

The feedback loop is honest and quick. You train for ten minutes. You get a number. The number goes on a leaderboard. A better number or a worse number is the only arbiter. There's no benchmark gaming, no cherry-picked evaluation. Either your idea works or it doesn't, and you find out before lunch.

The incentive structure is generous. OpenAI put a million dollars of compute credits on the table so that newcomers — me included — could afford to try. I wasn't expected to show up with my own cluster.

All of that adds up to something I've never had before in machine learning: a concrete, bounded, self-scoring problem where every successful idea is already in front of me as working code.

The month, roughly

The arc of my month wasn't week-by-week. It was concept-by-concept. Four ideas I didn't understand on March 17th, each of which unlocked the next phase.

Phase 1: tokenization is a compression scheme

I started by reading the naive baseline train_gpt.py line by line. I didn't type a thing for two days. I just wanted to understand what a small transformer — nine layers, 512 hidden dimensions, a 1024-word vocabulary — actually looked like as code.

The first real decision I made was to make the vocabulary bigger. The intuition, which I picked up from the SentencePiece paper and a Hugging Face tutorial, is that the vocabulary is itself a compression scheme: a bigger vocabulary breaks sentences into fewer, larger tokens, which means each step of training sees more context. The catch is that the embedding table — one learned vector per vocabulary entry — scales linearly with vocab size, and it has to fit in that 16MB budget. At 4096 entries it was already a quarter of the artifact. I ended up at 8192, the sweet spot where the token reduction paid for the extra bytes.

One stumble worth naming: I tried to retokenize the dataset the obvious way — load a shard into memory, run it through the new tokenizer, write it out — and my machine died. A single training shard is 191 megabytes of raw tokens. I had to rewrite the pipeline using memory-mapped files. I spent a weekend learning what np.memmap does.

Phase 2: test-time training is not cheating

My first working submission was a non-record — val_bpb 1.1573, mid-table at the time. But it was on OpenAI's repository, under my GitHub handle, and that mattered more than the score. I got it by adding a trick called LoRA test-time training: at evaluation time, after the model scores each chunk of text, it briefly fine-tunes itself on that chunk before predicting the next one. The first time I read about it I was sure it was cheating. It isn't — you're only training on tokens the model has already been graded on, not on tokens it will be scored against. The research lineage is well worth a read on its own.

One stumble: my first pull request included all my local experiment files, because I hadn't yet learned how to keep a clean submission branch. I had to rebuild it from scratch off openai/parameter-golf:main and re-submit. Nobody on the team made me feel dumb about it, which I still appreciate.

Phase 3: the leaderboard is a curriculum

The most counter-intuitive thing about this challenge is that every record submission is public, working code. I wrote a forty-line Python script that could pull down a winning submission, decompress the packed model blob (the submissions use LZMA plus base85 encoding), and leave me with the full train_gpt.py someone had used to beat everyone else.

Then I read them. Over and over. The same three or four ideas kept recurring at the top of the leaderboard: a quantization scheme called GPTQ, score-first test-time training, partial rotary position embeddings, and depth recurrence. Once I'd seen a name three times, I went and read the paper.

The single biggest unlock from this phase wasn't an attention trick — it was on the compression side. My naive quantization minimized weight reconstruction error: for each weight matrix I picked a scale, divided everything by it, rounded to int. That's the obvious thing. It's also the wrong objective. We don't actually care if the quantized weights are close to the originals. We care if the quantized model's predictions are close to the originals'. GPTQ flips the objective: for each weight column, it picks the rounding direction (up or down) that minimizes the downstream output error, using a Hessian estimated from a calibration pass through the model. Same int6/int8 bit width, dramatically smarter rounding.

The other half of the compression unlock was SDClip. Plain quantization clips weights at max(abs(row)). SDClip clips at k × std(row) instead — k=12.85 for the int6 layers, k=20 for the int8 embeddings. Same bit width again, just a smarter clipping threshold that produces lower-entropy quantized values. The compressed blob ended up at 0.455 bytes per parameter, down from 0.661 with naive max-clipping. That's the difference between fitting 24M params and fitting 35M into the same 16 MB. Suddenly the 11-layer SOTA-class architecture was on the table.

The other shift in Phase 3 wasn't about writing new code. It was about stopping being a spectator. I ran ablations: take one idea from the leading submission, turn it off, retrain, measure the hit. Four ablations cost me about fifty dollars of GPU time. They told me, clearly, which single idea was worth porting into my own stack.

Phase 4: partial RoPE is obvious, once you see it

The change that got me onto the near-top of the leaderboard was partial rotary embeddings. Rotary embeddings — RoPE — are how transformers encode position: instead of adding a position vector to each token, you rotate the query and key vectors by a position-dependent angle, and the dot product between them ends up depending only on their relative distance. It's elegant.

What I didn't know before reading the SOTA submission is that you don't have to rotate every dimension. You can rotate only 16 out of 64 head dimensions and leave the other 48 untouched.

Here's the cleaner version of why that works. RoPE splits each head into 32 dimension-pairs and rotates each pair at its own rate. Pair 0 rotates fastest — it wraps around a full circle every six tokens. Pair 16 is much slower: across an entire 2048-token training window it accumulates only about 12 degrees of rotation. Pair 24 accumulates 0.005 radians, which is essentially zero. Those slow pairs are "rotated" on paper and content dimensions in practice — the model can't really learn to interpret a phase shift it never sees. Partial RoPE just makes that explicit: rotate the 16 fast pairs that genuinely cycle, and let the other 48 pairs be pure content. The two effects compound — sharper attention from cleaner position signal, more capacity freed for what tokens actually mean.

Before retraining, I ran a sanity ablation: I took my already-trained model, monkey-patched apply_rotary_emb to identity (no rotation at all), and ran the eval. The score got 82% worse — bpb went from 1.26 to 2.30, which is a +1.03 hit. That convinced me the position signal mattered enormously and the rotation work wasn't a no-op. Then I retrained with partial RoPE instead of full RoPE, and the number went down.

The ablation discipline matters here. Three-seed standard deviation on a stable submission is around 0.0002 bpb. So a delta of 0.001 is real signal — anything smaller is seed noise and you're fooling yourself. The threshold I used was: any single feature that bought ≥ 0.001 bpb was a candidate to port; ≥ 0.0015 was a high-confidence path to close the gap to SOTA. That ends up being a single env-var flip per ablation in the SOTA's train_gpt.py, which is nice — the experimental cost is the GPU time, not the engineering.

I trained three seeds to prove it wasn't luck. I shipped the submission. The PR is open right now.

Four ideas I didn't know a month earlier, one pull request on the world's most-watched ML repo, and a leaderboard entry that sits a hair behind the record.

The 0.0010 bpb between me and the current SOTA is one or more of three known things. We share a lot — same SP8192 vocab, same 11-layer architecture, same MLP 4x, same LeakyReLU², same GPTQ + SDClip, same EMA, same warmdown. We differ on three structural pieces: I rotate all 64 RoPE dims, they rotate 16. I run sequential residuals, they run parallel residuals starting at layer 7 (a GPT-J trick). I have no depth recurrence, they loop layers 3–5 twice and switch it on at 35% of training. The closing experiment is to ablate each of those one at a time on the SOTA stack. I haven't run it yet. Post 5 of this series will, with the data.

What the rest of this series will cover

This post is the entry point. I will write more posts in the series on my learnings and insights I gained doing this challenge:

Post 2
Post 3
Post 4

Anyone can — because the on-ramp is there

I'm not going to pretend this was effortless. I burned real money on failed runs. I re-did my first pull request from scratch because I'd committed garbage. I learned what DevToolsActivePort is for reasons unrelated to the challenge and what np.memmap is for reasons very related. I had a weekend where nothing I tried moved the score and I wondered if I was one of those people who was going to flame out before shipping anything.

I kept going because the on-ramp was there, and I'd like to make the case that it is here for you too. The rules are bounded and public. The code is working and readable. The leaderboard doesn't care who you are. A thousand-dollar GPU grant can be requested on OpenAI's site. If you've been feeling like modern machine learning is a field that moved on without you, Parameter Golf is a concrete, unambiguous way to walk yourself back in.

The barrier to getting started on hard things is almost never intelligence. It's the absence of a clear on-ramp. Here, for once, the on-ramp is obvious. I'm going to spend the next four posts walking you up it.

推荐订阅源

DEV Community