惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
The Blog of Author Tim Ferriss
S
Securelist
D
Docker
The Register - Security
The Register - Security
GbyAI
GbyAI
Recorded Future
Recorded Future
Engineering at Meta
Engineering at Meta
Stack Overflow Blog
Stack Overflow Blog
云风的 BLOG
云风的 BLOG
P
Proofpoint News Feed
罗磊的独立博客
博客园 - 【当耐特】
F
Full Disclosure
WordPress大学
WordPress大学
腾讯CDC
小众软件
小众软件
大猫的无限游戏
大猫的无限游戏
D
DataBreaches.Net
SecWiki News
SecWiki News
L
Lohrmann on Cybersecurity
I
InfoQ
MyScale Blog
MyScale Blog
量子位
Cyberwarzone
Cyberwarzone
博客园 - 三生石上(FineUI控件)
The Hacker News
The Hacker News
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Jina AI
Jina AI
博客园_首页
H
Help Net Security
K
Kaspersky official blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Webroot Blog
Webroot Blog
Blog — PlanetScale
Blog — PlanetScale
V
Vulnerabilities – Threatpost
Y
Y Combinator Blog
The Cloudflare Blog
P
Proofpoint News Feed
V
Visual Studio Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
T
Tailwind CSS Blog
爱范儿
爱范儿
P
Privacy International News Feed
Security Archives - TechRepublic
Security Archives - TechRepublic
The GitHub Blog
The GitHub Blog
C
Cybersecurity and Infrastructure Security Agency CISA
B
Blog RSS Feed

Yi's Blog

Solving Jane Street's 'Dropped a Neural Net' Puzzle HRM Explained: A 27M Parameter Model That Reasons Without Chain-of-Thought BrushNet & BrushEdit Explained: From Inpainting Architecture to Intelligent Editing U-Net Explained: A Visual Guide for Beginners Building an Image Captioning Transformer from Scratch Reverse Engineering Guitar Pro 8's Locked Files Vibe Coding - Extracting Pet Sprites from Cross Gate Breaking Up with Evernote: Building a Custom Migration Tool for Apple Notes 《世上为什么要有图书馆》读书笔记 - Yi's Blog 《纳瓦尔宝典》推荐阅读 - Yi's Blog 与冰山交谈 - Yi's Blog Claude Code Complexity: Safety, Safety, Safety 微信读书:LLM 自动化问答 PK - Yi's Blog Working on Moonshot Projects - Yi's Blog Vibe Coding - Baby Sleep Tracker 独立思考的人 - Yi's Blog Magic Moment - Yi's Blog 《思辨力35讲:像辩手一样思考》读书笔记 - Yi's Blog Daily Watched YouTube Videos - Yi's Blog
Building a Language Transformer Step by Step
Yi · 2026-01-29 · via Yi's Blog
  • Preparation: Standing on the Shoulders of Giants
  • The Experiment
  • Results - Architecture Comparison (5,000 steps)
  • Results - Scaling Up
  • Generated Names
  • Key Findings
    • Depth beats width
    • Data handling matters most
    • MLP adds capacity but needs regularization
    • LayerNorm and RoPE help incrementally
    • GELU vs ReLU is negligible at small scale
    • Scaling up helps significantly
    • Dropout trades speed for generalization
    • Training improvements compound
  • Architecture Summary
  • What the Loss Means
  • Conclusion

After months of reading about transformers and LLMs, I finally decided to build one from scratch. Not by copy-pasting code, but by incrementally adding each architectural component and measuring its impact. The result was a character-level name generator trained on 32,033 names, and the journey taught me more than any paper or tutorial could.

Preparation: Standing on the Shoulders of Giants

Before diving into code, I spent time building intuition through two excellent resources:

“Build a Large Language Model (From Scratch)” by Sebastian Raschka was my theoretical foundation. The book walks through every component of a transformer with clear explanations and diagrams. Reading it gave me a mental model of how attention, embeddings, and layer normalization fit together — knowledge that proved essential when debugging my own implementation.

Andrej Karpathy’s YouTube series (Neural Networks: Zero to Hero) was equally valuable. His “Let’s build GPT” video demystified the architecture by building it live on screen. Watching someone think through the design decisions — why we use residual connections, how attention matrices work, what LayerNorm actually does — made the concepts stick in a way that reading alone couldn’t. His makemore repository became the dataset and benchmark for my experiments.

With this foundation, I was ready to build.

The Experiment

I incrementally built a character-level transformer for name generation. Each step adds one architectural improvement. All models were trained with batch size 32, AdamW optimizer, and per-name padding with masked loss.

Results - Architecture Comparison (5,000 steps)

Config N_EMBD Heads Layers Params Train Test
baseline 32 1 1 2,908 2.35 2.35
double embd 64 1 1 8,860 2.34 2.34
2 heads 32 2 1 5,948 2.25 2.23
4 layers 32 2 4 18,332 2.00 2.04
+ MLP 32 2 4 51,740 1.97 2.02
+ LayerNorm 32 2 4 52,252 1.96 1.99
+ RoPE 32 2 4 52,252 1.94 1.98
+ GELU 32 2 4 52,252 1.94 1.94

Results - Scaling Up

Config Steps Train Test Notes
N_EMBD=32, 2 heads 5,000 1.94 1.94 Baseline final model
N_EMBD=64, 4 heads 5,000 1.84 1.92 Matches makemore architecture
N_EMBD=64, 4 heads + dropout 5,000 1.95 2.00 Dropout slows convergence
N_EMBD=64, 4 heads + dropout 20,000 1.75 1.85 Longer training helps
+ LR schedule, weight decay, grad clip 20,000 1.72 1.86 Training improvements

Makemore’s default transformer achieves ~1.92 test loss with N_EMBD=64, 4 heads, 4 layers.

Generated Names

Sample outputs from the final model (N_EMBD=64, 4 heads, 20k steps with all training improvements):

kaelynn, aileigh, elyce, yadi, ovani, derella, nyailee, ranyah, niaa, sett

Key Findings

Depth beats width

Doubling embedding size from 32 to 64 (3x params) gave almost no improvement (2.35 -> 2.34). Adding a second attention head with fewer total params (5,948 vs 8,860) dropped loss by 0.12. Stacking 4 layers was the single biggest improvement, dropping test loss from 2.23 to 2.04. The model benefits far more from multiple layers of processing than from wider representations at a single layer.

Data handling matters most

Before adding per-name padding, our best model achieved 2.36 test loss. After switching to per-name padding with masked loss (same architecture), it dropped to 1.94. This was a larger improvement than all architectural changes combined. The reason: without padding, the model wasted capacity trying to predict across name boundaries — an impossible task that added noise to every gradient update.

MLP adds capacity but needs regularization

Adding the feed-forward network (MLP) to each layer tripled the parameter count (18k -> 52k) but only modestly improved results. It also widened the train-test gap (2.00/2.04 -> 1.97/2.02), suggesting mild overfitting. The MLP lets the model transform representations nonlinearly after attention gathers information, but at this small scale the effect is limited.

LayerNorm and RoPE help incrementally

LayerNorm stabilized training and closed the train-test gap slightly. RoPE (Rotary Position Embeddings) gave the model awareness of character positions without adding any parameters. Neither was dramatic at this scale, but both are essential for larger models — LayerNorm enables training deep networks, and RoPE enables generalization to longer sequences.

GELU vs ReLU is negligible at small scale

Switching from ReLU to GELU activation in the MLP had no measurable effect. The smoother gradient flow matters more when networks are deeper and wider.

Scaling up helps significantly

Doubling N_EMBD to 64 and using 4 heads (matching makemore’s architecture) dropped test loss from 1.94 to 1.92 at 5k steps. With longer training (20k steps), the model reached 1.85 test loss — surpassing makemore’s default.

Dropout trades speed for generalization

Adding 20% dropout increased the train-test gap initially and slowed convergence. At 5k steps, it actually hurt test loss (1.92 -> 2.00). But it prevents overfitting during longer training runs, allowing the model to keep improving past where it would otherwise plateau.

Training improvements compound

Learning rate scheduling (warmup + cosine decay), weight decay (0.01), and gradient clipping (max_norm=1.0) together produced smoother training curves. The cosine decay prevents the learning rate from being too high in later steps when fine-tuning. Weight decay acts as regularization. Gradient clipping prevents instability from occasional large gradients.

Architecture Summary

The final model is a proper transformer decoder:

Input tokens
    -> Token Embedding (28 vocab -> 64 dim)
    -> 4x Transformer Blocks:
        -> LayerNorm -> Multi-Head Attention (4 heads, RoPE, dropout) -> Residual
        -> LayerNorm -> MLP (64 -> 256 -> 64, GELU, dropout) -> Residual
    -> Linear (64 -> 28 vocab)
    -> Cross-entropy loss (masked on PAD tokens)

Training config:

  • 20,000 steps
  • Batch size 32
  • AdamW optimizer with weight decay 0.01
  • Learning rate: warmup to 1e-3 over 200 steps, cosine decay to 1e-4
  • Gradient clipping: max_norm=1.0
  • Dropout: 0.2

What the Loss Means

Cross Entropy Loss

A loss of 1.86 means the model assigns ~15.6% probability on average to the correct next character (e^(-1.86)). Random guessing over 27 characters would give ~3.7% (loss = 3.30). Perfect prediction is impossible because many positions are genuinely ambiguous — after “ma”, the next character could be r, d, k, x, t, and many others.

Progress through this project:

  • Start: 2.35 test loss (~9.5% confidence)
  • Final: 1.86 test loss (~15.6% confidence)
  • Improvement: ~1.6x more confident on the correct character

Conclusion

Building a transformer incrementally taught me that the magic isn’t in any single component — it’s in how they work together. Data preprocessing had the biggest impact. Depth mattered more than width. And the “modern” improvements (LayerNorm, RoPE, GELU) are less about dramatic gains and more about enabling scale.