慣性聚合 高效追蹤和閱讀你感興趣的部落格、新聞、科技資訊
閱讀原文 在慣性聚合中打開

推薦訂閱源

小众软件
小众软件
博客园 - 叶小钗
有赞技术团队
有赞技术团队
大猫的无限游戏
大猫的无限游戏
博客园_首页
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
L
LangChain Blog
Hugging Face - Blog
Hugging Face - Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
aimingoo的专栏
aimingoo的专栏
Blog — PlanetScale
Blog — PlanetScale
爱范儿
爱范儿
T
Tailwind CSS Blog
Jina AI
Jina AI
量子位
Stack Overflow Blog
Stack Overflow Blog
人人都是产品经理
人人都是产品经理
J
Java Code Geeks
V
Visual Studio Blog
月光博客
月光博客

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
當更多層意味著更差模型的時代……殘差的誕生
Aviral Singh · 2026-05-28 · via DEV Community

Aviral Singh

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        # setting the constructor for the initial values that we are every gonna need for the training of the data
        self.char_embedding = nn.Embedding(65, 64)
        self.pos_embedding = nn.Embedding(64, 64)
        self.query = nn.Linear(64, 64)
        self.key = nn.Linear(64, 64)
        self.value = nn.Linear(64, 64)
        self.mask = torch.tril(torch.ones(64, 64))
        # these are for changing the dimensions we are doing this to enlarge the matrix as to make it of higher resolution so as to make the 
        # data and weights more refined 
        self.ff1 = nn.Linear(64, 128)
        # this is to join them back again 
        self.ff2 = nn.Linear(128, 64)
        self.output_head = nn.Linear(64, 65)
        self.norm1 = nn.LayerNorm(64)
        self.norm2 = nn.LayerNorm(64)
        self.out_proj = nn.Linear(64, 64)

    def forward(self, x):
        # feed forward function
        x = self.char_embedding(x) + self.pos_embedding(torch.arange(64))
        # this is the start of the attention stuff i am writing this as a way to separate the code in section inside a functions
        #
        Q = self.query(x)
        Q = Q.view(32, 64, 2, 32)
        Q = Q.transpose(1, 2)
        K = self.key(x)
        K = K.view(32, 64, 2, 32)
        K = K.transpose(1, 2)
        V = self.value(x)
        V = V.view(32, 64, 2, 32)
        V = V.transpose(1, 2)
        A = (Q @ K.transpose(-2, -1)) / 32**0.5

        A = A.masked_fill(self.mask == 0, float("-inf"))
        At = A.softmax(dim=-1)
        # the -1 this is just to tell the
        output = At @ V
        output = output.transpose(1, 2).contiguous().view(32, 64, 64)
        output = self.out_proj(output)
        # this is where the attention ends and we start with the feed forward thing that will give us the predictions
        # added another form of normalization below to improve accuracy the first time the loss function reached 1.8 max now after adding the
        # below line it reached to like 1.5 something
        x = x + output
        x = self.norm1(x)

        output = self.ff1(x)
        output = torch.relu(output)
        output = self.ff2(output)

        x = x + output  # ← merge back into main flow
        x = self.norm2(x)
        x = self.output_head(x)
        return x

進入全螢幕模式 退出全螢幕模式

這段代碼基本上現在是訓練Transformer的標準模板,對AI領域的任何人都適用。

我只想理解這裡的一小行代碼,它背後有一段非常有趣的歷史。

 x = x + output

進入全螢幕模式 退出全螢幕模式

為什麼我們要這樣做 - X=X+output?

神經網絡通過反向傳播(back propagation)的過程進行學習,這基本上意味著它們本質上是在尋找能讓我們更接近正確預測的變化,方法是改變濾波器。但這裡存在一個特定問題:當我們從一層移動到另一層時,梯度會變得越來越小,這相當嚴重,因為計算也會變得越來越困難且計算成本更高。這主要是由於偏導數的鏈式法則(chain rule)所導致的。那麼,這個東西是如何解決這個問題的呢?

*這個問題的歷史 *

在這裡閱讀這篇論文 --

這篇論文由微軟研究團隊完成,主要探討他們如何解決「更多未必更好」的問題。在深度學習模型的訓練中,這篇論文之前,模型的深度(即層數)越多,所產生的誤差也越大,這是一個重大問題,而人們不知道如何解決。一方面,更深層的模型能帶來更好的理解力;另一方面,卻也面臨誤差增加的困境。

解決方案

現在我們可能會認為,解決方案就是將原始嵌入向量(就我的情況而言)直接加到我們在完成所有計算後得到的上下文矩陣上,而你這麼想是對的,但原因可能並非如你所想——論文本身也提到,這並非此問題的真正原因。

我們認為,這種優化困難不太可能
是由梯度消失所導致的。

為什麼? - 我們排除梯度消失疑慮的原因是,已經有措施來最小化和阻止梯度消失問題,這些措施是透過批次歸一化(batch normalization)以及在此情況下的ReLu來實現的。以下是我們在程式碼中實作的方式:

    x = self.norm1(x) # the batch normalization equivalent in transformers 

    output = self.ff1(x)
    output = torch.relu(output) # another way to solve the vanishing gradient problem 
    output = self.ff2(output)

    x = x + output  # ← merge back into main flow
    x = self.norm2(x)
    x = self.output_head(x)

進入全螢幕模式 離開全螢幕模式

如你所見,這個、那個、這些確實解決了梯度消失的問題,但如果我們移除 x=x+output 這行,結果會更糟。你知道嗎,我們來試試看。好的——

這是我們正常執行且不做任何更改時的情況。現在改變一件事,那就是移除 x=x+output 這一行,然後看看它對損失函數的影響。

所以損失函數僅因這一行就從1.70跳到了2.47,這聽起來可能不多,但請記住,這只是為了簡化而使用的單層模型,我們添加的層數越多,誤差也會隨之上升。為了進一步說明我的觀點,我想展示一下即時的梯度,在這裡做一些微調——

step 44500, loss: 2.5130
  char_embedding.weight          grad_norm: 0.006248
  pos_embedding.weight           grad_norm: 0.005838
  ff1.weight                     grad_norm: 0.024721
  ff2.weight                     grad_norm: 0.053932
  output_head.weight             grad_norm: 0.163109
  norm1.weight                   grad_norm: 0.007271
  norm2.weight                   grad_norm: 0.024594
step 45000, loss: 2.4751
  char_embedding.weight          grad_norm: 0.005574
  pos_embedding.weight           grad_norm: 0.005913
  ff1.weight                     grad_norm: 0.023506
  ff2.weight                     grad_norm: 0.056331
  output_head.weight             grad_norm: 0.161182
  norm1.weight                   grad_norm: 0.007898
  norm2.weight                   grad_norm: 0.020992
step 45500, loss: 2.4623
  char_embedding.weight          grad_norm: 0.006224
  pos_embedding.weight           grad_norm: 0.006075
  ff1.weight                     grad_norm: 0.025461
  ff2.weight                     grad_norm: 0.051210
  output_head.weight             grad_norm: 0.145062
  norm1.weight                   grad_norm: 0.008452
  norm2.weight                   grad_norm: 0.018521
step 46000, loss: 2.4764
  char_embedding.weight          grad_norm: 0.006709
  pos_embedding.weight           grad_norm: 0.006148
  ff1.weight                     grad_norm: 0.026940
  ff2.weight                     grad_norm: 0.057071
  output_head.weight             grad_norm: 0.163159
  norm1.weight                   grad_norm: 0.008988
  norm2.weight                   grad_norm: 0.025112
step 46500, loss: 2.4746
  char_embedding.weight          grad_norm: 0.006127
  pos_embedding.weight           grad_norm: 0.006181
  ff1.weight                     grad_norm: 0.025931
  ff2.weight                     grad_norm: 0.056799
  output_head.weight             grad_norm: 0.158272
  norm1.weight                   grad_norm: 0.008369
  norm2.weight                   grad_norm: 0.025981

進入全螢幕模式 退出全螢幕模式

所以,我要說的是,雖然它看起來解決了梯度消失的問題,但事實上根本沒有。
它真正做的事情要有趣得多。.-
每一層和非線性都做一些改變,而這些改變會快速疊加,真的非常快。舉例來說,20 層的情況下可能還行得通,因為即使層數很多,但當我們從 20 層增加到 50 層時,複雜度會進一步暴增。那些「微小」的改變,即使本身非常非常小,也可能大幅改變數值,最後產生的數值可能與原始值完全不同,甚至天差地遠。舉個例子——
5, 3, 8 --->5, 3, 8--->0.3, 0.01, 0.2 並注意到這些根本不是因為像梯度消失這樣的東西,而是因為我們在中間做的小改變,那麼如果我們在裡面加上原始值會發生什麼? -
[5, 3, 8] + [0.3, 0.01, 0.2] = [5.3, 3.01, 8.2]
所以這個結果向量非常接近原始向量,對吧?這就是殘差的主要概念。還有許多殘差算法,但為了簡單起見,我們就堅持使用簡單的加法,老實說這樣更好。