惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Louis C Deng's Blog

CS231n Lecture Note: Generative Models CS231n Lecture Note: Self-Supervised Learning CS231n Lecture Note: Large Scale Distributed Training 自動微分 | DIY 實現自己的 PyTorch From RNNs to Transformers CS231n Lecture Note VII: Recurrent Neural Networks Uncovering Batch & Layer Normalization CS231n Lecture Note VI: CNN Architectures and Training CS231n Lecture Note V: Convolution Neural Networks Basics Backpropagation: A Vector Calculus Perspective CS231n Lecture Note IV: Neural Networks and Backpropagation CS231n Lecture Note III: Optimization CS231n Lecture Note II: Linear Classifiers CS231n Lecture Note I: Image Classification CSAPP Cache Lab II: Optimizing Matrix Transposition CSAPP Cache Lab I: Let's simulate a cache memory! CS188 Search Lecture Notes III CS188 Search Lecture Notes II How to Use TouchID for Sudo Commands on macOS CS188 Search Lecture Notes I RECAP2025: 留白 CSAPP Bomb Lab 解析 x64 暫存器速查表 CSAPP Data Lab 解析 矩陣的 Modified Gram Schmidt 方法 聊一聊位掩碼(Bit Mask) 整數溢位與未定義行為 快速排序 幾種劃分方法討論 等待 記夢(DeepSeek 輔助創作) 午夜飛行 橋樑 黎明 或 2012 RECAP2024: 水檻臥聽雨 太陽、潮落 RECAP2023: 泡沫 題解 P1622 釋放囚犯 題解 P5888 傳球遊戲 殘陽似火 再會 飢餓藝術家 卡夫卡 Python 中的 zip() 和 enumerate() 泡沫 “救救孩子……”——談魯迅和《狂人日記》 想念 淺灘 蟬 · 夏 微風 觀星 浮塵 復活 【摘錄 | 轉載】普魯斯特 《追憶似水年華》第一卷 《在斯萬家那邊》(一) Time - Pink Floyd - The Dark Side of the Moon 【轉載】靜夜思變調 高樓 幻夢 冰 RECAP2022: 流星雨 清夜 割點 Tarjan 演算法 P3147 USACO16OPEN 262144 P 題解 P3354 Riv 河流 題解 馬拉車演算法 夜雨 層霧 從愚人節玩笑到真的玩笑(bushi): 淺談 lsnotes I made my own Hexo theme 題解 紀念品分組 題解 導彈攔截 如何高效使用搜尋引擎 用 GitHub Actions 格式化 C/C++ 程式碼 四季的天空 洛谷 7 月月賽 Div.2 總結 題解 最近公共祖先 (LCA) 用簡單的物理方法證明牛頓萊布尼茨公式 簡評榮耀手環6 海上生明月,天涯共此時。 我為什麼重新拿出了 iPod Swift 中的 SharedPreferance —— UserDefaults 凝視那一輪明月 用 GitHub Actions 部署 Hexo 部落格 遲來的日誌 - WWDC 2020 獎學金 vcpkg - 方便的 C/C++ 庫管理器 vimrc 配置指南 NextCloud - DIY NAS 解決方案 sudo shutdown -r now sudo shutdown -r now
Demystifying Softmax Loss: A Step-by-Step Derivation for Linear Classifiers
2026-03-28 · via Louis C Deng's Blog

If you are diving into the mechanics of neural networks, you will inevitably encounter the backpropagation of the Softmax and Cross-Entropy loss. At first glance, the matrix calculus can feel a bit intimidating. However, once you break it down step-by-step using the chain rule, you will discover that the final gradients are incredibly elegant and intuitive.

In this post, we will walk through the complete mathematical derivation of the gradients for a linear classifier, moving from single variables to full matrix vectorization.

1. The Setup: Forward Propagation

Let’s define our variables for a single training sample. Assume our input features have a dimension of DD, and we are classifying them into CC distinct classes.

  • Input xx: A D×1D \times 1 column vector.
  • Weights WW: A C×DC \times D matrix.
  • Bias bb: A C×1C \times 1 column vector.
  • True Label yy: The true class index (or a C×1C \times 1 one-hot encoded vector where only the true class index is 11).

The Linear Layer (Logits):
First, we compute the raw scores (logits) zz for each class:

z=Wx+bz = Wx + b

The Softmax Layer:
We convert these raw scores into a valid probability distribution. The probability of the sample belonging to class ii is:

pi=ezij=1Cezjp_i = \frac{e^{z_i}}{\sum_{j=1}^C e^{z_j}}

The Cross-Entropy Loss:
For a single sample where the true class is yy, the loss only cares about the predicted probability assigned to that true class:

L=log(py)L = -\log(p_y)

By substituting the Softmax formula into the loss function, we get:

L=zy+log(j=1Cezj)L = -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right)

2. The Core Derivation: Gradient with respect to Logits (zz)

To perform backpropagation, we first need to find how the loss changes with respect to each logit ziz_i. We denote this gradient as Lzi\frac{\partial L}{\partial z_i}.

We must split this into two scenarios: when ii is the true class, and when ii is any other class.

Case 1: Deriving for the true class (i=yi = y)

Lzy=zy(zy+log(j=1Cezj))\frac{\partial L}{\partial z_y} = \frac{\partial}{\partial z_y} \left( -z_y + \log\left(\sum_{j=1}^C e^{z_j}\right) \right)

Applying the chain rule to the logarithm term:

Lzy=1+1j=1Cezjezy\frac{\partial L}{\partial z_y} = -1 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_y}

Notice that the fractional term is exactly our definition of pyp_y!

Lzy=py1\frac{\partial L}{\partial z_y} = p_y - 1

Case 2: Deriving for an incorrect class (iyi \neq y)
Because zyz_y does not contain ziz_i, the derivative of the first term is 00.

Lzi=0+1j=1Cezjezi\frac{\partial L}{\partial z_i} = 0 + \frac{1}{\sum_{j=1}^C e^{z_j}} \cdot e^{z_i}

Again, the fractional term is pip_i:

Lzi=pi\frac{\partial L}{\partial z_i} = p_i

The Vectorized Form:
We can beautifully combine these two cases using an indicator function I(i=y)\mathbb{I}(i=y) (which is 11 if ii is the true class, and 00 otherwise).

Let dzdz be the gradient vector Lz\frac{\partial L}{\partial z}. Its vectorized form is simply:

dz=pydz = p - y

Intuition Check: This result is incredibly logical. The gradient is simply the Predicted Probability minus the True Probability. If the model is 100% confident and correct (py1p_y \approx 1), the gradient is 00, and no weights will be updated. The larger the error, the larger the gradient pushing the model to learn.

3. Gradients with respect to Weights (WW) and Bias (bb)

Now that we have the gradient of the loss with respect to the logits (dzdz), we use the chain rule to pass this error signal back to our parameters WW and bb.

Deriving for WW:
Let’s look at a single weight element WijW_{ij}. It only influences the loss through the specific logit ziz_i.

LWij=LziziWij\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_{ij}}

Since zi=Wikxk+biz_i = \sum W_{ik} x_k + b_i, the local gradient ziWij\frac{\partial z_i}{\partial W_{ij}} is simply xjx_j. Therefore:

LWij=dzixj\frac{\partial L}{\partial W_{ij}} = dz_i \cdot x_j

To vectorize this back into a C×DC \times D matrix (the same shape as WW), we take the outer product of the column vector dzdz and the row vector xTx^T:

LW=dzxT\frac{\partial L}{\partial W} = dz \cdot x^T

Deriving for bb:
Since z=Wx+bz = Wx + b, the local derivative zb\frac{\partial z}{\partial b} is 11. Thus, the error signal passes directly through:

Lb=dz\frac{\partial L}{\partial b} = dz

4. Scaling up: The Mini-Batch Form

In real-world training, we process NN samples at a time to stabilize gradients and utilize parallel computing.

  • XX becomes a D×ND \times N matrix.
  • dZ=PYdZ = P - Y becomes a C×NC \times N matrix.

To find the average gradient across the entire batch, we perform a matrix multiplication and divide by NN:

Batch Weight Gradient:

LW=1NdZXT\frac{\partial L}{\partial W} = \frac{1}{N} dZ \cdot X^T

Batch Bias Gradient:
Sum the errors across all NN samples for each class, then average:

Lb=1Ni=1NdZ(i)\frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^N dZ^{(i)}

Summary

By systematically applying the chain rule, we transformed a seemingly complex matrix calculus problem into clean, highly efficient linear algebra operations. Understanding this dz=pydz = p - y dynamic is the fundamental key to grasping how classification networks “learn” from their mistakes.