惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
I
InfoQ
The Cloudflare Blog
人人都是产品经理
人人都是产品经理
博客园 - Franky
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
博客园_首页
罗磊的独立博客
V
V2EX
李成银的技术随笔
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
True Tiger Recordings
Vercel News
Vercel News
Cyberwarzone
Cyberwarzone
Cisco Talos Blog
Cisco Talos Blog
F
Fox-IT International blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
M
Microsoft Research Blog - Microsoft Research
Know Your Adversary
Know Your Adversary
爱范儿
爱范儿
The Register - Security
The Register - Security
G
Google Developers Blog
The Hacker News
The Hacker News
Malwarebytes
Malwarebytes
S
Securelist
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
SegmentFault 最新的问题
博客园 - 叶小钗
F
Fortinet All Blogs
Apple Machine Learning Research
Apple Machine Learning Research
宝玉的分享
宝玉的分享
博客园 - 聂微东
T
Threatpost
博客园 - 【当耐特】
D
Docker
P
Privacy & Cybersecurity Law Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
G
GRAHAM CLULEY
V
Visual Studio Blog
C
Cisco Blogs
IT之家
IT之家
S
Security Archives - TechRepublic
Latest news
Latest news
阮一峰的网络日志
阮一峰的网络日志

Louis C Deng's Blog

CS231n Lecture Note: Generative Models CS231n Lecture Note: Self-Supervised Learning CS231n Lecture Note: Large Scale Distributed Training 自動微分 | DIY 實現自己的 PyTorch From RNNs to Transformers Uncovering Batch & Layer Normalization CS231n Lecture Note VI: CNN Architectures and Training CS231n Lecture Note V: Convolution Neural Networks Basics Demystifying Softmax Loss: A Step-by-Step Derivation for Linear Classifiers Backpropagation: A Vector Calculus Perspective CS231n Lecture Note IV: Neural Networks and Backpropagation CS231n Lecture Note III: Optimization CS231n Lecture Note II: Linear Classifiers CS231n Lecture Note I: Image Classification CSAPP Cache Lab II: Optimizing Matrix Transposition CSAPP Cache Lab I: Let's simulate a cache memory! CS188 Search Lecture Notes III CS188 Search Lecture Notes II How to Use TouchID for Sudo Commands on macOS CS188 Search Lecture Notes I RECAP2025: 留白 CSAPP Bomb Lab 解析 x64 暫存器速查表 CSAPP Data Lab 解析 矩陣的 Modified Gram Schmidt 方法 聊一聊位掩碼(Bit Mask) 整數溢位與未定義行為 快速排序 幾種劃分方法討論 等待 記夢(DeepSeek 輔助創作) 午夜飛行 橋樑 黎明 或 2012 RECAP2024: 水檻臥聽雨 太陽、潮落 RECAP2023: 泡沫 題解 P1622 釋放囚犯 題解 P5888 傳球遊戲 殘陽似火 再會 飢餓藝術家 卡夫卡 Python 中的 zip() 和 enumerate() 泡沫 “救救孩子……”——談魯迅和《狂人日記》 想念 淺灘 蟬 · 夏 微風 觀星 浮塵 復活 【摘錄 | 轉載】普魯斯特 《追憶似水年華》第一卷 《在斯萬家那邊》(一) Time - Pink Floyd - The Dark Side of the Moon 【轉載】靜夜思變調 高樓 幻夢 冰 RECAP2022: 流星雨 清夜 割點 Tarjan 演算法 P3147 USACO16OPEN 262144 P 題解 P3354 Riv 河流 題解 馬拉車演算法 夜雨 層霧 從愚人節玩笑到真的玩笑(bushi): 淺談 lsnotes I made my own Hexo theme 題解 紀念品分組 題解 導彈攔截 如何高效使用搜尋引擎 用 GitHub Actions 格式化 C/C++ 程式碼 四季的天空 洛谷 7 月月賽 Div.2 總結 題解 最近公共祖先 (LCA) 用簡單的物理方法證明牛頓萊布尼茨公式 簡評榮耀手環6 海上生明月,天涯共此時。 我為什麼重新拿出了 iPod Swift 中的 SharedPreferance —— UserDefaults 凝視那一輪明月 用 GitHub Actions 部署 Hexo 部落格 遲來的日誌 - WWDC 2020 獎學金 vcpkg - 方便的 C/C++ 庫管理器 vimrc 配置指南 NextCloud - DIY NAS 解決方案 sudo shutdown -r now sudo shutdown -r now
CS231n Lecture Note VII: Recurrent Neural Networks
2026-04-07 · via Louis C Deng's Blog

Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. Unlike standard feedforward networks, RNNs maintain an internal state (memory) that is updated as each element of a sequence is processed, allowing information to persist across time steps.

Sequence Architectures

RNNs are flexible and can be adapted to various input-output mapping structures depending on the task:

  • One-to-Many: A single input produces a sequence of outputs.
    • Example: Image Captioning (Input: Image \rightarrow Output: Sequence of words).
  • Many-to-One: A sequence of inputs produces a single output.
    • Example: Action Prediction or Sentiment Analysis (Input: Sequence of video frames/words \rightarrow Output: Label).
  • Many-to-Many: A sequence of inputs produces a sequence of outputs.
    • Example: Video Captioning (Input: Sequence of frames \rightarrow Output: Sequence of words) or Video Classification (Frame-by-frame labeling).

The Recurrence Mechanism

The “key idea” behind an RNN is the use of a recurrence formula applied at every time step (tt). This allows the network to process a sequence of vectors xx by maintaining a hidden state hh.

A. Hidden State Update

The hidden state is updated by combining the previous state with the current input.

ht=fW(ht1,xt)h_t = f_W(h_{t-1}, x_t)

Parameters:

  • hth_t: New state (current hidden state).
  • ht1h_{t-1}: Old state (hidden state from the previous time step).
  • xtx_t: Input vector at the current time step.
  • fWf_W: A function (often a non-linear activation like tanhtanh or ReLUReLU) with trainable parameters WW.

Note: the same function and the same set of parameters are used at every time step.

B. Output Generation

After the hidden state is updated, the network can produce an output at that specific time step.

yt=fWhy(ht)y_t = f_{W_{hy}}(h_t)

Parameters:

  • yty_t: Output at time tt.
  • hth_t: New state (the updated hidden state).
  • fWhyf_{W_{hy}}: A separate function with its own trainable parameters WhyW_{hy} used to map the hidden state to the output space.

Backpropagation Through Time

Optimizing an RNN requires a specialized version of gradient descent known as Backpropagation Through Time (BPTT).

In its standard form, the model performs a complete forward pass through the entire sequence to calculate a global loss. During the subsequent backward pass, gradients are propagated from the final loss all the way back to the first time step.

While this captures long-range dependencies accurately, it is often computationally prohibitive for long sequences due to the massive memory requirements for storing intermediate states.

To mitigate these resource constraints, researchers often employ Truncated BPTT. This technique involves partitioning the sequence into smaller, manageable chunks.

The model performs a forward and backward pass on a specific chunk to update its weights before moving to the next. Crucially, while the hidden state is carried forward into the next chunk to maintain continuity, the gradient flow is “truncated” at the chunk boundary.

This approximation significantly reduces memory overhead and allows for the training of models on extended temporal data without sacrificing the benefits of sequential learning.

RNN Tradeoffs

RNNs can process input sequences of arbitrary length, and their model size remains constant regardless of input length since the same weights are shared across all time steps—ensuring temporal symmetry in how inputs are processed. In principle, they can leverage information from arbitrarily distant time steps.

However, in practice, recurrent computation tends to be slow due to its sequential nature, and capturing long-range dependencies remains challenging because relevant information from many steps back often becomes inaccessible or diluted over time.

Long Short Term Memory

Long Short Term Memory (LSTMs) are used to alleviate vanishing or exploding gradients when training long sequences by using a gated architecture that regulates the flow of information.

LSTM Mathematics

The operations of an LSTM cell at time step tt are defined by three primary equations:

1. The Gate Vector
LSTMs compute four internal vectors (gates and candidates) simultaneously by concatenating the previous hidden state ht1h_{t-1} and the current input xtx_t:

(ifog)=(σσσtanh)W(ht1xt)\begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{pmatrix} W \begin{pmatrix} h_{t-1} \\ x_t \end{pmatrix}

  • ii (Input Gate): Decides which new information to store in the cell state.
  • ff (Forget Gate): Decides which information to discard from the previous state.
  • oo (Output Gate): Decides which part of the cell state to output.
  • gg (Cell Candidate): Creates a vector of new candidate values (using tanh\tanh) to be added to the state.

2. The Cell State (ctc_t)
This is the “long-term memory” of the network. It is updated via a linear combination of the old state and the new candidates:

ct=fct1+igc_t = f \odot c_{t-1} + i \odot g

The use of the Hadamard product (\odot, element-wise multiplication) allows the forget gate to “zero out” specific memories while the input gate adds new ones. This additive update is the secret to preventing vanishing gradients.

3. The Hidden State (hth_t)
The hidden state is the “working memory” passed to the next cell and the higher layers. It is a filtered version of the cell state:

ht=otanh(ct)h_t = o \odot \tanh(c_t)