惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
T
Tailwind CSS Blog
人人都是产品经理
人人都是产品经理
Last Week in AI
Last Week in AI
Hugging Face - Blog
Hugging Face - Blog
月光博客
月光博客
T
Troy Hunt's Blog
N
Netflix TechBlog - Medium
H
Heimdal Security Blog
Google DeepMind News
Google DeepMind News
N
News and Events Feed by Topic
SecWiki News
SecWiki News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
B
Blog
Attack and Defense Labs
Attack and Defense Labs
O
OpenAI News
爱范儿
爱范儿
酷 壳 – CoolShell
酷 壳 – CoolShell
N
News | PayPal Newsroom
S
Secure Thoughts
Recent Announcements
Recent Announcements
aimingoo的专栏
aimingoo的专栏
Forbes - Security
Forbes - Security
博客园 - 聂微东
L
LINUX DO - 最新话题
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
F
Fortinet All Blogs
博客园 - 【当耐特】
WordPress大学
WordPress大学
TaoSecurity Blog
TaoSecurity Blog
Hacker News: Ask HN
Hacker News: Ask HN
H
Hacker News: Front Page
Recorded Future
Recorded Future
E
Exploit-DB.com RSS Feed
博客园 - 叶小钗
C
Comments on: Blog
Blog — PlanetScale
Blog — PlanetScale
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
Threatpost
The Last Watchdog
The Last Watchdog
GbyAI
GbyAI
博客园 - 三生石上(FineUI控件)
云风的 BLOG
云风的 BLOG
Know Your Adversary
Know Your Adversary
阮一峰的网络日志
阮一峰的网络日志
Hacker News - Newest:
Hacker News - Newest: "LLM"
T
Tenable Blog
Scott Helme
Scott Helme
The Cloudflare Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Time When More Layers Meant Worse Model ... Birth Of Residual
Aviral Singh · 2026-05-28 · via DEV Community

Aviral Singh

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        # setting the constructor for the initial values that we are every gonna need for the training of the data
        self.char_embedding = nn.Embedding(65, 64)
        self.pos_embedding = nn.Embedding(64, 64)
        self.query = nn.Linear(64, 64)
        self.key = nn.Linear(64, 64)
        self.value = nn.Linear(64, 64)
        self.mask = torch.tril(torch.ones(64, 64))
        # these are for changing the dimensions we are doing this to enlarge the matrix as to make it of higher resolution so as to make the 
        # data and weights more refined 
        self.ff1 = nn.Linear(64, 128)
        # this is to join them back again 
        self.ff2 = nn.Linear(128, 64)
        self.output_head = nn.Linear(64, 65)
        self.norm1 = nn.LayerNorm(64)
        self.norm2 = nn.LayerNorm(64)
        self.out_proj = nn.Linear(64, 64)

    def forward(self, x):
        # feed forward function
        x = self.char_embedding(x) + self.pos_embedding(torch.arange(64))
        # this is the start of the attention stuff i am writing this as a way to separate the code in section inside a functions
        #
        Q = self.query(x)
        Q = Q.view(32, 64, 2, 32)
        Q = Q.transpose(1, 2)
        K = self.key(x)
        K = K.view(32, 64, 2, 32)
        K = K.transpose(1, 2)
        V = self.value(x)
        V = V.view(32, 64, 2, 32)
        V = V.transpose(1, 2)
        A = (Q @ K.transpose(-2, -1)) / 32**0.5

        A = A.masked_fill(self.mask == 0, float("-inf"))
        At = A.softmax(dim=-1)
        # the -1 this is just to tell the
        output = At @ V
        output = output.transpose(1, 2).contiguous().view(32, 64, 64)
        output = self.out_proj(output)
        # this is where the attention ends and we start with the feed forward thing that will give us the predictions
        # added another form of normalization below to improve accuracy the first time the loss function reached 1.8 max now after adding the
        # below line it reached to like 1.5 something
        x = x + output
        x = self.norm1(x)

        output = self.ff1(x)
        output = torch.relu(output)
        output = self.ff2(output)

        x = x + output  # ← merge back into main flow
        x = self.norm2(x)
        x = self.output_head(x)
        return x

Enter fullscreen mode Exit fullscreen mode

this code is basically boilerplate at this point for training a transformer to anyone in the ai space .

i just want to understand one little line that is here and that has a history behind it that is really interesting .

 x = x + output

Enter fullscreen mode Exit fullscreen mode

why are we doing this - X=X+output ?

the neural networks learn through the process of back propagation which basically means that they are essentially looking for the change that moves us closer to the correct predictions by changing the filters that is it now there is a particular problem with this and that is that as we move from one layer to the other the gradient becomes smaller and smaller and this is huge cause the computation would also become harder and harder and more computationally expensive . this happens basically due to the chain rule of the the partial derivative .. but how does this thing solve that ?

*History of this problem *

read this paper here --

this paper is done by the microsoft research team and this is basically about how they solved the problem that more is not always better . in the case of training a deep learning models before this paper the more depth the model had i.e the layers the more error it produced too and that is a huge problem and people didnt know how to solve it cause on one hand you had the depth of better understanding and on the other hand you were having this problem of getting more errors too .

solution

now we might think that the solution is to just add the original embeddings vector ( for my case ) to the context matrix we got after all the computations and you would be right to think that but not for the reason that you might think here in the paper itself it says that its not the reason for this problem .

We argue that this optimization difficulty is unlikely to
be caused by vanishing gradients.

why ? - the reason for removing our suspicion from the diminishing gradient is because there are stuff done to minimize and stop the diminishing gradient problem these are done with the help of stuff like batch normalization and in this case ReLu here are the ways we do it in out code -

    x = self.norm1(x) # the batch normalization equivalent in transformers 

    output = self.ff1(x)
    output = torch.relu(output) # another way to solve the vanishing gradient problem 
    output = self.ff2(output)

    x = x + output  # ← merge back into main flow
    x = self.norm2(x)
    x = self.output_head(x)

Enter fullscreen mode Exit fullscreen mode

as you can see this that these does solve the problem of vanishing gradient and yet if we remove the x=x+output the result would be worse you know what lets try it alright --

this is when we do this normally and dont change anything now lets change one thing and that is we remove the line x=x+output that is it and see how it affects the loss function .

so the loss function jumped from 1.70 to 2.47 by just this one line and it might not seem a lot but , remember that this is just a 1 layer model for simplicity and more layers we add the more we move up in the errors too . to solidify my point i want to show the gradient that live by making some of the small adjustments here -

step 44500, loss: 2.5130
  char_embedding.weight          grad_norm: 0.006248
  pos_embedding.weight           grad_norm: 0.005838
  ff1.weight                     grad_norm: 0.024721
  ff2.weight                     grad_norm: 0.053932
  output_head.weight             grad_norm: 0.163109
  norm1.weight                   grad_norm: 0.007271
  norm2.weight                   grad_norm: 0.024594
step 45000, loss: 2.4751
  char_embedding.weight          grad_norm: 0.005574
  pos_embedding.weight           grad_norm: 0.005913
  ff1.weight                     grad_norm: 0.023506
  ff2.weight                     grad_norm: 0.056331
  output_head.weight             grad_norm: 0.161182
  norm1.weight                   grad_norm: 0.007898
  norm2.weight                   grad_norm: 0.020992
step 45500, loss: 2.4623
  char_embedding.weight          grad_norm: 0.006224
  pos_embedding.weight           grad_norm: 0.006075
  ff1.weight                     grad_norm: 0.025461
  ff2.weight                     grad_norm: 0.051210
  output_head.weight             grad_norm: 0.145062
  norm1.weight                   grad_norm: 0.008452
  norm2.weight                   grad_norm: 0.018521
step 46000, loss: 2.4764
  char_embedding.weight          grad_norm: 0.006709
  pos_embedding.weight           grad_norm: 0.006148
  ff1.weight                     grad_norm: 0.026940
  ff2.weight                     grad_norm: 0.057071
  output_head.weight             grad_norm: 0.163159
  norm1.weight                   grad_norm: 0.008988
  norm2.weight                   grad_norm: 0.025112
step 46500, loss: 2.4746
  char_embedding.weight          grad_norm: 0.006127
  pos_embedding.weight           grad_norm: 0.006181
  ff1.weight                     grad_norm: 0.025931
  ff2.weight                     grad_norm: 0.056799
  output_head.weight             grad_norm: 0.158272
  norm1.weight                   grad_norm: 0.008369
  norm2.weight                   grad_norm: 0.025981

Enter fullscreen mode Exit fullscreen mode

so here is the thing that i was saying even though it looks like it solves the diminishing gradient but in fact it doesnt at all .
the true thing that it does is something way more interesting .-
every layer and non linear does some changes and these change compound fast like really fast and for like 20 layers it might work cause even though its a large number of layers the complexity further shoots when we go from this to something like 50 layers and these "small" changes may change the values a lot even though these changes themselves are very very small and the values that is creates after maybe completely different from the original like way to different . here is an example -
5, 3, 8 --->5, 3, 8--->0.3, 0.01, 0.2 and notice something that these are not due to something like diminishing gradient at all these are due to the small changes that we do in between and so what would happen if we add the original in this ? -
[5, 3, 8] + [0.3, 0.01, 0.2] = [5.3, 3.01, 8.2]
so this resultant one is very close to the original right ? that is the main idea of the residual and there are many residual algorithms too but for simplicity we are gonna just stick with the good old addition and frankly its better this way .