惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Engineering at Meta
Engineering at Meta
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
有赞技术团队
有赞技术团队
人人都是产品经理
人人都是产品经理
腾讯CDC
Jina AI
Jina AI
I
InfoQ
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
S
SegmentFault 最新的问题
Blog — PlanetScale
Blog — PlanetScale
Stack Overflow Blog
Stack Overflow Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
美团技术团队
MyScale Blog
MyScale Blog
量子位

DEV Community

Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090 Google Just Declared the Chat-Log Interface Dead. Here's What Neural Expressive Actually Signals for Developers. ARCHITECTURE SPECIFICATION & FORMAL SYSTEM REPORT: k501-AIONARC Notes from a Hammock What's Google Antigravity 2.0 ? Here's What the Agent Harness Actually Changes for Developers. Building an E2EE Chat App in Flask - Part 3: Keeping File Uploads Safe Google's Gemini Spark. Here's What It Actually Does for Developers. Microsoft Just Shipped MCP Governance for .NET. Here's What It Actually Enforces. How I Built a Pakistan Internet Speed Test Platform at 16 How to Build a Supervisor Agent Architecture Without Frameworks I Built My Own Corner of the Internet — Here's What It Looks Like How does VuReact compile Vue 3's defineExpose() to React? Neo-VECTR's Rift Ascent Idempotency Keys: The API Safety Net You Probably Aren't Using Building E-Commerce Sites for Niche Products: Technical Lessons from Specialty Outdoor Retailers Audit Logs: The Silent Guardian of Every Serious System Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled BetAGracevI I Built a Post-Quantum Cryptographic Identity SDK for AI Agents — Here's Why It Needs to Exist Running Claude Code across multiple repos without losing context There Are Cameras in Every Room of My House. I Put Them There. Why your AI agent loops forever (and how to break the cycle) How does VuReact compile Vue 3's defineSlots() to React? Building a Privacy-First Resume Editor with Typst WASM and React One Soul, Any Model: Portable Memory for Open-Source Agents with .klickd From Pixels to Prescriptions: Building an Autonomous Healthcare Booking Agent with LangGraph MonoGame - A Game Engine for Those Who Love Reinventing the Wheel # Day 24: In Solana, Everything is an Account Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests RP2040 Wristwatch Tells Time With a Vintage VU Meter Needle observations about models / 2026, may From Video Transcripts to Source-Grounded AI Notes: A Practical Look at Notesnip AI Agent Dev Environment Guide — Real Experience from an AI Living Inside a Server How I Run 7 AI Models 24/7: Multi-Agent Architecture in Practice What exactly changes with the Claude Max plan? I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible OpenAI's $2M-tokens-for-equity YC deal, decoded Why DMX Infrastructure is Still Stuck in the 90s Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you
750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut
keeper · 2026-05-23 · via DEV Community

DeepSeek made its V4-Pro 75% price cut permanent on May 22. The conventional read: "they got cheaper hardware." The real story is more interesting — and it's about a gap that's not closing fast enough.


What Happened

On May 22, 2026, DeepSeek announced that the 75% discount on its V4-Pro API would become permanent. The new pricing:

Metric Before After Cut
Input (cache miss) ¥12 / 1M tokens ¥3 / 1M tokens 75%
Output ¥24 / 1M tokens ¥6 / 1M tokens 75%
Input (cache hit) ¥0.1 / 1M tokens ¥0.025 / 1M tokens 75%

At current exchange rates, that's roughly $0.44/M input and $0.87/M output — making V4-Pro one of the cheapest frontier-class models on the market, on par with DeepSeek's own V4-Flash but with significantly more capability.

The move came exactly four weeks after V4's launch on April 24, and coincided with growing user frustration over rate limits at Google Gemini and Anthropic Claude.


The Standard Narrative

The surface-level story has three parts:

1. Architectural efficiency. V4 uses a Mixture-of-Experts architecture with 1.6 trillion parameters, but only activates a fraction per token. This gives it a structural cost advantage over dense models of comparable capability — roughly 30% of the gap.

2. Supply chain scaling. Huawei's Ascend 950PR entered mass production in April 2026. Huawei plans to ship ~750,000 units through the year — a 2.5x increase over 2025's 910C output. DeepSeek specifically optimized V4 for the Ascend architecture. More chips → lower unit cost → lower API pricing.

3. Competitive positioning. Western AI providers (Google, Anthropic) have been quietly tightening rate limits as demand overwhelms their GPU supply. DeepSeek is exploiting the backlash, offering unlimited usage at a fraction of the cost to capture disgruntled developers.

All three are true. But none of them fully explains the magnitude of the cut — or why it's permanent rather than promotional.


The Math That Changes Everything

Let's check the numbers.

Demand Side

China's daily token consumption hit 140 trillion in March 2026, according to the National Data Administration. The growth trajectory:

  • Early 2024: 0.1 trillion/day
  • End of 2025: 100 trillion/day
  • March 2026: 140 trillion/day

That's a 1,000x increase in two years, and a 40% jump in just the last quarter — implying ~13% month-over-month growth.

Supply Side

Huawei's mass-produced chip for 2026 is the Ascend 950PR (Prefill-optimized, 1 PFLOPS FP8), with the higher-end 950DT (2 PFLOPS FP8) coming in Q4. The numbers:

Chip FP8 Memory Bandwidth Inference Throughput (est.)
950PR 1 PFLOPS 128GB HBM 1.6 TB/s ~1,200 tokens/sec
950DT 2 PFLOPS 144GB HBM 4 TB/s ~2,400 tokens/sec

(Throughput derived from Huawei's published Atlas 950 SuperNode benchmark: 19.6M tokens/sec across 8,192 cards.)

Now the arithmetic:

Item Value
Total chips (2026 target) 750,000 (70% PR + 30% DT)
Raw daily throughput 85.7 trillion tokens/day
Inference-allocated (60%) 51.4 trillion tokens/day
vs Current demand (140T) 37% coverage
vs Demand in 6 months (~291T) 18% coverage

Even in the most optimistic scenario — every single chip dedicated to inference at 100% utilization:

Scenario vs Current vs +6 months
100% inference, 100% utilization 61% coverage 29% coverage

The conclusion is stark: 750,000 Ascend 950 chips can't cover today's demand — let alone the demand in six months.


So Why Cut Prices?

If supply is still a fraction of demand, permanent price cuts don't make sense in a normal market. But this is not a normal market.

The Real Logic: Pre-Commitment, Not Surplus

DeepSeek is not cutting prices because it has spare compute. It's cutting prices to lock in routing commitments before the hardware arrives.

Here's the timeline:

April 24:  V4 launched, optimized for Ascend
April 24+: ByteDance orders 350,000 Ascend 950 chips (~¥40B)
May 4:     Ascend 950PR mass production confirmed
May 22:    DeepSeek makes V4-Pro 75% cut permanent

Enter fullscreen mode Exit fullscreen mode

The critical insight: DeepSeek's price cut is not a cost pass-through. It's a market share pre-commitment — using the promise of future Ascend supply to grab developer mindshare now, before Western competitors can resolve their own capacity issues.

The Numbers Behind the Strategy

Western providers are capacity-constrained:

Provider Constraint Signal
Google Gemini TSMC CoWoS capacity Rate limits tightened, user backlash
Anthropic Claude H100/B200 availability API throttling, compute-use monitoring
OpenAI Inference cluster rollout Delayed GPT-5 token limits

DeepSeek's bet: "Spend the next 6 months building developer dependency on V4-Pro's API — by the time Ascend supply catches up in H2 2026, those developers won't switch back."

This is AWS in 2006. AWS wasn't cheaper than running your own servers in 2006. But it would be once scale kicked in. AWS priced for the scale it planned to have, not the scale it had. DeepSeek is doing the same.


What 750,000 Chips Actually Buys

The popular framing in Chinese media is "75万颗昇腾950产能大爆发." But as the math shows, 750,000 chips isn't abundance — it's barely adequacy.

Think of it this way: China's token demand is growing at roughly 0.5 trillion tokens per day every single month (the monthly increment itself is larger than the entire market 18 months ago). By year-end, demand will be 300-400+ trillion. Against that, 750K chips at the 950PR/DT mix buy roughly 50-85T/day of inference capacity.

Timeframe Demand (est.) Inference Supply Gap
March 2026 140T ~50T 90T
June 2026 ~200T ~50T 150T
September 2026 ~290T ~55T (DT ramp) 235T
December 2026 ~420T ~65T 355T

The gap is growing, not shrinking. Even with 75万 chips fully deployed, the supply-demand deficit more than triples over nine months.

This means DeepSeek's price cut isn't a sign of market saturation. It's a sign of exactly the opposite: a market so unsaturated that the winner gets to define the default API for an entire generation of developers, if they can lock them in before the hardware arrives.


Three Counter-Arguments (And Why They're Weak)

"But cache hits reduce the effective compute needed"

True — cache-hit tokens cost ~1/100th of miss tokens. And DeepSeek's cache hit rates can be high for workloads with stable system prompts. But cache hits are mostly in the input direction. Output tokens — the expensive ones — still need full compute. And as agentic workloads grow (multi-turn, chain-of-thought), output-to-input ratios increase, making cache less effective.

"But not all 140T tokens need 950-class inference"

Also true. Many tokens are generated by smaller models (Flash variants, Qwen, etc.) that don't need 950-level compute. But the growth is in the frontier-class tokens — longer context, more complex reasoning, higher quality requirements. That's exactly where 950-class chips are needed.

"But they can still buy H20 / smuggled H100"

H20 is less capable than 950PR per chip (the US-designed it to be worse). And the CHIPS Act + export controls have made H100 procurement increasingly difficult. Relying on smuggled hardware is not a supply chain strategy.


What This Means

For Developers

Your inference costs are likely going down over the next 12 months, not up — even though demand is exploding. That's unprecedented in any computing market. The driver isn't efficiency gains or manufacturing scale. It's a strategic subsidy by Chinese AI firms betting that locking in your API calls today is worth negative margins for a year.

Take the subsidy. But don't assume today's prices reflect tomorrow's costs — they reflect tomorrow's hopes.

For the Industry

The AI API market has entered a phase that looks like price war but functions like infrastructure land-grab. The playbook is AWS 2006, DoorDash 2019, Uber 2015: lose money on every transaction to own the default routing.

When the hardware does catch up — when Ascend 960 (2027) or 970 (2028) ships with 3-5x the throughput — the providers with the largest captive developer bases will convert negative margins to positive ones. Everyone else will be competing on price against incumbents they can't dislodge.


The Bottom Line

DeepSeek's permanent price cut is not evidence that Chinese AI compute supply has caught up with demand. The math shows it hasn't — and won't for at least 12-18 months. It's evidence that DeepSeek is playing the long game: use today's negative margins to own tomorrow's default inference route, and trust that Huawei's future chips will eventually close a gap that's currently 3-5x wider than headlines suggest.

The 75% cut isn't a cost breakthrough. It's a bet that developer lock-in is worth more than current margins — and that the 75万 Ascend 950 chips shipping this year are just the beginning.


Numbers sourced from: National Data Administration (China daily token data, March 2026), Huawei Connect 2025 (Ascend 950 specs and roadmap), SCMP/DW (ByteDance order volume), DeepSeek official pricing page (May 2026). Throughput calculations based on published Atlas 950 SuperNode benchmarks. Growth projections assume continuation of 40%/quarter rate per published data.