惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
博客园_首页
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
阮一峰的网络日志
阮一峰的网络日志
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 司徒正美
V
V2EX
Cloudbric
Cloudbric
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
量子位
博客园 - 三生石上(FineUI控件)
博客园 - 叶小钗
K
Kaspersky official blog
博客园 - 【当耐特】
T
Tenable Blog
L
Lohrmann on Cybersecurity
The Cloudflare Blog
S
Schneier on Security
A
Arctic Wolf
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Cisco Talos Blog
Cisco Talos Blog
小众软件
小众软件
P
Privacy & Cybersecurity Law Blog
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
雷峰网
雷峰网
NISL@THU
NISL@THU
人人都是产品经理
人人都是产品经理
月光博客
月光博客
J
Java Code Geeks
V
Visual Studio Blog
S
Security Affairs
博客园 - Franky
T
Tailwind CSS Blog
Apple Machine Learning Research
Apple Machine Learning Research
H
Heimdal Security Blog
有赞技术团队
有赞技术团队
V2EX - 技术
V2EX - 技术
AWS News Blog
AWS News Blog
G
GRAHAM CLULEY
T
Troy Hunt's Blog
SecWiki News
SecWiki News
Spread Privacy
Spread Privacy
宝玉的分享
宝玉的分享
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 聂微东

VentureBeat

Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth OpenAI voice models get GPT-5-class reasoning AI agent identity: how to govern agentic AI in 6 stages Anthropic wants to own your agent's memory, evals, and orchestration — and that should make enterprises nervous Enterprise GPU utilization: why 95% of AI infrastructure spend is wasted Governance, not gatekeeping: How SAP brings enterprise‑grade safety to AI connectivity Anthropic introduces "dreaming," a system that lets AI agents learn from their own mistakes RL orchestration: how a 7B model routes tasks across GPT-5, Claude, and Gemini Meet ZAYA1-8B, a super efficient open reasoning model trained on AMD Instinct MI300 GPUs Anthropic Skill scanners passed every check. The malicious code rode in on a test file. Why AI breaks without context — and how to fix it Market research is too slow for the AI era, so Brox built 60,000 identical 'digital twins' of real people you can survey instantly, repeatedly The app store for robots has arrived: Hugging Face launches open-source Reachy Mini App Store with 200+ apps Scaling AI into production is forcing a rethink of enterprise infrastructure Miami startup Subquadratic claims 1,000x AI efficiency gain with SubQ model; researchers demand independent proof. GPT-5.5 Instant shows you what it remembered — just not all of it One command turns any open-source repo into an AI agent backdoor. OpenClaw proved no supply-chain scanner has a detection category for it AI agents are missing all the discussions your team is having. SageOX has an answer: agentic context infrastructure OpenAI turns its sold-out GPT-5.5 party into a monthlong Codex giveaway for 8,000 developers Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions Microsoft takes Agent 365 out of preview as shadow AI becomes an enterprise threat The RAG era is ending for agentic AI — a new compilation-stage knowledge layer is what comes next Salesforce Agentforce Operations fixes workflows breaking enterprise AI MCP command execution flaw: what security teams need to know The scaffolding era is over. LlamaIndex says context is the new moat xAI launches Grok 4.3 at an aggressively low price and a new, fast, powerful voice cloning suite Hidden IT problems are quietly creating risk, shadow IT, and lost productivity Alibaba's HDPO cuts AI agent tool overuse from 98% to 2% One tool call to rule them all? New open source Python tool Runpod Flash eliminates containers for faster AI dev Why OpenAI's 'goblin' problem matters — and how you can release the goblins on your own AI coding agents breached: attackers targeted credentials, not models | VentureBeat Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce Netomi raises $110 million as Accenture and Adobe bet on AI for customer service Cheaper tokens, bigger bills: The new math of AI infrastructure Amazon’s OpenAI gambit signals a new phase in the cloud wars — one where exclusivity no longer applies Enterprise RAG rebuild: hybrid retrieval adoption tripled in Q1 2026 IBM launches Bob with multi-model routing and human checkpoints to turn AI coding into a secure production system AWS Quick's knowledge graph creates an orchestration blind spot Why enterprise GPU utilization is stuck at 5% — and why the fix makes it worse Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems How to build custom reasoning agents with a fraction of the compute American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding Mistral AI launches Workflows, a Temporal-powered orchestration engine already running millions of daily executions Microsoft and OpenAI gut their exclusive deal, freeing OpenAI to sell on AWS and Google Cloud Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks AI framework autonomously outperforms human-designed R&D baselines Why supply chains are the proving ground for automation‑led iPaaS RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk Enterprises are obsessing over model accuracy while ignoring the infrastructure layer where AI systems actually break. Monitoring LLM behavior: Drift, retries, and refusal patterns CVSS vulnerability triage: 5 failures, 5 fixes DeepSeek-V4 arrives with near state-of-the-art intelligence at fraction of the cost of Opus 4.7, GPT-5.5 85% of enterprises are running AI agents. Only 5% trust them enough to ship. AI synthetic audiences are already here and poised to upend the consulting industry Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0 New startup BAND debuts agentic mesh with deterministic routing to govern multiple enterprise AI agents across model providers, channels OpenAI unveils Workspace Agents, a successor to custom GPTs for enterprises that can plug directly into Slack, Salesforce and more Google and AWS split the AI agent stack between control and execution Are you paying an AI ‘swarm tax’? Why single agents often beat complex systems OpenAI launches Privacy Filter, an open source, on-device data sanitization model that removes personal information from enterprise datasets Google doesn't pay the Nvidia tax. Its new TPUs explain why. Salesforce’s Agentforce Vibes 2.0 targets a hidden failure: context overload in AI agents Google’s Gemini can now run on a single air-gapped server — and vanish when you pull the plug The modern data stack was built for humans asking questions. Google just rebuilt its for agents taking action. Google’s new Deep Research and Deep Research Max agents can search the web and your private data Vercel breach exposes the OAuth gap most security teams cannot detect, scope or contain The AI governance mirage: Why 72% of enterprises don’t have the control and security they think they do OpenAI's ChatGPT Images 2.0 is here and it does multilingual text, full infographics, slides, maps, even manga — seemingly flawlessly Kimi K2.6 runs agents for days — and exposes the limits of enterprise orchestration What AI model should you use for revenue intelligence? Von says all the big ones, and it will automate mixing and matching for you Three AI coding agents leaked secrets through a single prompt injection. One vendor's system card predicted it Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference AI agent security maturity audit: enterprises funded stage one, stage-three threats arrived anyway Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting, approval dialogs for messaging apps Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agents Are we getting what we paid for? How to turn AI momentum into measurable value OpenAI debuts GPT-Rosalind, a new limited access model for life sciences, and broader Codex plugin on Github OpenAI drastically updates Codex desktop app to use all other apps on your computer, generate images, preview webpages Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM AI lowered the cost of building software. Enterprise governance hasn’t caught up Microsoft patched a Copilot Studio prompt injection. The data exfiltrated anyway Frontier models are failing one in three production attempts — and getting harder to audit Meta researchers introduce 'hyperagents' to unlock self-improving AI for non-coding tasks We tested Anthropic’s redesigned Claude Code desktop app and 'Routines' -- here's what enterprises should know AI's next bottleneck isn't the models — it's whether agents can think together Adobe’s new Firefly AI Assistant wants to run Photoshop, Premiere, Illustrator and more from one prompt Traza raises $2.1 million led by Base10 to automate procurement workflows with AI Agentic coding at enterprise scale demands spec-driven development Designing the agentic AI enterprise for measurable performance Five signs data drift is already undermining your security models Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot AI agent credentials live in the same box as untrusted code. Two new architectures show where the blast radius actually stops. Intuit compressed months of tax code implementation into hours — and built a workflow any regulated-industry team can adapt OpenAI introduces ChatGPT Pro $100 tier with 5X usage limits for Codex compared to Plus Mythos autonomously exploited vulnerabilities that survived 27 years of human review. Security teams need a new detection playbook Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos Goodbye, Llama? Meta launches new proprietary AI model Muse Spark — first since Superintelligence Labs' formation LLM-referred traffic converts at 30-40% — and most enterprises aren't optimizing for it
AI hit the memory wall — now it needs a new context tier
VB Staff · 2026-06-22 · via VentureBeat

Presented by Solidigm


As inference workloads evolve from discrete question-and-answer exchanges into persistent, multi-step agentic systems, GPU availability is no longer the most critical AI bottleneck. Instead, the bottleneck has migrated from compute to context, says Jeff Harthorn, AI applied research lead at Solidigm.

"Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026," says Harthorn. "GPUs have gotten dramatically cheaper per FLOP. Model architectures and inference serving engines have all gotten much more efficient. But the thing that's grown faster than both of those is context. The persistent state that has to live between sessions has grown even faster than context itself."

It's happening as context windows grow dramatically, making individual inputs far larger than before. Agentic AI systems chain dozens or hundreds of model calls together, each generating state that must be tracked, and enterprises are requiring that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing context volumes beyond what any existing memory tier was designed to handle.

"Those three things are all happening at the same time, all of which are pushing context data and context memory into the stratosphere much more quickly than we're used to seeing," adds Ace Stryker, director of AI and ecosystem marketing at Solidigm.

The solution is a dedicated context tier emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash designed specifically to hold and serve Key-value (KV) cache, the inference data that allows models to retain and reuse context, and retrieval data at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are building SSD products optimized for this workload.

"Storage has not been the first thing folks have thought about when they've been planning their enterprise infrastructure buildout," Stryker says. "In a lot of ways, it was a relatively small cost compared to compute, and it was a commodity. You just shopped around for the lowest dollar per gigabyte and called it good. But now, if your storage is not up to snuff, your ROI suffers, and it directly impacts your bottom line.”

Why AI inference requires a different storage architecture than training

The storage architecture that AI systems rely on today was largely inherited from training workflows. Training is sequential and write-dominated, with data moving in large blocks to and from bulk object storage. The tier structure, with high-bandwidth memory on the GPU, fast NVMe in the server, and bulk storage over the network, serves that use case reasonably well.

However, inference is a different animal. Its I/O signature is fine-grained, latency-sensitive, and increasingly stateful. KV cache data and retrieval data each have distinct access patterns, but both need to be served quickly and reused across interactions. Neither fits cleanly within GPU high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads.

"The architectural gap that's interesting to me right now isn't at the top of the stack or the bottom, it's right in the middle," Harthon says. "A lot of what sits below the GPU HBM is being asked to do things it wasn't really designed for, which is where the most interesting systems work today is happening."

One of the most visible symptoms of this gap is recomputation. In inference, the pre-fill stage processes all of the context relevant to a given session before token generation can begin. When KV cache state isn't available in a fast, accessible tier, the system recomputes it — burning GPU cycles that produce no new value.

"A meaningful share of GPU cycles end up going to re-pre-filling," Harthon explains. "During all of that calculated context, that's potentially compute that's being spent reproducing state, rather than doing new work. When you start looking at the problem that way, GPU utilization starts looking like it's partly a storage problem."

This reframing is driving renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar.

The AI context memory tier and how it works

The industry's response is taking structural form. A new tier is emerging between GPU memory and traditional network storage, designed specifically to hold and serve inference context, a layer distinct from drives inside GPU servers (G3) and storage servers over the network (G4), engineered to serve context data back to accelerators as rapidly as possible.

"If you're building a data center starting in the second half of this year, or the beginning of next year, you can't think about storage only living in two places," Stryker says. "Storage has to live in at least three places to handle the context memory tier, and that's likely to be a permanent fixture in how the infrastructure gets built going forward."

It's analogous to the emergence of object storage as a category, which didn't exist until enough workloads needed it. And once it did, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors.

"The context tier looks like it might be on a similar arc," Harthorn says. "That volumetric pressure is causing the category to form, rather than any one vendor's road map."

For infrastructure leaders, this means actively planning for the new tier rather than treating it as optional. Deploying additional NAND at this layer reduces dependency on DRAM, which is orders of magnitude more expensive per gigabyte and constrained in both availability and thermal headroom.

"In terms of your investment effectiveness, you're laying out less cash to do it if you rely on the SSD layer in the way that Nvidia is now recommending and prescribing for a lot of use cases," Stryker adds.

What flash needs to deliver to support AI inference

Participating meaningfully in the inference stack places new demands on SSD technology. Tail latency, the worst-case performance of a drive, must be predictable, not just fast on average. An orchestration system that allocates GPU resources based on expected storage response times cannot tolerate unexpected multi-second delays. Consistent, observable performance matters more here than peak throughput.

Beyond latency, density becomes a critical concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm's products, is suited to that calculation. Network integration via NVMe over Fabrics, RDMA, and eventual CXL support is also essential, given the tight latency budgets of active inference pipelines.

"The drives have to have reliable performance characteristics, beyond the throughput side and being able to transfer as much data as possible as fast as possible, the way that training needed," Harthon says. "Now it's about being able to do it very consistently, in a way that's very observable to the people operating and orchestrating these systems."

How enterprise AI leaders should plan for the context tier

The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner lab collaborations, and published research, which is critical precisely because the category is still forming.

"The interesting question for the next couple of years isn't whether AI infrastructure needs more compute," Harthorn says. "It's whether it can use what it has more efficiently. A lot of that answer runs through this tier that is being built today."


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.