惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

I
Intezer
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
有赞技术团队
有赞技术团队
J
Java Code Geeks
人人都是产品经理
人人都是产品经理
博客园 - 叶小钗
M
MIT News - Artificial intelligence
月光博客
月光博客
C
Check Point Blog
Y
Y Combinator Blog
S
SegmentFault 最新的问题
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
Cybersecurity and Infrastructure Security Agency CISA
A
Arctic Wolf
S
Security Archives - TechRepublic
S
Securelist
美团技术团队
SecWiki News
SecWiki News
H
Help Net Security
V
Vulnerabilities – Threatpost
S
Secure Thoughts
F
Fortinet All Blogs
量子位
aimingoo的专栏
aimingoo的专栏
T
Tor Project blog
大猫的无限游戏
大猫的无限游戏
Scott Helme
Scott Helme
MyScale Blog
MyScale Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Docker
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
L
Lohrmann on Cybersecurity
F
Fox-IT International blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
博客园 - 三生石上(FineUI控件)
Engineering at Meta
Engineering at Meta
Microsoft Security Blog
Microsoft Security Blog
Recorded Future
Recorded Future
V
Visual Studio Blog
WordPress大学
WordPress大学
S
Schneier on Security
Stack Overflow Blog
Stack Overflow Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Apple Machine Learning Research
Apple Machine Learning Research
N
News | PayPal Newsroom
GbyAI
GbyAI
T
Threat Research - Cisco Blogs

VentureBeat

Agentic AI in production: Merck, Mastercard on what works DataGrail report finds your vendor may be sending data to AI models you never approved DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole How attackers bypass MFA in financial services Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk AI agents are quietly generating chaos engineering failures enterprises don’t track yet npm supply chain: valid certificates, stolen accounts Replacing RAG with bash cut AI retrieval costs 30% AI agent identity: D&B rebuilt its 642M-company graph Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code LLM agent memory at 0.12% of model parameters Americans can’t spot a deepfake, and that’s a business crisis, not just a consumer problem MFA verifies who logged in. It has no idea what they do next. Kore.ai launches Artemis AI agent platform, expands challenge to Microsoft and Salesforce Resolve AI says the AI coding boom is breaking production systems. It wants to fix that. AI didn’t kill brand consistency — it made it mission-critical Google Managed Agents API: fast deployment, Google runtime Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+ Cerebras says its chips run a trillion-parameter AI model nearly 7 times faster than GPU clouds Enterprise AI agents fail because they forget GitHub confirms 3,800 internal repos stolen through poisoned VS Code extension as supply chain worm hits Microsoft's Python SDK NanoClaw's creators are turning the secure, open source AI agent harness into an enterprise 'second brain' Corti's new Symphony for Speech-to-Text model beats OpenAI at medical terminology accuracy, highlighting the value of specialized AI AWS nabs white hot gen AI media creation startup fal, becoming its preferred cloud provider Securing AI agent credentials with MCP tunnels Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year Google just redesigned the search box for the first time in 25 years — here’s why it matters more than you think. Google’s new AI agent can draft your emails, monitor your inbox and eventually spend your money Google unveils Gemini Omni 'any-to-any' AI model: what enterprises should know Influential AI researcher Andrej Karpathy announces he's joining Anthropic Context architecture is replacing RAG in AI AI supply-chain attacks bypass model red teams LangSmith Engine closes the agent debugging loop automatically — but multi-model enterprises still need a neutral layer Architectural patterns for graph-enhanced RAG: Moving beyond vector search in production The enterprise risk nobody is modeling: AI is replacing the very experts it needs to learn from Intercom, now called Fin, launches an AI agent whose only job is managing another AI agent RecursiveMAS cuts multi-agent AI costs by 75%: researchers Claude’s next enterprise battle is not models: it’s the agent control plane Developers can now debug and evaluate AI agents locally with Raindrop's open source tool Workshop Cerebras stock nearly doubles on day one as AI chipmaker hits $100 billion — what it means for AI infrastructure Agent authorization gap: why verified agents are still a risk Anthropic's Claude Code adds a built-in evaluator to catch agents that quit too soon Enterprises are training their own AI models from production workflows — without a machine learning team AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech. Anthropic reinstates OpenClaw and third-party agent usage on Claude subscriptions — with a catch Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its lead Frontier AI models corrupt 25% of document content Protect your enterprise now from the Shai-Hulud worm and npm vulnerability in 6 actionable steps Perceptron Mk1 shocks with highly performant video analysis AI model 80-90% cheaper than Anthropic, OpenAI & Google Claude Code and Claude in Chrome have four security blind spots. Here's the audit Is your enterprise adaptive to AI? Turning AI cost spikes into strategic growth opportunities Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models' AI agent IAM: why enterprise identity governance is broken AI tool poisoning exposes a major flaw in enterprise agent security Intent-based chaos testing is designed for when AI behaves confidently — and wrongly Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth OpenAI voice models get GPT-5-class reasoning Vibe coding exposed 380,000 corporate apps — 5,000 held sensitive data AI agent identity: how to govern agentic AI in 6 stages Anthropic wants to own your agent's memory, evals, and orchestration — and that should make enterprises nervous Enterprise GPU utilization: why 95% of AI infrastructure spend is wasted Governance, not gatekeeping: How SAP brings enterprise‑grade safety to AI connectivity Anthropic introduces "dreaming," a system that lets AI agents learn from their own mistakes RL orchestration: how a 7B model routes tasks across GPT-5, Claude, and Gemini Meet ZAYA1-8B, a super efficient open reasoning model trained on AMD Instinct MI300 GPUs Anthropic Skill scanners passed every check. The malicious code rode in on a test file. Why AI breaks without context — and how to fix it Market research is too slow for the AI era, so Brox built 60,000 identical 'digital twins' of real people you can survey instantly, repeatedly The app store for robots has arrived: Hugging Face launches open-source Reachy Mini App Store with 200+ apps Scaling AI into production is forcing a rethink of enterprise infrastructure Miami startup Subquadratic claims 1,000x AI efficiency gain with SubQ model; researchers demand independent proof. GPT-5.5 Instant shows you what it remembered — just not all of it One command turns any open-source repo into an AI agent backdoor. OpenClaw proved no supply-chain scanner has a detection category for it AI agents are missing all the discussions your team is having. SageOX has an answer: agentic context infrastructure OpenAI turns its sold-out GPT-5.5 party into a monthlong Codex giveaway for 8,000 developers Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions Microsoft takes Agent 365 out of preview as shadow AI becomes an enterprise threat The RAG era is ending for agentic AI — a new compilation-stage knowledge layer is what comes next Salesforce Agentforce Operations fixes workflows breaking enterprise AI MCP command execution flaw: what security teams need to know The scaffolding era is over. LlamaIndex says context is the new moat xAI launches Grok 4.3 at an aggressively low price and a new, fast, powerful voice cloning suite Hidden IT problems are quietly creating risk, shadow IT, and lost productivity Alibaba's HDPO cuts AI agent tool overuse from 98% to 2% One tool call to rule them all? New open source Python tool Runpod Flash eliminates containers for faster AI dev Why OpenAI's 'goblin' problem matters — and how you can release the goblins on your own AI coding agents breached: attackers targeted credentials, not models | VentureBeat Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce Netomi raises $110 million as Accenture and Adobe bet on AI for customer service Cheaper tokens, bigger bills: The new math of AI infrastructure Amazon’s OpenAI gambit signals a new phase in the cloud wars — one where exclusivity no longer applies Enterprise RAG rebuild: hybrid retrieval adoption tripled in Q1 2026 IBM launches Bob with multi-model routing and human checkpoints to turn AI coding into a secure production system AWS Quick's knowledge graph creates an orchestration blind spot Why enterprise GPU utilization is stuck at 5% — and why the fix makes it worse Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems How to build custom reasoning agents with a fraction of the compute American AI startup Poolside launches free, high-performing open model Laguna XS.2 for local agentic coding Mistral AI launches Workflows, a Temporal-powered orchestration engine already running millions of daily executions
MiniMax teases M3 model with new sparse attention mechanism, 15.6X long-context response speed boost
carl.franzen · 2026-05-28 · via VentureBeat

Among the many Chinese AI companies and laboratories vying for market share and attention (no pun intended) on the global marketplace, MiniMax stands out for its commitment to providing frontier-level intelligence across a range of modalities, including text, coding, and video (through its Hailuo model series) — often under permissive, enterprise-friendly, standard open source licenses.

Now, MiniMax is again raising the eyebrows of AI power users and developers around the world by releasing a new, in-depth technical report on the making of its popular M2 series of language models (M2, M2.5, and M2.7) shedding light on its numerous engineering innovations and clever approaches — while the company and its leaders also teased a whole new sparse attention approach for its upcoming MiniMax M3 series of models, which it says yields up to 15.6 times faster decoding (or LLM response) speed at long contexts (a million tokens) by adopting a custom sub-quadratic framework. In so doing, MiniMax has designed M3 to make ultra-long-context AI agent deployment economically viable.

The M2 report is noteworthy for any enterprise working with AI models, and especially those looking to fine-tune and train their own in-house. After all, MiniMax's M2 series models often achieved top benchmarks in the world for open source AI performance when they were released.

While the title has since been eclipsed by several other Chinese labs including DeepSeek and Xiaomi, MiniMax's new report offers a blueprint that can be used to improve AI model and agent performance by enterprises around the world.

As Adina Yakup of Hugging Face observed on X, "Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent oriented design. Excited to see where M3 goes next!"

The attention dilemma

The core technical architecture of the M2 series relies on a sparse Mixture-of-Experts (MoE) decoder-only Transformer layout used by numerous other state-of-the-art LLMs.

The foundational backbone houses 229.9 billion total parameters, yet maintains a remarkably lean operational footprint by activating just 9.8 billion parameters per token across 256 fine-grained experts.

To optimize routing and avoid standard load-balancing issues, however, MiniMax implemented sigmoid gating paired with learnable, expert-specific bias terms, heavily reducing reliance on restrictive auxiliary losses.

The most definitive engineering decision documented in the M2 paper was the strict adherence to full multi-head attention with Grouped Query Attention (GQA) across all 62 layers.

In large language models, "quadratic scaling" refers to the computationally expensive reality of standard full attention mechanisms, where every token in a sequence must mathematically connect to every other token. To use a real-world analogy, it is akin to attending a networking event and being forced to have a deep conversation with every single person in the room while simultaneously monitoring all other ongoing conversations.

While this approach yields incredibly thorough context, the processing power and memory required explode at the square of the input length, creating a severe hardware bottleneck as models attempt to ingest hundreds of thousands of words.

The problem with sub-quadratic scaling

"Sub-quadratic" scaling introduces architectural shortcuts designed to bypass this exponential computational load. Instead of mapping every possible connection, sub-quadratic methods—such as Sliding Window Attention or compressed linear attention—might only analyze a localized window of nearby words or generate a compressed summary of the broader text.

These efficient methods drastically reduce hardware costs and allow models to process massive documents at high speeds, but they historically introduce severe trade-offs in accuracy, often causing the AI to miss the "big picture" or lose track of distant context.

This mathematical dilemma defines the architectural evolution from MiniMax's M2 to its upcoming M3 series. During M2's development, researchers rigorously tested sub-quadratic shortcuts but found they crippled the model's "multi-hop reasoning"—its ability to connect disparate clues across a long document—forcing the team to absorb the massive computational cost of full quadratic attention to maintain frontier-level intelligence.

Indeed, they aggressively benchmarked efficient attention alternatives during pre-training but intentionally threw them out. They experimented extensively with hybrid setups, interleaving full attention with sub-quadratic architectures like Lightning Attention or hybrid Sliding Window Attention (SWA) configurations.

The empirical results were definitive: at a larger scale, linear and windowed attention variants exhibited severe reasoning deficits.

On evaluations exceeding 32K context windows, SWA variants performed significantly worse than full attention, dropping from a baseline score of 90.0 to 72.0 on the RULER 128K complex word extraction task.

Sub-quadratic configurations proved prone to memory-bound constraints during training, lacked native prefix caching support, and failed to smoothly align with Multi-Token Prediction (MTP) modules used for speculative decoding. Full attention was deemed necessary to preserve multi-hop reasoning capability.

However, recognizing that physical hardware limits cannot sustain quadratic scaling indefinitely, MiniMax is designing the M3 series around a novel sub-quadratic framework to finally deliver both high-speed processing and uncompromised reasoning.

MiniMax Sparse Attention (MSA) and sub-quadratic scaling incoming

The upcoming MiniMax-M3 breaks away from the compute-heavy constraints of its predecessor. As disclosed by MiniMax’s engineering team under the banner "Something BIG is coming," M3 introduces "MiniMax Sparse Attention" (MSA).

Unlike DeepSeek’s Multi-head Latent Attention (MLA), which compresses keys and values into a low-dimensional latent space, MSA operates on a standard GQA backbone but utilizes block-level selection on real, uncompressed Key-Values.

Elie Bakouch at AI training infrastructure and platform lab Prime Intellect posted on X noting that the main changes feature "block level selection like in CSA but attention is done on the real KV, not in [compressed space]."

This solves the precision loss and prefix-caching obstacles noted in the M2 paper. By filtering and selecting block-level sequences dynamically, MSA delivers an architectural leap: early hardware profiling indicates a 9.7x speedup in prefilling latency and a massive 15.6x speedup during decoding phases at a 1-million token sequence length compared to the full-attention M2 architecture.

To understand why a speedup in the "decoding phase" is so significant, it helps to break down how an AI actually reads and writes information. When you interact with an AI, the processing happens in two distinct steps: prefilling and decoding.

When you hand an AI a prompt—whether it’s a short sentence or a massive 1,000-page document—it processes that entire chunk of text all at once in parallel, known as "prefilling." It essentially "reads" the input in one big gulp to build its initial understanding and establish context.

In order to generate a response, the AI must enter a "decoding phase." To predict the first word of its response, it looks at the prompt. To predict the second word, it has to look at the prompt plus the first word. To predict the hundredth word, it must recalculate the context of the prompt and the previous 99 words it just wrote. So the response actually becomes harder to generate as it goes on, with the end requiring a full review of all prior parts.

For a layperson, imagine reading a dense legal brief (prefilling) and then being forced to write a summary report where, before writing every single new word, you must rapidly reread the entire brief plus everything you've written so far to ensure your next word makes sense (decoding).

Because the AI must constantly and repetitively look backward to generate each new step forward, the decoding phase is the most severe computational bottleneck in generating text. It is why AI models often type out their answers word-by-word, and why they slow down significantly as conversations get longer.

Therefore, when the passage states the new architecture achieves a massive 15.6x speedup during the decoding phase at a 1-million token sequence length, it means the model has found a structural shortcut to generate its answer—token by token—nearly 16 times faster. It directly solves the exact bottleneck that normally makes AI chatbots freeze or stutter when handling massive amounts of information.

The evolution of the MiniMax M series and the creation of 'Forge'

On a product level, MiniMax has consistently evolved its models from simple text generation interfaces into autonomous workers.

The M2 series pioneered an "interleaved thinking" protocol where the model alternates between natural-language planning traces and explicit tool invocations inside a single trajectory. Rather than dropping the intermediate chain-of-thought blocks between execution turns, M2 appends the full thinking history directly into the conversation context. This planning persistence prevents state drift, allowing the model to recover gracefully from runtime errors and revise its strategies based on environment feedback.

To train these long-horizon workflows, MiniMax built "Forge," a scalable agent-native reinforcement learning system. Forge decouples execution into three independent modules—the Agent Side, the middleware abstraction layer (Gateway Server and Data Pool), and the Training/Inference engines.

As MiniMax engineer Olive Song explained on the ThursdAI podcast, "What we realized is that there's a lot of potential with a small model like this if we train reinforcement learning on it with a large amount of environments and agents... But it's not a very easy thing to do," adding that this environmental training was where the team spent a significant portion of their development timeline. To absorb the extreme trajectory-length variance common in multi-step agent environments, Forge implements two vital engineering solutions:

  1. Windowed FIFO Scheduling: A training scheduler that maps a sliding window over the generation queue. It permits greedy, high-throughput fetching of completed tasks within the window to prevent cluster idle time, while strictly enforcing FIFO boundaries to maintain distributional stability and avoid gradient oscillation.

  2. Prefix Tree Merging: An optimization that restructures batch training into tree computation. Completions sharing identical conversation prefixes are calculated exactly once in the forward pass before branching. This eliminates redundant calculations, generating up to a 40x training speedup with zero approximation error.

This reinforcement infrastructure directly spawned the M2.7 checkpoint, moving the series toward "self-evolution". Operating inside an automated agent harness, M2.7 functions as an independent machine learning engineer. The model profiles its own active training runs, diagnoses anomalies, reads logs, and automatically modifies its own codebase and configurations.

According to MiniMax, M2.7 successfully handled between 30% and 50% of its own development workflow.

On OpenAI’s rigorous MLE Bench Lite suite, which tests autonomous ML research capability, M2.7 achieved a 66.6% medal rate across independent 24-hour trials, effectively tying Google’s closed-weight Gemini 3.1 Pro.

The continuous cadence from M2 to M2.5, which famously completed 30% of internal tasks and 80% of newly committed code at MiniMax HQ, underlines a broader vision.

As the MiniMax team noted during that phase of deployment, "we believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy."

With the technical report codifying the M2 generation's successes and the MSA tech blog on the horizon, MiniMax is signaling that the next frontier of AI is explicitly about translating a mini-activation footprint into maximum real-world intelligence.