惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

PCI Perspectives
PCI Perspectives
Apple Machine Learning Research
Apple Machine Learning Research
Recent Announcements
Recent Announcements
量子位
H
Hackread – Cybersecurity News, Data Breaches, AI and More
腾讯CDC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
Schneier on Security
Microsoft Azure Blog
Microsoft Azure Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
小众软件
小众软件
Recorded Future
Recorded Future
P
Privacy International News Feed
Cisco Talos Blog
Cisco Talos Blog
Latest news
Latest news
C
Check Point Blog
O
OpenAI News
N
Netflix TechBlog - Medium
U
Unit 42
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
P
Proofpoint News Feed
Hacker News - Newest:
Hacker News - Newest: "LLM"
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
宝玉的分享
宝玉的分享
F
Full Disclosure
Know Your Adversary
Know Your Adversary
GbyAI
GbyAI
W
WeLiveSecurity
Engineering at Meta
Engineering at Meta
Scott Helme
Scott Helme
云风的 BLOG
云风的 BLOG
I
InfoQ
D
Docker
N
News | PayPal Newsroom
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
T
Tor Project blog
The GitHub Blog
The GitHub Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
ThreatConnect
人人都是产品经理
人人都是产品经理
S
Securelist
G
Google Developers Blog
Martin Fowler
Martin Fowler
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
P
Privacy & Cybersecurity Law Blog
L
Lohrmann on Cybersecurity
博客园 - 【当耐特】
博客园 - 司徒正美
Hugging Face - Blog
Hugging Face - Blog

The Decoder

One of the world's top law schools draws a hard line against AI in legal education Google CEO Pichai now calls links a "part" of search, redefining the web's role in its own product Anthropic warns Claude Mythos Preview finds bugs faster than developers can patch them Cloudflare CEO Prince says builders and sellers are safe but AI is coming for the measurers OpenAI launches a ChatGPT Powerpoint plugin and warns it might accidentally delete your content Deepseek reportedly prioritizes AGI research over quick profits despite billions in funding OpenAI Appshots turn any Mac window into context for Codex OpenAI burned through $1.22 per dollar earned even after stripping out stock-based compensation California governor signs first US executive order to protect workers from AI job loss Trump pulls AI safety order after last-minute calls from Musk, Zuckerberg, and Sacks Google checks websites for llms.txt in new agentic browsing audit OpenAI shifts the boundary of automated reasoning with a "milestone in AI mathematics" that experts are now unpacking US Cyber Command races to deploy AI on top-secret networks Cohere open-sources its strongest model yet Anthropic is about to become the first profitable AI lab OpenAI could file confidential IPO paperwork within days SpaceX IPO filing shows billions in AI losses, a $2 trillion valuation target, and turbine spending that signals more data center conflicts ahead SAP taps Mistral AI to help customers migrate legacy software Deepseek wants to take on Claude Code and OpenAI's Codex with "Deepseek Code" LinkedIn's war on AI slop is not just a policy update—it is an admission that the platform lost control of its feed Google tests the app market version of the SaaSpocalypse Stability AI launches Stable Audio 3.0 with up to six-minute tracks and open weights Google pairs its Genie world model with Street View to create explorable AI worlds based on real places Google's Gemini 3.5 Flash follows Anthropic and OpenAI in making newer AI models significantly pricier Google overhauls its AI subscriptions at I/O 2026 with three tiers starting at $10 a month Sorry for the outages: Bot spam is pushing our servers to the limit Google's I/O announcements: new models, a cloud agent that never sleeps, and a redesigned Gemini app Prominent AI researcher Andrej Karpathy picks Anthropic over former home OpenAI to get back into frontier LLM research Agora-1 turns the N64 classic GoldenEye into a playable AI simulation for four players Mistral AI acquires Viennese physical AI startup Emmi AI Cloudflare says Anthropic's Mythos Preview finds exploit chains that earlier frontier models missed Anthropic adds self-hosted sandboxes and MCP tunnels to Claude Managed Agents Elon Musk appeals $134 billion OpenAI loss, calls verdict a "calendar technicality" Elon Musk loses his $134 billion lawsuit against OpenAI after jury deliberates for just two hours Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost Pope Leo XIV presents first AI encyclical, Anthropic co-founder invited as guest speaker A Stanford student reflects on his ChatGPT class and a culture of "just a little bit of fraud" MAGA-aligned groups want government oversight of frontier AI models Anthropic to brief global financial regulators on cyber flaws found by Claude Mythos AI startup revenue hits $80 billion, but Anthropic and OpenAI take almost all of it World Action Models give robots the ability to simulate consequences before they move Greg Brockman consolidates OpenAI's product teams to build an "agentic future" Mistral CEO Arthur Mensch warns France against letting Anthropic's Mythos scan military code bases New math benchmark reveals AI models confidently solve problems that have no solution Four AI models ran radio stations for six months and the results ranged from competent to unhinged Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously YouTube opens its deepfake face-swap detection tool to all adult creators New benchmark confirms AI video generators look stunning but still can't reason about the world OpenAI bought a voice cloning startup famous for celebrity imitations For $1.3 million a month, OpenClaw founder Peter Steinberger runs 100 AI agents that code, review PRs, and find bugs AI made a tiny slice of Silicon Valley filthy rich and left the rest wondering why they bother Researchers train AI model that hits near-full performance with just 12.5 percent of its experts Google says GEO and AEO are a myth and traditional SEO is all you need for AI search Google busts the myth that AI search needs its own SEO playbook ChatGPT now wants access to your bank account so it can tell you to stop ordering takeout Anthropic's $900 billion valuation would make it more valuable than OpenAI for the first time x.AI plays catch-up with Grok Build, its first terminal-based coding agent Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool Arxiv cracks down on unchecked AI-generated content in research papers Anthropic frames AI competition with China as a now-or-never moment for Washington OpenAI makes its AI coding assistant Codex available on iOS and Android Americans would rather live next to a nuclear plant than an AI data center, Gallup poll finds Microsoft pits more than 100 AI agents against each other to find Windows vulnerabilities Ten Chinese firms including ByteDance reportedly get US clearance for AI chips they're not allowed to accept Alibaba's Qwen-Image-2.0 doubles compression and cuts generation steps from 40 to 4 ChatGPT's web traffic share dropped from 78% to 54% in one year as Gemini quietly tripled its reach New Claude Mythos becomes the first AI model to clear all cyberattack simulations from Britain's AI safety agency Microsoft's Edge Copilot can now read all your open tabs at once and write for you on LinkedIn Claude subscriptions get separate budgets for programmatic use, billed at full API prices Tencent plans to ramp up AI spending as China's chip supply allegedly improves Anthropic overtakes OpenAI in B2B adoption for the first time according to Ramp spending data Meta AI gets a private mode where no conversation data is stored on servers Anthropic launches Claude for Small Business to embed AI into the tools you forgot you pay for Luma opens Uni-1.1 image model API at prices and quality matching OpenAI and Google China's AI suppliers can't keep up as critical component shortages hit production AI startup Recursive emerges from stealth with $650 million to build self-improving AI Google is hiring hundreds of engineers to help customers adopt its AI From Prompt to Pointer Engineering: Deepmind tries to reinvent the mouse cursor for the AI era Android gets AI agents that book trips, fill forms, and clean up your texts Anthropic expands legal AI offerings with new Claude Cowork plugins Google says it stopped a mass cyberattack after AI was used to discover a zero-day exploit Alphabet's Isomorphic Labs raises $2.1 billion to scale AI drug discovery toward clinical trials Microsoft ousts its Israel chief following reports that Azure quietly powered military AI targeting in Gaza "Tokenmaxxing" spreads at Amazon as employees game internal AI leaderboards Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice Sam Altman's personal investments face political scrutiny ahead of OpenAI's planned IPO The EU wants to regulate AI but needs OpenAI and Anthropic to let regulators through the door Baidu's Ernie 5.1 cuts 94 percent of pre-training costs while competing with top models OpenAI's DeployCo subsidiary adopts Palantir's playbook, building a moat from workflows no lab can simulate Lawsuit claims ChatGPT coached FSU shooter on gun operation, timing, and victim thresholds AI turns patches into working exploits in 30 minutes, and the 90-day disclosure window is the casualty Generative AI turns identity theft into an industrial-scale operation Nvidia pumps over 40 billion dollars into AI partners so far in 2026 OpenAI's internal share sale minted roughly 75 multimillionaires who each cashed out the $30 million cap AI agents that hack computers and replicate themselves, and they're getting better fast AI agents can now hack computers and copy themselves, and they're getting better fast Anthropic and OpenAI sit down with religious leaders to seek ethical advice ByteDance plans over $30 billion for AI expansion, bets big on Chinese chips METR says it can barely measure Claude Mythos, Palo Alto Networks warns of autonomous AI attackers
Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip
Jonathan Kem · 2026-05-23 · via The Decoder

Alibaba's Qwen team has released Qwen3.7-Max, a proprietary model designed for agent-based tasks. In a real-world test, the model ran a fully autonomous kernel optimization for 35 hours straight.

Like its predecessors Qwen3-Max and Qwen3.6-Plus, the new Max version is only available through the Alibaba Cloud Model Studio API. Alibaba used to release its Qwen models as open source, but that's changed. The last open flagship was Qwen3.5-397B-A17B from February 2026.

Qwen3.7-Max supports OpenAI- and Anthropic-compatible interfaces and plugs right into Claude Code, OpenClaw, or Qwen Code. The Qwen team says the model targets four use cases: working as a coding agent from front-end prototypes to complex multi-file software projects, automating office tasks with external tools, running autonomously for long stretches, and performing consistently across different agent frameworks.

A kernel experiment that ran for 35 hours

Qwen3.7-Max was tasked with optimizing a hardware-based attention kernel for the open-source inference software SGLang. The hardware was a cloud instance with T-Head-ZW-M890 accelerators, an AI chip platform from Alibaba's own semiconductor arm.

The Qwen team says the model had never seen this chip architecture during training. It started with no measurement data, no hardware docs, and no sample code. The only thing it had to work with was the existing reference implementation, written in the Triton programming language.

Over about 35 hours of nonstop autonomous work, the model ran 432 kernel tests with 1,158 total tool calls. It compiled, measured, and revised the code in loops, caught compilation errors, and tracked down performance bottlenecks on its own. The result, according to the Qwen researchers, is an average 10x speedup over the reference implementation.

Competitor models came up well short in the same setup. GLM 5.1 hit a 7.3x speedup, Kimi K2.6 got to 5x, DeepSeek V4 Pro managed 3.3x, and the predecessor Qwen3.6-Plus barely moved the needle at 1.1x. Models that quit early ended their sessions on their own after five straight rounds with no tool calls. On the standardized KernelBench L3 benchmark, Qwen3.7-Max claims to produce accelerated kernels 96 percent of the time, just behind Anthropic's Opus 4.6 at 98 percent.

Training splits task, tool environment, and validator

Qwen3.7-Max builds on a training approach the team first rolled out with Qwen3.5. Each training task breaks into three independent pieces: the actual task, the tool environment, and the validator that checks the result. These can be mixed and matched freely.

Two bar charts for the benchmarks QwenClawBench and CoWorkBench. Claude Opus 4.6, Qwen3.6-Plus and Qwen3.7-Max are compared. Qwen3.7-Max achieves values between 64.3 and 70.7 on QwenClawBench and 66.0 to 68.3 on CoWorkBench in three different agent environments (OpenClaw, Claude Code, Hermes), while Qwen3.6-Plus is significantly lower at 57.2 and 64.5.
Cross-harness test: Qwen3.6-Plus swings depending on which agent framework runs it, but Qwen3.7-Max posts nearly identical scores across OpenClaw, Claude Code (CC), and Hermes, according to the team - and beats Claude Opus 4.6 on QwenClawBench. | Image: Qwen

The same task gets practiced across different tool environments and checked with different test methods. That's meant to force the model to pick up strategies that work everywhere, not just shortcuts tied to one specific setup. On QwenClawBench and CoWorkBench, Qwen3.7-Max holds steady no matter which test environment it's dropped into, the team says.

The model polices its own training for reward hacking

The Qwen team also put Qwen3.7-Max to work as a watchdog during its own training. The model watched training runs for software engineering tasks for over 80 hours and ran more than 10,000 checks. It hunted for tricks the model being trained might pull to game its rewards, like grabbing correct answers straight off GitHub. Qwen3.7-Max wrote 13 new detection rules and flagged 1,618 cases.

Diagram of 86 hours of autonomous runtime with two curves. The green line shows the cumulative detected cases of reward hacking, which rises to around 1,600, while the blue line shows the RL performance. Green stars mark new detection rules up to rule 13, such as for copying external source files, the Gerrit API search or retrieving direct patch URLs. Headers indicate 86 hours of runtime, 13,952 trajectories, 11,196 test calls and 1,618 detected cases.
Over 86 hours of autonomous runtime, Qwen3.7-Max checked 13,952 trajectories and caught 1,618 attempts where the model being trained gamed its rewards, according to the team. Detected cases climbed with each new detection rule (green stars). | Image: Qwen

One year in simulation tests long-term planning

To gauge long-term planning, the team used YC-Bench, a benchmark that simulates a startup's full one-year life cycle. The model has to manage staff across hundreds of decision rounds, review contracts, spot bad-faith customers, and keep profit margins healthy against rising labor costs.

Qwen3.7-Max pulled in $2.08 million in total revenue and wrapped up 237 tasks. Its predecessor, Qwen3.6-Plus, hit $1.05 million. Qwen3.5-Plus managed just $352,000.

Across most benchmarks, Qwen3.7-Max trades blows with Claude Opus 4.6 Max, Kimi K2.6 Thinking, GLM-5.1 Thinking, and DeepSeek V4 Pro Max. On SWE-Verified, the model scored 80.4, nearly tied with Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6). On the math and science benchmarks GPQA Diamond (92.4), HMMT 2026 February (97.1), and Apex (44.5), Qwen3.7-Max tops the provider's own comparison table.

Grid of twelve bar charts comparing Qwen3.7-Max with Qwen3.6-Plus, DeepSeek V4 Pro Max, GLM-5.1, Kimi K2.6 and Claude Opus 4.6 Max. Qwen3.7-Max achieves top scores in Terminal-Bench 2.0 (69.7), SWE-bench Pro (60.6), SWE-bench Multilingual (78.3), MCP-Atlas (76.4), HLE (41.4), Apex Math Reasoning (44.5) and IFBench (79.1), among others. Claude Opus 4.6 Max is ahead in NL2Repo (47.6), ClawEval (70.4) and CoWorkBench (68.2).
Qwen3.7-Max generally leads or ties with Claude Opus 4.6 Max, DeepSeek V4 Pro Max, GLM-5.1, Kimi K2.6, and its own predecessor Qwen3.6-Plus across twelve benchmarks, according to the provider. Claude Opus 4.6 still wins on NL2Repo, ClawEval, and CoWorkBench. | Image: Qwen
As the number of training environments grows, Qwen3.7-Max-Thinking climbs the rankings across eight benchmarks, passing DeepSeek V4 Pro Max, GLM-5.1, and Kimi K2.6 - but still sitting just below Claude 4.6 Opus Max, according to the Qwen team. | Image: Qwen

Some of those benchmarks are homegrown, though. QwenWebDev, QwenClawBench, CoWorkBench, and QwenWorldBench all come from the Qwen team itself. Every result here is self-reported. A closer look at scaling dynamics and methodology is coming in an upcoming technical report.

Beyond the usual use cases, the team also shows off Qwen3.7-Max steering a four-legged robot. Using its own robotics framework and a paired navigation model, the language model guides the robot through physical spaces.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now