惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

The Decoder

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training Deepmind's Hassabis sees humanity "in the foothills of the singularity" while LeCun says current AI isn't intelligent Why you shouldn't leave model selection on default in Copilot, Gemini and other AI tools Anthropic may keep supplying Claude to the NSA despite being flagged as a supply chain risk by the Pentagon Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed Deepseek makes its 75 percent discount permanent, pricing output tokens at least 34x below GPT-5.5 One of the world's top law schools draws a hard line against AI in legal education Alibaba's latest AI model ran autonomously for 35 hours to optimize code for its own custom chip Google CEO Pichai now calls links a "part" of search, redefining the web's role in its own product Anthropic warns Claude Mythos Preview finds bugs faster than developers can patch them Cloudflare CEO Prince says builders and sellers are safe but AI is coming for the measurers OpenAI launches a ChatGPT Powerpoint plugin and warns it might accidentally delete your content Deepseek reportedly prioritizes AGI research over quick profits despite billions in funding OpenAI Appshots turn any Mac window into context for Codex OpenAI burned through $1.22 per dollar earned even after stripping out stock-based compensation California governor signs first US executive order to protect workers from AI job loss Trump pulls AI safety order after last-minute calls from Musk, Zuckerberg, and Sacks Google checks websites for llms.txt in new agentic browsing audit OpenAI shifts the boundary of automated reasoning with a "milestone in AI mathematics" that experts are now unpacking US Cyber Command races to deploy AI on top-secret networks Cohere open-sources its strongest model yet Anthropic is about to become the first profitable AI lab OpenAI could file confidential IPO paperwork within days SpaceX IPO filing shows billions in AI losses, a $2 trillion valuation target, and turbine spending that signals more data center conflicts ahead SAP taps Mistral AI to help customers migrate legacy software Deepseek wants to take on Claude Code and OpenAI's Codex with "Deepseek Code" LinkedIn's war on AI slop is not just a policy update—it is an admission that the platform lost control of its feed Google tests the app market version of the SaaSpocalypse Stability AI launches Stable Audio 3.0 with up to six-minute tracks and open weights Google pairs its Genie world model with Street View to create explorable AI worlds based on real places Google's Gemini 3.5 Flash follows Anthropic and OpenAI in making newer AI models significantly pricier Google overhauls its AI subscriptions at I/O 2026 with three tiers starting at $10 a month Sorry for the outages: Bot spam is pushing our servers to the limit Google's I/O announcements: new models, a cloud agent that never sleeps, and a redesigned Gemini app Prominent AI researcher Andrej Karpathy picks Anthropic over former home OpenAI to get back into frontier LLM research Agora-1 turns the N64 classic GoldenEye into a playable AI simulation for four players Mistral AI acquires Viennese physical AI startup Emmi AI Cloudflare says Anthropic's Mythos Preview finds exploit chains that earlier frontier models missed Anthropic adds self-hosted sandboxes and MCP tunnels to Claude Managed Agents Elon Musk appeals $134 billion OpenAI loss, calls verdict a "calendar technicality" Elon Musk loses his $134 billion lawsuit against OpenAI after jury deliberates for just two hours Cursor's Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost Pope Leo XIV presents first AI encyclical, Anthropic co-founder invited as guest speaker A Stanford student reflects on his ChatGPT class and a culture of "just a little bit of fraud" MAGA-aligned groups want government oversight of frontier AI models Anthropic to brief global financial regulators on cyber flaws found by Claude Mythos AI startup revenue hits $80 billion, but Anthropic and OpenAI take almost all of it World Action Models give robots the ability to simulate consequences before they move Greg Brockman consolidates OpenAI's product teams to build an "agentic future" Mistral CEO Arthur Mensch warns France against letting Anthropic's Mythos scan military code bases New math benchmark reveals AI models confidently solve problems that have no solution Four AI models ran radio stations for six months and the results ranged from competent to unhinged Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously YouTube opens its deepfake face-swap detection tool to all adult creators New benchmark confirms AI video generators look stunning but still can't reason about the world OpenAI bought a voice cloning startup famous for celebrity imitations For $1.3 million a month, OpenClaw founder Peter Steinberger runs 100 AI agents that code, review PRs, and find bugs AI made a tiny slice of Silicon Valley filthy rich and left the rest wondering why they bother Researchers train AI model that hits near-full performance with just 12.5 percent of its experts Google says GEO and AEO are a myth and traditional SEO is all you need for AI search Google busts the myth that AI search needs its own SEO playbook ChatGPT now wants access to your bank account so it can tell you to stop ordering takeout Anthropic's $900 billion valuation would make it more valuable than OpenAI for the first time x.AI plays catch-up with Grok Build, its first terminal-based coding agent Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool Arxiv cracks down on unchecked AI-generated content in research papers Anthropic frames AI competition with China as a now-or-never moment for Washington OpenAI makes its AI coding assistant Codex available on iOS and Android Americans would rather live next to a nuclear plant than an AI data center, Gallup poll finds Microsoft pits more than 100 AI agents against each other to find Windows vulnerabilities Ten Chinese firms including ByteDance reportedly get US clearance for AI chips they're not allowed to accept Alibaba's Qwen-Image-2.0 doubles compression and cuts generation steps from 40 to 4 ChatGPT's web traffic share dropped from 78% to 54% in one year as Gemini quietly tripled its reach New Claude Mythos becomes the first AI model to clear all cyberattack simulations from Britain's AI safety agency Microsoft's Edge Copilot can now read all your open tabs at once and write for you on LinkedIn Claude subscriptions get separate budgets for programmatic use, billed at full API prices Tencent plans to ramp up AI spending as China's chip supply allegedly improves Anthropic overtakes OpenAI in B2B adoption for the first time according to Ramp spending data Meta AI gets a private mode where no conversation data is stored on servers Anthropic launches Claude for Small Business to embed AI into the tools you forgot you pay for Luma opens Uni-1.1 image model API at prices and quality matching OpenAI and Google China's AI suppliers can't keep up as critical component shortages hit production AI startup Recursive emerges from stealth with $650 million to build self-improving AI Google is hiring hundreds of engineers to help customers adopt its AI From Prompt to Pointer Engineering: Deepmind tries to reinvent the mouse cursor for the AI era Android gets AI agents that book trips, fill forms, and clean up your texts Anthropic expands legal AI offerings with new Claude Cowork plugins Google says it stopped a mass cyberattack after AI was used to discover a zero-day exploit Alphabet's Isomorphic Labs raises $2.1 billion to scale AI drug discovery toward clinical trials Microsoft ousts its Israel chief following reports that Azure quietly powered military AI targeting in Gaza "Tokenmaxxing" spreads at Amazon as employees game internal AI leaderboards Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice Sam Altman's personal investments face political scrutiny ahead of OpenAI's planned IPO The EU wants to regulate AI but needs OpenAI and Anthropic to let regulators through the door Baidu's Ernie 5.1 cuts 94 percent of pre-training costs while competing with top models OpenAI's DeployCo subsidiary adopts Palantir's playbook, building a moat from workflows no lab can simulate Lawsuit claims ChatGPT coached FSU shooter on gun operation, timing, and victim thresholds AI turns patches into working exploits in 30 minutes, and the 90-day disclosure window is the casualty Generative AI turns identity theft into an industrial-scale operation
AI models often give the right answers but point to the wrong sources
Jonathan Kem · 2026-05-25 · via The Decoder

Just because a language model nails a question about a PDF doesn't mean it actually found the answer where it claims to.

Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory built a new benchmark called CiteVQA to expose this gap between getting the right answer and pointing to the right source. They call it "attribution hallucination."

Schaubild des CiteVQA-Benchmarks mit Beispielfrage zu einer PDF, zwei Bewertungszeilen für SAA Score 0 und SAA Score 1 sowie einem Balkendiagramm, das Antwortgenauigkeit und Strict Attributed Accuracy für fünf Modelle gegenüberstellt.
CiteVQA checks both the answer and the source location. A correct answer paired with a wrong citation gets an SAA score of 0 - only a correct citation counts. | Image: Ma et al.

Standard document analysis tests like DocVQA or MMLongBench-Doc only grade the final answer. They can't tell whether a model actually pulled information from the document or just guessed based on what it already knew. In law, financial audits, or medicine, though, traceability is what makes an AI output usable in the first place, the paper argues.

Pinpointing evidence

CiteVQA makes models back up every statement with a precise marker in the document. They have to point to the exact paragraph, table, or figure. A page number alone won't do. The dataset covers 1,897 questions across 711 PDFs from seven subject areas: 451 in English and 260 in Chinese. The documents average 40.6 pages each, way longer than most benchmarks.

Rather than hand-labeling everything, the team built an automated pipeline. It breaks documents into individual elements, has models like Gemini 3.0 Flash trace the chain of evidence, and then checks which pieces are truly needed. Each document gets pulled out on a trial basis. If the model can't answer the question without it, that document counts as essential.

Vierstufige automatisierte Pipeline von CiteVQA mit den Schritten Multi-Doc-Verknüpfung, Evidence-Package-Extraktion, QA-Konstruktion und Quality Control inklusive Evidence-Ablation.
The dataset is built fully automatically. In the last step, the pipeline removes each document one by one to check which ones are truly needed. | Image: Ma et al.

The core metric is called Strict Attributed Accuracy. A model only gets points when the answer is correct and the citation lands on the right spot. Twenty current models were put through the test.

The best performer, Gemini-3.1-Pro-Preview, scored just 76 out of 100. GPT-5.4 often knew the right answer but couldn't show its work: 87.1 for raw answer quality, just 59 once correct citations were required.

Open-source models fared much worse. Qwen3-VL-235B-A22B, the strongest freely available system, managed 22.5 points. Smaller open models mostly landed below 10, making them "extremely risky" for regulated industries, the researchers say.

Most models can't even land on the right page

Many models can't even find the correct page. The Gemini 3 series gets there in over 87 percent of cases. Qwen3-VL-235B-A22B manages just under 58 percent. Harder tasks make things worse. Single-document questions still work okay, but when a model has to pull together info from multiple documents, recall for Gemini 3.1 Pro Preview drops from around 69 to 55 percent.

Netzdiagramm der SAA-Werte von fünf Modellen über sieben Dokumenttypen, Spitzenwerte bei Academic Tech mit 85,0 Punkten und niedrigster Wert bei Publishing & Media mit 63,3 Punkten.
Academic papers with clean, standard layouts score best. Newspapers and magazines with messy layouts cap even the top models at around 63 points. | Image: Ma et al.

Math tasks do fairly well because the logic demands obvious evidence. Things fall apart when a model first has to spot a document element by its color, position, or heading, then figure out what it means. Academic papers with tidy layouts score best. Newspapers and magazines with busy designs hold even the top models to around 63 points.

Locating the source is the bottleneck

In an ablation study, the researchers narrowed the search space on purpose, feeding models only the relevant pages or the right document. Scores jumped fast - over 13 points for Qwen3-VL-8B.

The not-so-surprising takeaway: models that know where to look also give better answers. Accurate source information directly improves answer quality and is not just about transparency. This also points to why context engineering matters so much: an AI model performs best when it gets exactly the information it needs for the task.

Liniendiagramm mit Belegqualität auf der x-Achse und Antwortgenauigkeit auf der y-Achse für fünf Modelle, die Genauigkeit steigt tendenziell mit besserer Belegqualität.
The more precisely a model pinpoints its source, the better its answers get. Good citations aren't just about trust - they boost accuracy too. | Image: Ma et al.

The researchers posted their code and details on GitHub, and the dataset is up for download on Hugging Face.

A different benchmark from the same group, the Shanghai AI Laboratory, showed back in 2024 that language models struggle with long documents across the board. Their bilingual NeedleBench tests how well models dig up relevant info in lengthy English and Chinese texts - with similarly grim results.

Google DeepMind goes after a related problem with FACTS Grounding, which measures whether answers come strictly from the provided document or whether the model sneaks in outside knowledge. Even Gemini 3 Pro and GPT-5.1 don't come close to reliable scores.

OpenAI recently looked at why models guess instead of saying "I don't know." In an analysis, the company framed hallucinations as a systemic incentive problem. Training and evaluation reward confident answers and punish hedging. That same dynamic likely fuels the "attribution hallucination" that CiteVQA now catches in source citations.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Subscribe now