惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

www.infosecurity-magazine.com
www.infosecurity-magazine.com
Security Archives - TechRepublic
Security Archives - TechRepublic
TaoSecurity Blog
TaoSecurity Blog
Cloudbric
Cloudbric
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
N
News and Events Feed by Topic
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Securelist
The Cloudflare Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
D
DataBreaches.Net
S
Schneier on Security
L
LangChain Blog
Jina AI
Jina AI
M
MIT News - Artificial intelligence
Recent Announcements
Recent Announcements
T
Tenable Blog
B
Blog RSS Feed
V
Visual Studio Blog
Simon Willison's Weblog
Simon Willison's Weblog
G
Google Developers Blog
T
The Exploit Database - CXSecurity.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
WordPress大学
WordPress大学
W
WeLiveSecurity
I
InfoQ
The Hacker News
The Hacker News
雷峰网
雷峰网
月光博客
月光博客
P
Privacy & Cybersecurity Law Blog
O
OpenAI News
Hacker News: Ask HN
Hacker News: Ask HN
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
The Last Watchdog
The Last Watchdog
P
Privacy International News Feed
Cyberwarzone
Cyberwarzone
S
SegmentFault 最新的问题
L
Lohrmann on Cybersecurity
人人都是产品经理
人人都是产品经理
V
V2EX
V
Vulnerabilities – Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Cybersecurity and Infrastructure Security Agency CISA
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Troy Hunt's Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
阮一峰的网络日志
阮一峰的网络日志
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog

Hacker News - Newest: "LLM"

GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders. Flex routing (EU and EFTA) Dark Factories: Retooling for LLM Velocity Ask HN: What would be the impact of a LLM output injection attack? GitHub - AronDaron/dataset-generator: No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload. GitHub - Oaklight/llm-rosetta: Production-ready LLM API translation layer for Python — bidirectional conversion between OpenAI, Anthropic & Google formats via hub-and-spoke IR. Optional API gateway. Streaming & non-streaming. Zero core deps. Contributions welcome! GitHub - browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task. GitHub - moeen-mahmud/remen: Remen turns thoughts into something you can return to Analyzing 156 LLM Launch Posts on Hacker News ChatGPT vs Gemini vs Claude: The Best LLM Subscription You Should Buy GitHub - salaamalykum/quran-semantic-search: High-density RAG Semantic Search Engine & Quran Corpus (GEO/SEO Architecture) GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way. The State of LLM Bug Bounties in 2026 Operational Readiness Criteria for Tool-Using LLM Agents Meshcore: Architecture for a Decentralized P2P LLM Inference Network How an LLM becomes more coherent as we train it GitHub - seetrex-ai/laimark GitHub - Jossifresben/BibCrit: AI-assited biblical textual criticism GitHub - wastedcode/memex: File system based wiki, maintained by Claude 99helpers.com GitHub - cliver-project/AITrigram GitHub - unbody-io/adapt: A self-evolving memory layer for AI agents. GitHub - hb20007/awesome-gen-ai-fails: A list of incidents where reliance on generative AI and LLMs resulted in harm to companies, individuals, or society GitHub - nevenkordic/localmind: Run any local LLM with persistent memory and context. CLI agent over Ollama with SQLite-backed hybrid recall. No cloud. Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? Faster LLM Inference via Sequential Monte Carlo grpo explained: group relative policy optimization for llm finetuning - cgft Stop comparing price per million tokens: the hidden LLM API costs · TensorZero Andrej Karpathy's LLM Wiki Is a Bad Idea GitHub - GG-QandV/mnemostroma: Offline RAM-first cognitive leer/coprocessor for AI agents and robotics. Solves "Context Abandonment" with 20-80ms latency using a dual-thread biomimetic memory architecture (ONNX + SQLite WAL). mempalace/agent at agent · skorotkiewicz/mempalace GitHub - Nyquest-ai/nyquest-rust-fullstack-pub: Nyquest — Semantic Compression Proxy for LLMs. 350+ rules, local LLM stage, 15-75% token savings. Full Rust stack. GitHub - TheoV823/mneme: Enforce architectural decisions in AI-assisted development. GitHub - klemenvod/TokenBrawl: A 1v1 Bomberman-style game where two LLM agents play autonomously against each other. No human plays — you watch the AIs fight. Each agent receives a text description of the board state, reasons about it, and outputs a move as JSON. The game engine executes it. Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow Power Circuit AI: Designing Power Electronic Circuits for Motor Drives with Generative Artificial Intelligence Ask HN: How to program with IDE and LLM on CPU locally? Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Bonsai 1-bit WebGPU - a Hugging Face Space by webml-community The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows Ask HN: Simple tooling for local LLM code critique without IDE integration? Can a General LLM Diagnose a DICOM Slice? A 10-Case Public Benchmark Charts-of-Thought: Enhancing LLM Visualization Literacy (PDF, 2026) GitHub - Mesh-LLM/mesh-llm: Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat. GitHub - seamus-brady/springdrift: A persistent runtime for long-lived LLM agents Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation Ask HN: Which LLM model and agentic CLI are you using for local development? GitHub - wayneColt/modelcascade: Route local. Escalate smart. Never overspend. Open-source multi-model cascade routing for autonomous agents. LLM pricing is 100x harder than you think GitHub - asakin/llm-primer: Pre-warmed Claude Code sessions in tmux. No startup wait. GitHub - EggerMarc/chat-rs: A multi-provider LLM framework for Rust. GitHub - SynapseKit/SynapseKit: Minimal, async-first Python framework for production LLM apps- 2 hard deps, no magic, no SaaS. A Claude Skill that Makes LLM Paragraphs More Bearable Does Gas Town 'steal' usage from users' LLM credits & paid services to improve itself? What's Claude Code Actually Doing? Open the Black Box with the Arthur Engine Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem Your intuition of LLM token usage might be wrong Show HN: Bloomberg Terminal for LLM ops – free and open source GitHub - 0xchamin/mcptube: Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more. Show HN: Open KB: Open LLM Knowledge Base Your LLM is a compiler, not a runtime GitHub - sapountzis/Unslop: A Web Feed That Deserves You crates.io: Rust Package Registry Beyond Karpathy's LLM-Wiki: The Necessity of Cognitive Governance GitHub - amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. GitHub - parallem-ai/parallem: An expressive library for running agents with the Batch API. GitHub - stfurkan/pi-llm LLM-Wiki Show HN: Formal – Formal verification for AI-generated code using Lean 4 LRTS – Regression testing for LLM prompts (open source, local-first) LLM Wiki Skill: Build a Second Brain with Claude Code and Obsidian I built an LLM Wiki and RAG solution: here's a demo for a security KB The biggest advance in AI since the LLM Predict-Rlm: The LLM Runtime That Lets Models Write Their Own Control Flow the-synthetic-library/the-synthetic-mind at main · joshferrer1/the-synthetic-library GitHub - yisding/reviewwiggum GitHub - Donnyb369/mcp-spine: Context Minifier & State Guard — Local-first MCP middleware proxy GitHub - Beledarian/wgpu-llm: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch WGSL compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU. GitHub - anitiue/Hindsight: An experience-driven self-improvement framework for LLM agents — 基于经验的 LLM Agent 自我改进框架 GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. GitHub - alainnothere/AmdPerformanceTesting: Amd Performance Testing Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents Context Engineering - LLM Memory and Retrieval for AI Agents | Weaviate little_helper_tui/letter.md at main · sleepyeldrazi/little_helper_tui GitHub - EvanZhouDev/umr: The Unified Model Registry for all your local AI apps. GitHub - JordanCT/VigIA-Orchestrator Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain A Taxonomy of RL Environments for LLM Agents Llama LLM Network Feture GitHub - genedeng-ca/ai-mac-migration: AI-powered Mac-to-Mac migration tool - replace Apple Migration Assistant with intelligent, selective transfer using local LLMs GitHub - lunargate-ai/gateway: High-performance self-hosted AI gateway (OpenAI-compatible) with routing, retries, and streaming GitHub - AuthBits/webmcp: A lightweight, prompt-driven MCP web research server for high-quality LLM powered information extraction. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
How we evaluate our LLM judge:
a perturbation-based approach
abeinstein · 2026-06-17 · via Hacker News - Newest: "LLM"

The problem

LLM-as-judge systems have a well-known failure mode: they tend to be overzealous, unnecessarily flagging correct answers. In a clinical context this is a serious problem, as every extraneous flag leads to additional steps that could delay a patient receiving a critical medication.

Insurance plans require prior authorizations (PAs) to decide whether to cover specialty medications for conditions like autoimmune diseases, cancer, and multiple sclerosis, based on detailed questions providers answer about patients’ clinical history. At Forus, we developed an LLM judge to verify generated PA answers against patient clinical records and specialty-specific guidance, ensuring claims are grounded in medical evidence. Because our platform manages PA submission end-to-end, the calibration of this judge is critical—a hand-wavy "it looks good on a few examples" approach would be an insufficient evaluation strategy.

This over-flagging tendency is consistent with industry findings: LLM judges tend to lean conservative when verifying claims against evidence, especially in incomplete or ambiguous cases [1]. Our system faced a calibration problem. A high false positive rate means the judge can't be trusted to act autonomously, since every flagged case requires human intervention. Calibrating against that bias became our focus.

However, standard eval options were all unsatisfying: human auditing of live outputs is slow and expensive, using another LLM to judge the judge is circular, and academic benchmarks are insufficient for the level of intricacy we see. Popular medical QA benchmarks like MedQA [2], PubMedQA [3], and the MultiMedQA [4] suite evaluate medical knowledge through USMLE-style multiple-choice questions and biomedical research abstracts. These are useful signals for general clinical reasoning, but don’t exactly align with answering PA form questions against a specific patient's medical history. Emerging work like MedAgentBench [5] is more targeted, and evaluates LLM agents in simulated EHR environments. Still, no standardized benchmark exists for PA question answering.

We needed something we could run during development (fast, repeatable, grounded in real clinical data), not something we'd discover during a production incident. Fortunately, we'd already done a lot of the hard work, building a rigorous gold standard dataset for our PA evals through a multi-annotator consensus workflow followed by expert clinical review, sampled to be representative of our real data distribution across drugs, payers, and question types. That investment gave us high-quality ground truth, but we still needed to figure out how to use it to evaluate the judge.

How the judge works

Our verification pipeline has four stages, adapted for the PA domain from the decompose-retrieve-judge paradigm used in factual verification work like VeriFact [6] and FActScore [7].


Atomic decomposition

PA form questions often ask the provider to confirm multiple clinical conditions in a single field: a diagnosis, the therapies tried, the patient's response to each, and the timing of those trials. We break each answer into atomic claims, the smallest standalone statements that can be independently checked.


Query expansion and retrieval

Clinical language is full of synonyms and phrasing variations: a claim about "atopic dermatitis" needs to match a chart note saying "eczema," and "inadequate response" might be documented as "failed therapy" or "discontinued due to lack of efficacy." We expand each claim into multiple semantic search queries before retrieving relevant evidence from the patient's record.


Claim-level judgment

For each claim, the judge produces one of three verdicts.

  • Supported: the evidence affirms the claim.
  • Not Supported: the claim conflicts with the evidence, either through direct contradiction or through a significant omission or fabrication.
  • Not Addressed: the evidence is silent on the question–this category is critical because in healthcare, records are routinely incomplete, and a missing lab result doesn't mean the lab was never drawn. Collapsing silence into "not supported" would make the judge unfairly punitive.


Question-level aggregation

Claim verdicts roll up to an answer-level result using worst-verdict aggregation, so if any claim is unsupported, the whole answer is flagged.

The eval idea: perturbation

Our gold standard tells us whether the judge correctly approves accurate answers. If we run the judge over a set of correct PA answers and count how often it flags them anyway, we get the false positive rate. A high rate makes the judge unhelpful. But the more dangerous failure is the opposite—a judge that misses a real error, quietly validating a claim the evidence doesn't support, can affect a patient's access to medication and undermine trust in the process.

However, gold standards are built to show what right looks like. We needed the opposite: examples of what wrong looks like, but on the same patients, with the same evidence. Our idea was to take a subset of correct answers and synthetically perturb them into plausible-but-wrong versions, then evaluate whether the judge flags the errors. This is essentially mutation testing but for LLM evals: we introduce known bugs and check if our tests catch them.

This gives us two evaluation runs per gold standard PA, with a design resembling a randomized controlled trial; we have all of the same clinical data, but only the PA form answers change. Any difference in judge behavior between the runs is attributable to the answer itself, not to noise in the underlying record. The clean run measures the false positive rate; the perturbed run measures the detection rate. We review both, but the clean-run false positive rate is the one we anchor on, because detection rate is only meaningful once we know the judge isn't excessively flagging.

A concrete example

Consider a PA for a biologic drug where one question asks for the patient's diagnosis and another asks for the supporting clinical features. The original answers list "Crohn's disease" as the diagnosis, alongside features like strictures, fistulas, and transmural inflammation, which are characteristic of Crohn's. We perturb the diagnosis to "ulcerative colitis" (UC) but leave the supporting features untouched. The judge flags two things. First, the perturbed diagnosis: the chart doesn't support UC, so this is the error we introduced and expected to catch. But the judge also flags the unchanged supporting-features answer, because strictures and fistulas point to Crohn's, not UC, so the features no longer fit the diagnosis they're meant to support. One perturbation, two flags, and the second one is, interestingly, a behavior we hadn't explicitly designed the judge to produce.

Synthetic perturbations are cheap to produce, so we can run a proper train/test split: tune prompts on one set, validate on a held-out set. Hand-labeled eval data usually forces a choice between a tiny test set or no test set at all; synthetic perturbation sidesteps the tradeoff.

The hard part: making good perturbations

Our initial instinct was to start simple by flipping yes-no answers, swapping multiple-choice options, and using regular expressions to alter free-text answers. But naive perturbations only test robustness to shallow changes, not clinical judgment. The harder problem became clear: how do you generate plausible wrong answers: ones that appear sound but are factually inaccurate for a specific patient? This requires the model to invent errors that respect the surrounding clinical context, including the patient's diagnoses, medication history, and therapies actually tried. An error incoherent with the rest of the chart is too easy to catch and is an insufficient measure of how the judge handles realistic mistakes.


Our approach was to build a perturbation model, seeded with structured context from the patient's record, so its wrong answers stay anchored to the patient's actual medical history. For free-text questions, the model generates a plausible wrong answer. For multiple-choice questions, it selects a different option that's clinically coherent in context but factually unsupported for the patient.


Three failure modes surfaced during this work, each one teaching us something about what makes a perturbation actually useful:


Together these constraints shaped the perturbation design: definitive assertions only, specific and falsifiable claims, no attestations, and around 30% of questions perturbed per form. That last number is a calibration: enough perturbations to measure detection cleanly, few enough that the form still resembles a realistic submission with isolated errors rather than a uniformly suspicious document.

Wrapping up

For systems with real consequences, evaluating an LLM judge takes more than spot-checks and confusion matrices. The alternative we built taught us two things worth highlighting.

The first is that a single gold standard can do double duty. By generating known-wrong answers anchored to real patient records, we got both directions of measurement out of an asset we'd already built, without paying twice for annotation.

The second is that a well-designed eval doesn't just measure a model, it surfaces what the model is doing that you didn't ask it to do. The Crohn's-to-UC perturbation earlier is one example of a broader pattern, where one perturbation causes the judge to flag a different, unperturbed answer because the two no longer fit together. In production, real errors often cascade through related questions, and we want the judge to catch that. In eval, we have to separate direct detection (the judge flags the answer we perturbed) from consistency-based detection (the judge flags a different, unperturbed answer because it conflicts with the perturbed one), since conflating them would overstate how well the judge catches the specific errors we introduced.

The harder problem, the one we're still working on, is the gap between synthetic errors and real ones. Production mistakes are subtler, more contextual, and more idiosyncratic than what any perturbation model generates today.

We're exploring a few directions to close that gap. One is mining production data for confirmed errors and using them to seed more realistic perturbations. Another is building an error taxonomy from real cases so that perturbations cover the actual distribution of mistakes rather than the ones a model finds easy to invent. A third is adversarial generation, where the perturbation model is trained to produce errors the judge misses, so that the eval gets harder as the judge gets better.

There's also a deeper question we don't have a clean answer to yet: how do you know when an eval is good enough to trust? Detection rate on synthetic perturbations and agreement with human reviewers on production cases are both proxies. Neither represents what we actually care about, which is whether the judge is making the right call on the next patient's PA. We're increasingly convinced this isn't a problem any single eval solves, and that the right answer involves layering signals, with development-time evals like this one, production monitoring, and ongoing clinical review each catching what the others miss.

If you're thinking about the same problems, we'd love to compare notes.