Fight Evils with Evals!

Hacker News - Newest: "LLM"

GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders. Flex routing (EU and EFTA) Dark Factories: Retooling for LLM Velocity Ask HN: What would be the impact of a LLM output injection attack? GitHub - AronDaron/dataset-generator: No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload. GitHub - Oaklight/llm-rosetta: Production-ready LLM API translation layer for Python — bidirectional conversion between OpenAI, Anthropic & Google formats via hub-and-spoke IR. Optional API gateway. Streaming & non-streaming. Zero core deps. Contributions welcome! GitHub - browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task. GitHub - moeen-mahmud/remen: Remen turns thoughts into something you can return to Analyzing 156 LLM Launch Posts on Hacker News ChatGPT vs Gemini vs Claude: The Best LLM Subscription You Should Buy GitHub - salaamalykum/quran-semantic-search: High-density RAG Semantic Search Engine & Quran Corpus (GEO/SEO Architecture) GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way. The State of LLM Bug Bounties in 2026 Operational Readiness Criteria for Tool-Using LLM Agents Meshcore: Architecture for a Decentralized P2P LLM Inference Network How an LLM becomes more coherent as we train it GitHub - seetrex-ai/laimark GitHub - Jossifresben/BibCrit: AI-assited biblical textual criticism GitHub - wastedcode/memex: File system based wiki, maintained by Claude 99helpers.com GitHub - cliver-project/AITrigram GitHub - unbody-io/adapt: A self-evolving memory layer for AI agents. GitHub - hb20007/awesome-gen-ai-fails: A list of incidents where reliance on generative AI and LLMs resulted in harm to companies, individuals, or society GitHub - nevenkordic/localmind: Run any local LLM with persistent memory and context. CLI agent over Ollama with SQLite-backed hybrid recall. No cloud. Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? Faster LLM Inference via Sequential Monte Carlo grpo explained: group relative policy optimization for llm finetuning - cgft Stop comparing price per million tokens: the hidden LLM API costs · TensorZero Andrej Karpathy's LLM Wiki Is a Bad Idea GitHub - GG-QandV/mnemostroma: Offline RAM-first cognitive leer/coprocessor for AI agents and robotics. Solves "Context Abandonment" with 20-80ms latency using a dual-thread biomimetic memory architecture (ONNX + SQLite WAL). mempalace/agent at agent · skorotkiewicz/mempalace GitHub - Nyquest-ai/nyquest-rust-fullstack-pub: Nyquest — Semantic Compression Proxy for LLMs. 350+ rules, local LLM stage, 15-75% token savings. Full Rust stack. GitHub - TheoV823/mneme: Enforce architectural decisions in AI-assisted development. GitHub - klemenvod/TokenBrawl: A 1v1 Bomberman-style game where two LLM agents play autonomously against each other. No human plays — you watch the AIs fight. Each agent receives a text description of the board state, reasons about it, and outputs a move as JSON. The game engine executes it. Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow Power Circuit AI: Designing Power Electronic Circuits for Motor Drives with Generative Artificial Intelligence Ask HN: How to program with IDE and LLM on CPU locally? Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Bonsai 1-bit WebGPU - a Hugging Face Space by webml-community The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows Ask HN: Simple tooling for local LLM code critique without IDE integration? Can a General LLM Diagnose a DICOM Slice? A 10-Case Public Benchmark Charts-of-Thought: Enhancing LLM Visualization Literacy (PDF, 2026) GitHub - Mesh-LLM/mesh-llm: Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat. GitHub - seamus-brady/springdrift: A persistent runtime for long-lived LLM agents Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation Ask HN: Which LLM model and agentic CLI are you using for local development? GitHub - wayneColt/modelcascade: Route local. Escalate smart. Never overspend. Open-source multi-model cascade routing for autonomous agents. LLM pricing is 100x harder than you think GitHub - asakin/llm-primer: Pre-warmed Claude Code sessions in tmux. No startup wait. GitHub - EggerMarc/chat-rs: A multi-provider LLM framework for Rust. GitHub - SynapseKit/SynapseKit: Minimal, async-first Python framework for production LLM apps- 2 hard deps, no magic, no SaaS. A Claude Skill that Makes LLM Paragraphs More Bearable Does Gas Town 'steal' usage from users' LLM credits & paid services to improve itself? What's Claude Code Actually Doing? Open the Black Box with the Arthur Engine Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem Your intuition of LLM token usage might be wrong Show HN: Bloomberg Terminal for LLM ops – free and open source GitHub - 0xchamin/mcptube: Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more. Show HN: Open KB: Open LLM Knowledge Base Your LLM is a compiler, not a runtime GitHub - sapountzis/Unslop: A Web Feed That Deserves You crates.io: Rust Package Registry Beyond Karpathy's LLM-Wiki: The Necessity of Cognitive Governance GitHub - amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. GitHub - parallem-ai/parallem: An expressive library for running agents with the Batch API. GitHub - stfurkan/pi-llm LLM-Wiki Show HN: Formal – Formal verification for AI-generated code using Lean 4 LRTS – Regression testing for LLM prompts (open source, local-first) LLM Wiki Skill: Build a Second Brain with Claude Code and Obsidian I built an LLM Wiki and RAG solution: here's a demo for a security KB The biggest advance in AI since the LLM Predict-Rlm: The LLM Runtime That Lets Models Write Their Own Control Flow the-synthetic-library/the-synthetic-mind at main · joshferrer1/the-synthetic-library GitHub - yisding/reviewwiggum GitHub - Donnyb369/mcp-spine: Context Minifier & State Guard — Local-first MCP middleware proxy GitHub - Beledarian/wgpu-llm: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch WGSL compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU. GitHub - anitiue/Hindsight: An experience-driven self-improvement framework for LLM agents — 基于经验的 LLM Agent 自我改进框架 GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. GitHub - alainnothere/AmdPerformanceTesting: Amd Performance Testing Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents Context Engineering - LLM Memory and Retrieval for AI Agents | Weaviate little_helper_tui/letter.md at main · sleepyeldrazi/little_helper_tui GitHub - EvanZhouDev/umr: The Unified Model Registry for all your local AI apps. GitHub - JordanCT/VigIA-Orchestrator Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain A Taxonomy of RL Environments for LLM Agents Llama LLM Network Feture GitHub - genedeng-ca/ai-mac-migration: AI-powered Mac-to-Mac migration tool - replace Apple Migration Assistant with intelligent, selective transfer using local LLMs GitHub - lunargate-ai/gateway: High-performance self-hosted AI gateway (OpenAI-compatible) with routing, retries, and streaming GitHub - AuthBits/webmcp: A lightweight, prompt-driven MCP web research server for high-quality LLM powered information extraction. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

Dan Levy · 2026-06-18 · via Hacker News - Newest: "LLM"

Benchmarks measure benchmarks. Your system needs its own measures.

Every new model arrives wearing a tuxedo of benchmarks.

MMLU: 92.4%. HumanEval: 87.2%. LLeMU: 88.7%. MATH: 73.6%. AGI: 127%!

Yet, for 99% of businesses building process & product with AI, none of it matters.

What matters? How are YOUR workloads doing? Getting better or worse? The only sane way to know that is to write Evals (tests for LLMs) that reflect the specific tasks, data, and failure modes of your system.

The benchmarks are not lying. They are answering someone else’s question.

What “Vibes-Based Evaluation” Actually Costs

The standard approach: ship a model change, watch the complaint channels, roll back if the room gets loud.

That misses almost everything interesting:

You only catch loud failures. Users who get a confidently wrong answer and don’t realize it? Silent. Users who get a worse answer and abandon the feature? Silent. Support tickets and error rates capture only a fraction of quality regression.

You can’t distinguish regressions from improvements. If the new model is better at task A and worse at task B, complaints about B look identical to generic “the AI got worse” feedback. You don’t know what to fix.

You’re using your users as test infrastructure. They didn’t sign up for that.

The Eval Spectrum (and Where Most Teams Get It Wrong)

Evaluation approaches sit on a spectrum from “fast but flimsy” to “expensive but valid.”

A spectrum diagram comparing deterministic checks, LLM-as-judge, and human evaluation by speed, cost, and validity. — Use the cheapest evaluation method that can honestly catch the failure.

LLM-as-judge is the current darling: ask a powerful model to grade another model’s outputs. Fast, scalable, cheap. The problem: it bakes in the grader model’s biases, can be gamed, and creates a circular dependency. If you use GPT-5 to grade GPT-5’s outputs, you’re measuring something like “how much does GPT-5 agree with GPT-5.” That’s not nothing, but it’s not what you think.

Human eval is the gold standard everyone tries to skip. Getting humans to evaluate outputs is expensive, slow, inconsistent across evaluators, and annoying to schedule. But it is the only thing that validates whether your system is useful to real humans.

Task-specific automated checks are where most teams should spend more time. They are not glamorous, but they are fast, deterministic, and tied to what matters in your system.

What Actually Works

1. Define Failure Before You Ship

Before changing a model or prompt, write down what bad looks like. Specifically.

Not “the output should be accurate.” That’s not a test. More like:

Structured JSON output must parse without errors
All citations in the response must appear verbatim in the retrieved context
Responses must not mention competitor product names
SQL queries must be syntactically valid and reference only tables that exist in the schema
Sentiment classification must not flip from positive to negative more than 3% of the time on the existing test set

You can check these programmatically. No judge model required.

Eval harness: deterministic checks

type EvalResult = { passed: boolean; reason?: string };
const evals: Record<string, (output: string, context: EvalContext) => EvalResult> = {
  // JSON must parse
  validJson: (output) => {
    try {
      JSON.parse(output);
      return { passed: true };
    } catch (e) {
      return { passed: false, reason: `Invalid JSON: ${e.message}` };
    }
  },
  // No hallucinated citations — every claim must appear in context
  groundedCitations: (output, { retrievedChunks }) => {
    const claims = extractCitations(output);
    const ungrounded = claims.filter(
      (claim) => !retrievedChunks.some((chunk) => chunk.includes(claim))
    );
    return ungrounded.length === 0
      ? { passed: true }
      : { passed: false, reason: `Ungrounded claims: ${ungrounded.join(', ')}` };
  },
  // Response length sanity check — catch truncation or runaway generation
  reasonableLength: (output) => {
    const words = output.split(/\s+/).length;
    return words >= 10 && words <= 2000
      ? { passed: true }
      : { passed: false, reason: `Word count ${words} out of bounds` };
  },
};

2. Build a Golden Set From Your Worst Days

Your best evaluation data is the embarrassing stuff: the outputs that made someone file a ticket, screenshot a hallucination, or quietly stop using the feature.

Every time a user reports a bad output, flags a hallucination, or you notice a failure manually, add it to your golden set: the input, the context, and the correct behavior. Keep 50-100 cases and run them on every model change.

This feels manual at first. After six months, you have a test suite no public benchmark can game, because every case came from your own failure history.

A workflow diagram showing how bad production incidents become golden cases, then CI eval runs, then blocked regressions or approved releases. — A golden set turns the embarrassing stuff into a regression suite.

Golden case shape

interface GoldenCase {
  id: string;
  input: string;
  context: Record<string, unknown>;
  expectedBehavior: {
    mustContain?: string[];
    mustNotContain?: string[];
    structureCheck?: (output: string) => boolean;
    minSimilarityToReference?: number; // cosine similarity to a reference answer
  };
  sourceIncident?: string; // link back to the bug report or ticket
}

3. Regression Testing, Not Just Acceptance Testing

Most teams run evals only when considering a model change. That’s acceptance testing: “is this new thing good enough?”

You also need regression testing: “did this break something that used to work?”

Run your golden set on every prompt change, not just model changes. A prompt that was working fine can silently degrade when you add a new tool, change a RAG retrieval strategy, or update your context template. You won’t know without a baseline. Tools like Langfuse attach eval scores to production traces so regression shows up in dashboards, not just in incident reports.

Eval harness: baseline vs candidate comparison

async function compareModelVersions(
  goldenCases: GoldenCase[],
  baselinePipeline: Pipeline,
  candidatePipeline: Pipeline
) {
  const results = await Promise.all(
    goldenCases.map(async (tc) => {
      const [baseline, candidate] = await Promise.all([
        baselinePipeline.run(tc.input, tc.context),
        candidatePipeline.run(tc.input, tc.context),
      ]);
      return {
        id: tc.id,
        baselinePassed: runEvals(baseline, tc.expectedBehavior),
        candidatePassed: runEvals(candidate, tc.expectedBehavior),
        regression: /* baseline passed */ && /* candidate failed */,
        improvement: /* baseline failed */ && /* candidate passed */,
      };
    })
  );
  const regressions = results.filter((r) => r.regression);
  const improvements = results.filter((r) => r.improvement);
  console.log(`Regressions: ${regressions.length} / ${goldenCases.length}`);
  console.log(`Improvements: ${improvements.length} / ${goldenCases.length}`);
  if (regressions.length > 0) {
    console.error('Blocking regressions found:');
    regressions.forEach((r) => console.error(` - ${r.id}`));
  }
  return { regressions, improvements };
}

If a candidate regresses on known failures, the upgrade conversation gets wonderfully specific: which cases improved, which cases broke, and whether the trade is worth it.

4. Use LLM-as-Judge for Exactly One Thing

LLM-as-judge is useful for open-ended outputs where there is no deterministic right answer: “is this response helpful?”, “does this summary preserve the key points?”, “is this explanation right for a beginner?”

Use it there. Don’t use it for deterministic answers. When you do use it, make the grading rubric explicit:

Eval harness: rubric-based judge

async function judgeHelpfulness(
  userQuery: string,
  modelResponse: string
): Promise<{ score: number; reasoning: string }> {
  const judgePrompt = `
You are evaluating a customer support response.
User question: ${userQuery}
Response: ${modelResponse}
Rate the response on a scale of 1-5:
5 = Directly answers the question with accurate, actionable information
4 = Answers the question but could be more specific or actionable
3 = Partially addresses the question; key information is missing
2 = Tangentially related but doesn't answer the question
1 = Off-topic, factually wrong, or harmful
Respond with JSON: {"score": <number>, "reasoning": "<one sentence>"}
`;
  const result = await judgeModel.generate(judgePrompt);
  return JSON.parse(result);
}

An explicit rubric reduces evaluator variance, gives you interpretable output, and makes it easier to audit when the judge is wrong. Libraries like Autoevals and Braintrust ship prebuilt rubrics for common tasks — worth stealing before writing your own from scratch.

You don’t have to build all of this from scratch. Several tools have made serious progress on the eval infrastructure problem:

Braintrust — Full eval platform with experiment tracking, dataset management, and scoring functions. Organizes eval runs by prompt, model, and deployment so you can diff quality over time, not just across releases. Pairs well with their open-source Autoevals library, which ships prebuilt model-graded scoring functions for common tasks (factual accuracy, helpfulness, toxicity, semantic similarity).

Langfuse — Open-source LLM observability that sits between your app and your models. Traces every call, attaches eval scores (human or automated) to individual spans, and surfaces quality trends over production traffic. Good choice if you want observability and evals in the same tool rather than a separate eval harness.

Evalite — TypeScript-native eval framework by Matt Pocock. Low ceremony: define a task, define a scorer, run it in your existing test setup. Targets teams who want evals that feel like unit tests rather than a separate ML experiment platform.

promptfoo — CLI-first eval runner focused on prompt comparison and red-teaming. Easy to configure via YAML, integrates with most model providers, and has built-in support for detecting prompt injection and other adversarial inputs.

deepeval — Python eval framework with a large library of built-in metrics (G-Eval, RAG faithfulness, answer relevancy, hallucination detection). Useful for RAG pipelines where you want specific grading for retrieval quality, not just generation quality.

The right tool depends on your stack and where you’re starting from. What matters more than the choice of framework is the discipline of running evals at all — consistently, on every significant change.

The Uncomfortable Part

Most teams skip this because it asks an irritating question early: what would “good” look like here?

That is genuinely hard for a new AI feature. It is also non-optional if you care about reliability. Teams that ship trustworthy AI are doing the same thing they’d do for any critical code path: define expected behavior, test it, and run those tests continuously.

The benchmarks are not lying. They are answering someone else’s question. Stop reading them as product roadmaps and start writing tests that match your system.

Your users will notice before your dashboards do. Build the test suite first.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News - Newest: "LLM"

What “Vibes-Based Evaluation” Actually Costs

The Eval Spectrum (and Where Most Teams Get It Wrong)

What Actually Works

1. Define Failure Before You Ship

2. Build a Golden Set From Your Worst Days

3. Regression Testing, Not Just Acceptance Testing

4. Use LLM-as-Judge for Exactly One Thing

The Uncomfortable Part