惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
E
Exploit-DB.com RSS Feed
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Forbes - Security
Forbes - Security
W
WeLiveSecurity
N
News | PayPal Newsroom
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threat Research - Cisco Blogs
MyScale Blog
MyScale Blog
美团技术团队
Recent Announcements
Recent Announcements
Cloudbric
Cloudbric
T
Tenable Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
Cisco Blogs
P
Palo Alto Networks Blog
Google Online Security Blog
Google Online Security Blog
Schneier on Security
Schneier on Security
N
Netflix TechBlog - Medium
Project Zero
Project Zero
The Hacker News
The Hacker News
aimingoo的专栏
aimingoo的专栏
P
Privacy International News Feed
S
Security Affairs
SecWiki News
SecWiki News
AI
AI
Engineering at Meta
Engineering at Meta
AWS News Blog
AWS News Blog
Latest news
Latest news
I
Intezer
云风的 BLOG
云风的 BLOG
The Register - Security
The Register - Security
Martin Fowler
Martin Fowler
PCI Perspectives
PCI Perspectives
V
Visual Studio Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Jina AI
Jina AI
Know Your Adversary
Know Your Adversary
Scott Helme
Scott Helme
NISL@THU
NISL@THU
C
Cyber Attacks, Cyber Crime and Cyber Security
爱范儿
爱范儿
F
Full Disclosure
博客园_首页
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - Franky
J
Java Code Geeks
Y
Y Combinator Blog

Hacker News: Show HN

PurrrrrFocus: Pomodoro Timer App - App Store Workflow Engine — Multi-Step Orchestration for Bun RapidPhoto: Pro Photo Editor App - App Store GitHub - DheerG/swarms: Achieve extraordinary results with claude code across a variety of tasks SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Show HN: VCoding – A 5 MB native Windows IDE with no dynamic dependencies Show HN: LLMs don't hallucinate because they're bad at math, it's the format GitHub - Agent-FM/agentfm-core: AgentFM is a peer-to-peer network that turns everyday computers into a decentralized AI supercomputer. AgentFM lets you run massive AI workloads directly across a global mesh of idle CPUs and GPUs. Show HN: Tracking Top US Science Olympiad Alumni over Last 25 Years GitHub - Potarix/agent-hub: One place to talk to all your agents Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) GitHub - dubeyKartikay/lazyspotify: Terminal Spotify client for macOS and Linux GitHub - the-banana-tool/king-louie: Easy to use GUI Personal AI Assistant. Win/Linux/Mac. Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission GitHub - basteez/jsf-autoreload: maven plugin to enable hot reload on jsf projects uvm32/hosts/host-gdbstub at main · ringtailsoftware/uvm32 GitHub - labsai/EDDI: Config-driven engine that turns JSON into production-grade AI agents. Multi-agent orchestration, 12+ LLM providers, MCP/A2A protocols, RAG, persistent memory, and enterprise compliance (EU AI Act, GDPR, HIPAA). Built on Quarkus. GitHub - glitchnsec/fortyone-oss: AI Executive Assistant Platform Quickstart | Alien GitHub - muxshed/shed: One stream in, or many. Every destination, simultaneously. No cloud middleman, no per-channel fees, no limits. GitHub - ocrbase-hq/ocrbase: 📄 PDF/IMG ->.MD/JSON Document OCR API for PaddleOCR and GLMOCR. Self-hostable. GitHub - impactjo/home-memory: MCP server that lets your AI assistant remember everything about your home. GitHub - Sets88/dbcls: DbCls is a powerful terminal database client that supports various databases GitHub - neptun2000/heor-agent-mcp GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh RollQuation: Math Puzzles - Apps on Google Play GitHub - dropbox/witchcraft Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis GitHub - opentalon/opentalon: OpenTalon is an open-source platform built from the ground up in Go as a robust alternative to OpenClaw LinkedIn™ 职位抓取工具 - Chrome 应用商店 GitHub - EdoardoBambini/Agent-Armor-Iaga: AI agents are getting tool access — shell, file system, databases, APIs, secrets. But **nobody is governing what they actually do with it**. Frameworks like LangChain, CrewAI, AutoGen, and Claude Code give agents the power to execute. Agent Armor gives you the power to control, audit, and approve every single action before it happens. HN Vibes — Week 15, Apr 7–13 2026 GitHub - chojs23/ec: Easy terminal-native 3-way git mergetool vim-like workflow GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - JakOb-dotcom/cloud-sandbox-security-analysis: Technical analysis and Proof of Concept (PoC) regarding environment variable exfiltration in containerized cloud sandboxes via side-channel data leaks. Springboards - Flint Alpha Show HN: A simpler coding agent harness GitHub - audiodude/sudomake-friends GitHub - 256thFission/mini-mythos: OSS clone of Anthropic’s Mythos harness to locate C/C++ memory vulnerabilities Show HN: OpenParallax: OS-level privilege separation for AI agent execution Hacker News Sorted - Chrome 应用商店 Show HN: How to Install Docker on Ubuntu 24.04 LTS: Complete 2026 Guide GitHub - himanshudongre/smriti GitHub - sverrirsig/claude-control: macOS desktop dashboard for monitoring and managing multiple Claude Code sessions GitHub - ory/dockertest: Write better integration tests! Dockertest helps you boot up ephermal docker images for your Go tests with minimal work. Chiral - Chrome 应用商店 Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC GitHub - pmichaillat/latex-cv: Minimalist LaTeX template for academic CVs GitHub - oguzbilgic/posse: A web UI for Anthropic Managed Agents. GitHub - sshiraz/depsly: Dependency risk analysis tool for npm packages ABI Add safari/agent-harness — Safari browser automation via safari-mcp by achiya-automation · Pull Request #212 · HKUDS/CLI-Anything GitHub - Halfblood-Prince/trustcheck: Verify PyPI package attestations and improve Python supply-chain security GitHub - oguzbilgic/kern-ai: Agents that do the work and show it. GitHub - bruits/satteri: High-performance Markdown and MDX processing for the JavaScript ecosystem GitHub - tylergibbs1/feedstock: High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright GitHub - Grimm67123/grimmbot: The self-improving sandboxed and open-source AI agent. With persistent memory and scheduling. GitHub - whitevanillaskies/whitebloom: Local whiteboard that blooms. GitHub - hwdsl2/docker-whisper: Docker image for a self-hosted Whisper speech-to-text server with speaker diarization and OpenAI-compatible transcription and translation APIs. Powered by faster-whisper. Supports all Whisper models, NVIDIA GPU (CUDA) acceleration, JSON/SRT/VTT output, SSE streaming, offline mode, and multi-arch (amd64, arm64). GitHub - yisding/reviewwiggum GitHub - MarwanAlsoltany/serrors: Structured errors for Go: sentinel hierarchies, typed data, custom formatting, and slog integration. GitHub - soatok/age-php GitHub - Luthiraa/markitme GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits GitHub - tombedor/excalicharts GitHub - wh1le/excalidraw-edit: Open and edit .excalidraw files from the terminal. Offline, auto-saves to disk. MalExt Sentry - Malicious Extension Scanner - Chrome 应用商店 GitHub - syi0808/asciianimesvg: Generate animated ASCII art SVGs from text. CLI, Rust library, WASM, and web editor. GitHub - zaina-ml/ml_forge: A visual-based graph node editor for training computer vision models. GitHub - anakin87/llm-rl-environments-lil-course: 🌱 A little course on Reinforcement Learning Environments for evaluating and training Language Models GitHub - takaakit/superpowers-uml: Superpowers-UML modifies Superpowers to ensure a software development workflow in which AI agents design through UML modeling. AdriByte Studio - Sviluppo Web e Soluzioni Digitali GitHub - chouligi/angel-copilot: Your personalized Angel Investment Advisor Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 GitHub - agenteractai/lodmem: Level Of Detail Context Management for Agents GitHub - ostefani/subnetlens: A fast, concurrent network scanner with a TUI and plain-text CLI, built in Go. It discovers live hosts on your network, scans their open ports, resolves hostnames, and fingerprints operating systems—delivered. Cyber Pulse: Agentic Intel - Apps on Google Play Whisper API: Self-Hostable Speech to Text Transcription The Agent-Web Protocol Stack: A Research Thesis GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Show HN: Provepy – A Python decorator that proves your code using Lean and LLMs Show HN: Pardonned.com – A searchable database of US Pardons GitHub - patrickdappollonio/dux: Dux is a terminal UI that lets you run multiple AI coding agents side by side, each in its own git worktree, with full companion terminals, macros, commit generation, and a command palette that knows more tricks than you do. kMC Crystal Simulator Show HN: HyperFlow – A self-improving agent framework built on LangGraph GitHub - stef41/vibescore: 🎵 Grade your vibe-coded project. One command, instant letter grade across security, quality, dependencies, and testing. GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. imgur.com GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. GitHub - nowork-studio/toprank: Open-source Claude Code skills for SEO, SEM, Google Ads GitHub - tacomanator/sash: Lightweight macOS menu bar app for reliably cycling through windows of the current application. Appents | Social Media Management for Product-First Teams GitHub - pnhoang/youtube-spam-blocker: Automatically detects and hides spam messages in YouTube Live chat. Set rate limits, keyword filters, and block repeat offenders. GitHub - decisionnode/DecisionNode: CLI + Local MCP - A shared structured memory store across Claude Code, Cursor, Windsurf, Antigravity, and every MCP client. Semantically queryable. GitHub - AvaCodeSolutions/django-email-learning: An open source Django app for creating email-based learning platforms with IMAP integration and React frontend components. The $100K Gap in Kubernetes Security Tooling Function Calling Harness: From 6.75% to 100%
GitHub - MatteoLeonesi/claim-memory-graph-sdk: A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.
ML0037 · 2026-06-15 · via Hacker News: Show HN

Animated CMG mascot

PyPI version Downloads

I built CMG out of a practical need because as a PhD student studying how to evaluate AI systems, I sometimes use model-based graders in my experiments, which means relying on a language model as a judge. The problem is that those judges gave me too little control and clarity over their decisions. You cannot tell whether the judge actually checked your criteria or simply ignored the evidence you gave it. CMG try to closes that gap by making the judge back up each verdict with explicit claims and tying every claim to the evidence behind it. A set of plain checks then flags the cases where the verdict does not hold up, without putting a second model in the loop. It will not tell you who is right, but it will tell you which verdicts you can "trust" and which ones a person should read.

Why

LLM judges are useful, but they are not neutral. Researchers keep finding the same failure modes.

  • Zheng et al. report position bias, verbosity bias, self-enhancement bias, and limited reasoning.
  • Li et al. show scoring bias from rubric order, score ids, and reference answer scoring.
  • Feng et al. show that explicit rubrics and criteria can help judge consistency, but do not solve it.
  • Wang et al. show weak evidence verification in research-agent judging.
  • Chen et al. show reliability gaps for long-form outputs, even when rubrics or references are present.

CMG does not pretend to fix these biases, but it does make them easy to spot. You tell the judge what to check by passing the task, the answer, an optional reference, the rubric, and the criteria, and CMG saves all of that as evidence for the judge to make claims against. Each verdict then has to rest on real claims, and each claim has to point back to a piece of that evidence, so when the judge cuts a corner the viewer flags it, whether that is missing evidence, an ignored reference, a rubric item nobody checked, a bad verdict, or an unsafe verdict change.

For now the local viewer is the dashboard.

cmg-view cmg-runs/*.cmg.jsonl --flagged-only

A web dashboard can read the same report data later.

When to use CMG

Use CMG when you run an LLM judge and cannot just trust what it says.

  • Large eval runs. You score thousands of cases and cannot read every explanation by hand, so CMG flags the ones that need a human and lets you skip the rest.
  • Reference checks. You want to catch a verdict that never cited the gold answer (reference_ignored).
  • Rubric coverage. You need every criterion checked, not quietly skipped (rubric_coverage_gap).
  • Audit and debugging. You want a replayable trail for each decision, so you can explain a score or work out why scores drift between runs.
  • Multi-turn judging. You need to catch a verdict that flipped without a proper retraction (verdict_flip_without_invalidation).

CMG will not tell you whether the judge is right, because that call still belongs to a person. What it does check is whether the judge backed its verdict, covered your rubric, and stayed consistent, and it points you at the cases where it did not.

Install

pip install claim-memory-graph

Optional provider helpers:

pip install 'claim-memory-graph[openai]'
pip install 'claim-memory-graph[anthropic]'

The distribution is named claim-memory-graph, but you import it as cmg. The core package has no runtime dependencies.

Quickstart

Start with the local demo. It needs no API key.

python examples/local_judge_demo.py
cmg-view cmg-runs/*.cmg.jsonl --summary
cmg-view cmg-runs/*.cmg.jsonl --show-evidence
cmg-view cmg-runs/*.cmg.jsonl --flagged-only

The --summary view gives you the whole run at a glance.

cmg-view --summary terminal output with the owl mascot, verdict bars, hard and soft flag counts, criteria coverage, and top review cases

Once that runs, wire CMG into your own judge. You keep the main task and the rubric. CMG only adds the audit layer.

from pathlib import Path

from cmg import ClaimGraph, JsonlStorage, arun_judge, judge_report


async def judge_fn(messages):
    return await call_your_judge_model(messages)


async with ClaimGraph(JsonlStorage(Path("cmg-runs/case-1.cmg.jsonl"))) as graph:
    result = await arun_judge(
        graph,
        judge_fn,
        prompt="Question shown to the candidate model.",
        candidate_output="Candidate model answer.",
        reference_answer="Optional gold answer.",
        rubric="How the judge should decide.",
        criteria=("Correctness", "Completeness"),
        verdicts=("pass", "fail"),
    )

    report = judge_report(graph)

if result.decision is None:
    print("The judge returned a missing or invalid verdict.")
else:
    print(result.decision.content)

print(report["human_review_flags"])

What the judge must return

The judge's visible answer has to start with a verdict line.

It should also add a hidden CMG block with its claims.

```cmg
{"ops": [{"op": "commitment", "content": "The answer matches the reference.", "refs": ["s-..."]}]}
```

CMG records the final Decision itself, so if the model sends a decision op, arun_judge ignores it. And if the model returns maybe when only pass and fail are allowed, CMG records no decision and the report marks the case for human review.

What you get

judge_report(graph) returns these fields.

  • verdict
  • claims
  • criteria
  • judge_responses
  • verdict_errors
  • retracted
  • human_review_flags
  • violations

Flags come in two kinds. Hard flags are real failures in the audit. Soft flags are gentler, just things to review. Here are the ones you will use most.

Flag Meaning
missing_verdict The judge did not return a valid verdict line.
invalid_verdict The verdict was not in the allowed list.
uncited_verdict A verdict has no active cited claims.
no_supported_claims No active claim has valid evidence.
criterion_citation_gap A criterion was discussed or may be covered, but no active claim cited that exact criterion id.
rubric_coverage_gap A criterion does not appear to be covered by any active claim text.
reference_ignored A reference answer exists, but no active claim cites it.
verdict_flip_without_invalidation A verdict changed without retracting old claims first.
silent_commitment_drop A later decision dropped an active claim without a retraction.

Integrations

CMG does not replace your eval framework. It sits inside it. Keep using the framework for datasets, model calls, scores, and totals. Let CMG hold the per-case audit log. Each example below is a small adapter you can drop into one common setup.

  • DeepEval. Wrap arun_judge in a custom metric. examples/deepeval_metric.py subclasses BaseMetric, so each measure call writes a per-case .cmg.jsonl, turns the verdict into a score, and puts the CMG path and review flags in the metric's reason.
  • Inspect AI. Register a @scorer that runs the judge. examples/inspect_ai_scorer.py returns an Inspect Score and keeps the CMG graph path, review flags, and claims in the score metadata, so the audit data rides along with every sample.
  • OpenAI, or any provider. For a judge with no framework around it, examples/openai_judge_demo.py passes make_openai_llm_fn(...) straight in as the judge_fn. CMG does not care which provider sits behind it.

Use a fresh output file for each case run. Do not append many runs of the same case to one JSONL file.

Docs

Topic Link
User guide docs/user-guide.md
Developer guide docs/dev-guide.md
Release checklist docs/release.md

These docs, this README included, were drafted with AI and reviewed by hand.

Sources

License

Apache-2.0.