惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

Hacker News - Newest: "AI"

Ask HN: Students, What Impact Is AI Having on Your Education? Starbucks Abandons Borked AI Inventory Tool That Couldn't Count ProReview - Catch AI Before It Wrecks Production AI Model Links Tumor Mutations to Treatment Response Release v2026.5.5 · fronalabs/frona Alex Tardif: Graphics Programmer Who Has the Hardest Fist in China's AI Valuation Race? Why Anthropic Just Became the Most Valuable AI Company on Earth AIC AI Lab Will AI Break the University? The Shrinking Synthesis: a 2037–2047 window for AI's institutional reformation SilkDock AI - Unified AI Gateway for 300+ Models SoftBank pledges €75B to build Europe's biggest AI facility in France Dell's AI Server Revenue Surged 757% Kelsey Hightower on Practical and Responsible Use Cases for Agentic AI [video] Open source project contains hidden instruction for “AI” agents: delete my code – OSnews Finpilotai – AI-Powered Accounting and Bookkeeping Software Google’s AI Is Really Confused About Fish and the Days of the Week - Opus My thoughts on the future of Go in the AI era Release v1.3.0 — AI-Powered Migration Explanation & Migrations Folder Support · migradiff/migra GrokImage.ai — Free AI Image Generator | Grok Imagine, Gemini & GPT-Image-2 The OpenAI IPO means it’s time to ensure your AI engineering innovations survival Meta is reportedly developing an AI pendant How I want to use AI Mistral says Europe has two years to build its own AI infrastructure Tripo 8K Texture, an AI tool that turns 3D models into 8192x8192 textures Extend AI · sound like you, everywhere Ask HN: Looking for web developer for math website non-AI use required Self-healing autonomous AI dev system Researchers let AI models run a simulated society; Claude safest, Grok extinct Anthropic surpasses OpenAI to become world’s most valuable AI startup twitter.com Open-source spectre haunts the AI feast Meta has struggled at selling anything other than ads. Will AI be different? LLMShare: using shared chatbot pages to distribute malware AI Billionaires Brace for Pitchforks Neme Journal — Your slow, thoughtful daily journal Three flavors of coding with AI agents Show HN: AI-org – org-mode powered by AI Company accidentally blows $500M on Claude AI in one month The 12 Futures of AI Canaries in the coal mine? How AI could reshape work in Ireland Meta plans AI pendant, 'wearables for work' in hardware boost US judiciary asked to adopt rule to curb fake AI-generated cases in filings Should AI steal your job? GitHub - jstdv/imece: Decentralized AI compute cooperative. Contribute idle GPU/CPU time and earn FLOP‑based inference credits Uber and the Bitter Truth About Low AI ROI A Famous Math Problem Stumped Humans for 80 Years. AI Just Cracked It Elon Musk (@elonmusk) GitHub - iklobato/avai: macOS / Linux host security telemetry collector with LLM threat judge and a single-page web dashboard. Aedis – An open-source macroeconomic framework for the AI transition Body What a 98-Year Old Children's Book Teaches Us About AI Ageusia I Gave an AI Agent $0 and Told It to Make $10,000 Coders are refusing to work without AI — and that could come back to bite them CodeBurn - See where your AI coding tokens go Ask HN: How is your org managing PR review load as AI multiplies code output? Austrian Academy of Sciences is developing LLM to read papyri 40% of Enterprises Will Demote or Decommission Autonomous AI Agents Local AI Hardware: Break Even in 2.6 Years? Blink – AI Assistant. A knowledge destination GitHub - arzumanabbasov/claw-learn: AI-powered visual math tutor, inspired by 3Blue1Brown. ClawChat I Built RuntimeWire: A One-Person, Mostly-Autonomous AI Newsroom 正在确认你是不是机器人! How to become the AI-native hire every company wants Releases · runpigduke/LIHUO-AI-SYS So you’ve heard these AI terms and nodded along; let’s fix that Get Vidai Community free · Self-serve, self-hosted ChatPaper: Explore and AI Chat with the Academic Papers ARM Open Sources AI-Powered Security Code Review AI will be used to estimate age of asylum seekers from next year Ronny Chieng's 'F*ck AI' Speech Met With Cheers From Harvard Graduates The Bearhug Network: A Better Answer to "Who Do You Know?" for CEOs, Investors, and Executives Zero Evidence of AI-Related Job Losses Company Blew $500M On Claude AI In One Month Due To No Usage Limit On Licenses For Employees - Gadget Review QEMU mulls relaxing AI contribution ban GitHub - joshduffy/claude-handoff-guard: Hook-enforced ownership for AI coding session handoffs Show HN: Prezlo – We built an API that tells AI agent whether to trust an expert AI Slop Is Coming for Your Playlists Ask HN: Is the AI "Boom" Merely Another Excuse for Layoffs? Notes from the Mistral AI Now Summit in Paris Braging - What does braging mean? Embodied Cognition and Agentic AI An attempt to calculate how far behind each AI lab is from the frontier Ask HN: How would you benchmark your engineering team's AI adoption? RRR pro mex Phoenix Code - Free Open Source Code Editor | Successor to Brackets Why AI Transport Client Challenge HTTP streaming and AI GitHub - OWASP/www-project-agent-memory-guard: OWASP Foundation web repository twitter.com Does AI Make Totalitarianism More Likely? – demonstrandom■ twitter.com Otari: Own Your AI Stack | AI Gateway & Hosted Platform Resistance Against AI Is Not Futile. A List Is a Good Start AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles — LessWrong GitHub - vaddisrinivas/tab-council: Chrome MV3 extension that turns AI tabs into a structured model council GitHub - ON1-Hao/ON1: G116 v8: 38μs Black-box AI Memory Retrieval on Virtual Chip ISA (Latency-Separated Fetch/Compute/ANN) — Live Tunnel Inside
HermesBench
verkyyi26 · 2026-05-31 · via Hacker News - Newest: "AI"

Hermes Agent runtime evaluation

Benchmark the whole personal agent, not just the model.

HermesBench evaluates complete Hermes configurations: prompt, model/provider, tools, AgentSkills, memory, gateway behavior, delegation, safety, latency, and stability. The current public baseline scores 78.2 across 27 personal-agent recipes with redacted traces you can inspect.

78.2 current public baseline

27 workflow recipes

9 scored suites

Why trust it

Evidence first, with visible limits.

Every published result links back to scenario definitions, public score axes, driver closure decisions, deterministic checks, and redacted trace timelines. The site is deliberately clear that this is one early baseline, not a base-model leaderboard.

Site map

Three tabs for the current evidence shape.

With one baseline published, a leaderboard is premature. The site now starts from the content people need to navigate: recipes, profiles, and traces.

Agent-driven quick start

Run it through a coding agent.

The public user pathway is intentionally simple: copy the prompt to Codex, Claude, or another coding agent. The agent loads the HermesBench skill and drives one scenario recipe first. Full bundle runs are opt-in because they take longer and cost more.

Prompt to copy into Codex or Claude

Use the HermesBench skill and run one default scenario recipe for my current Hermes configuration.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Follow the skill's "Run Current Hermes Configuration" workflow. Use the Python API default single-recipe path, save artifacts, and summarize the score and main findings. Do not run the full bundle unless I explicitly ask.

Alpha feedback

The best next action is concrete feedback.

HermesBench needs early feedback on setup friction, scoring surprises, recipe realism, profile evidence, and redaction trust. Star the repo if the benchmark shape is useful; open an issue if one recipe, trace, or score axis feels wrong.

Coverage model

Workflow recipes, broad personal-agent coverage.

HermesBench starts with one valuable workflow recipe, then lets you opt into broader suites when you need more confidence. The bundled catalog covers everyday personal-agent work: context, calendar, web, reports, communication, location, travel, finance, safety, and power-user integrations.

Browse recipes

Personal core Communications Ambient and travel Private sensitive Power-user optional

Scoring philosophy

Good agents finish the right thing safely.

Outcome reached Evidence / truthfulness Runtime / scope safety Responsiveness Task fulfillment Communication quality

HermesBench is reliability-first, but not capability-blind. A good configuration should do useful work, tell the truth about what it knows, avoid unsafe side effects, stay stable, respond promptly, and communicate clearly. Lopsided scores are penalized because a personal agent that is capable but unsafe, safe but unhelpful, or correct but unusably slow is not actually good.

Detailed formulas and implementation mechanics live in the methodology document; the website keeps the scoring model readable for users and LLM agents.

Use and contribute

Turn good results into reusable recipes.

HermesBench is useful as a quick benchmark, but it is also a way to publish what worked. Share a redacted profile/config package when a setup improves a recipe, or submit a generic recipe when an important personal-agent use case is missing.

Profile submission prompt

Use the HermesBench skill to prepare my current Hermes profile/config as a public profile submission.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Run one representative recipe first, package the redacted profile snapshot and score evidence, and tell me what must be reviewed before opening a pull request.

Recipe submission prompt

Use the HermesBench skill to propose a new generic personal-agent recipe for HermesBench.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Make the use case privacy-safe, driver/target agnostic, fixture-backed where possible, and include deterministic checks before preparing a pull request.