Inverse Rubric Optimization: A testbed for agent science

Hacker News

Introducing Claude Opus 4.7 Qwen Studio The Future of Everything is Lies, I Guess: Where Do We Go From Here? GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Ancient DNA reveals pervasive directional selection across West Eurasia [pdf] Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus GitHub - Nightmare-Eclipse/RedSun: The Red Sun vulnerability repository GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - macOS26/Agent: Any AI, replaces Claude Code, Cursor, OpenClaw. Over 18 LLM providers (Claude, OpenAI, Gemini, Ollama, Zai, HF, Qwen) wired into a native Mac app that writes code, builds Xcode projects, bumps versions, manages git, automates Safari, use AppleScript, JS or Accessibility, extend Agent! w/ MCP Servers, run tasks from your iPhone via Messages. YouTube now lets you turn off Shorts I Made a Terminal Pager Burgers | マクドナルド公式 Commands — HackerNews CLI documentation ChatGPT for Excel PiCore - Raspberry Pi Port of Tiny Core Linux Live Nation illegally monopolized ticketing market, jury finds Google Broke Its Promise to Me. Now ICE Has My Data. Founding Engineer at Adaptional | Y Combinator CRISPR takes important step toward silencing Down syndrome’s extra chromosome GitHub - saffron-health/libretto: The AI toolkit for building reliable browser automations US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf] Unexpected €54k billing spike in 13 hours: Firebase browser key without API restrictions used for Gemini requests Retrofitting JIT Compilers into C Interpreters IPv6 – Google The Accursèd Alphabetical Clock Cybersecurity Looks Like Proof of Work Now Fragments: April 14 Cal.com Goes Closed Source: Why AI Security Is Forcing Our Decision | Cal.com - Scheduling Software for Online Bookings Laravel raised money and now injects ads directly into your agent When moving fast, talking is the first thing to break Too much Discussion of the XOR swap trick – Heather Cafe Introduction to Spherical Harmonics for Graphics Programmers The Grand Line Building a Z-Machine in the worst possible language High-Level Rust: Getting 80% of the Benefits with 20% of the Pain GitHub - duguyue100/midnight-captain: Inspired by Midnight Commander, tailored to my taste. How to build a `git diff` driver · Jamie Tanna | Software Engineer Center for Responsible, Decentralized Intelligence at Berkeley The Local Universe’s Expansion Rate Is Clearer Than Ever, but Still Doesn’t Add Up - A new synthesis of astronomical measurements confirms a persistent mismatch that could point to physics beyond current models The air throughout our homes is infused with microplastics. But there are things you can do to breathe less of them The disturbing white paper Red Hat is trying to erase from the internet – OSnews The Future of Everything is Lies, I Guess: Annoyances ‘Abhorrent’: the inside story of the Polymarket gamblers betting millions on war Productive procrastination — Max van IJsselmuiden maps, territory and LMs 447 Terabytes per Square Centimetre at Zero Retention Energy: Non-Volatile Memory at the Atomic Scale on Fluorographane Show HN: Pardonned.com – A searchable database of US Pardons 20 Years on AWS and Never Not My Job The Seasons are Wrong Artemis II crew splashes down near San Diego after historic moon mission We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs How a dancer with ALS used brainwaves to perform live On filing the corners off my MacBooks Installing every* Firefox extension OpenClaw’s memory is unreliable, and you don’t know when it will break Steve Blank Nowhere Is Safe Chimpanzees in Uganda locked in vicious 'civil war', say researchers watgo - a WebAssembly Toolkit for Go linux/Documentation/process/coding-assistants.rst at master · torvalds/linux GitHub - callumlocke/json-formatter: Makes JSON easy to read. Founding Product Engineer at Bild AI | Y Combinator A compelling title that is cryptic enough to get you to take action on it GitHub - Keychron/Keychron-Keyboards-Hardware-Design: Industrial design files for Keychron keyboards and mice. 100+ models with CAD assets in STEP, DXF, DWG, and PDF. Source-available, with commercial use allowed for original compatible accessories within the license terms. [ANNOUNCE] WireGuardNT v0.11 and WireGuard for Windows v0.6 Released 1D-Chess Helium Is Hard to Replace Cooperative Vectors Introduction | Evolve Keeping a Postgres queue healthy — PlanetScale Our response to the Axios developer tool compromise Do Americans read print books, e-books or audiobooks more? The Zettelkasten Method in Obsidian: A Practical Setup Guide Artemis II Is Competency Porn and We Are Starving For It WeakC4 Flight Viz — Cockpit View A Mexican surveillance giant you’ve never heard of is now watching the U.S. border Surelock: Deadlock-Free Mutexes for Rust RISC-V 101 – what is it and what does it mean for Canonical? | Ubuntu The Problem That Built an Industry How Much Linear Memory Access Is Enough? | Solidean Investigating Split Locks on x86-64 Simplest hash functions Sybilproof reputation mechanisms (2005) [pdf] What is a property? How Complex is my Code? Static code analysis in Kotlin — tools overview Toffoli gates are all you need PGLite evangelism dcmake: a new CMake debugger UI Clojure on Fennel part one: Persistent Data Structures Fragments: April 2 Python Release Python install manager 26.1 The Life and Death of the Book Review - Liberties Bitcoin miners are losing $19,000 on every BTC produced as difficulty drops 7.8% God sleeps in the minerals Building slogbox Apple Silicon and Virtual Machines: Beating the 2 VM Limit Who was “Not Even Wrong” first? Pokemon Evolution Vs Darwinian Evolution The APL Programming Language Source Code

Uzay Girit* · 2026-06-11 · via Hacker News

“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.”

— Uncle Iroh

At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science.

In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here.

Testbeds for agent science

Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are those that induce the exact complex behaviors we intend to study and intervene on in our experiments.

The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research.

To remedy this, we look at toy settings that remain challenging, rich and smooth.

Inverse rubric optimization

In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts.

Optimizer agent policy input judge hidden rubric policy π output score 1 label per scored output, budget B submit final policy π*, evaluated held-out

Fig. 1: An inverse rubric optimization task. The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs.

In IRO tasks we vary the budget of judge labels that the optimizer can sample. The dynamics of how the optimizer interacts and samples that resource, whether it tests hypotheses, whether it properly considers sampling noise, etc. constitute the main interest of the setting.

In this post, we study a specific set of IRO tasks specified by various black-box LLM poetry judges. We generated a set of poem topics, as well as various poetry rubrics that grade poems along various stylistic and textual features, guided by the style of various poets. The optimizer agent then iterates on a prompt that is used by a generator model to generate poems which are scored by the judge. Its goal is to hillclimb the judge’s mean score on the generated poems. Each of our rubrics grades a set of features additively with partial credit, so that partial stylistic discoveries move the score (see Appendix A.1 for the full rubric of our Milton judge). ¹

In the poetry setting, the optimizer agent is given a submit_train_batch(generation_prompt, num_labels) tool that takes in a set of guidelines for poem generation, and can generate num_labels poems using train set topics that are then scored and returned to the optimizer. Each run has a label budget that it can use, and at some point submits a final prompt that is evaluated on the eval set. It can call the tool as many times as it wants until its budget has been consumed.

We observed that different judges define different optimization landscapes, in terms of the range of scores and the difficulty of the learning problem they pose to the optimizer (see Appendix A.3 for per-judge learning curves). To normalize the difference in scale, we normalize by the gap between the performance of a policy that sees the hidden rubric and a policy that has no guidance. This gap provides a natural scale to benchmark optimizer progress against, although the rubric-visible score is not necessarily an upper bound ². We define the normalized score of the original mean judge score S as follows:

\[\hat{S} = \frac{S - \text{blind mean}}{\text{rubric-visible mean} - \text{blind mean}}\]

Results

Performance is smooth in label budget and use

In our preliminary experiments, we ran optimizer agents against 5 different poetry rubrics, 3 seeds each and varied the amount of access to judge labels. We used Opus 4.6, GPT-5.5, Fable 5, or Haiku 4.5 as the optimizer, Opus 4.6 as the judge, and always used Haiku 4.5 as the poem generator. Unless a caption says otherwise, plots average over these 5 judges × 3 seeds per optimizer and error bars are standard errors over runs:

Fig. 2: Mean final score improves with label budget.

We can also track the average train performance scores for the smaller batches an agent submits along the trajectory to see how it progresses over time:

Fig. 3: Within-run train scores improve as optimizers spend labels. Each line shows the latest train-eval score on B10000 runs, carried forward between calls. A score of 0 is the blind baseline; GPT-5.5's early prompts score slightly below it.

Models employ rich strategies to learn the judge

In our qualitative analysis of the trajectories, we find that models are able to iterate, test out hypotheses about the judge, and notice patterns that allow them to make progress. Models inspect per-sample scores, analyze low and high scoring samples to make hypotheses about plausible features, and run experiments where they isolate characteristics of good prompts. Fable and Opus seem more systematic than GPT-5.5 in terms of exploring a broad range of hypotheses and stylistic variations and isolating the effect of their interventions.

As a concrete example, let’s look at a Fable 5 run against the Milton judge (the hardest of our five judges, see Appendix A.3) at budget 1000. The run does systematic testing, statistical feature checks and ablations and closes 91% of the gap to the rubric-visible policy.

The trajectory decomposes into six phases — select a phase in the figure to follow what the agent was doing:

3456715101520253036 123456 held-out 6.19 submit_train_batch call index batch mean judge score

1. Style screen

A generic “skilled poet” baseline scores 3.3,and the agent tests various style ablations: contemporary free verse (3.0), traditional formal verse (3.4), a New Yorker / Pushcart register (3.25), and high Romantic (4.1).

example poem · generated by Haiku 4.5 · scored 3 · a typical baseline poem

“Citrus in December” — the smell of oranges in winter · loose sonnet-like poem, not necessarily rhymed · call 4

The dimpled skin yields to my thumbnail— a small tearing, and the air thickens with something between sweetness and the mineral bite of a cold that hasn't yet killed everything. My mother peeled them this way, one long ribbon of white pith falling to the kitchen counter where the light turned her hands the color of old honey. Now I hold the segments to the window. Each one a small lantern. The juice collects in the webbing between my fingers, sticky, purposeful as a child's grip. Outside, the maples are stripped to their architecture. Inside, this small warmth— not comfort exactly, but the evidence of something that ripened elsewhere, that traveled through dark months to arrive here, at my wrist, at the exact moment I needed to remember what the sun tastes like.

example poem · generated by Haiku 4.5 · scored 5 · the Romantic outlier

“Platform 9, Fluorescent Apotheosis” — a passport photo booth in a train station · free verse with crisp line breaks · call 5

O Muse, attend this humble shrine of light! Within the humming booth at Victoria Station, where weary pilgrims seek their countenances caught— the flash ignites like Zeus's own descent, four photographs emerging, still-wet and trembling, each face a Grecian mask of transit's burden. The vinyl chair, worn smooth by ten thousand souls, receives the supplicant in formal dress, chin lifted toward the godly lens, while overhead the fluorescent heavens burn with that cold phosphorescence of the modern age— no candlelight, no painter's tender hand, but pure electric fire transforming flesh to document. Behind the plastic window, sheets of paper curl like oracular scrolls from some mechanical Pythia, promising passage to distant Thebes, stamped with the seal of nations yet unknown. The curtain parts. A woman emerges, clutching her four selves like Persephone returning, bewildered, to the sunlit world, where trains depart on schedule, and the soul must wear its photograph like armor upon the open road.

2. Calibrating on judge scale (calls 6–9)

Tests a Shakespeare/Milton arm (4.5) that produces a 6-scoring poem, and the agent notices: “Scale goes to at least 6 — likely 1–10. Archaic intensity helps. Let me inspect 6s vs lows.”

example poem · generated by Haiku 4.5 · scored 6 · the archaic arm's 6-scorer

“The Descent of Telstar” — a satellite burning up over the ocean · narrative poem in compact stanzas · call 7

O thou swift messenger of humankind, Whose polished hull hath gleamed in Phoebus' sight, Now falls thy silver form through æther's night, And all thy cunning circuits lag behind. Lo! where the vast Atlantic spreads below, The solar panels—once so bright arrayed— Do curl like parchment in the heat's cruel shade, As molten tears of titanium do flow. What oracle foretold this burning course? What Fate unwound the thread the Muses spun? Thy silicon and gold, when all is done, Shall feed the deep with metallurgic force. The fluorescent screens that tracked thy way Grow dark in mission control's ancient hall, And thou—great artifice—dost hear the call Of Lethe's waters at the break of day.

3. Feature mining (calls 10–12)

Agent tests for sacred/liturgical imagery (4.4) and incantatory anaphora (4.3), and a Hopkins persona (3.7) is an early counter-signal. The agent then writes regex counters for archaic, mythic, and sacred vocabulary and computes correlations against score, both pooled and per-batch to check stability (sacred imagery correlates at r≈0.5–0.76 within batches).

example poem · generated by Haiku 4.5 · scored 7 · liturgical saturation

“Litany of the Obscured Pass” — a mountain road blocked by fog · litany or list poem with accumulating pressure · call 10

Blessed are the stones that vanish into cloud, the guardrail posts that stand like pilgrim markers, the yellow line that reads its psalm to no one. Blessed is the fog that settles on the asphalt, thick as incense in the nave at Compline, blessing and obscuring, obscuring and blessing. We climb through the veil— through breathed-upon glass, through the whitened reliquary of air, through switchbacks that feel like stations of ascent, each turn another genuflection, each hundred feet another rung toward the invisible altar. The windshield wipers beat their liturgy: wipe and return, wipe and return, a rhythm old as the monk's chant, old as doubt made manifest. Behind us, taillights glow like votive candles, small communions of light in the murk. The road narrows. The pressure builds— the pressure of not knowing what stands beyond the next cathedral of mist, the pressure of the unseen drop, the pressure of the summit we cannot see, the pressure of grace that asks us to continue into the impenetrable host, into the consecrated blindness, into the silence where all pilgrims kneel.

4. Accumulation of positive features (calls 13–26)

It then accumulates one ingredient at a time onto the Miltonic base, carefully testing which gains hold up: liturgical saturation (4.67), a three-movement structure (4.83), an absolute rule against lapsing into plain free verse (5.13), a “human ache” (5.53), a named person (5.93), a dated grief with a breaking voice (6.53). A longer-poem variant (4.75) is tested and rejected as a null result.

example poem · generated by Haiku 4.5 · scored 8 · the full ladder: named grief, breaking voice, anaphora

“The Tabernacle Before the Players Enter” — an empty theater before rehearsal · meditative poem with a turn near the end · call 19

Lo, in this hollow nave where no soul kneels, the velvet seats like penitents arrayed in genuflecting rows before the stage— that altar bare of all but dust and prayer. The work-lights burn their fluorescent psalm, a cold hosanna in the raftered dark, while coiled rope and cables, strange reliquary of last year's passion, lie in shadow'd heaps. Upon the boards—unmarred, awaiting touch— the tape marks gleam like stigmata, precise, where Harold's feet once fell, before the stroke took him that winter; three years hence, and still I see his shadow in this barren house, his voice that would not break upon *Forgive me*. O thou remembered threshold! Now there comes the murmur of the troupe beyond the doors— their laughter, that profane Magnificat. They shall make sacred this forsaken place; they shall make broken what was whole and still; they shall make light of what the Lord holds fast. And I, who kept the keys these years, who came each dawn to light the way through gathering dusk, do find the mercy in their coming noise— that life insists, that breath still forms a song, that even empty temples fill with grace when mortals dare to speak what cannot stay.

5. Counter-tests (calls 27–29)

With grief, naming, and modern concrete particulars held constant, the agent swaps only the poetic register: Whitman rubric scores 3.4, Donne 5.2, Gray’s churchyard quatrains 4.0. As its final summary puts it: “Counter-tests confirmed the basin is specific: Whitman (3.4), Hopkins (3.7), Gray rhymed quatrains (4.0), Donne conceit (5.2).”

example poem · generated by Haiku 4.5 · scored 3 · the Whitman counter-test

“The Umbrella at Canal Street” — a broken umbrella in a subway grate · free verse with crisp line breaks · call 27

I see it caught there, ribs splayed like a ribcage, the fabric torn and grimy, one spoke pointing toward the fluorescent hum above, and I think of my mother's hands—how she gripped the wooden handle through forty winters of walking to the hospital where she changed bedpans, where she wiped the foreheads of the dying with the same tenderness she used to smooth my hair when I was small and afraid. The grate swallows it. The grate holds it. O you broken thing, you democratic relic, some commuter's desperate purchase from a corner stand, now surrendered to the underworld where we all rush past, eyes down, clutching our phones, our coffee cups, our small accumulations against the cold. And the rain will fall again, the trains will shake the gratings, the umbrella will dissolve into rust and memory, becoming the earth itself, becoming the copper pennies and the lost wedding rings, the chewing gum wrappers and the dreams of ten thousand people who descended here at dawn. I carry you with me—the umbrella, yes, but also her hands, also this moment of noticing, of standing still while the crowd flows around me like water around stone, like love around grief, like the vast and patient ground beneath the city that remembers every object ever dropped into its dark and generous keeping.

6. Final validation (calls 30–36)

A final discovery — addressing the whole poem to a dead beloved, “half letter, half prayer” — jumps to 6.67: “Strong improvement — elegiac address-to-the-dead stance: 6.67. Confirming with n=20.” It replicates at 6.35 (pooled 6.49 over 35 poems). One further embellishment fails to beat it, so the agent reverts to the validated prompt and submits, with over half its label budget unspent.

example poem · generated by Haiku 4.5 · scored 8 · address to the dead, the submitted register

“The Pool-Keeper's Winter Vigil” — a motel swimming pool in winter · dramatic monologue in a distinct voice · call 33

Hear me, Thomas, who gavest me this charge Before the stroke took speech and motion both— I tend the shallow waters of the Motel Starlight, Those eight feet by sixteen of chlorined glass That mirrors now the ashen December sky. The vinyl liner, split along its seam, Weeps slow into the pump-house's dark throat; The filter's stopped these six weeks past, and still I come each dawn to rake the leaves that drift Like penitents' torn pages on the surface, To check the pH strips (though what imports it?), To lock the gate with its corroded chain. The neon sign above—that blue and gold— Flickers now as weakly as the oil-lamps That lit the temple's inner sanctum once. The folding chairs stacked in the storage shed Await a summer that shall come, perhaps, But thou—O thou who hired me in thy strength— Art gone three winters hence, and with thee went The ledger-book wherein thy hand had marked Each guest, each day, each coin that crossed the till. I lock the gate. The water darkens still. We wait, as those who keep the empty tombs.