


















Language models are getting better at learning from feedback during post-training. In reinforcement learning with verifiable rewards (RLVR), a model tries a problem, a verifier checks the answer, and the policy is updated based on the scalar reward. Recent self-distillation methods go further by using feedback or successful sibling rollouts to create a stronger teacher signal. But most of these methods are still episode-local. A rollout is sampled, scored, used for one update, and then mostly discarded, which wastes signal. Across training, a model repeatedly sees related problems under a changing policy, and those attempts reveal more than just right or wrong answers. They show which strategies keep working and which reasoning patterns transfer to new problems. Procedural Memory Distillation (PMD) is built around this idea: self-improvement should be cumulative.
Knowing whether an answer was correct is only part of what a model can learn from an attempt. The reasoning that produced the answer matters too, along with the patterns that are worth holding onto.
PMD converts the model’s own training-time attempts into procedural memory, uses that memory to condition a stronger self-teacher, and distills the resulting guidance into the model’s weights. The memory is used only during training. At inference time, the final model runs normally, without external retrieval or a memory bank.PMD uses memory as a training scaffold, not a deployment dependency.
PMD creates an online loop:
This is the central design principle: policy and memory co-evolve.

Hierarchy of procedural memory
PMD organizes memory into three levels.

Experience memory stores the raw trajectories including successful rollouts, failed rollouts, rewards, and feedback. It is faithful, but local and verbose.
Insight memory reflects on attempts for the same problem and extracts strategies and lessons. Strategies describe what led to correct solutions. Lessons describe recurring mistakes and why they failed. When both successes and failures are available, PMD compares them directly to identify what separates correct reasoning from incorrect reasoning.
Behavior memory abstracts across related problems. PMD clusters semantically similar questions and distills their insights into reusable behaviors such as general skills or mistakes to avoid.
This hierarchy creates a fidelity-transfer trade-off. Experience is concrete but hard to transfer, behavior is transferable but more abstract, and insight sits in between. In our experiments, combining problem-level insights with cross-problem behaviors gives the strongest internalized policy.
PMD builds on self-distillation where the current model plays two roles:
Standard self-distillation gives the teacher episode-local context, such as feedback from the current attempt. PMD gives the teacher a broader view that includes strategies from prior correct attempts, lessons from prior failures, and behaviors retrieved from related problems.
The student then learns from this memory-conditioned teacher on its own rollout states. This keeps training on-policy while letting the teacher use knowledge accumulated across earlier attempts.
At inference time, the student receives no memory prompt. If performance improves, the useful procedural knowledge must have been internalized into the model weights.

We evaluate PMD on two verifiable domains: SciKnowEval, a science reasoning benchmark covering biology, chemistry, physics, and materials science, and LiveCodeBench, a code-generation benchmark with execution-based unit-test feedback.

Across both benchmarks and two model families, PMD improves over GRPO and SDPO.
With Qwen3-8B, PMD improves SciKnowEval average accuracy from 74.4 with SDPO to 77.2, and LiveCodeBench from 47.9 to 51.7. With OLMo3-Instruct-7B, PMD improves SciKnowEval from 69.5 to 73.3, and LiveCodeBench from 45.0 to 51.1. Relative to SDPO, these are 3.8–5.5% gains on SciKnowEval and 7.9–13.6% gains on LiveCodeBench. PMD uses the same self-distillation backbone as SDPO, but conditions the teacher on online procedural memory. The model’s own training history appears to contain useful signal that episode-local updates miss.
PMD’s gain depends on three pieces working together: reflection on past attempts, persistence of memory across training steps, and co-evolution between policy and memory.

Reflection alone provides some of the lift. A variant called PMD-Transient builds memory from the current batch and discards it after one step. It still improves over SDPO, showing that structured strategies and lessons are useful even without long-term memory.
Persistence adds more on top of that. full PMD keeps memory across training steps, allowing useful patterns to accumulate. This is especially important for code generation, where correct patterns may be rare and need many attempts to consolidate.
Memory by itself, though, isn’t enough. In the Evolving Memory + Frozen Policy condition, the memory keeps improving while the policy weights stay fixed. Performance remains much weaker than full PMD, because the model never internalizes the memory.
The reverse arrangement falls short for a different reason. In the Frozen Memory + Evolving Policy, condition, the policy updates but the memory bank is fixed, and memory written by an earlier policy can become stale as the learner changes.
The takeaway: The learner creates experience, experience becomes memory, memory strengthens the teacher, and the teacher updates the learner. Freezing either the policy or the memory breaks this loop.
We also test whether learned memory is useful beyond the exact model that produced it. Memories learned from Qwen3-8B are transferred to models ranging from Qwen3-1.7B to Qwen3-32B.
Across model sizes, memory-augmented inference outperforms the no-memory baseline. PMD co-evolved memory also transfers better than memory built with a frozen policy. Retrieving more memory entries improves performance,which suggests the behaviors are adding signal rather than noise.

A stronger model should improve its top answer and also preserve useful candidate diversity when sampled multiple times. Candidate diversity matters because methods like majority voting, verifier reranking, and best-of-N selection all depend on good candidates being present in the sample set.
On SciKnowEval, PMD continues to improve as the rollout budget increases, while SDPO saturates earlier. PMD also leaves more verifier headroom: more cases where a correct answer exists among sampled candidates even if majority voting does not pick it.


PMD does not simply accumulate more and more text. Experience and insight memories grow quickly early in training, then plateau as per-problem memory reaches capacity. Behavior memory acts more like a consolidation layer: it keeps reusable skills while avoiding redundant behaviors.

Language models already learn from feedback. PMD shows they can also learn from the history behind that feedback. The point isn’t to give the model an external notebook forever, but to let it use one while learning and then absorb the useful lessons into its own behavior. A self-improving model needs to track its own history and carry the useful parts of it forward. That is the central idea behind Procedural Memory Distillation.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。