惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
ThreatConnect
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 聂微东
H
Help Net Security
T
Threat Research - Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
A
Arctic Wolf
G
Google Developers Blog
量子位
U
Unit 42
I
InfoQ
V
V2EX
F
Fox-IT International blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
T
The Exploit Database - CXSecurity.com
T
Tailwind CSS Blog
SecWiki News
SecWiki News
Know Your Adversary
Know Your Adversary
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
The Hacker News
The Hacker News
Project Zero
Project Zero
Application and Cybersecurity Blog
Application and Cybersecurity Blog
月光博客
月光博客
Recent Commits to openclaw:main
Recent Commits to openclaw:main
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
G
GRAHAM CLULEY
C
Cisco Blogs
I
Intezer
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
Recorded Future
Recorded Future
T
Tenable Blog
W
WeLiveSecurity
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
T
The Blog of Author Tim Ferriss
www.infosecurity-magazine.com
www.infosecurity-magazine.com
D
Docker
C
Cybersecurity and Infrastructure Security Agency CISA
PCI Perspectives
PCI Perspectives

Salesforce

How We Use Data Detect to Secure the Heart of Salesforce Beyond the Vibe in Financial Services Series Build a Semantic Layer Your AI Agents Can Reason Over 10 Small Business Holidays That Make a Big Impact From the Farm to Flows: How One Trailblazer Built a Tech Career from Scratch Agentforce Life Sciences: June 2026 Newsletter 5 Common Architectural Mistakes in Salesforce Implementations Simple Ways to Optimize Customer Touchpoints for Sales Growth How Salesforce Built a World-Class Partner Ecosystem: The 6-Step Blueprint What Does Salesforce Do? Introducing the Data 360 MCP Server — Your Unified Data, Ready for Any Agent Ending the “easy vs. enterprise” compromise with Storefront Next for B2C Commerce Salesforce Connections 2026: Your 5-Minute Guide How To Lead a Team (To Greatness) 7 BigCommerce Alternatives For Small Business We Debunk 5 Misconceptions About Email Marketing Top 7 Financial Blogs For Small Business Owners How to Scale Salesforce for 1 Million Concurrent Users AI Agents Are Shopping. Is Your Brand Getting Noticed? Unified Service and Field Service: Capturing Intelligence (and Value) From Phone Calls to the Field Beyond the Funnel: Selling in the Age of Agentic AI How Founders Are Moving Boldly Into The Agentic Enterprise: Startup Summit Insights Asking For a Friend: What Exactly is an AI Moat? Readiness Starts at the Kitchen Table: Modernizing the Military Family Experience Introducing the Salesforce Architecture Blog: Community Author Program Entering a New Era of Salesforce Development with Headless 360 How to Build a Service Team That Thrives Alongside AI Pin It: Use Pinterest To Power Your SMB When AI Becomes Invisible: The Rise of Ambient Intelligence Headless Doesn’t Mean Ungoverned: How Trust Works When Agents Call Salesforce Forensic Behavioral Analysis: Finding Anomalies in Salesforce Logs Lead Your Team Asynchronously: Best Practices For Remote Management How SharkNinja Turned Product Unboxing Into a Guided AI Conversation How Indeed Builds, Tests, and Deploys Agents Without Ever Opening a Browser Run every HCP Event on Time, on Budget, and in Compliance with Agentforce Florida Prepaid’s Call Center Never Closes Thanks to Agentforce Voice How We Protect Our Data as Customer Zero Scale Your MRR: Subscription Management For Small Business Streamlining Commerce Media Ad Inventory Management 12 AI Sales Strategies for Startups That Actually Work Sell Smarter: Ecommerce Metrics To Track For Your Small Business Shop Apply the Orchestration Density Framework to Your Next Automation Decision Wait, Black Friday Planning In Spring? It’s Time to Start Holiday Promotions AI-First Operations, One Process at a Time CEOs of the classroom: why principals are the key to the AI era CEOs of the classroom: Why principals are the key to the AI era How BCU Is Transforming Banking Service with Agentforce Salesforce Headless 360: What the Agent Consumer Means for Your Integration Architecture Meet Customers Where They Are: Agentforce Contact Center Now Offers WhatsApp Voice 11 Free Lead Generation Tips for Small and Growing Businesses SFR-VibeTrain: The Agent That Trains Agents Why Technical Accuracy is the Wrong Metric for Agent Success Strengthening Salesforce Security Against AI-Driven Threats Join Us in the Community Hub at Connections 2026 The Best Way To Build AI Agents That Customers Trust 5 Ways AI is Changing the Communication Game For Startups Trust in the Era of Agents: Highlights from the 2nd Annual Trusted AI Impact Report You Can Be an Agentic Enterprise No Matter What Size Business How to Make Your Email Marketing Accessible for Everyone What is Headless? Don’t Lose Your Head, SMBs: It’s a Good Thing Architect the Future UI: Slack as Your Agentic Surface Point of Sale Innovations to Modernize the Shopper Experience Governing the Agentic Enterprise at Scale with MuleSoft Omni Gateway How to Cut Service Time with Case Routing Automation 5 Tips for Marketers to Get Started with Salesforce Flow No One is Vibe Coding Trade Promotion Management 7th Edition State of Sales Report: 3 Growth Trends for Startups and SMBs How the Architect Vista Brought Architectural Thinking to Life at TDX 2026 5 Steps to Develop an Architect Mindset With AI Why AI Isn’t Replacing Developers, It’s Empowering Them 10 Ways to Make Your AI Agent a Better Communicator AI in Design 2025: What Real Use Taught Us. How Salesforce Personalization Learns Which Offers Drive Revenue AI Voice Recognition For Your Growing Business: Speed Up Your Day AI Voice Recognition For Your Growing Business: Speed Up Your Day AI Voice Recognition For Your Growing Business: Speed Up Your Day The 4-Step Guide to Salesforce Agent and Application Development Get Ready for Connections 2026: Top Sessions and New Reveals 8 Ways AI Agents Are Evolving in 2026 7 LinkedIn Connection Strategies For Small and Medium Enterprises 7 LinkedIn Connection Strategies For Small and Medium Enterprises 7 LinkedIn Connection Strategies For Small and Medium Enterprises 4 Principles to Make the Right Salesforce UI Decisions Apprentice Journey Shines a Light on Talent Pathways at Salesforce Agent Script: The Control Plane for Agentic Decisions Scaling the Agentforce Life Sciences Ecosystem to Drive the Future of Pharma and MedTech Unlocking Unstructured Data: Building AI-Powered Support Triage with Data 360 Asking For a Friend: What Are Rich Communication Services (RCS)? 5 Slack Shortcuts For Small Teams 195% ROI In Field Service? Here’s How They Did It Submit Your Architect Session: The Dreamforce 2026 Call for Participation Is Open What Is an AI Assistant for Small Business? Slack is the AI Work Platform for Every Salesforce Customer, Ready on Day One B2C Commerce April release: Transforming the B2C developer experience with agents Meet Your 24/7 Prospecting Partner — And 5 More Stand-Out Features In Our Latest April Release 11+ Small Business YouTube Channels You Need to Follow Today Stop Treating Disputes Management Like IT Tickets Limitless Service: A New Operating Model for Growth in the Agentic Era What is Transactional Reconciliation in Email and SMS Marketing? How to Prepare for National Small Business Week (2026)
Can Language Models Remember What They Learn?
Ye Liu, Srij · 2026-05-29 · via Salesforce

Post-training methods (RLVR, On-policy distillation) are Episode-local

Language models are getting better at learning from feedback during post-training. In reinforcement learning with verifiable rewards (RLVR), a model tries a problem, a verifier checks the answer, and the policy is updated based on the scalar reward. Recent self-distillation methods go further by using feedback or successful sibling rollouts to create a stronger teacher signal. But most of these methods are still episode-local. A rollout is sampled, scored, used for one update, and then mostly discarded, which wastes signal. Across training, a model repeatedly sees related problems under a changing policy, and those attempts reveal more than just right or wrong answers. They show which strategies keep working and which reasoning patterns transfer to new problems. Procedural Memory Distillation (PMD) is built around this idea: self-improvement should be cumulative.

PMD turns model attempts into reusable training memory

Knowing whether an answer was correct is only part of what a model can learn from an attempt. The reasoning that produced the answer matters too, along with the patterns that are worth holding onto.

PMD converts the model’s own training-time attempts into procedural memory, uses that memory to condition a stronger self-teacher, and distills the resulting guidance into the model’s weights. The memory is used only during training. At inference time, the final model runs normally, without external retrieval or a memory bank.PMD uses memory as a training scaffold, not a deployment dependency.

PMD creates an online loop:

  1. The current policy attempts problems.
  2. A verifier or environment scores the attempts.
  3. PMD stores the attempts as memory.
  4. The model reflects on successes and failures.
  5. Reusable strategies, lessons, and behaviors are extracted.
  6. A memory-conditioned self-teacher trains the next policy.
  7. The updated policy produces new attempts, refreshing the memory.

This is the central design principle: policy and memory co-evolve.

Figure 1. PMD turns repeated attempts into online memory, uses that memory to condition a self-teacher, and distills the guidance into the student.

Hierarchy of procedural memory

PMD organizes memory into three levels.

Experience memory stores the raw trajectories including successful rollouts, failed rollouts, rewards, and feedback. It is faithful, but local and verbose.

Insight memory reflects on attempts for the same problem and extracts strategies and lessons. Strategies describe what led to correct solutions. Lessons describe recurring mistakes and why they failed. When both successes and failures are available, PMD compares them directly to identify what separates correct reasoning from incorrect reasoning.

Behavior memory abstracts across related problems. PMD clusters semantically similar questions and distills their insights into reusable behaviors such as general skills or mistakes to avoid.

This hierarchy creates a fidelity-transfer trade-off. Experience is concrete but hard to transfer, behavior is transferable but more abstract, and insight sits in between.  In our experiments, combining problem-level insights with cross-problem behaviors gives the strongest internalized policy.

How PMD trains the model

PMD builds on self-distillation where the current model plays two roles:

  • The student generates rollouts from the current policy.
  • The self-teacher produces a target distribution after seeing additional context.

Standard self-distillation gives the teacher episode-local context, such as feedback from the current attempt. PMD gives the teacher a broader view that includes strategies from prior correct attempts, lessons from prior failures, and behaviors retrieved from related problems.

The student then learns from this memory-conditioned teacher on its own rollout states. This keeps training on-policy while letting the teacher use knowledge accumulated across earlier attempts.

At inference time, the student receives no memory prompt. If performance improves, the useful procedural knowledge must have been internalized into the model weights.

Figure 2. During PMD training, memory-related concepts such as “strategy,” “lesson,” and “behavior” appear more often in memory-free student rollouts, suggesting that teacher-side guidance is being absorbed into the policy.

What we found

We evaluate PMD on two verifiable domains: SciKnowEval, a science reasoning benchmark covering biology, chemistry, physics, and materials science, and LiveCodeBench, a code-generation benchmark with execution-based unit-test feedback.

Across both benchmarks and two model families, PMD improves over GRPO and SDPO.

With Qwen3-8B, PMD improves SciKnowEval average accuracy from 74.4 with SDPO to 77.2, and LiveCodeBench from 47.9 to 51.7. With OLMo3-Instruct-7B, PMD improves SciKnowEval from 69.5 to 73.3, and LiveCodeBench from 45.0 to 51.1. Relative to SDPO, these are 3.8–5.5% gains on SciKnowEval and 7.9–13.6% gains on LiveCodeBench. PMD uses the same self-distillation backbone as SDPO, but conditions the teacher on online procedural memory. The model’s own training history appears to contain useful signal that episode-local updates miss.

Why the full loop matters

PMD’s gain depends on three pieces working together: reflection on past attempts, persistence of memory across training steps, and co-evolution between policy and memory.

Reflection alone provides some of the lift. A variant called PMD-Transient builds memory from the current batch and discards it after one step. It still improves over SDPO, showing that structured strategies and lessons are useful even without long-term memory.

Persistence adds more on top of that. full PMD keeps memory across training steps, allowing useful patterns to accumulate. This is especially important for code generation, where correct patterns may be rare and need many attempts to consolidate.

Memory by itself, though, isn’t enough. In the Evolving Memory + Frozen Policy condition, the memory keeps improving while the policy weights stay fixed. Performance remains much weaker than full PMD, because the model never internalizes the memory.

The reverse arrangement falls short for a different reason. In the Frozen Memory + Evolving Policy, condition, the policy updates but the memory bank is fixed, and memory written by an earlier policy can become stale as the learner changes.

The takeaway: The learner creates experience, experience becomes memory, memory strengthens the teacher, and the teacher updates the learner. Freezing either the policy or the memory breaks this loop.

Memory transfers across model scales

We also test whether learned memory is useful beyond the exact model that produced it. Memories learned from Qwen3-8B are transferred to models ranging from Qwen3-1.7B to Qwen3-32B.

Across model sizes, memory-augmented inference outperforms the no-memory baseline. PMD co-evolved memory also transfers better than memory built with a frozen policy. Retrieving more memory entries improves performance,which suggests the behaviors are adding signal rather than noise.

Figure 3. Learned memory transfers across model scales. PMD co-evolved memory outperforms frozen-policy memory, and retrieving more memory entries improves performance.

PMD improves test-time scaling

A stronger model should improve its top answer and also preserve useful candidate diversity when sampled multiple times. Candidate diversity matters because methods like majority voting, verifier reranking, and best-of-N selection all depend on good candidates being present in the sample set.

On SciKnowEval, PMD continues to improve as the rollout budget increases, while SDPO saturates earlier. PMD also leaves more verifier headroom: more cases where a correct answer exists among sampled candidates even if majority voting does not pick it.

Figure 4. PMD preserves answer-space coverage as rollout budget increases. The wider gap between majority voting and best-of-k indicates more recoverable accuracy for verifier reranking.
Figure 5. PMD solves a larger set of problems than SDPO across SciKnowEval subjects, suggesting that memory-augmented training expands useful coverage rather than merely shifting which examples are solved.

What happens to the memory bank?

PMD does not simply accumulate more and more text. Experience and insight memories grow quickly early in training, then plateau as per-problem memory reaches capacity. Behavior memory acts more like a consolidation layer: it keeps reusable skills while avoiding redundant behaviors.

Figure 6. Experience and insight memories accumulate problem-specific evidence, while behavior memory shows subject-dependent consolidation dynamics.

The Bottom Line

Language models already learn from feedback. PMD shows they can also learn from the history behind that feedback. The point isn’t to give the model an external notebook forever, but to let it use one while learning and then absorb the useful lessons into its own behavior. A self-improving model needs to track its own history and carry the useful parts of it forward.  That is the central idea behind Procedural Memory Distillation.