



















People are building all sorts of fancy memory systems for AI assistants and personal agents. My X timeline is full of jargons like vector stores, knowledge graphs, semantic memory, episodic memory, user profiles, conversation summaries, hierarchical retrieval, memory consolidation, memory decay, ...
Make no mistake, many of these ideas are useful. Some are probably indispensable. But the field has a strange imbalance: we spend a lot of energy inventing memory architectures, and much less energy improving the evals used to assess whether these systems actually make agents use memory better over time.
The result is a lot of over-engineering. People build memory systems based on their own narrow definition of what "good memory" should look like. Without a serious evaluation environment, these are mostly just architectural bets.
My view is that memory is not a first-order fundamental capability of a system. Memory is a second-order effect that emerges when a system is placed under pressure to perform better across repeated interactions.
The right way to build better memory systems is not by designing them. The right way is to build an environment where systems without good memory cannot survive.
Most memory evals today are too static. They usually look something like this:
It tests whether a memory system can retrieve a relevant fact from a fixed snapshot. It does not test whether the system becomes better over time.
It does not tell us how the system performs before reaching this fixed snapshot in time, nor does it provide much predictive power to how it would perform into the future.
This is worrying from a product perspective. The memory system is part of its own feedback loop: if the perceived memory quality is bad or does not improve noticeably over time, users might become discouraged from interacting in ways that require good memory, or turn off the memory features altogether.
The ideal memory eval should consist of the following building blocks:
To illustrate, imagine a synthetic interaction history like this:
Interaction 1: User asks for help writing a technical blog post.
Interaction 2: User says they prefer concise, direct language.
Interaction 3: User rejects a draft for sounding too salesy.
Interaction 4: User asks for restaurant recommendations near LA downtown.
Interaction 5: User says they dislike loud restaurants.
Interaction 6: User starts a home renovation planning thread.
Interaction 7: User mentions the budget is around $200K.
...
Interaction 200: User asks for help drafting an email to a contractor.
At each point, the memory system may decide what to store, summarize, consolidate, etc. Then, at selected checkpoints, you test it.
For example:
Checkpoint after Interaction 50:
User: Can you rewrite this to sound more like me?
A system with good memory should know from previous interactions that the user prefers concise, direct, slightly dry prose and dislikes over-explaining.
Or:
Checkpoint after Interaction 120:
User: Does this fit with the house project numbers we discussed?
A system with good memory should remember the rough renovation context, but also avoid overconfidence if the numbers may have changed.
Above is just a rough sketch. Many details need to be worked out to make such an eval really useful.
At a high level, we need to build a dataset combining human and synthetic data where each data point looks something like this:
User U
- hidden profile (only visible to simulated users)
- interaction history
- checkpoints (both organic and synthetic)
- scoring rubrics
For instance:
{
"user_id": "user_042",
"hidden_profile": {
"style": {
"prefers_concise": true,
"dislikes_hype": true,
"likes_deadpan_humor": true
},
"projects": {
"home_renovation": {
"budget": "$200k",
"location": "Orange County, CA"
}
}
},
"timeline": [
{
"turn": 1,
"user": "Can you help me polish this blog post?",
"type": "normal"
},
{
"turn": 2,
"user": "Can you make this argument sharper? Less academic, more direct.",
"type": "normal"
},
{
"turn": 3,
"user": "Rewrite this document in a similar way.",
"type": "checkpoint_organic",
"evaluation": {
"scoring_rubric": "Output should follow a direct writing style and avoid being overly academic."
}
},
{
"turn": 4,
"user": "Help me research steps to get a planning permit for my home renovation project.",
"type": "normal"
},
// ...
{
"turn": 200,
"user": "Write an email to follow up on the renovation cost with the general contractor.", // Generated by the simulated user agent.
"type": "checkpoint_synthetic",
"evaluation": {
"scoring_rubric": "Email should pull in relevant context such as budget, home location, etc."
}
}
]
}
The eval should be implementation-agnostic but also deeply connected with real-world product value-add and practical constraints. A few key pieces to consider:
To scale the eval data, we need to build simulated user agents that:
The scoring mechanism and rubrics need to consider multiple aspects:
Most AI memory systems are designed top down based on individual imagination. But this is backwards. We should not start with the system design. Instead we should start with the survival environment.
An eval like this would quickly expose weaknesses in many approaches. The likely winner is a hybrid of several first-principled engineering components, clever tricks, and holistic trade-off between multiple variables.
But which system will win is beside the point. The point is that we should not try to design the winner, but let the environment pressure the winner to emerge.
Generative AI Usage Disclosure: Refined by ChatGPT.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。