How We Built an AI That Evolves Alongside a Creator Through Memory

Let me tell you about the moment I knew we had a problem. We'd just shipped our content repurposing tool. A fitness YouTuber pasted in a video URL. Out came a LinkedIn post that opened with "In today's fast-paced digital landscape..." The man deadlifts 200kg for a living. That's when we decided our AI needed to actually learn who it was writing for, not just parrot generic marketing speak into a different shaped box.

Most AI tools for content creators work like a photocopier with a thesaurus. You paste text in, pick a platform, and out comes something that sounds like it was written by a committee of people who've never watched a YouTube video. We wanted something different. We wanted an AI that gets better at sounding like you the more you use it. Not because we fine-tuned a model (we don't have that kind of GPU budget, and frankly, neither do you), but because the system actually remembers what you do and why.

This is the story of how we built that system, what broke along the way (spoiler: a lot), and why the combination of Hindsight's agent memory and cascadeflow's multi-model orchestration turned out to be the two pieces we didn't know we needed until 3 AM on a Tuesday when everything else had failed.

What the System Actually Does

The elevator pitch: you give our system a YouTube URL. It downloads the video, transcribes it, finds the strongest 30 to 90 second moments, and generates production-ready briefs for Instagram Reels, YouTube Shorts, and LinkedIn. Each brief includes a hook, a full spoken script with editing cues like [CUT] and [PAUSE], a platform-native caption, and even a visual prompt for AI-generated b-roll. Basically, it does what a $5,000/month content team does, except it doesn't take Fridays off.

But here's the thing. Plenty of tools can chop a video into clips. That's table stakes. The interesting part is what happens after the creator starts reviewing those briefs. Every edit, every approval, every rejection, every "regenerate this but make it less corporate" gets silently observed and stored as a memory. The next time the system generates content, it recalls those memories and adjusts.

The creator never fills out a settings page and checks boxes next to adjectives like "casual" and "bold." (We all know nobody reads those forms honestly anyway. Everyone thinks they're "authentic and relatable.") The system just watches what they actually do and converges on their personality over time.

Think of it like hiring a new editor. Day one, they're guessing. They write your hooks like a BuzzFeed intern from 2014. By week three, they know you hate exclamation marks, you always cut filler words, and your hooks work best when they open with a question, not a command. That's the loop we built. Except this editor doesn't ask for a raise and doesn't have opinions about your Slack status.

What you're looking at above: The pipeline flows left to right, from a YouTube URL through seven stages (Ingest, Transcribe, Recall, Extract, Clip, Generate, Review). The key detail is the feedback loop at the bottom, the red arrow that makes the whole thing worth building. Every editing action in the Review stage feeds observations back into Hindsight's memory bank. On the next pipeline run, the Recall stage pulls those memories and injects them into both moment extraction and content generation. The system literally gets smarter with each review cycle. It's like git for personality, except you never have to write a commit message.

The Memory Loop: Where Hindsight Fits

Here's a confession. The first version of our system had no memory at all. Zero. Zilch. The kind of amnesia that makes Memento look like a documentary about a guy with excellent recall. It generated the same generic hooks regardless of who was using it. A fitness creator and a fintech newsletter writer got output in the same tone. That's obviously wrong, but the fix isn't obvious.

The naive approach: build a preferences form. Let the creator pick "casual" or "professional," choose their platforms, list words to avoid. We built that too (it helps with the cold start problem, and honestly it makes the onboarding screen look impressive in screenshots). But it turns out people are terrible at describing their own style. A creator will tell you "I'm casual and direct" and then consistently reject every draft that doesn't open with a specific data point. Ask anyone what kind of music they like. Now watch what they actually play in the car. Two completely different playlists.

Their stated preferences and their revealed preferences are two different universes. We needed to observe the second one.

That's where Hindsight enters the picture. Instead of asking creators to describe themselves (an exercise roughly as accurate as asking a cat to describe its relationship with furniture), we observe what they do and store those observations as memories.

The loop in plain English: The creator reviews drafts and edits/approves/rejects them. Each action fires retain_diff_observation(), which stores a tagged observation in Hindsight's memory bank. On the next pipeline run, recall() fetches relevant memories and reflect() synthesizes them into a compact paragraph. That paragraph gets injected into Claude's generation prompt. The drafts come back better. The creator edits less. The loop tightens. Rinse and repeat until the AI sounds more like the creator than the creator does on a Monday morning.

Here's the function that fires every time someone saves an edited draft. It's not clever. It doesn't need to be:

async def retain_diff_observation(
    before: str, after: str, platform: str, content_type: str
) -> None:
    if not before or not after or before.strip() == after.strip():
        return

    before_words = before.split()
    after_words  = after.split()
    delta        = len(after_words) - len(before_words)
    pct          = abs(delta) / max(len(before_words), 1) * 100

    if delta < -10 or pct > 30:
        action = "significantly shortened"
    elif delta < 0:
        action = "trimmed"
    elif delta > 10 or pct > 30:
        action = "significantly expanded"
    else:
        action = "rewrote (same length)"

    observation = (
        f"Creator {action} a {platform} {content_type} draft. "
        f'Before ({len(before_words)} words): "{before[:160]}" → '
        f'After ({len(after_words)} words): "{after[:160]}"'
    )

    await retain_observation(
        observation,
        tags=["editing-behaviour", "draft-edit", platform, content_type],
    )

Nothing fancy happening here. No PhD required. We compute a rough diff, classify whether it was a trim, expansion, or rewrite, and store a human-readable observation via Hindsight's retain API. The tags are important, though. They let us query memories by type later without building our own taxonomy from scratch. (We tried building our own taxonomy once. It lasted two days before we set it on fire.)

The real magic shows up on the next pipeline run. Before we generate any content, we call two functions that sound like they belong in a therapist's office:

recall_result = await recall_memories(
    query="How does this creator prefer their content? "
          "Hook styles, editing preferences."
)
reflection = await reflect_on_creator(
    query="Summarise this creator's content preferences, "
          "voice, and style."
)

recall fetches the raw observations. reflect synthesizes them into a compact paragraph, like a friend who can summarize your entire personality in three sentences (and is somehow right about all of them). That paragraph gets injected straight into the generation prompt as a ## Creator Voice & Preferences section. Claude doesn't need fine-tuning. It just needs good context, and Hindsight provides exactly that.

The result is genuinely satisfying to watch. After three or four review sessions, the system stops generating hooks with exclamation marks for creators who always delete them. It starts opening LinkedIn posts with data points for creators who approve those. It learns that one creator shortens every tweet to under 120 characters and begins generating tighter drafts automatically. No retraining, no config files, no "please describe your brand voice in 500 words" forms. Just memory doing what memory does.

Keeping It Cheap: Where cascadeflow Fits

Here's a problem we didn't anticipate, which in retrospect we absolutely should have. After a creator completes a review session, we need to analyze all their editing events and extract style observations. That's an LLM call. We also use LLM calls during moment extraction from transcripts, and for the actual brief generation. Each project can easily rack up 15 to 20 LLM calls. Our moment extraction needs to be accurate (you can't miss the best 45 seconds of a 30-minute video, that's literally the whole point), but our synthesis calls are simple text classification that a slightly motivated intern could do.

Using the same expensive model for everything felt like hiring a Michelin-star chef to make toast. Sure, the toast would be excellent. But your budget would be gone by Tuesday. That's where cascadeflow saved us real money and possibly our sanity.

cascadeflow lets you define a drafter model (fast and cheap) and a verifier model (slower and accurate), and it routes each call to whichever model meets a quality threshold you set. The key insight is that different parts of the pipeline have different accuracy requirements:

def get_extraction_agent() -> CascadeAgent:
    """Higher quality: moment extraction needs accuracy."""
    return build_agent(quality_threshold=0.8)

def get_generation_agent() -> CascadeAgent:
    """Lower threshold: drafter handles most generation cheaply."""
    return build_agent(quality_threshold=0.65)

def get_synthesis_agent() -> CascadeAgent:
    """Synthesis uses drafter only: observations are simple text."""
    return build_agent(quality_threshold=0.5)

Three agents, three thresholds, same two underlying models. It's like having a junior dev and a senior dev, and only paging the senior when the junior says "I'm not sure about this one." The synthesis agent (which extracts observations like "creator always removes filler words") runs on the drafter almost exclusively because the task is straightforward. The extraction agent, which needs to identify the strongest 30-second moments from a transcript, escalates to the verifier more often because getting that wrong means the whole project is useless. We didn't have to think about routing logic or write a single if/else. We just set a quality number and cascadeflow handles the rest.

We also built a CostAccumulator that tracks every call and computes what we would have spent if everything went to the expensive model. Think of it as the "what if we hadn't been smart about this" meter:

@dataclass
class CostAccumulator:
    total_calls: int = 0
    drafter_calls: int = 0
    verifier_calls: int = 0
    total_cost_usd: float = 0.0

    def record(self, result: Any) -> None:
        self.total_calls += 1
        model_used = getattr(result, "model_used", "") or ""
        cost = getattr(result, "total_cost", 0.0) or 0.0
        self.total_cost_usd += cost
        if settings.cascade_drafter_model in model_used:
            self.drafter_calls += 1
        else:
            self.verifier_calls += 1

This surfaces in the UI as a cost breakdown per project. Creators don't care about it (they shouldn't have to), but we stare at it like it's a stock ticker. It tells us that roughly 70 to 80 percent of calls get handled by the drafter, which means cascadeflow is doing exactly what we hoped: using the expensive model only when it actually matters. The other 20 to 30 percent? Those are the calls where quality genuinely required the bigger model. We sleep better knowing we're not burning money on tasks a smaller model handles perfectly.

The Seven-Stage Pipeline

The full pipeline is a background task triggered when a creator submits a YouTube URL. Each stage updates the project status in Supabase so the frontend can show a live progress stream (because nothing says "your money's being well spent" like a progress bar that actually moves). Here's the sequence:

Ingest: Detect input type, try YouTube auto-captions first (faster than transcription).
Transcribe: If no captions exist, download audio via yt-dlp and run Groq Whisper. Files over 25 MB get automatically chunked.
Recall: Pull creator memory from Hindsight. If no memories exist yet (cold start), the system falls back to universal heuristics.
Extract: Claude Haiku 4.5 analyzes the transcript in parallel chunks and identifies the 3 to 5 strongest moments. Each moment is scored on hook quality, narrative arc, and standalone clarity.
Clip: yt-dlp downloads just the relevant segments (no full-video download), ffmpeg crops to 9:16 vertical. All clips extracted in parallel.
Generate: Claude Sonnet 4.5 produces a full production brief per moment per platform, with the creator's memory reflection injected into every prompt.
Review: The creator edits, approves, or rejects. Every action feeds back into Hindsight. The loop closes.

The pipeline runs entirely as an async background task. If clip extraction fails for a particular moment (network issues, YouTube deciding it doesn't like you today), it falls back to embedding a YouTube player with start/end timestamps. The frontend never shows a broken state. This is important. Nothing kills user trust faster than a loading spinner that never stops.

The Intelligence Graph: Making Memory Visible

Here's something we learned the hard way: if the system is learning from you, you have to let the user see what it learned. Nobody trusts a black box. Especially not creators who've spent years building a personal brand and have strong opinions about whether they're "witty" or "sarcastic" (it's always sarcastic, by the way).

So we built a knowledge graph visualization. It runs three parallel Hindsight recall queries across different memory domains (style/tone, editing behavior, profile preferences), deduplicates by text, and classifies each memory into a node type using the tags that Hindsight already stores:

def _classify_memory_with_tags(text: str, tags: list[str]) -> tuple[str, str, float]:
    """Classify using real Hindsight tags first, fall back to keyword heuristic."""
    for tag in (tags or []):
        kind = _TAG_KIND.get(tag.lower().strip())
        if kind:
            return kind, _short_label(text), 0.80
    return _classify_memory(text)

Tags from Hindsight act as a first-class classification signal. We only fall back to keyword heuristics when tags are missing (which is like using a map when your GPS dies, except the map is made of regex and tears). The graph renders with five node types: root (the creator identity), traits (tone, style), platforms, preferences (editing patterns), and topics (niche). Edges encode semantic similarity, temporal proximity, and causal relationships (platform preferences causing editing patterns).

The creator can hover over any node and see the full observation text. It's the difference between "the AI is learning" and "here's exactly what the AI thinks it knows about you, and you can see it evolving in real time." One tester told us it felt like reading their own therapy notes. We're choosing to take that as a compliment.

Lessons Learned (a.k.a. Things We Wish Someone Had Told Us)

Memory is more useful than configuration. We have a preferences page. Creators fill it out during onboarding. But the observations extracted from actual editing behaviour are consistently more specific and more accurate than what people self-report. "I prefer casual tone" is less useful than "Creator consistently removes the word 'essentially' and shortens hooks to under 8 words." If you're building a personalization system, observe behavior first and ask questions second. People don't know what they want until they see what they don't want.

Quality thresholds beat manual routing. We initially wrote if task_type == "synthesis": use_cheap_model() branching logic. It was ugly. It was fragile. It was the kind of code that makes future-you send angry Slack messages to past-you. Replacing that with cascadeflow's quality threshold was simpler and more robust. The threshold is a single number, and the system figures out when to escalate. We spent less time debugging routing decisions and more time tuning the threshold values themselves, which is the right knob to turn.

Tag your memories at write time. We initially stored observations as plain text and tried to classify them at read time using keyword heuristics. It worked about as well as you'd expect, which is to say it didn't. Switching to Hindsight's tag parameter (e.g., tags=["editing-behaviour", "draft-edit", "linkedin"]) meant that recall queries and graph construction could use structured metadata instead of parsing free text. The 30 seconds of effort at write time saved us hours of heuristic maintenance and approximately three existential crises.

Show the user what you learned. The intelligence graph isn't a gimmick. In our early testing, creators would see observations they disagreed with ("Creator prefers formal tone" when they thought they were casual) and immediately edit a few more drafts to correct the signal. The memory system self-corrected because the creator could see the model's assumptions and naturally generated counter-evidence. Transparency creates a better feedback loop than any amount of prompt engineering. Your users will train your system for free if you just show them what it thinks.

Fail gracefully at every stage. YouTube throttles downloads. Whisper sometimes hallucinates timestamps (it once confidently transcribed silence as a TED talk). Claude occasionally returns malformed JSON that would make a parser weep. Every stage of our pipeline has a fallback: failed clips become YouTube embeds, missing memories trigger universal heuristics, broken JSON gets regex-parsed as a last resort. The system never shows a blank screen. It always shows something, and that something improves as the infrastructure cooperates. Ship the 80% solution. The remaining 20% will fix itself when you're not looking.

推荐订阅源

DEV Community