You can't benchmark an AI notetaker against a real meeting — you don't know the right answer. So I generated the meeting.

I wanted to know which AI notetaker transcribes most accurately — Granola, Fathom, or Otter. So I did the obvious thing: I recorded a real meeting, ran it through all three, and compared the transcripts.

That experiment is worthless, and it took me one afternoon to see why. To score a transcript you need the correct transcript to score it against. But the only record of what was actually said in my meeting was… the transcripts I was trying to grade. I was marking the exam with the students' own answers. There was no answer key.

The fix turned out to be the interesting part, and it's a trick worth stealing for any speech-to-text evaluation: if you don't have ground truth, manufacture it. Write the script first, synthesize the audio from it, and now the exact words are something you typed — not something you have to reconstruct after the fact.

Generate the meeting, keep the answer key

I wrote an 80-second, two-speaker product meeting and deliberately stuffed it with the tokens that actually matter in a work call and that ASR engines love to fumble: quarter labels (Q3, Q2), percentages (5.2%, 6.8%, 41%, 58%), dollar figures ($16 → $19), jargon (churn, cohort, activation, SSO, deep links, p95, P1), names (Sarah, David, Priya, Marcus), and a few real action items with deadlines.

Then I gave each speaker a distinct ElevenLabs voice and rendered the turns through the API. The whole harness is a bash script — two voice IDs, one text-to-speech call per line, and ffmpeg to stitch the turns together with a beat of silence between them so the diarizers have a clean boundary to find:

SARAH=21m00Tcm4TlvDq8ikWAM   # Rachel (female)
DAVID=pNInz6obpgDQGcFmaJgB   # Adam (male)
MODEL=eleven_multilingual_v2

gen () { # $1 voice  $2 outfile  $3 text
  curl -sS -X POST "https://api.elevenlabs.io/v1/text-to-speech/$1" \
    -H "xi-api-key: $ELEVENLABS_API_KEY" \
    -H "Content-Type: application/json" \
    -d "$(python3 -c 'import json,sys; print(json.dumps({"text":sys.argv[1],"model_id":sys.argv[2]}))' "$3" "$MODEL")" \
    --output "$WORK/$2"
  # fail loud if the API returned a JSON error instead of audio
  if file "$WORK/$2" | grep -qi 'json\|text'; then echo "ERROR in $2:"; cat "$WORK/$2"; exit 1; fi
}

gen $SARAH t00.mp3 "Morning, David. Before we start, did the Q3 churn numbers come in?"
gen $DAVID t01.mp3 "They did. We closed at five point two percent monthly churn, down from six point eight in Q2..."
# ...eight more turns...

# concat in order with a short silence between turns for cleaner diarization
ffmpeg -y -f lavfi -i anullsrc=r=44100:cl=mono -t 0.4 -q:a 9 "$SIL" >/dev/null 2>&1
for i in 00 01 02 03 04 05 06 07 08 09; do
  echo "file '$WORK/t$i.mp3'" >> "$LIST"; echo "file '$SIL'" >> "$LIST"
done
ffmpeg -y -f concat -safe 0 -i "$LIST" -c copy "$WORK/meeting.mp3"

A couple of small things that matter in practice:

Spell numbers as words in the input ("five point two percent"), or the TTS will read "5.2%" inconsistently. You want the audio to be unambiguous; the notetaker's job is to turn it back into "5.2%", and whether it does is exactly what you're testing.
Fail loud. The ElevenLabs endpoint returns a JSON error body with a 200-ish shape when something's wrong (bad voice ID, quota), and if you blindly write it to meeting.mp3 you'll "transcribe" a corrupt file and not know why. The file check above bails instead of silently producing garbage.
Insert real silence between turns. Diarizers lean on pauses to find speaker boundaries. A 0.4s gap is realistic and gives every tool a fair shot at separating Sarah from David.

Now I had a clip and a text file of exactly what's in it. The answer key is just the script I wrote. Same clip, three tools, one rubric.

The surprise: clean-audio accuracy is a wash

Here's the result I didn't expect. On clean, two-voice audio with no crosstalk, all three are essentially excellent. Otter clocked in around 99% word accuracy. Fathom's transcript was the most accurate of the three. Granola kept the substance and garbled maybe a line. If you rank these tools by raw word error rate on a clean clip, you basically can't tell them apart — they're all near the ceiling.

Which means raw accuracy is the wrong thing to benchmark. It's table stakes. The differences that decide which tool you should actually use live in two places the overall WER number hides: what each one does with the handful of tokens that carry the meeting's meaning, and whether it tells you who said what.

Scoring against the answer key, only on cells I could verify from the actual transcripts:

Tool	Clean-audio accuracy	Quarter labels (Q3 / Q2)	Speaker labels
Fathom	most accurate of the three; summary numbers all correct	kept "Q3 churn", "6.8 in Q2"	tracks owners in the summary (David, Priya)
Granola	near-perfect; garbled a line or two	kept "Q2"; wrote "P95" and "$16 → $19" cleanly	none on an ad-hoc capture — one unlabeled stream
Otter	~99% word accuracy	collapsed "Q3"/"Q2" to "Q"; "tag it P1" → "tag at p1"	correct diarization — the only one that labeled the two speakers

Read that table and the headline accuracy number inverts in importance. Otter, the strongest pure transcriber on paper, is the one that quietly collapsed both "Q3" and "Q2" into a bare "Q" and turned "tag it P1" into "tag at p1" — and in a business meeting the quarter is not a throwaway word, it's the thing the whole churn number is anchored to, just as "P1" is the difference between a ticket that pages someone and one that doesn't. Meanwhile Otter is the only one that reliably told me David said the latency line and Sarah owned the pricing test, because it's the only one that diarizes. Granola, bot-free and great on the numbers, handed me one unlabeled stream of text for an ad-hoc capture — fine for a solo note, useless if you need attribution.

So the "best" tool flatly depends on what your meetings need, and it isn't the WER ranking:

If you need speaker attribution, only one of these reliably gives it to you on a casual capture.
If you need the numbers and jargon dead-on, the leader on overall accuracy is the one that fumbled the quarter labels.
If you take sensitive client calls, the bot-free tools change the calculus entirely (no visible recorder joining the call), and that's a different axis again.

I put the full head-to-head — the per-tool summaries, the privacy and free-tier trade-offs, and which one I'd hand to which kind of user — in our tested comparison of the best AI notetakers, because the short version ("they're all accurate") is true and also completely unhelpful for choosing one.

What the synthetic-ground-truth trick is good for (and not)

The methodology generalizes well beyond notetakers:

Any speech-to-text eval. Captions, voice-command parsers, call-center QA — if you script the audio, you get a free, exact reference transcript and a repeatable test you can re-run after every model update.
Adversarial token design. Because you write the script, you can plant the things that break engines on purpose: homophones, acronyms, numbers, code-switching, overlapping names. Real meetings rarely stress all of those in 80 seconds; a synthetic one can.
Regression tracking. The clip is deterministic input. Re-run it next quarter and you'll see whether a vendor's "improved" model actually improved on your hard tokens or just the easy ones.

The honest caveat, stated plainly: synthetic clean audio is the friendly case. Two distinct TTS voices with no crosstalk, no accents, no room noise, no people talking over each other is the easiest input these tools will ever see — which is exactly why it's good for isolating the meaning-carrying-token and diarization differences, and exactly why you should not read a 99% here as 99% in a real four-person standup. It's a controlled benchmark for comparing tools against each other on identical input, not a prediction of real-world accuracy. For that, you still have to test on your own messy calls — but now you have a clean baseline to measure the messiness against.

The whole harness is ~40 lines of bash. Swap in your own script, plant your own landmines, point it at whatever transcribers you're weighing, and you'll have an answer key nobody can argue with — because you wrote it first. If you've got a better way to get ground truth for an ASR comparison without hand-transcribing hours of audio, I'd genuinely like to hear it in the comments.

推荐订阅源

DEV Community

Generate the meeting, keep the answer key

The surprise: clean-audio accuracy is a wash

What the synthetic-ground-truth trick is good for (and not)