Steven
Steven6 min read

Why AI Transcription Mishears Technical Terms (and How We Fixed It)

A live session heard "what is the pointer in C++" as "what is the point in life". Here is the forensic trail from that transcript to GeekBye v2.0.11 — keyterm biasing, a connection-dropping race, and the day our own fix backfired.

Transcription
Reliability
Engineering
GeekBye Releases
Why AI Transcription Mishears Technical Terms (and How We Fixed It)

On July 2nd we ran a test session and asked GeekBye a simple question out loud: "What is the pointer in C++?"

The live transcript answered with poetry:

[23:16:37] You: Tell me, what is the point in life? [23:16:52] You: Handy Plus. [23:17:02] You: What the pointer in Plus Plus? [23:17:09] You: C.

Same session, the health metrics told the rest: 3 dropped transcription connections in 163 seconds and a 51-second hole in the transcript. And one more clue that turned out to matter most: our post-session recovery pass — which re-transcribes locally saved audio to fill gaps — got the sentence almost right: "a pointer in plus, plus? What the pointer in plus, plus C++."

The audio was fine. The live model just had no reason to expect C++.

This is the story of GeekBye v2.0.11, told from the actual transcripts and production logs.

Why speech models mishear your vocabulary

Speech recognition is a prediction problem. Given ambiguous audio, the model picks the likeliest words — and for a general-purpose model, "point in life" is a far likelier phrase than "pointer in C++". Every engineer who has watched a meeting transcript render Kubernetes as "cube and eddies" has met this failure.

The fix is not a better microphone. It's keyterm biasing: telling the model, before the session starts, which unlikely words are likely for you. Our speech provider supports up to 50 biasing terms per session. Here's the embarrassing part: the plumbing for those terms existed end-to-end in our stack — client, backend, provider — and nothing had ever populated it. Every session ran with zero domain help.

Fix 1: your profile becomes the model's vocabulary

GeekBye already knows your domain — it's in your active profile. v2.0.11 derives biasing keyterms from the profile's name and description: terms with symbols (C++, Node.js), acronyms (SQL, AWS), camel-case names (TypeScript, PostgreSQL), and proper nouns. A profile mentioning your stack now makes that stack expected rather than exotic.

The day the fix made everything worse

Our first version treated every capitalized word as a proper noun. On an internal test build (this never reached customers), a profile written in prose shipped this biasing list to the model:

Senior, Writing, Direct, For, Includes, Write, Role, Intent…

Biasing a speech model toward the word "For" is worse than not biasing it at all. In the very next test session, the word "speak" — spoken plainly, multiple times — came back as "Clicky", "Hey, Vicky", and "Peter Paderty". The lesson cost us one afternoon: bias only with distinctive terms. Capitalized words now count only when they appear mid-sentence (a genuine proper-noun signal); markdown headings, where every word is capitalized, never contribute. That same profile now derives exactly LinkedIn, AI, CEO, MCP — and the validation session transcribed multilingual, fast-switching audio correctly for 199 straight seconds, 189 transcript segments, zero errors.

Fix 2: the race that was dropping connections

The keyterms explained the misheard words. They did not explain the three dropped connections.

That trail led somewhere subtler. Our provider commits (finalizes) transcription on its own voice-activity detection, about one second into silence. Our client also sends a safety commit 250 milliseconds into silence, to flush any hanging partial sentence. The provider's confirmation that it already committed takes one to three seconds to travel back. Do the math on those three numbers: whenever the provider committed first, our safety commit fired against an almost-empty buffer — and the provider's response to that was not just a polite rejection. It dropped the connection. Every pause in speech was a coin flip.

v2.0.11 ships two layers against this:

  1. In the app: when a committed transcript arrives, the client now knows the provider's buffer was just flushed and skips the redundant safety commit.
  2. In our backend, the same day: the proxy that sits between the app and the provider mirrors the provider's audio accounting exactly — it sees every audio frame and every commit confirmation with zero latency — and simply refuses to forward any commit the provider would reject. This one protects every client version at once, including users who haven't updated.

We watched it work in production within the hour. The guard intercepted doomed commits carrying 178ms and 256ms of buffered audio — each one, before that day, a guaranteed dropped connection and a gap in someone's meeting notes. A 60-minute continuous session that afternoon recorded five interceptions and zero drops. Before the fix, a real user that same morning had restarted their recording five times in six minutes fighting exactly this bug.

Two smaller fixes that ride along

AI insights now wait for substance. Those garbled early fragments used to feed GeekBye's live suggestion chips, which confidently produced topics like "Defining Life's Ultimate Purpose" from a misheard C++ question. Suggestions now wait until the session has real conversational mass.

Recovered text gets the right speaker. The recovery pass that transcribed our C++ question correctly had attributed it to "Them". The locally saved audio timeline now records who was speaking, so recovered segments attribute correctly to You or Them.

The scoreboard

Metric (measured, not estimated) Before After v2.0.11 + backend guard
Connection drops in the test session 3 in 163s 0
Longest transcript hole 51s ~6s worst gap in validation
"pointer in C++" "point in life" correct, biased vocabulary
Doomed commits reaching the provider all of them 0 (intercepted at the backend)

If you're building on realtime speech APIs

Three transferable lessons from this release:

  1. Feed the biasing feature. If your STT provider supports keyterms/phrase hints, populating it with a small, distinctive vocabulary is the cheapest accuracy win available — and populating it with common words is an accuracy loss.
  2. Never race the provider's own state machine from the wrong side of a network round-trip. Our client could not win a 250ms-vs-3s information race. The guard belongs where both signals converge — for us, the backend proxy.
  3. Validate on a live build before publishing. The keyterms regression was caught because every GeekBye release is tested as a signed, notarized build against production before it ships. The bad version existed for a few hours on one internal machine, not on your Mac.

GeekBye v2.0.11 is live now — if you're on v2, you already have it via auto-update. For the release before this one — the meeting-ending idle timer and the desktop-locking crash — see why your AI notetaker stops recording mid-meeting. For the reliability groundwork both build on, see why your AI notetaker stops on bad Wi-Fi and what changed in GeekBye v2. For how live transcription works day to day, start with real-time transcription in GeekBye.