Steven
Steven5 min read

The Day Our App DDoSed Itself

A backlog of pending uploads, released all at once on startup, turned every GeekBye client into a small denial-of-service attack on our own servers. The fix — and the connection-liveness ladder it forced us to build — is one of the most useful things v2 taught us.

Reliability
Networking
Engineering
GeekBye Releases
The Day Our App DDoSed Itself

Every distributed-systems engineer eventually meets the thundering herd: a mass of clients all doing the same thing at the same instant, and the shared resource behind them buckling. Usually it's someone else's clients. In GeekBye v2.0.1 it was ours, and the attacker was our own app.

How a notetaker attacks itself

GeekBye can upload recordings to your Google Drive. If a recording can't upload immediately — you were offline, the app closed, Drive hiccuped — it goes into a backlog to retry later. Sensible.

The failure was in when "later" happened. On the next launch, the app would try to drain the whole backlog at once. One user with a week of pending recordings became a burst of simultaneous upload requests the instant they opened the app. Multiply by every user launching in the morning, and our backend was being asked to authenticate, rate-check, and process a wall of traffic in the same few seconds — from clients that were all, technically, behaving.

Our backend did the right thing and rate-limited the flood. And that's where the second-order damage started. A rate-limit response is an HTTP 429, and 429 looks a lot like other failures if you're not careful:

  • The startup profile fetch got a 429 and the app treated it as auth failed — briefly logging users out of a perfectly valid session.
  • The transcription connection got a generic 429 and showed an audio-limit-reached upgrade prompt — telling paying users they'd hit a quota they hadn't.

So one root cause — an unpaced backlog — produced three visible symptoms: server strain, false logouts, and false upsells. The classic shape of a self-DoS: the load is bad, but the misinterpretation of the load's symptoms is what users actually feel.

The fix, in two layers

Stop the stampede. The upload backlog is now paced — requests spread out with backoff, and a cooldown after any 429 so the client backs away from a strained server instead of leaning harder on it. A thundering herd becomes an orderly queue.

Stop the misinterpretation. A transient 429 or 5xx during startup no longer logs you out — the client distinguishes "the server is briefly busy" from "your session is invalid." And a generic rate-limit 429 on the transcription path no longer masquerades as an audio-quota error; only a genuine quota response shows the upgrade prompt. The lesson that stuck: an error code is not an error meaning. 429 means "slow down," not "you're unauthorized" and not "you're out of quota" — and the client has to know the difference.

The liveness ladder that followed

Pacing the herd exposed a quieter problem: when a connection did go bad under load, how fast did we notice? Slower than we wanted. So v2.0.4 built out a connection-liveness ladder for real-time transcription:

  • Heartbeats — the transcription socket is pinged on a fixed interval, so a silently-dead connection is detected in seconds instead of waiting for the next chunk of audio to fail.
  • A network-up reconnect kick — when the OS reports the network is back, the app proactively re-establishes the connection instead of waiting to stumble into the discovery.
  • Control ping/pong liveness — an application-level round-trip that confirms not just "the socket is open" but "the other end is actually answering."
  • A service-owned reconnect gate — one place decides whether to reconnect, instead of several parts of the app racing to do it and stepping on each other.

None of this is glamorous. All of it is why a v2 session feels like it just... stays connected.

It happened again this week — one tier down

Here's the part that makes this more than a war story. This same failure class resurfaced days ago, one level below us. A single client fired 21 transcription-session starts in under 400 milliseconds — and tripped our speech provider's account-wide limit of 15 concurrent requests. A stampede against shared infrastructure, exactly like the upload backlog, just aimed at a dependency instead of our own backend.

The shape is identical, and so is the fix: the stampeding client needs a single-flight guard so it can't fire twenty starts for one intent, and the shared resource needs a per-user cap so one client can't consume the pool. We've built the client-side version of this before — the upload pacer is this pattern. Now we apply it to session starts. The thundering herd is not a bug you fix once; it's a shape you learn to recognize everywhere clients meet a shared limit.

Three things to take away

  1. Your own clients are a load test you didn't schedule. Anything that batches-on-startup, retries-on-launch, or reconnects-on-wake is a thundering herd waiting for enough users. Pace it before you have them.
  2. Distinguish the code from the meaning. 429 is the most misread status in existence. "Slow down" is not "log out" and not "pay us." Route each failure to what it actually means.
  3. Liveness is a ladder, not a flag. "Is the connection alive?" has several honest answers at different layers — socket open, bytes flowing, other end answering, network present. A robust app checks more than one.

This is the unglamorous machinery under GeekBye v2's calm — the same discipline that drove the whole rewrite in what a version 2 actually takes: 206 commits of honest states (v2.0.0). For the reliability features it enabled, see why your AI notetaker stops on bad Wi-Fi and live transcription when the firewall blocks WebSockets (v2.0.8). For neighboring releases in this series, why your AI notetaker stops recording mid-meeting (v2.0.9).