I test in prod


We’re a startup. Startups don’t tend to fail because they moved too fast. They tend to fail because they obsess over trivialities that don’t actually provide business value.

Artifacts that are untestable outside prod:

  • Many concurrent connections
  • A specific network stack with specific tunables, firmware, and NICs
  • Iffy or nonexistent serializability within a connection
  • Race conditions
  • Services loosely coupled over networks
  • Network flakiness
  • Ephemeral runtimes”
  • Specific CPUs and their bugs; multiprocessors
  • Specific hardware RAM and memory bugs
  • Specific distro, kernel, and OS versions
  • Specific library versions for all dependencies
  • Build environment
  • Deployment code and process
  • Runtime restarts
  • Cache hits or misses
  • Specific containers or VMs and their bugs
  • Specific schedulers and their quirks
  • Clients with their own specific back-offs, retries, and time-outs
  • The internet at large
  • Specific times of day, week, month, year, and decade
  • Noisy neighbors
  • Thundering herds
  • Queues
  • Human operators and debuggers
  • Environment settings
  • Deaths, trials, and other real-world events

The phrase “I don’t always test, but when I do, I test in production” seems to insinuate that you can only do one or the other: test before production or test in production. But that’s a false dichotomy. All responsible teams perform both kinds of tests.

Honeycomb is an observability service, so there is definitely some bias involved, not that I disagree.

This is possibly also not true in cases where prod is out of reach, such as on-prem or mobile.