The Myth of the Server’s Terrible, Horrible, No Good, Very Bad Day

Have you ever been on call when something had a temporary latency spike that was possibly just a network blip? Or maybe the service was unhappy and went into a GC spiral? Sometimes it just goes away on its own, or sometimes you restart the service and things start to look better. It’s all too common these days to feel like the systems we operate are feeling emotions like “unhappiness” rather than behaving deterministically.

We already know that software is deterministic — it does what you program it to do. So, why do we sometimes feel like it’s not doing its thing? Programs don’t get tired, so why does restarting do anything at all, let alone magically fix it? You wouldn’t expect a Fibonacci generator to just randomly give up or slow down, so why do large distributed systems sometimes seem like they do?