Fixing On-Line Systems
Shit breaks. And often. Especially at high-load, high-complexity sites. And in ways not easily ‘solved’ with auto-scaling, more containers, restarting services, nor fancy scheduling systems. While all those are useful and have their place, they are not where the real work happens, by the grown-up girls.
Fixing these things is made harder by shiny new objects like micro-services, server-less, infinitely-divisible, loosely-connected pieces and parts spread out over everywhere.
Dao is the Way . . .
This leads us to the Dao of Troubleshooting complex systems.
First, Model All the Things. Know what is where, how it’s connected, how it’s configured, and hopefully its behavior. Have & view logical and if needed, physical or network diagrams. With layers, and groupings that make sense, at any scale.
Second, Know All the Knowables. This means knowing the status and configurations of everything, and I assure you this is not exactly what is checked into your code, config, .env, and infrastructure-as-code systems, let alone all the dynamic pieces and parts floating around. Like it or not, the source of truth is what’s really running right now.
Third, Rue the Changes. What has changed in the last relevant time period, by who, when, to what, and to what effect. Who logged into the server, who pushed any code, changed any config, modified the cloud, etc.
Then, what behaviors changed, e.g. whose latency changed, whose correlation dynamics changed, did error rates changed, what resource loading or availability changed? And which of these changes mattered?
Fourth, Exploit Expertise. Directly or indirectly apply knowledge and experience of how all the things, their relationships, dependencies, and especially dynamics and failure modes interconnect. Directly apply expertise via real live experts, on-site, on-line, or via Ouija. If you can, use experience indirectly apply via 7x24 via Expert Systems and Rule Engines with encoded expertise.
Fifth, Seek Clarity. Always ponder additional observations to boost the rule engines and expert brains, especially with low risk, quick-answer info that ideally can be automated by the rule engines. There is never enough data, and never time to get it all, but bringing balance brings answers.
Sixth, Explore Effects by making changes or adjustments to the system to observe how they affect things. Especially useful to increase your exclusion list or uncover previously unknown relationships and stuff that never worked anyway.
Seventh, Exclude Exclusives, by not wasting time on problems you cannot have, as they can suck enormous energy, focus, and resources because they weren’t sufficiently excluded early on. Never lose sight of what the problem is not and rigorously exclude by logic and experience.
Eighth, Test Truths, as Late Stage Troubleshooting can end in contradictions and conundrums, where something that appears true must not be so — to paraphrase Mark Twain, “The problem ain’t what you don’t know, it’s what you know that just ain’t so.” Always be willing to challenge your most basic assumptions, facts, and truths for therein often lies something you know that just ain’t so.
Ninth, Seek Solace as this stuff is hard, there is never enough time nor tools, and the pressure is always high. Continually step back, revisit what you know and think you know, looking at how it’s all connected, cause and effect, and the truth will often reveal itself, often in mysterious ways . . .
Shit Breaks — Follow the nine truths and solutions will come your way …
Join our community Slack and read our weekly Faun topics ⬇
Join a Community of Aspiring Developers.Get must-read articles, learn new technologies for free…
Join thousands of developers and IT experts, get must-read articles, chat with like-minded people, get job offers and…www.faun.dev