Operational Strategies for Cascading System Failure

I have, I believe, never suggested hurting more than feelings. I always recommend friendship, honesty, peace and love as the beginning of solving problems.  This is not idealism, it is hard practical understanding about how to get along as a person and group in a world with other persons and groups and peoples.

From a system’s view, the view of someone who has spent a fair amount of time poking through core dumps trying to find where the cascade of problems started, I find that when we do find the problem, very nearly every time someone didn’t love their program enough.

Professional programmers will know what I mean. The more time I spend making my programs pretty in code, expressive in names, elegant in design and friendly and open in comments, the better they work. If I leave anything sloppy ‘because it doesn’t matter’, too-often it turns out to matter, I didn’t take the time to find the bugs, I didn’t read the code often enough to see that flaw.

Programmers have individual styles that show through a company’s coding standards, our files carry our identity even if they don’t include our name, and we want to be proud of our work.  We do that by making it easy for other programmers to see how very clever we were. Programmers who love their programs work at making their program easy to grasp for other programmers and consequently have programs that work, and continue to work even after others assume responsibility for them.

It is the inverse of the QA maxim that if you find some first thing wrong, keep looking because there will be more.  Quality code is correct in every detail, because anything less hasn’t been worked over enough to see all of the possible problems. Correcting a comment makes you check the bit of code it refers to. Changing the name of a variable or function to make it more orthogonal to another, makes a possible interaction obvious, etc. Making test cases more complete, normally found because you were working on comments and realized there was yet another case possible, exposes other issues in the code. This is an iterative, creative, evolutionary process and continues for the life of the code.

When code stops improving, a common situation as programmers move on in life, it may begin degrading as juniors take over and begin creating their own bugs.

An inevitable sign is comments that no longer match the code.  The next reader of the code has a tripling of difficulty : which one is correct? Or both wrong? That increases the probability of any fix to the code being wrong, and is a positive feedback degrading the quality.  Code maintained by programmers of less expertise than the creator often degrades rapidly, and the effort required to reverse the trend and return it to previous levels quickly runs over any engineering budget that could leave the product profitable. That product will die without further investment, and realistic estimates of the cost often show the ROI is too low.

Failing to measure quality well and prioritize quality above new features is the standard way companies in technology have destroyed their products and companies.   There are dozens of electronics and software companies who have become living dead and miserable places to work because they let the quality of their products slip, and now can’t ever catch up with the market because they spend all their time fixing bugs.

Software is just another way of expressing engineering, e.g. blueprints of machines, factory through ships and airplanes, that are modified as they are repaired and maintained. Engineering is a subset of organizing knowledge and material, instruction manuals, books and specifications, through warehouse layout and wirings of home, factory, and city electrical grid.

Other examples are databases, including your city’s property records, which have been FUBARed by the banks  which has not processed proper documentation for titles’ transfers.  The number of times your name is misspelled on mail and documents is a measure of the low quality of databases in the business world.*

You might think that the software example is encouraging, all we need to do is be careful. Sorry, problems are not analagous and use incommensurate tools.  The un-matched advantage we have in the computer world is we can stop the system and execute it  one instruction at a time, or rerun it to check a previous condition.  The computer executes the instructions the same way every time, so we eventually find the bug, even when it results from a random write millions of instructions previously.  Tedious at worst, but new tools are developed to deal with new problems, e.g. Dtrace.

Tracing problems to root causes is difficult for anything complex, e.g. an assembly line or power train or … That is why working engineers spend so much of their time debugging, wtf this time?  Those share the programmer’s advantages of being able to stop them, reset to a known state, and continue.

That is the root reason we have working systems : we can trace every problem back to a cause, eventually.  If we can understand the cause, there is usually a fix for the problem.

But human individuals, groups, companies and institutions, and the total societies composed of those are open, evolving, complex systems that have no static mode, the only static things in life are dead. Worse, they don’t execute the same rules or check sequence the same way every time, people change from day to day, influenced by weather and spouses’ mood, lousy coffee at breakfast.  You too.

Yet worser, the best next step won’t be the same today as tomorrow, because systems are path-dependent.  That is, a system ‘changes state’, for example depositing the lottery check vs tearing up the losing ticket, produces different emotional states and interpretations of any accompanying tears. ‘It depends’, what you should do at any point for a crying woman, an upset co-worker, subordinant or boss. Hard rules are not possible for real events in complex systems.

So any idea of tracing social or management problems to an ultimate root cause and correcting them is conceptually wrong and impossible of practice.   Open, evolving complex systems vs deterministic machines** are incommensurate in too many dimensions.

This is why any approach to controlling complex systems using rules, laws, regulations, employee hand books, or careful management oversight (micromanagement) must fail. Instead, you must evolve them, beginning from what you have and making it better, there is no choice at all, and it is what we do, though we don’t normally think of Continuous Revision as the process of evolution.

Instead, think of us all doing impromptu performance art as our control system of an open, evolving, complex system, which we are both part of and the mechanisms for working through problems as a group of people. Add to the many possible interpretations of an authentic performance displaying your abilities as you enhance your fellow performers’ the problems of people interacting in groups, and QA on social organizations is conceptually and practically entirely different from engineering problems.

I strongly recommend positive-sum : friendship, honesty, peace and love in your performance art.

**We call computer systems ‘systems’ because they have many distinct components and the engineers working on them consider them ‘subsystems’, and because the total hardware + software does exhibit system behavior.  But the hardware side is totally deterministic, or the programs can’t be deterministic.  A non-deterministic program means it has a bug.

Added later:  I was thinking about this in the night, realized I had not really addressed “how to address cascading system failures”, as perhaps promised in the title.

Implicit in the above is the best answer “avoid the beginning of the cascade, because you might not be able to recover”.  Beyond that, I never thought of that because there is no such, unless it is a designed system.  If I have the blueprints, all the modifications, … and I know that designed system very well, have an engineering team that really knows all of the detail for every subsystem and the interactions, then maybe, but just maybe.  Systems are complex beasts, ones as simple as nuclear power plants turn out to have un-anticipated failure modes.  The most experienced captains and crew have problems handling unanticipated problems, there are famous examples of teamwork that survived, and probably a lot that didn’t.  Those are systems that are hard to perform a root cause analysis on, because they aren’t stoppable and restartable.  The simplicity of airplane and their maintainance and airline operations s is why they have gotten so good.

 

For genuinely complex systems, there may be wisdom, but that question is a subject for the blog, not easy to summarize.

Partial recovery.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s