If Windows Crashes, WWIII

Modern weapons systems more complex than a rifle are automated. Larger weapons, e.g. ships and airplanes, have extensive computer systems running complex software.

Complex software ==> bugs ==> crashes. Every piece of complex software crashes.  I don’t know anything about military software, but I know every non-fail-safe system still crashes and also many fail-safe systems, mostly for reasons of software bugs, but sometimes for hardware failures.

That happens at a certain rate, the rate is dependent upon the use, the pattern of different actions and the rate of the inputs of data and requests. That pattern is substantially different in real operations in the vicinity of genuine potential enemy, so the failure rate should be expected to be higher, perhaps markedly higher.  Both conceptually and practically, there is no way to predict the rate of failure under different usage levels, ones not yet ever tried.

Now add a small news item and this later that has never been explained to those of us in the public who pays their salaries.  Was that the Russians testing a weapon?  If a weapon, did the weapon take out the electronics directly, or indirectly? If it did it by causing the execution of a bug in some system’s code, with an indirect cascade of errors from there, we think ‘normal bug, no additional danger’. But even if that best case, it could happen again.  If not that best case, we certainly should conclude that loss of electronics is an act of war. The meaning of ‘crash’ under war-time conditions is not at all clear.

So here you are, a Captain of a weapon of war, and suddenly your ships’ weapons systems all go down, and the first reboot cycles fail.  While they are dealing with that, and the ship is proceeding on more primitive control and information systems, the captain gets a message from the morse operator that another ship in the group has the same problem.

What is the joint probability of random failure rather than enemy action?  It makes a lot of difference in whether the captains execute the emergency launch orders, the ones for when you are under attack and don’t expect to survive.

Assume the boat’s software is as reliable as Microsoft Windows.  It isn’t, it was built by military contracts to imprecise specs, ran way the hell over budget, was tested and patched until it passed the acceptance test, was shipped with many known bugs, and hasn’t had any improvements made since. The supposed support organization doesn’t even have valid builds of any prior version, no way can they support it, and many new bugs have been filed against that version. The weapon system just last quarter had the 3rd layer of bus translator added, as the ship’s weapon’s system control network was upgraded.  Each layer has its own bugs, of course.

Assume the two ships have different versions of software, and that they think to check that fact.  Did you think of that and adjust your estimate of probabilities in the correct direction? Assume the weapons systems do not both have the 3rd network translator.  Did you think of that, and realized that worked in the opposite direction of probabilities? Ship’s systems exchange information via protocols, there a possibility of ship-to-ship cascades of failure? All these little facts adjust your assessment of the probabilities of needing to press that red button.  Pressure.

Not even as good as Microsoft Windows on an average day, but MS Windows when you keep your hard drive whirring with applications and internet traffic and … So a lot of activity, and a lot of opportunity for errors to be revealed.

You might make it through the day.

Consider the reasons for the World Wars of the last century, the best minds of the civilization made errors particular to their times and mental sets and technology.  The decisions for mobilization of the First World war, for example, were driven by required lead times and train schedules, the fact that a nation who mobilized one day first could roll over the laggard. We should expect more modern mistakes to control our future now, e.g. software.

There, hope I assisted you to understand your future, adjusted the restfulness of your sleep. You really have not looked worried enough.

Added later : A bit of background.  I work with systems, including fail-safe systems.  A fail-safe system is an entirely different class of system compared to an ordinary control for a router or machine tool.  Military weapons systems have part of what is required, e.g. dual independent networks for subsystems connected to the ship or plane’s networks.  Larger systems may have a manual fail-over system that gives a 1-switch replacement. Some no doubt are entirely fail-safe, meaning you can kill components in ways that leave shorts between signals, between power lines, etc without affecting the other unit that automatically takes over function. Systems that require concurrence of 2 of 3 systems voting and can continue with just one is a level of reliability above that.

Engineering a fail-safe system is much larger effort than a mere control system for the same machinery, and the testing effort is an order of magnitude larger.  Add in all of the upgrades that happen in the 30 year life of a ship, …  That total set of connected electronics is a system, and probably the ship that appeared to be knocked out by the plane’s electronics had a cascade of failures through a system because of interacting bugs.  Probably.

That is a too-frequent occurrence for large-scale electronic systems, and the pressure on the administrators when it happens is unbelievable, I bet Captains hate it too, being impotent and vulnerable.  It even happens at Google, the most bug-free software at that scale ever, but also the only system at that scale, ever.

Every ship echos that situation of greatest scale ever with every voyage because all of the upgrades that are always happening. The situation is far worse than I can imagine, guaranteed, as there is no way to test complex electronics and software like that.  “It works” means “It passed the acceptance test”, and many, many bugs are still known and not fixed, probably not even isolated, and many, many, many, very very very many combinations of conditions have not been tested, because the age of the universe would not be enough time to do so exhaustively.

Bugs lurk in the combinations.

One other thing : the combinations of factors I list above would not be unusual, I have found all of them on different projects.  It takes a lot of management discipline to maintain the information systems and test units that allow engineering to happen reliably, and it is easy to short-cut when budgets or schedules are tight.  All complex pieces of software and hardware are shipped with known bugs, and if not, their testing is inadequate, because all software and hardware have bugs you have not yet found. Not being able to build and test a shipped version of hardware/software is quite common. Ditto bugs in bus adapters.

Added yet later. Today, I have been slow to catch on to the significance of what I write. The real significance is that the military of every country makes errors that could cause wars at a certain rate. That error rate increases when more than 2 militaries are involved in a potential conflict situation.

Thus, if you have a military long enough, you will be involved in a war by accident.  Quite a profound argument against having any military, imho.

Added yet later : This discussion of the electronic records systems in hospitals adds the concepts of ‘mode error’ and ‘latent errors’. Good article.

Added much later. I just recalled that I once wrote an general test of a software system in Python, all of the dimensions were loops with the values defined in a configuration file. The loops were 11 levels deep, meaning 2**11 * the sum of the values assigned at the 11 levels. The values were an extreme at each end of the possible range and a few points in the middle, hardly exhaustive. It would take a long time to run that test, even splitting the work across many machines.

Validation of EE designs has a similar problem, as exhaustive testing of a bus controller module requires every combination-in-time of signals to be sure that none of them produce hardware-level ‘crashes’, lockups of logic, states of the machine that have no exit. There may be 100s of constants in firmware and logic controlling such a controller. They can’t be tested exhaustively, every SOC has hardware problems. Most can be ‘fixed’, avoided, by software changes, after much work.

I add that as more evidence that no system can be tested enough to know it won’t crash, including ultra-failsafe systems.

The same argument controls the possibility of security, no amount of testing can guarantee that. Security at least has the advantage that you can combine measures, each independently well-tested and having no interactions with each other beyond running on the same system. Each level combinatorially complexifies the problem for the attacker, with linear increase in work for the defender.

NSA could have lead that research effort that would have much improved our civilization, instead went with the dark side. The Bastards.


9 thoughts on “If Windows Crashes, WWIII

  1. That does not consider the faulty , out right criminally designed and installed CHINESE COMPUTER CHIPS and RUSSIAN SOFTWARE found in American weapon systems and missiles ! If I did not know better , I would think American Forces are being set up for defeat !


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s