Monday, February 23, 2015

Errors Versus Quality


Are Errors a Moral Issue?


NOTE: This essay is the second of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns

    

Errors in software used to be a moral issue for me, and still are for many writers. Perhaps that's why these writers have asserted that "quality is the absence of errors." 

It must be a moral issue for them, because otherwise it would be a grave error in reasoning. Here's how their reasoning might have gone wrong. Perhaps they observed that when their work is interrupted by numerous software errors, they can't appreciate any other good software qualities. From this observation, they can conclude that many errors will make software worthless—i.e., zero quality.
But here's the fallacy: Though copious errors guarantee worthlessness, but zero errors guarantees nothing at all about the value of software.
Let's take one example. Would you offer me $100 for a zero defect program to compute the horoscope of Philip Amberly Warblemaxon, who died in 1927 after a 37-year career as a filing clerk in a hat factory in Akron? I doubt you would even offer me fifty cents, because to have value, software must be more than perfect. It must be useful to someone.
Still, I would never deny the importance of errors. First of all, if I did, people in Routine, Oblivious and Variable organizations would stop reading this book. To them, chasing errors is as natural as chasing sheep is to a German Shepherd Dog. And, as we've seen, when they are shown the rather different life of a Steering organization, they simply don't believe it.
Second of all, I do know that when errors run away from us, that's one of the ways to lose quality. Perhaps our customers will tolerate 10,000 errors, but, as Tom DeMarco asked me, will they tolerate 10,000,000,000,000,000,000,000,000,000? In this sense, errors are a matter of quality. Therefore, we must train people to make fewer errors, while at the same time managing the errors they do make, to keep them from running away.

The Terminology of Error

I've sometimes found it hard to talk about the dynamics of error in software because there are many different ways of talking about errors themselves. One of the best ways for a consultant to assess the software engineering maturity of an organization is by the language they use, particularly the language they use to discuss error. To take an obvious example, those who call everything "bugs" are a long way from taking responsibility for controlling their own process. Until they start using precise and accurate language, there's little sense in teaching such people about basic dynamics.

Faults and failures. First of all, it pays to distinguish between failures (the symptoms) and
faults (the diseases). Musa gives these definitions:

A failure "is the departure of the external results of program operation from requirements." A fault "is the defect in the program that, when executed under particular conditions, causes a failure." For example:
An accounting program had a incorrect instruction (fault) in the formatting routine that inserts commas in large numbers such as "$4,500,000". Any time a user prints a number greater than six digits, a comma may be missing (a failure). Many failures resulted from this one fault.
How many failures result from a single fault? That depends on
• the location of the fault
• how long the fault remains before it is removed
• how many people are using the software.
The comma-insertion fault led to millions of failures because it was in a frequently used piece of code, in software that has thousands of users, and it remained unresolved for more than a year.
When studying error reports in various clients, I often find that they mix failures and faults in the same statistics, because they don't understand the distinction. If these two different measures are mixed into one, it will be difficult to understand their own experiences. For instance, because a single fault can lead to many failures, it would be impossible to compare failures between two organizations who aren't careful in making this "semantic" distinction.
Organization A has 100,000 customers who use their software product, each  for an average of 3 hours a day. Organization B has a single customer who uses one software system once a month. Organization A produces 1 fault per thousand lines of code, and receives over 100 complaints a day. Organization B produces 100 faults per thousand lines of code, but receives only one complaint a month.
Organization A claims they are better software developers than Organization B. Organization B claims they are better software developers than Organization A. Perhaps they're both right.
The System Trouble Incident (STI). Because of the important distinction between faults and failures, I encourage my clients to keep at least two different statistics. The first of these is a data base of "system trouble incidents," or STIs. In my books, I mean an STI to be an "incident report of one failure as experienced by a customer or simulated customer (such as a tester)."
I know of no industry standard nomenclature for these reports—except that they invariably take the form of TLAs (Three Letter Acronyms). The TLAs I have encountered include:
- STR, for "software trouble report"
- SIR, for "software incident report," or "system incident report"
- SPR, for "software problem report," or "software problem record"
- MDR, for "malfunction detection report"
- CPI, for "customer problem incident"
- SEC, for "significant error case,"
- SIR, for "software issue report"- DBR, for "detailed bug report," or "detailed bug record"- SFD, for "system failure description"- STD, for "software trouble description," or "software trouble detail"
I generally try to follow my client's naming conventions, but try hard to find out exactly what is meant. I encourage them to use unique, descriptive names. It tells me a lot about a software organization when they use more than one TLA for the same item. Workers in that organization are confused, just as my readers would be confused if I kept switching among ten TLAs for STIs. The reasons I prefer STI to some of the above are as follows:
1. It makes no prejudgment about the fault that led to the failure. For instance, it might have been a misreading of the manual, or a mistyping that wasn't noticed. Calling it a bug, an error, a failure, or a problem, tends to mislead.
2. Calling it a "trouble incident" implies that once upon a time, somebody, somewhere, was sufficiently troubled by something that they happened to bother making a report. Since our definition of quality is "value to some person," someone being troubled implies that it's worth something to look at the STI.
3. The words "software" and "code" also contain a presumption of guilt, which may unnecessarily restrict location and correction activities. We might correct an STI with a code fix, but we might also change a manual, upgrade a training program, change our ads or our sales pitch, furnish a help message, change the design, or let it stand unchanged. The word "system" says to me that any part of the overall system may contain the fault, and any part (or parts) may receive the corrective activity.
4. The word "customer" excludes troubled people who don't happen to be customers, such as programmers, analysts, salespeople, managers, hardware engineers, or testers. We should be so happy to receive reports of troublesome incidents before they get to customers that we wouldn't want to discourage anybody.
Similar principles of semantic precision might guide your own design of TLAs, to remove one more source of error, or one more impediment to their removal. Pattern 3 organizations always use TLAs more precisely than do Pattern 1 and 2 organizations.
System Fault Analysis(SFA). The second statistic is a database of information on faults, which I call SFA, for System Fault Analysis. Few of my clients initially keep such a database separate from their STIs, so I haven't found such a diversity of TLAs. Ed Ely tells me, however, that he has seen the name RCA, for "Root Cause Analysis." Since RCA would never do, the name SFA is a helpful alternative because:
1. It clearly speaks about faults, not failures. This is an important distinction. No SFA is created until a fault has been identified. When a SFA is created, it is tied back to as many STIs as possible. The time lag between the earliest STI and the SFA that clears it up can be an important dynamic measure.
2. It clearly speaks about the system, so the database can contain fault reports for faults found anywhere in the system.
3. The word "analysis" correctly implies that data is the result of careful thought, and is not to be completed unless and until someone is quite sure of their reasoning.
"Fault" does not imply blame. One deficiency with the semantics of the term"fault" is the possible implication of blame, as opposed to information. In an SFA, we must be careful to distinguish two places associated with a fault, neither of these implies anything about whose "fault" it was:
a. origin: at what stage in our process the fault originated
b.
correction: what part(s) of the system will be changed to remedy the fault
Routine, Oblivious and Variable organizations tend to equate these two notions, but the motto, "you broke it, you fix it," often leads to an unproductive "blame game." "Correction" tells us where it was wisest, under the circumstances, to make the changes, regardless of what put the fault there in the first place. For example, we might decide to change the documentation—not because the documentation was bad, but because the design is so poor it needs more documenting and the code is so tangled we don't dare try to fix it there.

If Steering organizations are not heavily into blaming, why would they want to record "origin" of a fault? To these organizations, "Origin" merely suggests where action might be taken to prevent a similar fault in the future, not which employee is to be taken out and crucified. Analyzing origins, however, requires skill and experience to determine the earliest possible prevention moment in our process. For instance, an error in the code might have been prevented if the requirements document were more clearly written. In that case, we should say that the "origin" was in the requirements stage.


(to be continued)

Tuesday, February 17, 2015

Why You Should Love Errors

Observing and Reasoning About Errors


NOTE: This essay is the first of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns
Three of the great discoveries of our time have to do with programming: programming computers, the programming of inheritance through DNA, and programming of the human mind (psychoanalysis). In each, the idea of error is central.
The first of these discoveries was psychoanalysis, with which Sigmund Freud opened the Twentieth Century and set a tone for the other two. In his introductory lectures, Freud opened the human mind to inspection through the use of errors—what we now call "Freudian slips."
The second of these discoveries was DNA. Once again, key clues to the workings of inheritance were offered by the study of errors, such as mutations, which were mistakes in transcribing the genetic code from one generation to the next.
The third of these discoveries was the stored program computer. From the first, the pioneers considered error a central concern. von Neumann noted that the largest part of natural organisms was devoted to the problem of survival in the face of error, and that the programmer of a computer need be similarly concerned.
In each of these great discoveries, errors were treated in a new way: not as lapses in intelligence, or moral failures, or insignificant trivialities—all common attitudes in the past. Instead, errors were treated as sources of valuable information, on which great discoveries could be based.
The treatment of error as a source of valuable information is precisely what distinguishes the feedback (error-controlled) system from its less capable predecessors—and thus distinguishes Pattern 3 (Steering) software cultures from Patterns 0 (Oblivious), 1 (Variable), and 2 (Routine). Organizations in those patterns have more traditional—and less productive—attitudes about the role of errors in software development, attitudes that they will have to change if they are to transform themselves into Steering organizations.
So, in the following blog entries, we'll explore what happens to  people in Oblivious, Routine, and especially Variable organizations as they battle those "inevitable" errors in their software. After reading these chapters, perhaps they'll appreciate that they can never move to a Steering pattern until they learn how to use the information in the errors they make.
One of my editors complained that the first sections of this essay spend "an inordinate amount of time on semantics, relative to the thorny issues of software failures and their detection."
What I wanted to say to her, and what I will say to you, is that such "semantics" form one of the roots of "the thorny issues of software failures and their detection." Therefore, to build on a solid foundation, I need to start this discussion by clearing up some of the most subversive ideas and definitions about failure. If you already have a perfect understanding of software failure, then skim quickly, and please forgive me.

Errors Are Not A Moral Issue

"What do you do with a person who is 900 pounds overweight that approaches the problem without even wondering how a person gets to be 900 pounds overweight?"—Tom DeMarco
This is the question Tom asked when he read an early version of this blog. He was exasperated about clients who were having trouble managing more than 10,000 error reports per product. So was I.
Over fifty years ago, in my first book on computer programming, Herb Leeds and I emphasized what we then considered the first principle of programming:
The best way to deal with errors is not to make them in the first place.
Not all wisdom was born in the Computer Age. Thousands of years before computers, Epictetus said,  "Men are not moved by things, but by the views which they take of them." 
Like many hotshot programmers half a century ago, my view of "best" was then a moral stance:
Those of us who don't make errors are better programmers than those of you who do.
I still consider this a statement of the first principle of programming, but somehow I no longer apply any moral sense to the principle. Instead, I mean "best" only in an economic sense, because,
Most errors cost more to handle than they cost to prevent.
This, I believe, is part of what Crosby means when he says "quality is free." But even if it were a moral question, I don't think that Steering cultures—which do a great deal to prevent errors—can claim any moral superiority over Oblivious, Routine and Variable cultures—which do not. You cannot say that someone is morally inferior because they don't do something they cannot do. Oblivious, Routine and Variable software cultures cannot, though these days, most programmers operate in such organizations—which are culturally incapable of preventing large numbers of errors. Why incapable? Let me put Tom's question another way:
"What do you do with a person who is rich, admired by thousands, overloaded with exciting work, 900 pounds overweight, and has 'no problem' except for occasional work lost because of back problems?"
Tom's question presumes that the thousand pound person perceives a weight problem, but what if they perceive a back problem instead? My Oblivious, Routine or Variable clients with tens of thousands of errors in their software do not perceive they have a serious problem with errors. Why not? First of all, they are making money. Secondly, they are winning the praise of their customers. Customer complaints are generally at a tolerable level on every two products out of three. With their rate of profit, why should they care if a third of their projects have to be written off as total losses?
If I attempt to discuss these mountains of errors with Oblivious, Routine or Variable clients, they reply, "In programming, errors are inevitable. Even so, we've got our errors more or less under control. So d on't worry about errors. We want you to help us get things out on schedule."
Such clients see no more connection between enormous error rates and two-year schedule slippages than the obese person sees between 900 pounds of body fat and pains in the back. Would it do any good to accuse them of having the wrong moral attitude about errors? Not likely. I might just as well accuse a blind person of having the wrong moral attitude about rainbows.

But their errors do create a moral question—for me, their consultant. If my thousand-pound client is happy, it's not my business to tell him how to lose weight. If he comes to me complaining of back problems, I can step him through a diagram of effects showing how weight affects his back. Then it's up to him to decide how much pain is worth how many chocolate cakes.
Similarly, if he comes to me complaining about error problems, I can ... (you finish the sentence)
(to be continued)

Tuesday, February 10, 2015

The Eight Fs of Software Failure

It doesn't have to be that way

Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention

One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others

Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:

What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over more than a half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)."

The Significance of Failure Sources

If we're to prevent failures, then we must observe the conditions that generate them. In searching out conditions that breed failures, I find it useful to consider that failures may come from the following eight F's: frailty, folly, fatuousness, fun, fraud, fanaticism, failure, and fate. The following is a brief discussion of each source of failure, along with ways of interpreting its significance when observed.

But before getting into this subject, a warning. You can read these sources of failure as passing judgment on human beings, or you can read them as simply describing human beings. For instance, when a perfectionist says "people aren't perfect," that's a condemnation, with the hidden implication that "people should be perfect." Frankly, I don't think I'd enjoy being around a perfect person, though I don't know, because I've never met one. So, when I say, "people aren't perfect," I really mean two things:

"People aren't perfect, which is a great relief to me, because I'm not perfect."

"People aren't perfect, which can be rather annoying when I'm trying to build information system. But it will be even more annoying if I build my information system without taking this wonderful imperfection into account."

It may help you, when reading the following sections, to do what I did when writing them. For each source, ask yourself, "When have I done the same stupid thing?" I was able to find many examples of times when I made mistakes, made foolish blunders, made fatuous boo boos, had fun playing with a system and caused it to fail, did something fraudulent (though not, I hope, illegal or immoral), acted with fanaticism, or blamed fate for my problems. Once, I actually even experienced a hardware failure when I hadn't backed up my data. If you haven't done these things yourself (or can't remember or admit doing them), I'd suggest that you stay out of the business of managing other people until you've been around the real world a bit longer.

Frailty
Frailty means that people aren't perfect. They can't do what the design calls for, whether it's the program design or the process design. Frailty is the ultimate source of software failure. The Second Law of Thermodynamics says nothing can be perfect. Therefore, the observation that someone has made a mistake is no observation at all. It was already predicted by the physicists.

It was also measured by the psychologists. Recall case history 5, the buying club statement with the incorrect telephone number. When copying a phone number, the programmer got one digit incorrect. Simple psychological studies demonstrate that when people copy 10-digit numbers, they invariably make mistakes. But everybody knows this. Haven't you ever copied a phone number incorrectly?

The direct observation of a mistake has no significance, but the meta-observation of how people prepare for mistakes does. It's a management job to design procedures for updating code, acknowledging facts of nature, and seeing that the procedures are carried out. The significant observation in this case, then, is that the managers of the mail-order company failed to establish or enforce such procedures.

In Pattern 1 and Pattern 2 organizations, for instance, most of the hullaballoo in failure prevention is directed at imploring or threatening people not to make mistakes. This is equivalent to trying to build a kind of perpetual motion machine—which is impossible. Trying to do what you know is impossible is fatuousness, which we will discuss in a moment.

After a mistake happens, the meta-observation of the way people respond to it can also be highly significant. In Pattern 1 and Pattern 2 organizations, most of the response is devoted to establish blame and then punishing the identified "culprit." This reaction has several disadvantages:

• It creates an environment in which people hide mistakes, rather than airing them out.

• It wastes energy searching for culprits that could be put to better use.

• It distracts attention from management responsibility for procedures that catch failures early and prevent dire consequences.

The third point, of course, is the reason many managers favor this way of dealing with failure. As the Chinese sage said,

When you point a finger at somebody, notice where the other three fingers are pointing.

Folly

Frailty is failing to do what you intended to do. Folly is doing what you intended, but intending the wrong thing. People not only make mistakes, they also do dumb things. For example, it's not a mistake to hard code numerical billing constants into a program as was done in the public utility billing cases. The programs may indeed work perfectly. It's not a mistake, but it is ignorant because it may cause mistakes later on.

Folly is based on ignorance, not stupidity. Folly is correctable, whereas frailty is not. For instance, it is folly to pretend not be frail, that is, to be perfect. Either theoretical physics or experience in the world can teach you that nobody is perfect.

In the same way, program design courses can teach you not to hard code numerical constants. Or, you can learn this practice as an apprentice to a mentor, or from participating in code reviews where you can observe good coding practices. But it's management's job to establish and support training, mentoring, and technical review programs. If these aren't done, or aren't done effectively, then you have a significant meta-observation about the management of failure.

Fatuousness
It is worse than folly to manage a foolish person and not provide the training or experience needed to eradicate the foolishness. We call such behavior "fatuousness." ("Utter stupidity" would be better, but it doesn't start with F.) Fatuousness is utter stupidity, or being incapable of learning. Fatuous people—which occasionally includes each of us—actively do stupid things and continue to do them, time after time. For example,

Ralston, a programmer, figures out how to bypass the configuration control system and zaps the "platinum" version of an about-to-be-released system. The zap corrects the situation he was working on, but results in a side-effect that costs the company several hundred thousand dollars.

The loophole in configuration control is fixed, but on the next release, Ralston figures out a new way to beat it. He zaps the platinum code again, producing another 6-figure side effect.

Once again, the new loophole is fixed. Then, on the third release, Ralston beats it again, although this time the cost is only $45,000.

The moral of this story is clear. The fatuous person will work very hard to beat any "idiot-proof" system. Indeed, there is no such thing as an "idiot-proof" system, because some of the idiots out there are unbelievably intelligent.

There's no protection against fatuous people in a software engineering organization except to move them into some other profession. What significance do you make of this typical situation?

Suppose you were Ralston's manager's manager. Hunt, his immediate manager, complains to you, "This wouldn't have happened if Ralston hadn't covertly bypassed our configuration control system. I don't know what to do about Ralston. He goes out of his way to do the wrong thing, beating all our systems of protection. And he's done this three times before, at least."

What was the significant part of this story? Ralston, of course, has to be moved out, but that's only the second most important part. Hunt—who has identified a fatuous employee and hasn't done anything about it—is doubly fatuous. Hunt needs to be recycled out of management into some profession where his utter stupidity doesn't carry such risk. If you delay in removing Hunt until he's done this with three employees, what does that make you?

Fun

Ralston's story also brings up the subject of fun. Some readers will rise to the defense of poor Ralston, saying, "He was only trying to have a little fun by beating the configuration control system." Well, I'm certainly not against fun, and if Ralston wants to have fun, he's entitled to it. But the question Ralston's manager has to ask is, "What business are we in?" If you're in the business of entertaining your employees at the cost of millions, then Ralston should stay. Otherwise, he'll have to have his fun hacking somewhere else.

In the actual situation, Ralston wasn't trying to have fun—at least that wasn't his primary motivation. He was, in fact, trying to be helpful by putting in a last minute "fix." Well-intentioned, but fatuous, people like Ralston are not as dangerous as people who are just trying to have a good time. Hunt could have predicted what Ralston that going to to to be helpful, but

Nobody can predict what somebody else will consider "fun."

Here are some items from my collection of "fun" things that people have done, each of which has resulted in costs greater than their annual salary:

• created a subroutine that flashed all the lights on the mainframe console for 20 seconds, then shut down the entire operating system.

• created a virus that displayed a screen with Alfred E. Neumann saying "What, me worry?" in every program that was infected.

• altered the pointing finger in a Macintosh application to point with the second finger, rather than the index finger. The testers didn't notice this obscene gesture, but thousands of customers did.

• diddled the print spooler so that in December, "Merry Christmas" was printed across a few tens of thousands of customer bills, as well as all other reports. The sentiment was nice, but happened to obliterate the amount due, so that customers had to call individually to find out how much to pay.

The list is endless and unpredictable, which is why fun is the most dangerous of all sources of failure. There are only two preventives: open, visible systems and work that is sufficient fun in and of itself. That's why fun is primarily a problem of Pattern 2 organizations, which seldom meet either of those conditions.

Fraud

Although fun costs more, software engineering managers are far more afraid of fraud. Fraud occurs when someone illegally extracts personal gain from a system. Although I don't mean to minimize fraud as a source of failure, it's an easier problem to solve than either fun or fatuousness. That's because it's clear what kind of thing people are after. There are an infinite number of ways to have fun with a system, but only a few things worth stealing.

I suggest that any software engineering manager be well read on the subject of information systems fraud, and take all reasonable precautions to prevent it. The subject has been well covered in other places, so I will not cover it further.

I will confess, however, to a little fraud of my own. I have often used the (very real but minimal) threat of fraud to motivate managers to introduce systematic technical reviews. I generally do this after failing to motivate them using the (very real and significant) threat of failure, folly, fatuousness, or fun.

Fanaticism

Very infrequently, people try to destroy or disrupt a system, but not for direct gain. Sometimes they are seeking revenge against the company, the industry, or the country for real or imagined wrongs done to them. Fanaticism like this is very hard to stop, if the fanatic is determined, especially because, like "fun," you never know what someone will think is an offense that requires revenge.

Fanaticism, like fraud, is a way of getting the attention of management. With reasonable precautions, however, the threat of terrorism can be reduced far below that of frailty. Frailty, however, lacks drama. In any case, many of the actions that protect you against frailty will also reduce the impact of terrorism. Besides, I cannot offer you any useful advice on how to observe potential terrorists in your organization. That would be "profiling."

Failure (of Hardware)

When the hardware in a system doesn't do what it's designed to do, failures may result. To a great extent, these can be overcome by software, but that is beyond the scope of this book. Fifty years ago, when programmers complained about hardware failures, they had a 50/50 chance of being right. Not today. So, if you hear people blaming hardware failures for their problems, this is significant information. What it signifies can be chosen from this list, for starters:

1. There really aren't significant hardware failures, but your programmers need an alibi. Where there's an alibi, start looking for what it's trying to conceal.

2. There really are hardware failures, but they are within the normally expected range. Your programmers, however, may not be taking the proper precautions, such as backing up their source code and test scripts.

3. There really are hardware failures, and you are not doing a good job managing your relationship with your hardware supplier.

4. Failure attributed to hardware may actually be caused by human error—unexpected actions on the part of the user. These are really system failures.

 Fate

This is what most bad managers think is happening to them. It isn't. When you hear a manager talking about "bad luck," substitute the word "manager" for "luck." As they say in the Army,


There are no bad soldiers, only bad officers.

What's Next?

This three-part essay is now finished, but the topic is far from complete. If you want more, note that the essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

I'm sure you can figure out what to do next. Good luck!