Myth of the nines
From Wikipedia, the free encyclopedia
In information technology, the myth of the nines is the idea that standard measurements of availability can be misleading. Availability is sometimes described in units of nines, as in "five nines", or 99.999%. Having a computer system's availability of 99.999% means the system is highly available, delivering its service to the user 99.999% of the time it is needed.
[edit] How to calculate five nines
The number N of nines describing a system which is available a fraction A of the time is
In general, the number of nines is not often used by engineers when modelling and measuring availability, because it is hard to apply in formulae. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is more often seen in marketing documents, presumably because it looks impressive.
The following table tries to elaborate the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service Level Agreements often refer to monthly downtime in order to calculate service credits to match monthly billing cycles.
Availability % | Downtime per year | Downtime per month* | Downtime per week |
---|---|---|---|
98% | 7.30 days | 14.4 hours | 3.36 hours |
99% | 3.65 days | 7.20 hours | 1.68 hours |
99.5% | 1.83 days | 3.60 hours | 50.4 min |
99.9% | 8.76 hours | 43.2 min | 10.1 min |
99.99% | 52.6 min | 4.32 min | 1.01 min |
99.999% | 5.26 min | 25.9 s | 6.05 s |
99.9999% | 31.5 s | 2.59 s | 0.605 s |
Note that for monthly calculations, a 30-day month is used. This model does not take into account the impact that an outage would have on business if it occurred at a critical moment.
[edit] The myth explained
The myth of the nines is the implicit assumption that if the computer is operating 0.99999 of the time, then the user's business is operating 0.99999 of the time. In fact, this is often far from the truth. After an outage, the humans using the computer have to scramble to catch up, perhaps apologising to customers, calling them back, entering data written down with ink and paper during the outage, and other unfamiliar chores. A computer outage of a minute might cause a business outage of hours.
A further assumption in this model is that ten outages of one minute each have the same effect on the user as one outage of ten minutes. Again, this is not usually true. If a system is experiencing repeated outages, the user is justified in believing that the system cannot be trusted. There must be something wrong with it that nobody can fix. In this case, the user may regard the computer as a liability. The user may measure ten one-minute outages over a period of six months as a downtime of six months, while the computer's manufacturer measures it as a downtime of ten minutes.
Also, the failure probability is cumulative. Hence, a system made up of five-nine components does not have five-nine availability. For example, a system of ten components (eg. disks, motherboard, PSU, RAM, mains-power, network...), each with 99.999% availability, only has 99.99% overall availability (ie. one tenth as good). [However, a fault-tolerant system, such as RAID, multiplies the probabilities. Thus, a highly redundant system of ten five-nine disks, where only one is required would have 99.9999% availability]
Lastly, in many cases, "scheduled maintenance" is not included within the reliability calculation. So if the computer must be taken down to replace a failing disk, but the downtime is notified a week in advance, it "doesn't count".