100% uptime claims considered harmful.

Filed By: Robert Moir

100% uptime claims considered harmful.

Wild Uptime Claims

One of the strangest things I've seen is people claiming their "system" has 100% uptime. This is most often thrown about during OS Wars btw, makes an appearance in advertising material: "Yes you too can improve the uptime to your Beowulf cluster of 386s with just the addition of our new improved power over Ethernet connector" , and is also seen in job adverts: "The successful candidate will be able to deliver a system that has 100% uptime, using our budget of $2500 p.a. and an old Leatherman we found in the last person's desk drawer to deliver this world class system required by our team of part time Internet e-filing clerks".

Of course, 100% uptime is a good goal to aim at, as long as everyone realizes that it is impossible to achieve. Much like a pro athlete who wants to set a record every time they run. Its just not going to happen, but they still try. But when they fail they don't get broke up about it as long as they know they did their best.

This thought came about after some friends were discussing a job one of us is applying for. The advert demanded 100% uptime from the system the successful applicants would build, though to be fair they promised rather more resources than I did in my first paragraph here. But still. 100% uptime. Is that possible?

Maybe this company with the job were using it as a bullsh*t filter. If you tell them politely why the person who made that statement must have squirrels living in their head where the brains should be, then maybe you get the job. If you nod and smile wisely and say you can do it with just a bit of extra help from your favorite consultancy maybe that's their key to sling you out. That might work, and of course you can use it as a BS filter yourself from the other side of the table: All employers expect computer staff to do the impossible but the good ones have the decency to let you get started and find your desk and figure out where to hang your coat up before hitting you with it!

Some Maths

Ok, what is 100% uptime? What's this "2 nines, 3 nines, 4 nines, 5 nines" I hear about?

First of all, lets decide how many hours are in our year. We tend to talk about 24x7x365 for continuous operations, meaning of course 24 hours a day, 7 days a week, 365 days a year. Don't forget that this is a shorthand for a nice turn of phrase and most certainly not the correct formula to figure out how many hours in a year.

A correct formula might be 24x7x52... hmm sounds right until you work it out. But its not (ok, it helps that I already know the answer). This sum contains too much rounding for my needs and its more cumbersome than it needs to be anyway.

How about 24 hours x 365 days? 24x365 is 8760. Better, but not better enough when we start worrying about minutes and seconds of downtime lost later on (Trust me, when you see what it takes to get "99.999% uptime" you'd be pissed if you had let me throw away a quarter of a day here). Ok, a leap year, every 4 years. 1 day /4. 24x365.25=8766 hours.

If you want to be more precise then go ask Google's calculator but for our purposes lets use 8766 and feel free to flame me later if you disagree because the principles apply even if you want to whine about my arithmetic

Anyway, once we have our figure for hours in a year we know our system has to be up 8766 hours a year to score a perfect 100. We can also start to look at how much downtime is actually behind those "4 nines" claims people make. Lets start with the simple stuff and place our 8766 over 100 to get 1% downtime (or 99% uptime if you like). 8766/100 gives us 87.66 hours downtime. Put that over 24 hours to get our time in days... and we have 3.6525. Lets be nice to ourselves and round that up to 3.7 days. And lets use and abuse that figure a bit further too, to get the rest of our numbers.

Availability class
Availability measurement
Annual downtime
Two nines 99% 3.7 days
Three nines 99.9% 8.8 hours
Four nines 99.99% 53 minutes
Five nines 99.999% 5.3 minutes

Why is this an Operational Issue?

Actually, between 99% and 99.9% should be attainable for most reasonable system administrators. At this level, hitting the target is simply a matter of having the following:

  1. Well trained staff.
  2. Well specified networking and server equipment that is designed to handle the load it is carrying.
  3. A secure air conditioned alarmed sever room so that the servers and switches are not disturbed by "environmental" issues such as some tool switching your air conditioning off *grr* or a cleaner who hasn't been told to stay out of the server room unplugging something so they can plug in the vacuum cleaner to tidy the place up a bit.
  4. A UPS for all the servers under this "contract". Oh and the switches too.
  5. A change-log for all servers, managed switches and routers.
  6. Some redundancy in the key 'choke' points of your network design.

Hmm. Did I say "Simply a matter of"? Did I mention that this sort of thing is going to cost a bomb and will certainly require at least one full time member of staff who knows how to conduct themselves in a server room?

Of course if you want to go for the real specialist stuff then you better have a real specialist budget too. Hot spares, "live" backup sites, a testing lab that replicates your production network fairly closely, custom hardware and software support and clustering for everything doesn't come cheap. Nor will the staff for an advanced setup come cheap either, because you don't recruit the right sort of people for a job like this by hanging around MCSE boot camps trying to hire people who are looking for their first computing job after a career change.

What you need to do at this point is take a hard look at what you want vs the resources you can set aside for going after it and keep a realistic goal in mind. And also, keep in mind that the improvements get harder and more expensive as you get closer to the unattainable. For an already reasonably well equipped system its probably free in regards to capital outlay and easy in regards to effort needed to go from "somewhere in the 80 to 90% bracket" to 99%; no well run and reasonably well equipped system should be offline for more than 3.7 days in a year without extenuating circumstances. Working at this end of the system will have a lot of benefit in the efficiency of the day to day operation of the system for all users and you'll probably find it easy to get the budget for modest improvements here as the users will easily see where the money went.

Going from 99% to 99.9% is going to take a real commitment to pushing up availability. At this level you start needing to work through my list above carefully, redesigning your network to provide redundancy in key areas, buying extra equipment, etc. Here's where you start spending money "just in case". At this point its easy to rationalize it as insurance, which after all is exactly what it is. When you buy insurance you are taking out a bet that you'll need it, and the broker is betting that you won't. Same here. When you buy something you are betting that you need it. When the boss or the accountant asks you if you are made of money and crosses stuff off your list then they are betting that you won't need it.

By the time we get to sitting at 99.99% and wanting to get to 99.999% you are looking at a major effort for absolutely no day to day return; you'll be spending a lot of money on very specialized equipment and configurations that will only get called into play in the event of weird and esoteric disasters. You've gone from taking out general disaster insurance to writing different insurance policies based on whether the disaster is a tsunami, an invasion by annoyed dolphins or attack of the two-headed aliens from alpha centauri. At this end of the spectrum you'll even be spending a lot of money on "hot testing" of all this specialist stuff just because the "real thing" is so unlikely to happen that you can't be sure the equipment is working otherwise.

Still, having said all that, its an interesting technical problem and I'd love to have the budget to set something like that up myself.

Why is this a security problem?

Well if you are shooting for this moon, how can you apply patches? Even on systems that don't require reboots for patches, its clear that patching a particular sub-system will require at least a brief interruption of that subsystem while you restart it after applying the patch... good bye 100%. In fact, look at the last couple of figures. 53 minutes downtime a year for 99.99% uptime? Or for 99.999%, a mere 5.3 minutes?

If you are in a job that demands 100% uptime, what do you do when your machines may be compromised? Do you shut them down, as you would correctly want to do when investigating a security breach, or do you keep them all online, as you would correctly want to do so when attempting to get as close as possible to 100% uptime? Do you blow your "uptime" or do you blow your network security and data integrity? Only one answer to that isn't there?

Many thanks to the bucketeers, without whom this article would never have occurred to me.

Top