"Damn it! Why the hell isn’t my <insert application or device> working?" screams the irate <customer, manager, CIO>. "What do you get paid to do? Play video games all day?"
Ouch.
Let’s look at this objectively. What exactly is your job? Simple, right? Keep the IT systems running.
What exactly does that mean? Let’s be good engineers and model it. You land in Dumbonia and they have a brand new IT system. A stone tablet, a chisel, and hammer. Your job is to make sure the system doesn’t grind to a halt. They already figured out that the tablet breaks and everything goes offline. Or the hammer handle breaks with the same result. Or the chisel gets really dull and everything gets really slow. So you whip out your iPad (scratching your head over the stone table thing, but they are the customer and the customer is ALWAYS right), you do some calculations and decide that you can monitor the stone pad and hammer handle for micro-fractures and the chisel point for sharpness. You set up some thresholds and any time a threshold is breached, it is a fail state and you swap in a new piece of hardware before everything crashes. Swapping in new hardware is relatively quick and painless. You keep some hammers, tablets and chisels in the back room. Set up some change management protocols. Institute a communication chain.
Problem solved.
The killer here is the fail states. How many good states are there? How many fails states? Well let’s map it out. Three devices is easy.
Tablet
|
Hammer
|
Chisel
|
Result
|
accept
|
accept
|
accept
|
accept
|
accept
|
accept
|
fail
|
fail
|
accept
|
fail
|
accept
|
fail
|
accept
|
fail
|
fail
|
fail
|
fail
|
accept
|
accept
|
fail
|
fail
|
accept
|
fail
|
fail
|
fail
|
fail
|
accept
|
fail
|
fail
|
fail
|
fail
|
fail
|
Because only an engineer will have read this far, you should see where this is going. First, it is obvious that ALL OF YOU EFFORTS are directed towards maintaining ONE accept state out of all the possible states in the system. You should also see that this neatly maps onto a binary number.
Now use that massive brain of yours and extrapolate.
If you have a moderately sized IT system with only 500 devices, how many accept states are there? 500 really isn’t that many, especially if the granularity is useful at all. A server has a memory, hard drives, a NIC – that alone is 3 devices. 10 servers and 90 workstations is 500 separate monitoring points looking at just those three things.
So again, how many accept states?
Only 1.
Repeat that too yourself. ONLY ONE!!!
Any other state gets a phone call from a pissed off person. My hard drive is making funny noises! I don’t have internet! I keep getting blue screens with funny number on it.
How many fail states?
That’s easy. 2^500-1. That’s a big number. How big? Well some bright physicist calculated the number of atom in the known universe and came up with 10^87. That’s also a big number.
It’s easy to convert base 2 to base 10. 2^500 is approximately 10^150. That’s 10 raised to the 150th power. That means there are more fail states in your 500 devices than there are atoms in the universe. By 63 orders of magnitude. You have more fail states in your moderately sized network than 10 raised to the 67th power UNIVERSES.
Your job is to find the ONE accept state in all of that and KEEP THE ENTIRE SYSTEM IN THAT STATE.
Your job sucks because everyone thinks that there is nothing else in the universe but "it works or it doesn’t work". But what is actually in the universe is 10^150 ways for it to break and 1 way for it to work.
Reality bites.
Brought to you by your friendly neighborhood idiot savant, Bob Castleman