Computer software, like every other human creation, is seldom perfect. There are many, many reasons for this. Some problems may stem from a design flaw. (Trivial example: once I was asked to write a small piece of code to send customized email to two or three dozen people in a class. It worked fine — until the person I wrote it for gave it to someone else who used it to send email to two or three thousand. It wasn’t pretty.) Problems can stem from errors in logic. Problems can stem from unexpected circumstances. Problems can stem from…. The list is endless (well, maybe not endless, but it sure is long).
One thing that’s common to them all (if the software is still used, that is) is that problems need to be fixed. Somehow. And, once fixed, the fixes need to be ‘pushed out’ to all the copies that are still in use. That’s sometimes easy, but if the software is complex or widely distributed (or both) it’s usually not. A lot of very clever people have thought about ways to do this over the years.
Fixes can be distributed in a variety of ways: Modifications to the source code can be distributed and this isn’t all that uncommon for (duh) open source software. They can also be distributed as compiled (‘binary’) fixes — these are usually drop-in replacements for existing programs. Whatever the scheme, this is often called a patch.
This is, of course, a grossly oversimplified discussion of the issues, but it provides enough background to understand the events I’m going to describe.
Some years back, I was part of a team (can two people be called a team?) that was responsible for the care and feeding of a network of (in round numbers) a couple hundred computers, most running some flavour of un*x; we had several (four or five) different flavours of un*x running on several (four or five) different hardware platforms.
It was a little chaotic.
One day a moderately important but not absolutely essential program that ran on some of these machines choked, gasped, and died. Being not completely unobservant (the phone rang and someone said “I can’t log on!”) we tracked the problem down and restarted the dead programs. Inevitably, they died again, this time overacting like a bad actor in a high school drama.
Repeat a couple of times and we were eventually convinced that there was a Problem and we should phone the vendor. (“Not The Vendor. Room 101! Aieeee!”) After the requisite time navigating telephone menus (“For a list of the ways that technology has failed to improve our quality of life, press 2.”) we were eventually put in touch with someone who transferred us to someone else who consulted with yet another someone else and eventually announced “This is a known problem. We have a patch.” (See, the preamble wasn’t a total waste of time.) “The patch is XXX-01 for hardware platform A and YYY-01 for hardware platform B.”
So I downloaded them and applied them to the affected systems. I started with XXX-01; everything worked like a charm and the problems went away — no more teenaged histrionics. This is how it’s supposed to work.
Feeling optimistic — even cocky — I moved on to YYY-01. Of course you’ve figured out that if it had worked, I wouldn’t be writing this. It failed. It’s not that the problem didn’t go away, rather, the patch wouldn’t install. The ‘installpatch’ utility coughed up an alphabet soup of cryptic errors and exited.
That’s not the kind of thing that fills one with confidence.
I investigated and figured out what was wrong. There’s a lot of stuff in each kit, but each one has a file in it that sort of acts as a manifest. It contains a list of files in the kit along with a variety of information for each one — including a checksum; the checksum lets the ‘installpatch’ utility verify the kit’s integrity: Is this patch for this version of the operating system? For this hardware platform? Are all the files present? Have any of them been damaged or altered? After all, if a file has been corrupted in some way, you really don’t want to install it on your system.
What I found was that the ‘manifest’ file had incorrect checksums for some of the files in the kit. This meant that the kit would not — could not — ever install. It also meant that no one had ever tried to install it before making it available to The Customer. (That would be me if you’re following along at home.)
That’s really not a QA department you want on your side.
More practically, what was I going to do about it? I could have fixed the incorrect checksums, but that would have changed the checksum of the manifest so the installation would still fail. I was confident that I could eventually have made things work, but did I want to be the first one on my block (heck, the first one anywhere) to try this?
Call me unadventuresome, but no, I didn’t.
So I called the vendor again, navigated the never-ending menus, and tried to log another call. Note the word ‘tried.’
First of all, they didn’t know who to assign it to. The operating system group? While they might have been responsible for the program being fixed, they weren’t responsible for the way the fix was packaged. They suggested assigning the call to the group responsible for the corporate website, since that’s where the patch kit ‘lived.’ Except they didn’t want it either — they claimed to be responsible for the website, but none of the actual content. Eventually the call was assigned to someone — I don’t remember who.
Some time later, that ‘someone’ phoned me. I explained everything to him as best I could and he told me that this wasn’t their department and could we reassign this ticket?
Repeat this process for a couple of weeks; I’ve long since forgotten how many times I explained the details of the problem. Eventually, though, there was success! I finally spoke to someone who seemed to understand and said that he’d get the problem fixed ASAP.
More time passed and lo! A new patch appeared on the website, one named YYY-02. A new version! Finally! Yay! With trembling hands (well, not really, but they always say that) I downloaded it, copied it to one of the machines that had been malfunctioning for weeks, and ran the ‘installpatch’ utility.
Again, I investigated. Remember how I mentioned that it (the ‘installpatch’ utility) looked at operating system version, hardware platform, stuff like that? Well, the manifest in this version of the patch claimed it was for a hardware platform that didn’t actually exist. This meant that, again, the kit would never install. It also meant that, again, no one had ever tried to install it before releasing it. Which meant that…. I picked up the phone and started to dial.
Take the above process and repeat it with longer and somewhat more turgid explanations at every step. (I seem to remember explaining everything to at least five different people, but I don’t remember the details. By this time I really had no expectation of ever actually seeing a fix, but I was fascinated by how long this might take.)
More weeks passed. And one day, there it was — YYY-03. This time for sure! They finally listened and did the TEN MINUTES of work needed to fix the problem. Surely this time they will have tested it before releasing it. Surely this time it would work. Surely this time the sun would shine, the rain would stop, the Ewoks would dance and all would be happiness and joy.
It failed again. I investigated and found that YYY-03 was exactly the same as YYY-01. Exactly. They had replaced a defective kit with another kit that was known to be defective. This stroke of genius took weeks.
I gave up and uninstalled the defective program from all the machines it was installed on.
At around this time I was speaking to someone in the locker room at the curling club. He was thinking about making some investment decisions regarding the vendor that I have mentioned (but not by name) above. He asked me if I had invested any money in this company. I hadn’t. Why not? Well… “You know how some people that work in a restaurant won’t eat there?” “Yes.”
“It’s like that.”