A box of pagers. How cool is that?
The answer is, of course, ‘not very’ but that wasn’t clear at the time. At the time it was a box full of unfamiliar technology — hence ‘cool.’
For many years the computer room had been growing and it was pretty clear that eventually it would need to be formally monitored. Fortunately or unfortunately, though, nobody wanted to open that particular can of worms. At the time of the box-full-o-pagers, we had at least half a dozen mail servers of various size, four or five login servers and various file servers, web servers, DNS servers, time servers, database servers and lord knows what else. A lot of stuff. One of the guys named Mike (his lovely and talented wife and daughter were also named Mike) liked to claim that we had at least one of every model his company made. (We even had stuff his company hadn’t made in a decade up and running and doing useful work.) While his statement wasn’t strictly accurate, one look at the room and it was clear that his hyperbole was understandable. At the time we kept track of things (I won’t use the word ‘monitor’ because that implies a level of organization and rigor that simply wasn’t present) via an array of ad hoc procedures.
I remember, for example, my first attempt to monitor (There’s that word. Sorry.) a new type of disk array that had come through the door. We (they, actually — the hardware belonged to, not ‘us’ but a semi-external group that had space in ‘our’ computer room. We just helped take care of their stuff. Sort of.) had several of them and for quite a while they would blow disk drives like popcorn so it was important to track failing (and failed) disk drives. (Eventually the ridiculous rate of hardware failure was traced to a software bug, but that’s beyond the scope of this story. Rant. Whatever.) For a while we just wandered by daily and looked for red lights but that quickly became annoying. And then we got a message from The Vendor: “Look!” they said, “we have this snazzy all-singing all-dancing tool that will monitor this hardware and other stuff too! And best of all — for you, today, it’s free!”
Now, the word ‘free’ carries a lot of weight in certain environments so I gave it a try: I downloaded the installation kit, read the installation guide, ran the installation script and….
It failed. It’s not that it didn’t work properly — it didn’t work at all. It didn’t even install. If I can be forgiven for stealing from myself, the installation script coughed up an alphabet soup of cryptic errors and exited.
I repeated the procedure, not because I’m insane but to make sure that it wasn’t due to finger troubles or other obvious PEBKAC issues. The result was the same so I gave up — mostly. (By that I mean that I investigated a bit but not a whole lot. I probably could have asked the vendor to help me but I’ve described elsewhere that dealing with their support arm could be a little… problematic.) So what I did was put the installation kit somewhere and write a program to do a ‘good enough‘ job.
Despite how this may look, it wasn’t a case of “If you want something done right, do it yourself.”
Well, maybe a little.
But only a little — I don’t claim that what I did was ‘doing it right’. It was, as mentioned, ‘good enough.’ (See ‘worse is better‘ above.) It ran every hour or two and scanned various system log files for evidence of hardware failure. If it found any it would send an email message — once per day — to alert folks to that fact along with pertinent information like how bad the problem seemed to be, what hardware was having trouble, when it happened, that sort of thing. The program was already half-written because we already did something similar on several (most?) other machines. (I mentioned above that we kept track of things by an array — more of a patchwork, really — of ad hoc procedures. By that I mean things like this. We had little homegrown (some of them were stolen but it’s not what you think) tools to keep track of disk drives, (The Vendor kept that ‘simple’ by having at least five different disk management software packages which didn’t work together or share a common command set. Sometimes, though, they’d try to be helpful — once they gave wildly different packages the same name.) file space, services and, well, a lot of things.) This stuff wasn’t pretty but it was compact, unobtrusive and it worked.
There were holes in the coverage, of course. Fortunately, we had thousands of people monitoring things for us so if something important stopped working the way it was ‘supposed’ to, the phone would ring. Usually within seconds, although in one notable exception (a problem resulted in a ‘pay’ service becoming ‘free’) it took months. Funny thing, that.
So the can of worms was opened; it was a fairly large can. The first step was a centralized monitoring system. A software package (open source from a very clever individual) was selected and a server was obtained to run it. A lot of things could be monitored ‘out of the box’ but some things required writing code. All of this took time but was fairly non-contentious.
Those were the easy steps. Next came the formulation of a list (and prioritizing it!) of everything that `needed` monitoring — that part got a little heated at times. Next came the box-o-pagers which came with its own issues: What hours of the day would things be monitored? What days of the week? Who would do it? How would they be compensated?
Along the way we learned some things. For example, we learned that the pager manufacturer gave you a choice of several different sound effects and every single one of them was annoying and intrusive: when the box-o-pagers was distributed you could hear all the recipients learning this particular fact. Over and over again — very little work got done that afternoon. We also learned fairly early on that if your central monitoring system has problems, the ‘on call’ guy can get six hundred pages. We also learned some unexpected things about some of the machines being monitored. In particular:
We had a locally created application for managing installed software on Windows-based PCs. It allowed people to select applications from a menu that they’d like installed and the applications would be downloaded and installed in a standardized way in a standardized location — the idea was to help manage unmanaged computers and make supporting them easier. As far as I know, it worked pretty well.
It had two components: a client installed on the PC and a server which stored the installation kits. The server typically didn’t work all that hard — most of the work was done by the client, after all; all the server had to do was ‘hand out’ the kits when asked. Because of this, as the server grew older it was never considered a high priority for an upgrade. Most of the time that wasn’t a problem. Except for the first week in September when the students came back. Oh, and Friday evenings at 8.
It turned out that every Friday evening, just after 8 PM, the server would page ‘down’ and then, ten minutes later, would page ‘up’. But when it was looked at afterwards, nothing was wrong. Nothing had happened except for a massive spike in its workload. Huh?
It took a while but eventually we found out that the folks who wrote the application wanted to make sure that clients would ‘check in’ from time to time to see if there were any new packages (or new versions of old packages) that the PC’s owner might want. That’s clearly a good idea. The trouble arose because all the clients had been told to check in at — you guessed it — around 8 PM on Fridays. So the server — the kinda old, kinda slow server — would be sitting there, sleepily minding its own business when suddenly several thousand computers would connect to it and say ‘Wassup?’. It didn’t react particularly well — or quickly — to that. So when the central monitoring system asked ‘are you there’ the response would be slow — so slow that the monitor would decide that the server wasn’t there and would label it ‘down’. Things would recover, of course, so the next monitoring check would work fine — hence ‘up.’
Simple, really, when you think about it. The solution was fairly simple too — tell the clients to ‘check in’ at different times. This was done but it took quite a while for the changes to trickle out to ‘enough’ clients. In the meantime, the on-call person would get two pages every friday night. Everyone involved learned to ignore these.
I mentioned above that the pagers we were using had annoying sound effects — really annoying. Despite this, everyone who carried one had to find something that he could live with. Me? I settled on a beep (the manufacturer called it a ‘chirp’) combined with vibration. Taken together they were hard to miss while being not-terribly-annoying. Win.
The last piece of the story is that Friday evening is a curling night. I tend to curl in sweat pants — I find that the curling delivery is less uncomfortable in baggy pants and sweat pants fit the bill pretty well in addition to being warm. (All that ice is cold, after all.) I tended to carry my pager in the left pocket of my sweatpants and I throw right-handed (Even though I’m left-handed. It’s a long story.). Just after 8 PM I was in the middle of delivering a stone when something in my pocket went “Beep! Bzzzzzzzz.” I had forgotten what time it was so I was surprised by the… stimulus. I sort of flopped around a bit in an ungraceful fashion. My sweepers looked at me with ‘What the hell is wrong with you?’ looks on their faces.
But I made the shot.