Thursday 20 April 2017

Sys-Admin/Dev Ops : Assumption is Danger

As a systems admin, or dev ops, or whatever your job title might be, never ever assume that the person you're handing a system to has a clue.  This might seem harsh, but it's true, and proves itself true time and time again.

"Assumption is the mother of all f**k ups"

About a year ago I deployed a system which automatically sent requests to remote machines (via SMS) getting those machines to report their status or send back error information, but also to gather some basic information.

It has run happily for a whole year, it has been all pretty plain sailing, the hours and hours of work I put into it, to automate it and keep it self-sustained have paid off, zero faults, zero down time, self-regulation is the way forward for me; even if it took slightly longer to put the system in place, it has needed no human input for nearing a year!

However, the unit needed to move, about a week ago, it needed be physically picked up and taken out of my small server room and into the official server room, a dark cupboard basically controlled not by myself or my cohort, but the IT boffins.

Fine, I notified the customers, went off to the IT area, sorted out who I was to hand it to and physically delivered it to the chap, I watched him start to plug it all back together, power, wires, boot, fine....

I assumed he'd do this seamlessly....

Until this morning, well a morning last week, as I post these with a date in the future.  That morning was hell, I walked into a wall of customers not being able to get to their machines, the Easter weekend was looming, performance needed to be monitored, customer sites didn't have regular staff, explaining to temporary cover staff that system would be off was not a prospect I relished. 

To be frank, a lot of flapping going on, more than I expected... IT reported the system back online, but customers didn't stop flapping... Indeed, none of the estate seemed to be able to connect in... 1 hour, 2 hours, I've asked the boffins to check it time and again "It's fine", they tell me.

I look locally, I can't see the controller machine on the network, I can't see it through the remote management console... Where the hell is the machine?

I assure the customers I'll have answers within the hour, I hit social media with the same, this is going very public, and I'm rather annoyed as for a whole year things have run seamlessly; but been ignored, now its offline for a scheduled purpose and everyone is complaining, I do not want my success wiping away in a flood of negative press.

I call the IT boffins... "we'll look into it"... No, no no, you'll get onto it right now, not look, not glance, answers are needed.  Action from you is needed before my Re-Action goes nuclear.

I wait, five minutes, I was willing to give them ten.... My phone rings...

Them > "Hello?"...
Me > "Answers?"...
Them > "Yeah, you know when you brought it back?"...
Me > "The Machine?"....
Them > "Yes"...
Me > "I remember, why?"...
Them > "Well, it has power"...
Me > "Good"....
Them > "Not really"...
Me > "Why not?"...
Them > "Because that's all it has, it's not been plugged into the network"

I hung up.  They plugged it into the network, I had a slew of data come through... The customers were pacified.

I however was not.

I've had an on the spot review, firstly the IT bod who did this was held to account, second I was held to account for not noticing.

In not noticing I admit that having had it run cleanly for a year I had turned off the performance reports and I admitted I had assumed a network machine being handed to an IT bod would be plugged into the network.  People were not happy, least of all me, but that was the fall out.

However, I then had to do a tertiary clean up and after the Easter break I spoke to three of my main customers, trusted operators, the actual folk who should have been using the machines at the remote sites; not temporary staff; I asked them why they had not noticed.  The replies...  "Because it had worked for so long without an issue", "like you make it work, so we just guess it always is" and "we didn't notice it was offline".

They were very much putting everything into my court, assumption on the part of all parties was to blame.

The lessons learned for me are to now keep checking, keep monitoring, use my automation to report status, to fix faults and if human errors creep in, to let me know.

I'm now off to spec up a service I can run on one of my own servers, just to ping the network machine which went AWOL and receive a report from it to let me know what its up to, this might be a bit of python or just bash on a cron task, but it's going to be something rather than nothing.

I will NOT assume again.

No comments:

Post a Comment