Friday 4 September 2020

Great Rack Mount Mistakes #7

Today's story in the annuls of problems in IT comes from a guest editor... Mr B.... And Mr B (no relation to anyone in other stories given monogram names) works as the sysadmin and developer for the whole set of systems with his employer; unfortunately this means "it's all his fault".

So what went wrong?  Well, over night the site had a power cut and though they have a nice server, they don't have a power back up, so that server went off.

The server is essentially a java host, specifically hosting Tomcat, and it reaches out to connect to a set of third party endpoints via a restful API.

You'd think no big deal, start up get running and keep running, except that third party don't force a disconnect upon a new edition of their interface API, if you're connected to version 1.0 then you can and they will happily leave you connected to version 1.0, even if they release interim updates, add new calls and quite what got Mr B today remove a call or two.

Your session ending, and then I presume all sessions of that old version, would free their server provisioning to de-allocate the old version.  But to force users to migrate upwards in the chain their published API (so think the end point here in whatever flavour you wish) declaration changes.  Such that you re-download it upon re-connection and that's your new flavour of the month API.

The problem?  It didn't work.

So Mr B had to set about debugging this on the fly, in a live environment, which was down.  And he went through the three stages of technological grief....

1) Denial:    "This is completely illogical, my code brings down their interface, which is the only thing we connect to, it must be right, they can't miss-match them, so this must be my side or the gods are against me".

2) Investigation:  "Read the logs, make a change, nothing seems to work, the gods are definitely snickering behind that cloud of steam now".

3) Realisation:   "If it's not me, and it's not the system here, it must be their side, the huge multi-billion international must have published their API spec with a mistake or miss-match.... click.... YOU FUCKERS!"

What was the actual problem?  Well, the third party published API was actually wrong, the downloaded specification still contained several calls which were removed, when the services Mr B had written came up they checked each end point and found several calls defined which did not respond and so his software, correctly, reported that the endpoint was offline.  They were, they didn't exist anymore.

His fix was to literally tell his stuff to ignore the multi-billion dollar international service providers API spec and to "download" a copy which he hosted locally, with his own edits to it.

Now, he's a tiny fish in a huge pond here, even if he reports this miss-match said multi-billion dollar international isn't going to hear him, and by the time he does they maybe several months down the line, and other folks may have spotted this problem.  He maybe listened to, but he essentially doubts his voice would be heard.

The problem of course being how to abate this issue in the future?  How to avoid this stress?  For at one point he did say "the company is done for", because literally everything was offline, all their services were down.... And of course everyone will blame the little guy doing all the IT, they won't think that the multi-billion behemoth entity could possibly publish a wonky API spec, most of those shouting at Mr B with mouths frothing wouldn't even know what he meant when he explained this to them...

The fact that he's identified this issue, resolved it, and everything is back up within two hours won't be remembered, the glass will remain half-empty, and so it'll only be remembered that on the 3rd September 2020 Mr B's IT suite went offline.

No comments:

Post a Comment