The Megalomaniac Bore: customers

Monday, 26 February 2024

Tech Tribulations #1 : Smartcard Release Drama

It has been a very long time since a story time, so I thought I'd go over one about a software system I wrote from the ground up to secure the service to a machine; so I worked for a company which sold a whole machine to the customer (or leased them) while ever the buyer had the machines they would run.

In late 2014 the higher management realized this was an untapped revenue stream, and much to the annoyance of the customers, it was decided that a system update would go out; which the customer had to take to get any new content; and in this update they would have to also have a smart card reader installed and a card inserted which would count down time until it ran out.

Metering essentially, but "metering" had a whole other meaning for this system already, so it was just called the "Smartcard" system.

Really it was a subsystem, bolted into the main runtime as a thread check, which would wake at intervals, query if there was a card reader on the USB at all, check it was the exact brand of card reader (because we wanted to limit the customer just being able to put any in, they had to buy our pack).

And then it would query the card and deduce credit/count down if we were beyond a day.

We tried a bunch of time spans, hours, minutes etc, but deducting was decided to be after we accumulated 24 hours of on time, every 5 minutes an encrypted file on the disk would be marked, after 24 hours worth of accumulations the deduct would happen.

We tested this for months, absolutely months and to be honest we thought it was really robust.

Until it actually went into the customers hands, we suddenly had a slew of calls and returns, folks unhappy that they were inserting the card, "testing their machines" and suddenly all the credit was gone, and they were asking for a new card all the time.

At first we could simply not explain this anomaly, we had the written information about the service calls, replicated what the folks were saying, it all checked out fine, we got our increments, we could inspect the encrypted file and see we were accumulating normally and deducting normally.

I worked on this for days on end, we had to test for real, we did all sorts of things, power drop tests, pulling the card tests, all sorts.

The machine checked out, end of.

What had we missed? What are the customers doing? Testing the machine, okay, what are they testing? The content, how does the content work for this update? Well, seems that the customers didn't trust their testers or engineers, so what was happening was instead of testing for real they were doing what we called "Open door testing".

You see, when you close the door you accumulate and deduct, the machine is in operation normally as any user their end would have it operate....

Door open mode however, was intended to be used by service engineers, when the machine was deployed; so it is still in operation, the machine is in the field, but the door is briefly open to check things.

But these customers didn't trust their engineers in their warehouse, so they were not giving them credit to check the machine properly, they therefore tested in open mode... for days....

They accumulated massive operation debt with the machines in door open mode for days.

The moment they turned them off, happy they were working, and shipped them to sites they'd arrive on side immediately be turned on finally after so long in proper door closed operation and they'd instantly deduct the massive debt the warehouse team has accrued.

This was intentional.... But their use of the door open mode was an abuse, and one we had not even thought about. We didn't even clock how long a machine sat in door open or door closed mode, worse still when in door open mode and test things on the machine ran at an accelerated update rate, we ticked over 10x faster to allow faster testing... The result was in just 3 days of warehouse door open mode testing they could accrue 30 days of operational debt.

That was a fault, one I could tackle with the team. But changing the user habit of leaving the door open was harder...

We had to work with the user, and their patterns, we suspended the system for a short while and issued a new update, but the first customer taste of this "pay as you go" approach was a sour one.

Then things got bad....

Yes, you might think they were already bad, but they got worse.

A month later, all the above was resolved, and we thought things were settling down... Until we suddenly all hell broke loose.

EVERY MACHINE WAS LOCKED.

There were dozens of reports of their just not working, they had done their daily reboot and all of them reported a security fail on the smartcard....

All hands on deck, are our machines in the test pool doing the same? Nope.

Is there something special going on? Have clocks changed, is it a leap year, has the sky fallen on Chicken Little?

We honestly had no idea, there was no repeat in any of our test pool, no repeat on our personal engineering rigs, there was essentially no reason for this failure.

The only answer in such a situation is to observe or return one of the machines exhibiting the problem.

A lorry was sent and a machine brought back, under the explicit instruction not to open it nor change it, and the customer was not to keep their smartcard (it was theirs, but we would credit them a whole new card for the inconvenience).

Several hours spent staring at the code, and running checks by lowering cards so they would expire, or pulling the reader out and inserting it again we had no answer.

Before that arrives back with us however lets just think about the "smartcards" we used in our daily lives; our bank cards, they go into a machine we enter our pin and we remove them again. Then how about cards like your gas meter, or you go see your GP, they have a card you insert into a machine and it stays there all the time to validate they are the GP, or they keep your meter in operation, if you have Sky TV and a viewing card; same thing, it is always in the device.

These machines are the latter kind.... Those cards are rated to have power on to them for long periods of time, as a consequence they cost more money than a card you only insert transiently...

And this company I worked for had very canny buyers, too canny. Because they spotted a smartcard which used the same protocol... but was significantly less money to buy!

The difference? You guessed it, it was the transient use variant.

The broken machine arrived, we powered it on, fail. We open the door, remove the smartcard and sure enough on the rear of the plastic behind the chip the plastic is brown, burned.

The card can not be electrically trusted!

We highlight this and send it back to the buying department, they fouled up, they changed the hardware after we certified it, essentially sending an uncertified machine out.

A huge issue ensued about this, as this wasn't well understood that we had been provided and advised one card type into the update set, but of course the buyers would not accept it wasn't the same until we literally had the specifications of the card side by side we could see a digit difference in the part number and looked up the datasheet where clearly it said that the transient card was only rated to remain in a machine for 10 minutes. More than enough for an ATM. But a "security gating" card, as we wanted, they are rated to be inserted continually for 36 months.

Thursday, 20 April 2017

Sys-Admin/Dev Ops : Assumption is Danger

As a systems admin, or dev ops, or whatever your job title might be, never ever assume that the person you're handing a system to has a clue. This might seem harsh, but it's true, and proves itself true time and time again.

"Assumption is the mother of all f**k ups"

About a year ago I deployed a system which automatically sent requests to remote machines (via SMS) getting those machines to report their status or send back error information, but also to gather some basic information.

It has run happily for a whole year, it has been all pretty plain sailing, the hours and hours of work I put into it, to automate it and keep it self-sustained have paid off, zero faults, zero down time, self-regulation is the way forward for me; even if it took slightly longer to put the system in place, it has needed no human input for nearing a year!

However, the unit needed to move, about a week ago, it needed be physically picked up and taken out of my small server room and into the official server room, a dark cupboard basically controlled not by myself or my cohort, but the IT boffins.

Fine, I notified the customers, went off to the IT area, sorted out who I was to hand it to and physically delivered it to the chap, I watched him start to plug it all back together, power, wires, boot, fine....

I assumed he'd do this seamlessly....

Until this morning, well a morning last week, as I post these with a date in the future. That morning was hell, I walked into a wall of customers not being able to get to their machines, the Easter weekend was looming, performance needed to be monitored, customer sites didn't have regular staff, explaining to temporary cover staff that system would be off was not a prospect I relished.

To be frank, a lot of flapping going on, more than I expected... IT reported the system back online, but customers didn't stop flapping... Indeed, none of the estate seemed to be able to connect in... 1 hour, 2 hours, I've asked the boffins to check it time and again "It's fine", they tell me.

I look locally, I can't see the controller machine on the network, I can't see it through the remote management console... Where the hell is the machine?

I assure the customers I'll have answers within the hour, I hit social media with the same, this is going very public, and I'm rather annoyed as for a whole year things have run seamlessly; but been ignored, now its offline for a scheduled purpose and everyone is complaining, I do not want my success wiping away in a flood of negative press.

I call the IT boffins... "we'll look into it"... No, no no, you'll get onto it right now, not look, not glance, answers are needed. Action from you is needed before my Re-Action goes nuclear.

I wait, five minutes, I was willing to give them ten.... My phone rings...

Them > "Hello?"...

Me > "Answers?"...

Them > "Yeah, you know when you brought it back?"...

Me > "The Machine?"....

Them > "Yes"...

Me > "I remember, why?"...

Them > "Well, it has power"...

Me > "Good"....

Them > "Not really"...

Me > "Why not?"...

Them > "Because that's all it has, it's not been plugged into the network"

I hung up. They plugged it into the network, I had a slew of data come through... The customers were pacified.

I however was not.

I've had an on the spot review, firstly the IT bod who did this was held to account, second I was held to account for not noticing.

In not noticing I admit that having had it run cleanly for a year I had turned off the performance reports and I admitted I had assumed a network machine being handed to an IT bod would be plugged into the network. People were not happy, least of all me, but that was the fall out.

However, I then had to do a tertiary clean up and after the Easter break I spoke to three of my main customers, trusted operators, the actual folk who should have been using the machines at the remote sites; not temporary staff; I asked them why they had not noticed. The replies... "Because it had worked for so long without an issue", "like you make it work, so we just guess it always is" and "we didn't notice it was offline".

They were very much putting everything into my court, assumption on the part of all parties was to blame.

The lessons learned for me are to now keep checking, keep monitoring, use my automation to report status, to fix faults and if human errors creep in, to let me know.

I'm now off to spec up a service I can run on one of my own servers, just to ping the network machine which went AWOL and receive a report from it to let me know what its up to, this might be a bit of python or just bash on a cron task, but it's going to be something rather than nothing.

I will NOT assume again.