Monday 26 February 2024

Tech Tribulations #1 : Smartcard Release Drama

It has been a very long time since a story time, so I thought I'd go over one about a software system I wrote from the ground up to secure the service to a machine; so I worked for a company which sold a whole machine to the customer (or leased them) while ever the buyer had the machines they would run.

In late 2014 the higher management realized this was an untapped revenue stream, and much to the annoyance of the customers, it was decided that a system update would go out; which the customer had to take to get any new content; and in this update they would have to also have a smart card reader installed and a card inserted which would count down time until it ran out.

Metering essentially, but "metering" had a whole other meaning for this system already, so it was just called the "Smartcard" system.

Really it was a subsystem, bolted into the main runtime as a thread check, which would wake at intervals, query if there was a card reader on the USB at all, check it was the exact brand of card reader (because we wanted to limit the customer just being able to put any in, they had to buy our pack).

And then it would query the card and deduce credit/count down if we were beyond a day.

We tried a bunch of time spans, hours, minutes etc, but deducting was decided to be after we accumulated 24 hours of on time, every 5 minutes an encrypted file on the disk would be marked, after 24 hours worth of accumulations the deduct would happen.

We tested this for months, absolutely months and to be honest we thought it was really robust.

Until it actually went into the customers hands, we suddenly had a slew of calls and returns, folks unhappy that they were inserting the card, "testing their machines" and suddenly all the credit was gone, and they were asking for a new card all the time.

At first we could simply not explain this anomaly, we had the written information about the service calls, replicated what the folks were saying, it all checked out fine, we got our increments, we could inspect the encrypted file and see we were accumulating normally and deducting normally.

I worked on this for days on end, we had to test for real, we did all sorts of things, power drop tests, pulling the card tests, all sorts.

The machine checked out, end of.

What had we missed?  What are the customers doing?  Testing the machine, okay, what are they testing?  The content, how does the content work for this update?  Well, seems that the customers didn't trust their testers or engineers, so what was happening was instead of testing for real they were doing what we called "Open door testing".

You see, when you close the door you accumulate and deduct, the machine is in operation normally as any user their end would have it operate....

Door open mode however, was intended to be used by service engineers, when the machine was deployed; so it is still in operation, the machine is in the field, but the door is briefly open to check things.

But these customers didn't trust their engineers in their warehouse, so they were not giving them credit to check the machine properly, they therefore tested in open mode... for days....

They accumulated massive operation debt with the machines in door open mode for days.

The moment they turned them off, happy they were working, and shipped them to sites they'd arrive on side immediately be turned on finally after so long in proper door closed operation and they'd instantly deduct the massive debt the warehouse team has accrued.

This was intentional.... But their use of the door open mode was an abuse, and one we had not even thought about.  We didn't even clock how long a machine sat in door open or door closed mode, worse still when in door open mode and test things on the machine ran at an accelerated update rate, we ticked over 10x faster to allow faster testing... The result was in just 3 days of warehouse door open mode testing they could accrue 30 days of operational debt.

That was a fault, one I could tackle with the team.  But changing the user habit of leaving the door open was harder...

We had to work with the user, and their patterns, we suspended the system for a short while and issued a new update, but the first customer taste of this "pay as you go" approach was a sour one.

Then things got bad....

Yes, you might think they were already bad, but they got worse.

A month later, all the above was resolved, and we thought things were settling down... Until we suddenly all hell broke loose.

EVERY MACHINE WAS LOCKED.

There were dozens of reports of their just not working, they had done their daily reboot and all of them reported a security fail on the smartcard....

All hands on deck, are our machines in the test pool doing the same?  Nope.

Is there something special going on?  Have clocks changed, is it a leap year, has the sky fallen on Chicken Little?

We honestly had no idea, there was no repeat in any of our test pool, no repeat on our personal engineering rigs, there was essentially no reason for this failure.

The only answer in such a situation is to observe or return one of the machines exhibiting the problem.

A lorry was sent and a machine brought back, under the explicit instruction not to open it nor change it, and the customer was not to keep their smartcard (it was theirs, but we would credit them a whole new card for the inconvenience).

Several hours spent staring at the code, and running checks by lowering cards so they would expire, or pulling the reader out and inserting it again we had no answer.

Before that arrives back with us however lets just think about the "smartcards" we used in our daily lives; our bank cards, they go into a machine we enter our pin and we remove them again.  Then how about cards like your gas meter, or you go see your GP, they have a card you insert into a machine and it stays there all the time to validate they are the GP, or they keep your meter in operation, if you have Sky TV and a viewing card; same thing, it is always in the device.

These machines are the latter kind.... Those cards are rated to have power on to them for long periods of time, as a consequence they cost more money than a card you only insert transiently...

And this company I worked for had very canny buyers, too canny.  Because they spotted a smartcard which used the same protocol... but was significantly less money to buy!

The difference?  You guessed it, it was the transient use variant.

The broken machine arrived, we powered it on, fail.  We open the door, remove the smartcard and sure enough on the rear of the plastic behind the chip the plastic is brown, burned.

The card can not be electrically trusted!

We highlight this and send it back to the buying department, they fouled up, they changed the hardware after we certified it, essentially sending an uncertified machine out.

A huge issue ensued about this, as this wasn't well understood that we had been provided and advised one card type into the update set, but of course the buyers would not accept it wasn't the same until we literally had the specifications of the card side by side we could see a digit difference in the part number and looked up the datasheet where clearly it said that the transient card was only rated to remain in a machine for 10 minutes.  More than enough for an ATM.  But a "security gating" card, as we wanted, they are rated to be inserted continually for 36 months.