Saturday, 24 April 2021

Bad Files and Smart Cards in a Project from Long Ago

I need to anonymize this code, so we'll be doing it in a pseudo C# style.  So one of the last tasks I had at my prior employer was to inherit the entire code base for a project I had been bitting and bobbing in for years, I'd seen this project start, release (many times), mutate and ultimately age.

As I took control it needed replacing, which is a whole other story involving C++ and dragging people kicking and screaming into touch.

This product though was like your grandad, it sat quietly on its own sucking a Worther's original waiting for a war film or Columbo to come on the tele.

The difficulty was the fault rate, between 9 and 14%, of machines were off in the morning, if a pack of updates were ever sent (for content) then that was around 46%... Image the calls there, the service manager and his oppo having to field 46% fault rate because of your update.  Indeed on one occasion I remember driving to a customers site and physically handing them a good update DVD rather than our leaving them to wait.

So what was so bad?  Well, it all came down to.... Lets look at a piece of code that is seared in my memory:

FileStream file = new FileStream("C:\\SomeFile.txt", FileMode.Open, FileAccess.Read, FileShare.None);
byte[] buffer = new byte[file.Length];
int bytesRead = file.Read(buffer, 0, (int)file.Length);
file.Close();
            // Do something with buffer to give us a new buffer
            int newDataLength = 64;
            byte[] newBuffer = new byte[buffer.Length + newDataLength];

file = new FileStream("C:\\SomeFile.txt", FileMode.OpenOrCreate, FileAccess.Write, FileShare.None);
file.Write(newBuffer, 0, newBuffer.Length);
file.Close();

This is part of an update sequence, where the existing file would be opened, the new update delta calculated and it intended to append it onto the end of the file, and this was fine for years, it worked, it got shipped.  It went wrong about five years later, can you see how maybe?

A hint is that this was a 32bit machine.

Did you spot it?.... it's line 2...

"file.Length" returns a long, but then all the following file operations work on int.  The file started to go wrong after it was two gigabytes in size, because the range of int being 2,147,483,647 if we divide by 1024 three times we get kilobytes, then megabytes, then gigabytes and we see this is roughly 1.99 gigabytes.

But then think about that, this is a 2GIGABYTE file being opened in a buffer in RAM!?!?!?

It just makes a pure RAM copy of itself, then opens the file and starts to write over the original from zero to the end.

YEAH, so it's over writing the whole original file.


It's so wrong in so many ways, the massive buffer, the overwriting of existing data already safe on disk, the fact that this all took time too... this operation happened at a reconcile phase, it was all asynchronous, whilst this system portion was doing this mental tossing about another part of the system had changed the screen... to say "Please Power off or Reboot".


So people did, they literally pulled the power.  So they lost their 2gigabytes+ of data, and when these files were getting large they were nuking them by pulling power too!

The solution is simple, open the file for append, or just seek to the end and add the new data on.

int newDataLength = 64;
byte[] buffer = new byte[newDataLength];
// Get the new data into the buffer
FileStream file = new FileStream("C:\\SomeFile.txt", FileMode.OpenOrCreate, FileAccess.Write, FileShare.None);
file.Seek(file.Length, SeekOrigin.Begin);
file.Write(buffer, 0, buffer.Length);
file.Close();

This was only part of the problem, the functions using the data from this file took it as a whole byte array, it literally had no way to chunk the file.  I can't go into the details, but I had to break that up and start to stream the data through that system which then let me add the resulting new delta array (which was always smaller than 2MB) to the end of the file.

That was only one part of the system which kept be awake, another good one, used a lot was a pattern to also overwrite small files, mostly the json files which controlled the settings.  So the users would often turn these machines off by simply pulling the power out of the back.

Whenever it was saving a file it would basically be doing:

File.WriteAllBytes(thePath, allTheBytes).

Yep, it'd just write over the file.

My fix?  Simple, when opening the file at a time when we didn't expect the users to just pull the power - or at least it being less common - make a file back up "File.Copy(source, dest)" and these destination files were numbered 1, 2, 3 which we could configure... so sites where we knew they had a high fault rate we could stack up 5 or 7 backups of these files.  but machines with a better hardware, or SSD's we'd only need 3.

I don't even think the service manager knew about this "fix".

But armed with these backups we could then leave the original code alone (which was quite convoluted and I didn't want to fix to be honest) but then on next load if the opening failed I'd have it nuke the back up it just took, then use the last best aged backup.  And if now there were more backups then we should have we'd delete the oldest.

Settings didn't change very often, but this did let us solve this issue.

The final worst piece of this system was the licensing system, which used a USB connected smart card reader, and a custom decrementing secure card format to license the machine time.  This was fine for years, it used a nice Gemalto reader and cards, and all was fine in testing.

The machine tested the card whilst in operation once every five minutes, so no big deal.  When in service mode it checked the card every 10 seconds to update the license level display, but the service mode was never intended to be left for more than a few minutes.... So what happened?

Yeah, a customer opened a machine and left it open for a week.... And their machine went out of operation, when we got this particular machine back I just opened the door, took the card out and pointed to the literally charred burned back of the smart card chip... It was a white plastic card, and the back was deformed and light brown... I did chuckle, sucked for the customer, but we never worked out why they had the door open in service mode for so long; they weren't meant to.

But worse that that isolated incidence was a new tranche of machines being released in 2015, suddenly all had faults, there was machines out of order, machines not allowing play, machines rebooting... Nothing seemed to clear them, and some were reporting "Out of Licensing"; despite people having paid for brand new cards.

They were issued a new card... The old cards came back, were reworked... so randomly once working sites got either a new card or a reconditioned card from any other random site.

New machines had a new brand of card reader, old machines had the Gemalto.  New cards were all these new brand of card, and the old cards were the white gemalto ones... this mix just went on... and soon we had a rising fault rate.

The diagnostic view was at first a little mixed, sometimes a new reader was fine, sometimes a new reader was bad... all customers reported "my new card", they had no idea that the brand had changed under the hood... and in fact nor did I.

You see to save a few pence per card (12p per card to be precise) they hadn't gone with the grand 34p GemAlto cards, they'd gone with 22p Chinese copies... Inferior copies as it turned out, they had around 1/8th the life span, so over time ALL these "new" cards failed.

But then, in the GemAlto reader they were all fine... So the new reader?... Oh that was ALSO a cheap Chinese knock off, and these things had strange problems, I suspected sometimes they were putting the full 5V USB current through the cards (rated at 3v) killing them.  And was proven right.


This unholy quartet of product caused havok, but I eventually found that new readers could kill either new or old cards, they had to be recalled... Then new Cards could die randomly in even old reliable readers, they had to be recalled.  Which means we slowly struggled to find old readers and old cards.

All of this was a purchasing foul up, unfortunately managers saw it as an engineering problem and so one had to code around poor hardware.

The first thing we did was add two toggles, one for "old card" which I could detect from the card chip type being read on reader access.  This slowed the reading of the card down... form 5 minutes to every 30 minutes, so we ricked giving customers longer before an unlicensed machine went out of action, but it was accepted to give us a much longer read life for the card cell.

Then we deferred the first read of the card, on boot up we literally leave the USB device completely alone, let windows start and everything settle on the desktop driven system.  And after 5 minutes we'd start our licensing check.  It was accepted that a user could technically receive 4m59s of unlicensed use and then reboot to get more time, but that would be a little impractical in this usage scenario.

Doing these two things we could just about use the new readers...

But the new cards were just so utterly terrible, we did eventually have to buy better cards.  I never heard if there was a refund on the originals, but I can assure you my time along cost more than the £120 they saved going with these cheap cards.