Monday 4 March 2024

Tech Tribulations #2: There's a New System in Town


Time for another story from my tech history, this time I point my torch of memory at one of the first larger systems I worked on from start to finish (nearly 10 years).

I think I can say the product name of this particular system, it was called Phoenix and it wasn't a bad idea, to be honest, as with nearly every project I've ever worked on we needed more documentation and design; those to continuously neglected facetes of software engineering.  But other than that, the initial design stood up.

Before we dive into that though, we need to understand the reason Phoenix was thrown onto a whiteboard and born.  What was the direct predecessor?

Yes, there was one, this is not a bluesky thinking system, Phoenix was to directly replace a system which had itself been engineered over the prior two years and not really lived up to expectations.

I shan't name that system, but anyone who knows me and knows Phoenix may know the code name it was given.  It was a hybrid of two languages a C wrapper around the physical hardware bound with JNI to a java core, which itself used the ODBC driver to talk to a Marathon database backend.

The idea being that the transactional nature of the database allowed for the representation of perfect state, and in any error it was simply a case of rolling the transaction off and so we had RORI behaviour in a very complex system.

The problem?

It was SSSSLLLLOOOOWWWWWW.

The host machine was (and I think I'm right here) a Pentium II class machine (it may have been a Pentium III later) with 16MB of RAM (and I believe it went up to 64MB by the end).  Am I talking about early 2005 here, we were working with legacy machines in the field which were being refitted from a system itself all written in C but not well understood.

This new modular java, database, transactional system was envisioned to be the future!

The problem?  Still it was SLLOOOOOOWWWWWW......

I've said that twice now, so for effect lets skip forward to about 2006, two years in and this database java monstrosity is not doing well, it doesn't perform well, the team has literally stopped even trying to work in agile scrum manner; the lead developer didn't take well to critique amongst other issues (like it's slow, have I said that three times now?)

Happily however there were two separate development machines, the one on the old old system and a whole other on the java junk.  So it was in the C group I was called into a meeting with my manager, the manager of the whole department.

He had a punch bag doll in front of a whiteboard and asked me to close the door.  Furtively expecting my P45 (to be fired) he asked me to take a look at the board... he had asked two other engineers to take a look through the day, so he was canvassing for opinion.

In the dry wipe ink was a block diagram, eight boxes, a hardware module to talk to hardware, a module to account, a module to host the menu, a module to talk to the content program, and so on, one for logging, one for sending reports back to HQ... These simple systems.

Each would talk to one another with a message passing mechanism, which was a line from each of these eight boxes.  Little did I know then that this message passing system would be the biggest part of my debugging life for the next ten years; that however is another story.

Right there and then this new system seemed like a much better proposition.

The manager liked the feedback from the various folks he had to talk to, and it was decided upon.  He would work on the platform, starting up the system, securing it with a manifest and preventing unauthorized execution of random code and content.

I would be working on the main attract module, this is the main menu from which content is categorized, and launched.

The content would talk to the Game Interface module, which was owned by a guy called Darren.

The accounting module was handled by the previous lead on the java system, and later a guy called Conrad.  Conrad at first though would be doing some other systems.

And there was a smattering of other work, like a logger module, and a build stack and al-sorts.

Work began, but not before the lead on the then current system had to be told her baby was being canned.  She and her main ally were duly summoned, shown the block diagram and where they would be working.  She did not take it well.

As she left the managers office, he was heard to exclaim and his stapler threw across the space and nailed a window barrier, sticking into it with a vibrating twang.

She loved her system, she did not like Phoenix, not least as it's name was a direct reference to something rising from the ashes, her system being declared to have self-combusted in failure.

Monday 26 February 2024

Tech Tribulations #1 : Smartcard Release Drama

It has been a very long time since a story time, so I thought I'd go over one about a software system I wrote from the ground up to secure the service to a machine; so I worked for a company which sold a whole machine to the customer (or leased them) while ever the buyer had the machines they would run.

In late 2014 the higher management realized this was an untapped revenue stream, and much to the annoyance of the customers, it was decided that a system update would go out; which the customer had to take to get any new content; and in this update they would have to also have a smart card reader installed and a card inserted which would count down time until it ran out.

Metering essentially, but "metering" had a whole other meaning for this system already, so it was just called the "Smartcard" system.

Really it was a subsystem, bolted into the main runtime as a thread check, which would wake at intervals, query if there was a card reader on the USB at all, check it was the exact brand of card reader (because we wanted to limit the customer just being able to put any in, they had to buy our pack).

And then it would query the card and deduce credit/count down if we were beyond a day.

We tried a bunch of time spans, hours, minutes etc, but deducting was decided to be after we accumulated 24 hours of on time, every 5 minutes an encrypted file on the disk would be marked, after 24 hours worth of accumulations the deduct would happen.

We tested this for months, absolutely months and to be honest we thought it was really robust.

Until it actually went into the customers hands, we suddenly had a slew of calls and returns, folks unhappy that they were inserting the card, "testing their machines" and suddenly all the credit was gone, and they were asking for a new card all the time.

At first we could simply not explain this anomaly, we had the written information about the service calls, replicated what the folks were saying, it all checked out fine, we got our increments, we could inspect the encrypted file and see we were accumulating normally and deducting normally.

I worked on this for days on end, we had to test for real, we did all sorts of things, power drop tests, pulling the card tests, all sorts.

The machine checked out, end of.

What had we missed?  What are the customers doing?  Testing the machine, okay, what are they testing?  The content, how does the content work for this update?  Well, seems that the customers didn't trust their testers or engineers, so what was happening was instead of testing for real they were doing what we called "Open door testing".

You see, when you close the door you accumulate and deduct, the machine is in operation normally as any user their end would have it operate....

Door open mode however, was intended to be used by service engineers, when the machine was deployed; so it is still in operation, the machine is in the field, but the door is briefly open to check things.

But these customers didn't trust their engineers in their warehouse, so they were not giving them credit to check the machine properly, they therefore tested in open mode... for days....

They accumulated massive operation debt with the machines in door open mode for days.

The moment they turned them off, happy they were working, and shipped them to sites they'd arrive on side immediately be turned on finally after so long in proper door closed operation and they'd instantly deduct the massive debt the warehouse team has accrued.

This was intentional.... But their use of the door open mode was an abuse, and one we had not even thought about.  We didn't even clock how long a machine sat in door open or door closed mode, worse still when in door open mode and test things on the machine ran at an accelerated update rate, we ticked over 10x faster to allow faster testing... The result was in just 3 days of warehouse door open mode testing they could accrue 30 days of operational debt.

That was a fault, one I could tackle with the team.  But changing the user habit of leaving the door open was harder...

We had to work with the user, and their patterns, we suspended the system for a short while and issued a new update, but the first customer taste of this "pay as you go" approach was a sour one.

Then things got bad....

Yes, you might think they were already bad, but they got worse.

A month later, all the above was resolved, and we thought things were settling down... Until we suddenly all hell broke loose.

EVERY MACHINE WAS LOCKED.

There were dozens of reports of their just not working, they had done their daily reboot and all of them reported a security fail on the smartcard....

All hands on deck, are our machines in the test pool doing the same?  Nope.

Is there something special going on?  Have clocks changed, is it a leap year, has the sky fallen on Chicken Little?

We honestly had no idea, there was no repeat in any of our test pool, no repeat on our personal engineering rigs, there was essentially no reason for this failure.

The only answer in such a situation is to observe or return one of the machines exhibiting the problem.

A lorry was sent and a machine brought back, under the explicit instruction not to open it nor change it, and the customer was not to keep their smartcard (it was theirs, but we would credit them a whole new card for the inconvenience).

Several hours spent staring at the code, and running checks by lowering cards so they would expire, or pulling the reader out and inserting it again we had no answer.

Before that arrives back with us however lets just think about the "smartcards" we used in our daily lives; our bank cards, they go into a machine we enter our pin and we remove them again.  Then how about cards like your gas meter, or you go see your GP, they have a card you insert into a machine and it stays there all the time to validate they are the GP, or they keep your meter in operation, if you have Sky TV and a viewing card; same thing, it is always in the device.

These machines are the latter kind.... Those cards are rated to have power on to them for long periods of time, as a consequence they cost more money than a card you only insert transiently...

And this company I worked for had very canny buyers, too canny.  Because they spotted a smartcard which used the same protocol... but was significantly less money to buy!

The difference?  You guessed it, it was the transient use variant.

The broken machine arrived, we powered it on, fail.  We open the door, remove the smartcard and sure enough on the rear of the plastic behind the chip the plastic is brown, burned.

The card can not be electrically trusted!

We highlight this and send it back to the buying department, they fouled up, they changed the hardware after we certified it, essentially sending an uncertified machine out.

A huge issue ensued about this, as this wasn't well understood that we had been provided and advised one card type into the update set, but of course the buyers would not accept it wasn't the same until we literally had the specifications of the card side by side we could see a digit difference in the part number and looked up the datasheet where clearly it said that the transient card was only rated to remain in a machine for 10 minutes.  More than enough for an ATM.  But a "security gating" card, as we wanted, they are rated to be inserted continually for 36 months.

Monday 8 January 2024

The 2023 Great Laptop Shop...

It is that time of life again.... Time to shop for a new Laptop... And I am finding a new scourge in the mix, that is "Soldered".

I am so absolutely sick of seeing "Soldered" next to the RAM specifications, like I get it for budget machines, I get it for tiny integrations, but for actual working laptops?  For laptops marketed as "Business" or as "performance"... oh my word, why are they doing this?

I really hate this whole phase.  I have hated the laptop hidden gotcha's for years to be honest, and I'm upset Lenovo have joined the price crunch unupgradable ewaste train.

My current machine being a Lenovo E480, I can and have upgraded the RAM, it has two drive slots (one SATA and one nVME), you can change the battery and it has plenty of USB and aux IO ports all around.  The issues I have with it is that one RAM Slot has gone bad; I simply doubled the RAM in the one remaining slot - but really I'm not at end of life.  And the CPU being an 8th gen is struggling with my workloads.

My options are to soldier on with the E480, maybe even move all my work into remote virtual machines and use this simply as a terminal into them; but then I would always require network access and that's not always possible.

Or get something new and locked down.

Or a Framework offering, which lets me upgrade to a point.

Or keep shopping... But the ease of finding what I want is very much on the low-side, and I'm a technical user, finding what I want should be simple.  But the deck seems stacked with both retailers and system integrators wanting to pass off this shovel ware crap on me.