The Megalomaniac Bore: system

Showing posts with label system. Show all posts

Saturday, 24 April 2021

Bad Files and Smart Cards in a Project from Long Ago

I need to anonymize this code, so we'll be doing it in a pseudo C# style. So one of the last tasks I had at my prior employer was to inherit the entire code base for a project I had been bitting and bobbing in for years, I'd seen this project start, release (many times), mutate and ultimately age.
As I took control it needed replacing, which is a whole other story involving C++ and dragging people kicking and screaming into touch.
This product though was like your grandad, it sat quietly on its own sucking a Worther's original waiting for a war film or Columbo to come on the tele.
The difficulty was the fault rate, between 9 and 14%, of machines were off in the morning, if a pack of updates were ever sent (for content) then that was around 46%... Image the calls there, the service manager and his oppo having to field 46% fault rate because of your update. Indeed on one occasion I remember driving to a customers site and physically handing them a good update DVD rather than our leaving them to wait.
So what was so bad? Well, it all came down to.... Lets look at a piece of code that is seared in my memory:
FileStream file = new FileStream("C:\\SomeFile.txt", FileMode.Open, FileAccess.Read, FileShare.None);
byte[] buffer = new byte[file.Length];
int bytesRead = file.Read(buffer, 0, (int)file.Length);
file.Close();
// Do something with buffer to give us a new buffer
int newDataLength = 64;
byte[] newBuffer = new byte[buffer.Length + newDataLength];

file = new FileStream("C:\\SomeFile.txt", FileMode.OpenOrCreate, FileAccess.Write, FileShare.None);
file.Write(newBuffer, 0, newBuffer.Length);
file.Close();
This is part of an update sequence, where the existing file would be opened, the new update delta calculated and it intended to append it onto the end of the file, and this was fine for years, it worked, it got shipped. It went wrong about five years later, can you see how maybe?
A hint is that this was a 32bit machine.
Did you spot it?.... it's line 2...
"file.Length" returns a long, but then all the following file operations work on int. The file started to go wrong after it was two gigabytes in size, because the range of int being 2,147,483,647 if we divide by 1024 three times we get kilobytes, then megabytes, then gigabytes and we see this is roughly 1.99 gigabytes.
But then think about that, this is a 2GIGABYTE file being opened in a buffer in RAM!?!?!?
It just makes a pure RAM copy of itself, then opens the file and starts to write over the original from zero to the end.
YEAH, so it's over writing the whole original file.

It's so wrong in so many ways, the massive buffer, the overwriting of existing data already safe on disk, the fact that this all took time too... this operation happened at a reconcile phase, it was all asynchronous, whilst this system portion was doing this mental tossing about another part of the system had changed the screen... to say "Please Power off or Reboot".

So people did, they literally pulled the power. So they lost their 2gigabytes+ of data, and when these files were getting large they were nuking them by pulling power too!
The solution is simple, open the file for append, or just seek to the end and add the new data on.
int newDataLength = 64;
byte[] buffer = new byte[newDataLength];
// Get the new data into the buffer
FileStream file = new FileStream("C:\\SomeFile.txt", FileMode.OpenOrCreate, FileAccess.Write, FileShare.None);
file.Seek(file.Length, SeekOrigin.Begin);
file.Write(buffer, 0, buffer.Length);
file.Close();
This was only part of the problem, the functions using the data from this file took it as a whole byte array, it literally had no way to chunk the file. I can't go into the details, but I had to break that up and start to stream the data through that system which then let me add the resulting new delta array (which was always smaller than 2MB) to the end of the file.
That was only one part of the system which kept be awake, another good one, used a lot was a pattern to also overwrite small files, mostly the json files which controlled the settings. So the users would often turn these machines off by simply pulling the power out of the back.
Whenever it was saving a file it would basically be doing:
File.WriteAllBytes(thePath, allTheBytes).
Yep, it'd just write over the file.
My fix? Simple, when opening the file at a time when we didn't expect the users to just pull the power - or at least it being less common - make a file back up "File.Copy(source, dest)" and these destination files were numbered 1, 2, 3 which we could configure... so sites where we knew they had a high fault rate we could stack up 5 or 7 backups of these files. but machines with a better hardware, or SSD's we'd only need 3.
I don't even think the service manager knew about this "fix".
But armed with these backups we could then leave the original code alone (which was quite convoluted and I didn't want to fix to be honest) but then on next load if the opening failed I'd have it nuke the back up it just took, then use the last best aged backup. And if now there were more backups then we should have we'd delete the oldest.
Settings didn't change very often, but this did let us solve this issue.
The final worst piece of this system was the licensing system, which used a USB connected smart card reader, and a custom decrementing secure card format to license the machine time. This was fine for years, it used a nice Gemalto reader and cards, and all was fine in testing.
The machine tested the card whilst in operation once every five minutes, so no big deal. When in service mode it checked the card every 10 seconds to update the license level display, but the service mode was never intended to be left for more than a few minutes.... So what happened?
Yeah, a customer opened a machine and left it open for a week.... And their machine went out of operation, when we got this particular machine back I just opened the door, took the card out and pointed to the literally charred burned back of the smart card chip... It was a white plastic card, and the back was deformed and light brown... I did chuckle, sucked for the customer, but we never worked out why they had the door open in service mode for so long; they weren't meant to.
But worse that that isolated incidence was a new tranche of machines being released in 2015, suddenly all had faults, there was machines out of order, machines not allowing play, machines rebooting... Nothing seemed to clear them, and some were reporting "Out of Licensing"; despite people having paid for brand new cards.
They were issued a new card... The old cards came back, were reworked... so randomly once working sites got either a new card or a reconditioned card from any other random site.
New machines had a new brand of card reader, old machines had the Gemalto. New cards were all these new brand of card, and the old cards were the white gemalto ones... this mix just went on... and soon we had a rising fault rate.
The diagnostic view was at first a little mixed, sometimes a new reader was fine, sometimes a new reader was bad... all customers reported "my new card", they had no idea that the brand had changed under the hood... and in fact nor did I.
You see to save a few pence per card (12p per card to be precise) they hadn't gone with the grand 34p GemAlto cards, they'd gone with 22p Chinese copies... Inferior copies as it turned out, they had around 1/8th the life span, so over time ALL these "new" cards failed.
But then, in the GemAlto reader they were all fine... So the new reader?... Oh that was ALSO a cheap Chinese knock off, and these things had strange problems, I suspected sometimes they were putting the full 5V USB current through the cards (rated at 3v) killing them. And was proven right.

This unholy quartet of product caused havok, but I eventually found that new readers could kill either new or old cards, they had to be recalled... Then new Cards could die randomly in even old reliable readers, they had to be recalled. Which means we slowly struggled to find old readers and old cards.

All of this was a purchasing foul up, unfortunately managers saw it as an engineering problem and so one had to code around poor hardware.

The first thing we did was add two toggles, one for "old card" which I could detect from the card chip type being read on reader access. This slowed the reading of the card down... form 5 minutes to every 30 minutes, so we ricked giving customers longer before an unlicensed machine went out of action, but it was accepted to give us a much longer read life for the card cell.

Then we deferred the first read of the card, on boot up we literally leave the USB device completely alone, let windows start and everything settle on the desktop driven system. And after 5 minutes we'd start our licensing check. It was accepted that a user could technically receive 4m59s of unlicensed use and then reboot to get more time, but that would be a little impractical in this usage scenario.

Doing these two things we could just about use the new readers...

But the new cards were just so utterly terrible, we did eventually have to buy better cards. I never heard if there was a refund on the originals, but I can assure you my time along cost more than the £120 they saved going with these cheap cards.

Friday, 23 June 2017

Development : My Top Three Testing Tips

I've said before, and I'll say again, I'm not a fan of Test Driven Development, tests and test frameworks have their place, but they should not; in my opinion; be the driving force behind a projects development stream - even if it does give managers above the dev team a warm fuzzy sense of security, or if it allows blame to be appropriated later, you're a team, work as a team and use tests on a per-developer basis as a tool not as a business rule.

*cough* I do go off topic at the start of posts don't I... *heerrhum*, right... Top Three Automated Testing Tips... From my years of experience...

1. Do not test items which are tested by masses of other developers... I'm talking about when you're using a frame work of library, ensure you are using it correctly certainly, do this at training or with your coding standard, but then do not labour the point by re-testing... Lets take a good example of this, the C++ Standard Library.

The Standard Library contains many collection classes, these classes have iterators within, lets look at a vector:

#include <vector>

std::vector<int> g_SomeNumbers { 1, 3, 5, 7, 9 };

We could iterate over the collection and output it thus:

int g_Sum(0);

for (int i (0); i < g_SomeNumbers.size(); ++i)

{

g_Sum += g_SomeNumbers[i];

}

However, this is not leveraging the STL properly, you are introducing the need to test the start poing "int i(0);" the end condition "i < g_SomeNumbers.size();" and the iterator "++i", three tests, slowing your system down and complicating your code base.

int g_Sum(0);

-- TEST SUM START

-- TEST i COUNT START

-- TEST RANGE CONDITION LIMIT

for (int i (0); i < g_SomeNumbers.size(); ++i)

{

-- TEST ITERATION

g_Sum += g_SomeNumbers[i];

-- TEST SUM CALCULATION - THE ACTUAL WORK DONE

}

-- REPORT TESTS

Using the iterator, we leverage all the testing of the STL, we remove the need to range test the count variable, we remove the need to test the condition and leave only the step as a test to carry out...

int g_Sum(0);

for (auto i(g_SomeNumbers.cbegin()); i != g_SomeNumbers.cend(); ++i)

{

g_Sum += (*i);

}

Our code looks a little more alien to oldé timé programmers however, it's far more robust and requires less tests simply because we can trust the STL implementation, if we could not thousands, hundreds of thousand of developers with billions of other lines of code would have noticed the issue, our trivial tests show nothing of gain, so long as we've written the code to a standard which uses the interface correctly...

int g_Sum(0);

-- TEST SUM START

for (auto i(g_SomeNumbers.cbegin()); i != g_SomeNumbers.cend(); ++i)

{

-- TEST ITERATION

g_Sum += (*i);

-- TEST SUM CALCULATION - THE ACTUAL WORK DONE

}

-- REPORT TESTS

2. Do now allow values which have been tested to change unexpectedly... I'm of course talking about "const", which I have covered before on these pages, but constness in programming is key. The C family of languages allow constness at the variable level, you may notice in the previous point I used a const iterator (with cbegin and cend) as I do not want the loop to change the values within the vector... Constness removes, utterly, the need to perform any tests upon the integrity of your data.

If it's constant, if the access to it is constant, you do not need to test for mutations of the values.

Your coding standard, automated scripts upon source control submissions, and peer review are your key allies in maintaining this discipline, however it's roots stretch back into the system design and anlysis stages of the project, to before code was cut, when you were discussing and layout out the development pathway, you should identify your data consider it constant, lock it down, and only write code allowing access to mutable references of it as and when necessary.

Removing the need to retest mutable calls, removing the need to log when a mutable value is called because you trust the code is key.

In languages, such as python, which do not directly offer constness, you have to build it in, one convention is to declare members of classes with underscores to intimate they are members, I still prefer my "m_" for members and "c_" for constants, therefore my post-repository submit hooks run scripts which check for assigning to, or manipulation of "c_" variables. Very useful, but identified by the coding standard, enforced by peep review, and therefore removed from the burden of the test phase.

3. Remove foreign code from your base... I'm referring to code in another language, any scripting, any SQL for instance, anything which is not the pure language you are working within should be removed from the inline code.

This may mean a stored procedure to store the physical SQL, rather than inline queries throughout your code, it maybe the shifting of javascript functions to a separate file and their being imported within the header of an HTML page.

But it also includes the words we ourselves use, be that error messages, internationalisation, everything except code comments which is in whatever language you use (English, French etc etc) should be abstracted away and out of your code.

Your ways of working, coding standards, analysis and design have to take this into account, constness plays it's part as well, as does mutability, where ever you move this language to, and whatever form it takes test it a head of time, and then redact that test from your system level tests, trust you did it right based on the abstraction you've performed. Then avoid burdening your system throughout the remaining development cycle.

One could expand this to say "any in-house libraries you utilise, trust their testing" just as I stated with the STL in my first point, however, I am not talking about code, I am talking about things which are not code, which are uniquely humanly interpretable.

The advantage of removing them and pre-testing the access to them is that you retain one location at which you have an interlink, one place at which a value appears, one place where they all reside and so you leverage easily converting your programs language, you leverage easily correcting a spelling mistake, and all without needing to change your system code; perhaps without needing to even re-release or re-build the software itself (depending on how you link to the lingual elements)

Ultimately reducing the amount of testing required.

Wednesday, 8 February 2017

Development : The Art of Fungibility

Fungibility is about being able to substitute one thing for another seamlessly, let us say you have a paint sprayer which contains red paint today, but tomorrow you want to paint the wall blue, you simply swap the paint... The unit remains unchanged... The paint is fungiable.

When it comes to software, especially in the past, it was of paramount importance to write your code to conform to the system, you had to write code which was compiled for one processor only, you had to conform to the calling conventions of the underlying architecture and most importantly the OS... Software was generally not very portable.

High level languages, like C, were created to get over this problem and indeed the Unix system as created by Dennis Richie could be recompiled from source on various platforms, however, things were sill very specific and the "portability" was the case of recompiling and overcoming the localisation quirks of each underlying platform.

However, with the advent of portable interpreted languages like Java or Python, and manageable tool-chain assistants like Docker and Kivy you can write your code once and have it move between systems without effort. The first of these which I encountered really was Java, I had this run on both a Sun Sparc station and a PC, and when I wrote my degree dissertation (Parallel Computing in an Open Environment) in order to open up that platform the interop module was all Java, running on a Windows PC or any other Java 1.1.4 (AWP) functional system.

Little did I understand, at undergraduate level, how important a trend this cross platform context for an application would become. Smart Phones, Tablets, Consoles, various PC Hardware, Linux, Windows... At one point I was even writing work in Personal Pascal on the Atari ST, then porting it to Turbo Pascal on the PC for profit, when I didn't own a PC myself, so one had to be careful and thoughtful. There were not so many avenues for a system to truly be the same on one platform or the next.

So, as I learned to program you had to sit down with a pencil & paper, with a process flow chart stencil and work out how you wanted to lay the system in order. To order what was being written when.

This lead to bring frugal with your time, and ultimately lead me to see fungiability as a key component of my way of working. Because it was such hard work to cover all the bases and to make your code truly portable across machines.

As great as Java was, and python is now, or technology like Docker & Kivy are though, I believe they are pushing fungiability into decline, sometimes with it not even being considered, take my post of a few weeks prior, replacing an older system with a new, I simply looked at the API being exposed, boiled it down to a few function calls then re-implemented them and the system came back up. Replacing the code inside the functions was hard work, but the API swapped over for me.

I've since learned that other, younger, developers would simply have re-written all the code, they'd not have cut out what wasn't needed; because they have the tools, they have the machine power (ram and processor) to support continuing with the added load, they don't pair things down, they don't binary chop what's not required, they only create and ever expanding gaggle of code (If you are young and don't do this, please don't think I'm painting all with the same brush!)

So as you start to consider scale-ability, on demand computing and cloud computing, keep fungibility in mind, think and design like for like replacement of systems, can you swap the service back end from an instance of Ubuntu on AWS to running on your local XEN host? Can your developers in Canada send their data to dedicated local machine today, but deploy to a virtual instance hosted by yourself in the UK tomorrow?

Consider your file system advanced RAID arrays offer a form of Fungibility, that you can swap bad disks out and replace them with new, ZFS does this in software in the same manner, so long as you include your disks into the pool by disk-id.

Fungibility, in code, in provision, in hardware... It's an art which is sadly not often mentioned, indeed it's not even in my spellchecker's dictionary.

Friday, 1 July 2016

Software Engineering : My History with Revision Control (Issues with Git)

I'm sure most of you can tell I'm one of those developers whom has been using Revision control systems for a long time... So long in fact, that I wrote a program for my Atari ST; whilst at college; which would span files across multiple floppy disks and use a basic form of LZH to compress them.

Later, when I graduated I worked for a company using a home-brew revision control system, imaginatively called "RCS". Which basically zipped the whole folder up and posted it to their server, or unzipped it and passed it back to you, there was no way to merge changes between developers, it was a one task, one worker, one at a time system; almost as lacking in use as my floppy based solution from six years prior.

During my years at university, revision control was not a huge issue, it was NEVER mentioned, never even thought about. Yet today we, sometimes happily, live in a world where software engineers need to use revision control. Not only to ensure we keep our code safe, but to facilitate collaborative working, to control the ever growing spread of files and the expanding scope of most all projects beyond the control of a single person.

Now, I came to using professional grade revision control with subversion, in early 2004. I think we were a very early adopter of Subversion in fact, and we spent a lot of time working with it.

If you've ever taken a look around my blog posts you will see Subversion is mentioned and tutorials exist for it all befitting for nearly twelve years working with it. And unlike the comments made by Linus Torvalds I totally believe Subversion works, and works well. It is not perfect, but I find it fits my ways of working pretty well.

Perhaps after twelve years my ways of working have evolved to adopt subversion and vice versa, but whatever the situation, I'm currently being forced down the route of using git a lot more.

Now, I have no issues with git when working with them locally, ALL my issues are with using git remotely, firstly the person whom (in the office) elected to use git, started off working alone, he was creating a compiler project with C# and so he just had it all locally and used Visual Studio plug-ins to push to a local repo, all was fine.

I've used git with local repos without problem.

All the problems come with pulling and pushing, with remote, and controlling that access. Git intrinsically fails to protect the access to the repo easily, relying instead on the underlying operating system. Which is fine, when you have a controlled, and easy to manage user base as with a Linux server, however, with the minefield of integrating with Active Directories, domains and whatever on windows based infrastructure nothing but problem comes up.

The next problem I've had with Git has been the handling of non-mergable files. We have lots of digital files, movies, sounds and plenty of graphics. As such we've had to work around git, by having people work on files one at a time, and to cross reference which files they are responsible for. With an art crew of five people, this means a flip chart or white board is constantly listed with the media files, and someones initials are next to it, just to help control access.

"Surely git should be able to lock these files", they constantly cry. No, how can it, how can a distributed control system manage locks across five or more repos's which are not talking to one another, and if you did elect one to be considered the master, how do you then transmit out to the passive clients every time you lock or release a file? You can't, the artists would each have to remember to pull, or shout to each other to pull now! It simply doesn't work.

And as a way of working the white board is pretty poor, but it's all we have right now.

The next problem we had was the massive amount of disk space being used by the repos. We boot our machines off of very small (128GB) drives, then use either NAS or SAN for our main storage. This was fine, and efficient, and critically it was all well backed up on the infrastructure we use, and it worked for twelve years with subversion. However, with Git our huge files are constantly being snapshotted, this growth in the size of the overall repo is replicating files over and over and over.

In short, despite someone else, and the world at large turning its back on Subversion, we here in my area are strongly drifting back to Subversion.

Trouble is, it feels as though we're swimming against the tide, despite all these slight deficiencies in Git, the over all organisation; and even external projects I'm working on; are pushing Git. Torvalds himself calls people still working on Subversion "brain dead". But has he thought about the short comings? Or these case-studies we can give where subversion is a better fit for our working style?

Above all this wrangling internally has been my problem expressing our situation with Git to both the initiated and uninitiated. When talking to those advocates of Git, all sorts of acronyms, actions and comments are made "use git this", "use git that". The problem being, there are something like about 130+ commands in Git, that's a huge amount of things you can work with. But, we can break down what we've done as "git init", "git checkout", "git add", "git commit", "git push", "git pull" and "git status" (as I've said merging utterly failed, so I'll gloss over that right now).

Given this huge scope of possible usage, and such a small exposure experience it's hard to put words against why things were not a good fit with Git, the initiated always seem to argue "you didn't give it a good crack of the whip". But we don't work in an environment where we can try one thing and then another, it's an old working structure, which has evolved over time, people are used to it; and I'm nearly 40 yet I'm the youngest guy here! Training those around me in new ways of working is very much an uphill struggle. So, when introducing something as alien to their mindset as Git, it was always a loosing battle.

To express this to the uninitiated is even harder, they don't know what an RCS does, nor what we mean by centralised or distributed control, they just want to see our work kept safe, and our work to be released to the customer. Gripes about Git and Subversion make no in roads with them, they're just unimpressed when you explain that these solutions are both open source and have no support. The fact that they're free has been wildly ignored, yet I could - for the price of the support contract of another system here - easily buy and operate a whole new SAN just for our needs!

Lucky for me though, after struggling with this issue, I ran across Peter Lundgren's post on the same topic, of expressing what's wrong with Git. He doesn't advocate Subversion, or anything, over Git, he just lists the problems he had with Git, and he crosses much of the same ground I have had to.

Check out his post here: http://www.peterlundgren.com/blog/on-gits-shortcomings/