The Megalomaniac Bore: software engineering

Showing posts with label software engineering. Show all posts

Wednesday, 22 January 2025

Why does XVE look so basic?

I've literally been inundated by a message about my XVE project, specifically calling out why I have the temerity to describe it as an "Engine when it looks do basic?"

My over simplification in reply being that an engine can be anything you please, the engine is simply the framework within which the content can execute.

This applies to Unreal, Unity, Godot all of them, if you have something which will load your models, compile/apply your shaders and render to the screen you have a graphics engine. If you have a threading model, memory or allocation handling, logging, input, user interface... All these module pieces form an engine.

The reason I am investing time working in my own such project is to explore those modules in which I find my experience or knowledge needing expanding, where I wish to trial alternative approaches, keep up with emergent new techniques or processes.

And of course for the content itself to explore the content creation tools; for instance I am very much a Blender beginner, and learning fast. I could not say this just three months ago.

The major advantage to me in performing this kind of exploratory work at home in my own time is of course that I take a confidence and an assuring confidence back to my real work, as a leader of a small team I feel I am equipped to jump into pair program and just help. I feel equipped going over a diverse or new code base; for instance recently I explored the Dagor engine used by WarThunder and the Little Vulkan Engine by Brendan Gaela.

Chronologically XVE has gone over many internal milestones, long before I began posting anything to YouTube or this blog about it, including:

CMake & Project Structure
Coding Standards
Git & Commit Hook integration for Jenkins Continuous Integration Builds
Test Framework
Module Design

Base
Renderer
Data loading

XML
CSV
Wavefront Obj
Generated type definitions (a whole generator suite written in python

Entity Component System
Threading Model

Thread pool
Anonymous Async Dispatch
Promise/Future Standard Exploration

Memory/Allocator Models

Slab Allocators
Dynamic Allocators
RPMalloc
Standard Allocator API (Custom)

Input Framework

XInput/DirectInput
SDL2 Input
SDL3 Input
GLFW Input
Custom personal exploration

And only after all this did I begin exploring actual game content and systems and component and entity relationships to represent the game design concepts.

The engine becomes a tool, a tool to achieve the delivery of the experience we want our players to experience and enjoy, so they return time & again to our game.

The game is key, and if done right the player should never know what engine or framework is being used to deliver that experience to them.

Tuesday, 9 July 2024

Code Locality & Concurrent Systems

Let us talk Software Design, let us talk about code locality. What do I mean by locality? Well, in Software Engineering we often talk about code being self documenting, a fabulous place to be if the code performing the work is right in front of you. But to be honest systems get pretty big pretty quickly. So you're very much more likely to me making calls to API's or just bunched of functions you have to blindly trust, unless you have the leisure of digging into them.

And there's usually precious little time in development which is invariably taken up with writing new code, not going over the old (unless you're very lucky - but that's a conversation for another day).

Now, these API's can encapsulate large features, and they don't always achieve the amount of descriptive power we'd like in the call site location we're at. I therefore advocate for a comment around that point.

And so we are immediately at odds with the wish for our code to be as accessible, local, as it can be, but also encapsulating the large unrelated tasks elsewhere. For me this is the dichotomy of code locality in a nut shell.

Where can it come unstuck? Well, back in the day (and I have to be honest with most engineers still thinking linearly even today) you could be forgiven to writing out your big system block diagram, connecting things with events or signals and just going about your business, when your code called "BigFooFunction" in "BarBarBlackBox" library you didn't pay it much mind. Today however, as Moore's Law runs aground on the rocks of pumping ever more cores into a system we have to think in concurrent terms and it is in just such a scenario that I want to pick up thinking about Systems Design and Code Locality.

Let us perform a thought experiment. We have a system which progresses from Eggs to Birds, it controls the state of the entity laid as an egg, being incubated, then hatched, fed and tended until they start to fledge and ultimately turn into a bird. All this transmogrification of the state from Egg to hatchling to fledgling to bird happens asynchronously in the background in a big slap of system you do not have to worry about.

All you worry about is a signal coming into you saying "predator", and when this signal arrives you need to stimulate all the little birds you have into action, they all need to take flight.

for(auto& bird : flock)

{

bird.TakeFlight();

}

This is the locality of our change to the property flight on each bird, we are absolutely unaware of each individual possible state the bird can be in and so we rely on the API and code backing our call to "TakeFlight".

Now, let us assume the members of "flock" are all of the base type "Bird", so they all have a TakeFlight function? Well, they might, if Bird looked something like this:

class Bird

{
public:

virtual void TakeFlight() = 0;

};

Then at compile time all the derived classes would have to implement "TakeFlight".

class Egg : public Bird

{

private:

uint64_t mTimeLaid;

float mTemperature;

public:

void Incubate();

void TakeFlight() override;

};

And because we know this is an Egg we know it can't fly and so we know that this override of the function will do nothing.

This is perfect code-locality for that derived type, but for bird itself it leaves us the open question, well what does it do? Does it do anything?? Should "TakeFlight" return us some code to indicate the bird took flight and an error if not, but an egg not flying is not an error, so must it be some compound it returns.

Now, I am straying into API design somewhat, and they are related fields, but really here we're thinking about where the active code sits, what is its locality compared to our callsite in the loop?

And for a function, you can see we can define this and know.

Now lets us change our example:

class Bird

{
public:

bool mInFlight { false };

};

and

class Egg : public Bird

{

// As above

public:

void TakeFlight () override {

// Intentionally Empty

}

};

Our egg can not fly, or can it? for now with a public member we're flying somewhat in the wind of a contrived example, but anything can now set the value of mInFlight, our for-loop upon the predator signal can now achieve its aim:

for(auto& bird : flock)

{
bird.mInFlight = true;

}

And this is correct, this loop in review would pass, who can argue? The bird took flight, it did, it is true. And this is code locality in action, for at the location we needed it the functionality was available to mutate the value and achieve our goals, no matter what the state of the rest of the system was.

This is a very dangerous place to be.

Especially with an asynchronous system, lets say this is not a trivial call, lets say that the call site is to loop through a series of resources and start them loading, and upon that call each is pending load, but not yet loaded?

for(auto& item : objects)

{
item.StartLoad();

}

We can assume this code is starting the load, but now lets package this into some context, we like to have our code be self-documenting after all.

class Loader

{
private:

bool mEverythingReady { false };

std::vector<Objects> mObjects;

public:

void Load()

{

for(auto& item : mObjects)

{

item.StartLoad();

}

mEverythingReady = true;

}

};

Can you already spot the problem? The local code here is communicating that EVERYTHING READY, when it is anything but that, you have simply started some other action elsewhere, you have not checked the state, you have not deferred until ready, you have started load and that is all you know, but you code here locally is communicating something subtly different.

And in huge systems you must not fall foul of this kind of behaviour, you need your code locally to communicate what it intends, to do as it intends and if you spot silly public interfaces like this do not be affraid to fix them, the bravery to address an issue, if only to raise it to the owner, is a step in the right direction with massive software systems.

Monday, 26 February 2024

Tech Tribulations #1 : Smartcard Release Drama

It has been a very long time since a story time, so I thought I'd go over one about a software system I wrote from the ground up to secure the service to a machine; so I worked for a company which sold a whole machine to the customer (or leased them) while ever the buyer had the machines they would run.

In late 2014 the higher management realized this was an untapped revenue stream, and much to the annoyance of the customers, it was decided that a system update would go out; which the customer had to take to get any new content; and in this update they would have to also have a smart card reader installed and a card inserted which would count down time until it ran out.

Metering essentially, but "metering" had a whole other meaning for this system already, so it was just called the "Smartcard" system.

Really it was a subsystem, bolted into the main runtime as a thread check, which would wake at intervals, query if there was a card reader on the USB at all, check it was the exact brand of card reader (because we wanted to limit the customer just being able to put any in, they had to buy our pack).

And then it would query the card and deduce credit/count down if we were beyond a day.

We tried a bunch of time spans, hours, minutes etc, but deducting was decided to be after we accumulated 24 hours of on time, every 5 minutes an encrypted file on the disk would be marked, after 24 hours worth of accumulations the deduct would happen.

We tested this for months, absolutely months and to be honest we thought it was really robust.

Until it actually went into the customers hands, we suddenly had a slew of calls and returns, folks unhappy that they were inserting the card, "testing their machines" and suddenly all the credit was gone, and they were asking for a new card all the time.

At first we could simply not explain this anomaly, we had the written information about the service calls, replicated what the folks were saying, it all checked out fine, we got our increments, we could inspect the encrypted file and see we were accumulating normally and deducting normally.

I worked on this for days on end, we had to test for real, we did all sorts of things, power drop tests, pulling the card tests, all sorts.

The machine checked out, end of.

What had we missed? What are the customers doing? Testing the machine, okay, what are they testing? The content, how does the content work for this update? Well, seems that the customers didn't trust their testers or engineers, so what was happening was instead of testing for real they were doing what we called "Open door testing".

You see, when you close the door you accumulate and deduct, the machine is in operation normally as any user their end would have it operate....

Door open mode however, was intended to be used by service engineers, when the machine was deployed; so it is still in operation, the machine is in the field, but the door is briefly open to check things.

But these customers didn't trust their engineers in their warehouse, so they were not giving them credit to check the machine properly, they therefore tested in open mode... for days....

They accumulated massive operation debt with the machines in door open mode for days.

The moment they turned them off, happy they were working, and shipped them to sites they'd arrive on side immediately be turned on finally after so long in proper door closed operation and they'd instantly deduct the massive debt the warehouse team has accrued.

This was intentional.... But their use of the door open mode was an abuse, and one we had not even thought about. We didn't even clock how long a machine sat in door open or door closed mode, worse still when in door open mode and test things on the machine ran at an accelerated update rate, we ticked over 10x faster to allow faster testing... The result was in just 3 days of warehouse door open mode testing they could accrue 30 days of operational debt.

That was a fault, one I could tackle with the team. But changing the user habit of leaving the door open was harder...

We had to work with the user, and their patterns, we suspended the system for a short while and issued a new update, but the first customer taste of this "pay as you go" approach was a sour one.

Then things got bad....

Yes, you might think they were already bad, but they got worse.

A month later, all the above was resolved, and we thought things were settling down... Until we suddenly all hell broke loose.

EVERY MACHINE WAS LOCKED.

There were dozens of reports of their just not working, they had done their daily reboot and all of them reported a security fail on the smartcard....

All hands on deck, are our machines in the test pool doing the same? Nope.

Is there something special going on? Have clocks changed, is it a leap year, has the sky fallen on Chicken Little?

We honestly had no idea, there was no repeat in any of our test pool, no repeat on our personal engineering rigs, there was essentially no reason for this failure.

The only answer in such a situation is to observe or return one of the machines exhibiting the problem.

A lorry was sent and a machine brought back, under the explicit instruction not to open it nor change it, and the customer was not to keep their smartcard (it was theirs, but we would credit them a whole new card for the inconvenience).

Several hours spent staring at the code, and running checks by lowering cards so they would expire, or pulling the reader out and inserting it again we had no answer.

Before that arrives back with us however lets just think about the "smartcards" we used in our daily lives; our bank cards, they go into a machine we enter our pin and we remove them again. Then how about cards like your gas meter, or you go see your GP, they have a card you insert into a machine and it stays there all the time to validate they are the GP, or they keep your meter in operation, if you have Sky TV and a viewing card; same thing, it is always in the device.

These machines are the latter kind.... Those cards are rated to have power on to them for long periods of time, as a consequence they cost more money than a card you only insert transiently...

And this company I worked for had very canny buyers, too canny. Because they spotted a smartcard which used the same protocol... but was significantly less money to buy!

The difference? You guessed it, it was the transient use variant.

The broken machine arrived, we powered it on, fail. We open the door, remove the smartcard and sure enough on the rear of the plastic behind the chip the plastic is brown, burned.

The card can not be electrically trusted!

We highlight this and send it back to the buying department, they fouled up, they changed the hardware after we certified it, essentially sending an uncertified machine out.

A huge issue ensued about this, as this wasn't well understood that we had been provided and advised one card type into the update set, but of course the buyers would not accept it wasn't the same until we literally had the specifications of the card side by side we could see a digit difference in the part number and looked up the datasheet where clearly it said that the transient card was only rated to remain in a machine for 10 minutes. More than enough for an ATM. But a "security gating" card, as we wanted, they are rated to be inserted continually for 36 months.

Thursday, 25 May 2023

Just Stand It Up: About Premature Pessimization

Engineers often talk about premature optimization, but today I'm going to just talk briefly about the opposite, premature pessimization.

I currently work on a very large code base, it has been developed over four years from scratch. One of the first things performed were a series of investigations into "best performing" data structures, such as maps, lists and so forth.

Now of course one has to totally accept one can optimize any data structure that little bit more for a specific use case. One also accepts that when in C++ the standard library lends itself to being replaced bu defining standard operators, iterators and the algorithms going with all this use those standard exposed APIs, so you can implement your own.

I just want you to stop though and think... Do you want to?

Too early in a project and you can start to introduce bloat, either in terms of slight differences in the optimized cases, the code from whatever third party "best" you picked and even from your build chain, as you are bringing in dependencies on someone else.

The standard library doesn't do any of this, its dependencies are guaranteed by the ABI.

So why not just use standard map, or standard vector and standard string, or standard formatting?

Quite often I'm finding it is due to premature pessimization, that developers voice who cries out about some issue they had, either when some technology was new and emerging or late in an earlier project's life where they had to optimize for that specific case I mention, where the standard version did prove itself to be a detriment.

These engineers carry with them the experience, sometimes scars, from such exposure to edge cases and bugs they had to quash. Rightly and understandably they do not want to experience these self same issues again; their minds therefore are almost averted from just standing it up with the standard version. They immediately seek and even proactively nay-say the standard versions in favour of domain specific "best" versions.

This is in my opinion the very definition of premature pessimization, the standard library is wonderful, diverse and full of very well tested code and will have nearly zero overhead in adding and using to your project in C++.

I would therefore coach any developer with such mental anguish over just using the standard library to simply stand it up, just get across that line of things both building and running, then extend it to remain maintainable. And finally as you think you're getting close to stable, well then you can expend more time looking at, profiling, and understanding the edge cases.

Friday, 18 January 2019

C++: The Dilemma of using Exceptions

As a C++ developer I've recently been in the midst of the Exceptions debate. Exceptions as a feature of the C++ language have long divided the community, naively I've often put this down to their coming to the language from elsewhere (C perhaps) where exceptions were not the norm.

This was a mistake, for in the projects I've previously worked upon the overhead of an exception compared to the ramifications of the error resulting in a crash have often totally outweighed the overhead, making deciding to employ an exception exceptionally easy (*bum bum*). Exceptions were the norm, still are and should be to you too (according to the C++ Core Guidelines https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Ri-except).

The difficulty, the dilemma if you will, comes not from the feature, but the cost. If you application domain demands high performance you can not risk using exceptions, the argument carries on therefore whether in not using them to avoid that overhead in performance you are actually cutting off your nose to spite your face... What are the downstream costs if an unexpected, in tested for, error occurs and some strange behavior ensues, or worse it propagates up and out of your application to crash?!!

In the modern era of social media such feedback is near instantaneous, can you afford the egg on face?

And so you spin yourself around and dig further into the ground your problems.

Voices from industry do little to alleviate the problem, google for instance (https://google.github.io/styleguide/cppguide.html#Exceptions), almost seem to advocate using exceptions if from the ground up you can build them in. You stand the cost of retraining developers and change your ways of working around (e.g. embrace RAII).

They even go so far as to state "the benefits of exceptions outweigh the costs", but then add the caveat "in new projects". How old or new your project should not define your error handling mechanisms, surely? Plus from experience I've always found a code base is as old or new as the minds working upon it. If you have an engineer firmly rooted in C without exceptions they will continue to churn code in that style through your compiler (which will happily work with it), not because it's a better way of working nor because it's a worse way of working, quite often it's continued along simply under the inertia of experience that developer has and they've not time to stop to take stock.

Even given that time, which the project I've just begun working upon has, there's weight behind the past, the old way of working, what worked before, and if that was non-exception handling code people stick with it. Tell those self same people "you really should use these, oh and you have to retrain your minds, plus it'll take longer to compile and more time to run" they're going to stare at you like you're mental.

The question I'm left with is therefore not whether to use exceptions or not (well I am left with this problem, and this post is no help) instead we're left with the question of when will the benefits of exceptions, their ease of use, simplicity to write and visualise out weight the retraining, compile time and runtime costs?... I don't think that's anytime soon, it should be after so many years, but it's simply not.