Tuesday, 7 October 2025

Over the Fence.... And Far Away....

No, I'm not talking about horses this time.... Lets talk engineering, what is a fence?  A memory fence or barrier is used to order reads and writes around certain operations called Atomics.

This ordering ensures that both communication of the value from the register in the processor is correct between all the cores/threads running and the cache within the chip as a minimum, and may even leave to a full flush to the main memory (though that is far longer in cycle).

These can take tens to hundreds of cycles to leap the value over the barrier.

In C++ the most common two, or certainly which I encounter most often, are fetch_add and fetch_sub, using them to control everything from the shared count on a resource to the control gating on fixed sized "high performance" containers.

And there in lies the rub, these operations cost a lot of cycles just to increment or decrement a counter in many of my use cases, so why use them?

Well, that barrier within the chip, between the memory is very expensive if we compare it simply with the very cheap increment or decrement of a register itself, just the value in the register on the chip can change in a single operation instruction; sure it took others to load the register and it'll take yet more to store the result off, just as it would with the atomic; but on top of that you have no overhead in comparison with the atomic....

Until... Until you try to synchronize that flat increment or decrment, sure then that code is going to be far faster, however, it's not thread safe, not at all, the atomic already is (when ordered correctly)...

In order to protect a flat operation one therefore has to wrap a separate lock, or mutex, around it which is far far more costly than the atomic operation.  This difference is called the "contention cost", the contention cost of an atomic is simply in the number of steps, lets look at code:

The atomic addition, the CPU itself will execute

lock xadd [or similar]

This itself is a single instruction, it may result in multiple cycles of the CPU to complete, but it is a single instruction.  It ensures unique ownership of the cache line (usually 64kb) within which this variable resides, and means if you perform an operation anywhere in that cache line you will be making optimal operations.  As the CPU can perform all the atomic updates in that 64kb block without having to fetch another, this is really useful when there are a few cores (2-8 on average) accessing an area of memory and event holds up when scaling out to more cores.

A mutex however, has to be controlled wholly separately from the increment, so we may end up with C++ such as this:

std::mutex lock;

uint64_t count { 0 }; 

{

    std::lock_guard<std::mutex> lockGuard { lock };

     ++count;

The execution here will have to acquire the mutex in a harsh manner, internally this is an atomic; if the pathway here is lock-free, then the atomic operation underlying the mutex is the only added cost.  However, and this is a HUGE however, if there is contention, someone else already has the lock then this lock-guard has to spin wait... And it's the contention, the other thing having the mutex locked, which adds the cost.

So you're essentially gambling on whether you have a value not contested before the lock or not, and in both cases you take on the cost of an atomic operation; so for my codebase and it's uses across sub 32 core machines means that an atomic is much more efficient in most all my use cases.

A mutex however is far more useful when protecting more than a single register, when protecting a whole block of memory, a shared physical resource (like a disk) or just a complex structure you can only use a mutex around it.

All this sort of eluded me earlier this evening, I was in the middle of a technical conversation and I bought up the atomics backing a shared_pointer in C++ and immediately just sort of lost it, my memory drifted far far away and I have to admit to waffling some.

I even forgot about weak_ptr deriving from a shared_ptr and it's uses to "lock" a new copy of the shared_ptr and so passing ownership by weak pointers.

But it came from a very specific conversation about Unreal engine, about TSharedPtr... Not a structure I myself have used, and for the life of me I could not think why not, I just knew not to having been told...

And of course here I sit a few hours later and I know why, it's not thread safe... TSharedPtr in Unreal is not threadsafe, and why is it not?  Well because it does not protect its internal reference count with an atomic, no it's just a flat inc and dec of a count integer register "for performance purposes".

So sure if you're inside one Unreal system, on one thread, then yeah you can use the TSharedPtr, but it's utility is much reduced to my eye, and you would want to perhaps look at other ways to hold your resources in that thread, even in thread local storage rather than in the engine heap.

The moment that TSharePtr crosses a barrier, then you're far far away from thread safe.

So what do you use a TSharedPtr for?  The documentation says "To avoid heap allocation for a control block where possible"... Yet it contains an explicit reference count, which is in a control block, and underlying it is a flat "operator new" and it uses only default deletion via the flat operator delete.... So my non-Unreal expert brain says "Why use it at all".

Hence when asked earlier today my memory was over the fence and far far away.... Of course now, it's all returned and... yeah I just sat through two hours of TV mulling this over.... and had to come write something down before I went a little bit odder than usual.

Tomorrow let me regale you with my story about forgetting how to load a crash dump and the symbols files and explain myself properly, despite doing that particular duty about a hundred thousand times in my life.  Hey ho, a technical conversation in which I fouled up, one is only human.

No comments:

Post a Comment