I have had a really excellently interesting sprint with the work I'm currently doing, like you know one of those sprints where you get a real technical tooth into the pie of problems.
That however is neither an interesting story, nor one I can actually tell.
Instead I will fling my mind back a score of years and we'll discuss THE WORST DEVELOPMENT ARGUMENT I HAVE EVER HAD!
The problem started with the platform I was then working on, it was a quite low-powered single core Celeron PC base, running a favour of Windows for Embedded systems (I think it was Windows 2000 Embedded), but it was basically used as a host for a C# application stack which itself was more or less a wrapper around a C hardware driver talking to the various light, button and money handling mechanisms in the machine.
This system of ours then simply would process launch, at the shell level, another child process which was the actual game content. And there were lots of different games we could fire up.
Our menu and the hardware polling all would back off, it was actually only polling for button and light IO and every second polling for cash changes from money physically being entered; so that was slow to update, but otherwise it was fine and backed off.
Being a Win32 environment it was fairly easy to back off most all the threading and just launch the child process.
Our testing showed that we had a 92/8% CPU split, the system would take on average 8% CPU whilst in this active child mode; it had a few spikes and it had a few troughs, that included windows itself. Otherwise it was pretty much that most all the CPU was available to the child process.
It was therefore with some scepticism that we had the development manager in the content area wonder over complaining that one of his guys was having issues getting smooth frame rates from the platform. We accepted the install and ran the child process, indeed we immediately saw the issue.
A popular UK TV show game, with a target circling the board...
The target would stagger and slew around, this was immediately explained to us that the system was so busy the game was being forced to drop frames, the frame rate icon the game had programmed into it went up and down like a pair of kangaroo's in the mating season. This was the whole argument, our game drops frames your system takes too much time on the CPU.
It has to be pointed out that our manual actually said we only make 75% of the system time available to the game, our demonstrating we were making over 90% available was well within tolerance, and this is the only game showing this kind of issue. We were therefore skeptical about the claims being made; especially about the high quality and tested nature of their content executable.
After an amount of head scratching we decided to measure how often the "Present" function was being called in DirectX, a little hackery later and we had a measure... Was Present being called consistently and then the system not presenting, or was the game itself changing the length of time a frame took, staggering when it presented?
Yes, the game itself was almost immediately measured to be staggering how often it called Present. So the question came up "When you move these items are you interpolating between the position and so smoothing where the target is? Or are you moving them a fixed distance each frame?"
"The frames are a fixed length"... The developer and their manager said. We had just measured and shown that the frames were changing length, we couldn't look at their code, so we'd had to hack about but we'd shown their code was calling the present with the same staggering it was not a fixed frame rate as they said it should be.
We video recorded (literally with a camera on a tripod) measured the frames on the camera and calculated the stutter. Then measured with our test harness when present was being called and saw a direct match, when they called present the present happened. Our conclusion, the only conclusion really, was the content executable needed to be calling present more consistently; or their DirectX present be set up to sync with the screen or something.
The argument then began, they insisted that their game was presenting at a fixed interval, it was fixed to 30 FPS; they refused to turn on VSync, their platform did not allow this setting to be part of the DirectX initialise... It really should have been. But we had explained the staggering, the delay in present being called, neither were o do with the system all appeared to be in their code.
"But the system is taking 100% CPU"
Yes at this point three member of my department all pulled this face:
It has to be said the manager in charge of this developer is perhaps the worst development manager I've ever met, a man so inept he alienated and lost developers at a prolific rate, who spent £40,000 on migrating to Perforce rather than paying his team and just using git. Who seemed to think he could pick up modern development techniques by osmosis.
The developer himself? Well, lets see how he handled all this evidence from the camera footage....
That's right, he lost his shit.
Despite myself, a coworker and our manager all confirming our observations that the hitching and stuttering were coming from the child process itself seeming to go idle and so it calling present at differing intervals he clearly took this to be a personal attack. He stood in our development lab and basically tore into our system; our whole team took a verbal berating at this juniors gob and his manager relished it (this was one of the earliest examples I saw of this manager relishing in his minions going to town on others instead of his reeling them in).
Rocked by their insistence and being in the unhappy situation of having to prove our innocence our manager asked for us to be able to review their code. They refused. There were some issues, because our company had just conjoined with the place this developer worked with, they still saw us as interlopers.
My colleague was known to be somewhat more volatile than myself, so our manager left me on this horrible issue, trying to figure out what was wrong.
Come the Friday, and a little exacerbated after two days, I started to decompile their executable. It appeared to me that their main loop was miss-behaving, it seemed to be just calling Sleep with a fixed value on top of the work it had carried out each loop. And sleep takes a variable amount of time.... It was not taking a check of how long their update took and then sleeping for the difference to meet the frame rate target and then it dawned on me.... Did you spot it too?
They were calling SLEEP.
The Win32 API documentation literally says sleep is not a fixed interval, it relinquishes your thread for the remainder of your time slice up to or more than the sleep time given, returning when the windows scheduler next wishes to make your thread active.
It still says similar today, the time slept is not guaranteed.
This client process main function seemed to be calling sleep with a fixed value each pass, it was hardcoded into the executable. If you sleep for X+Unknown, you are going to see exactly this staggering. Our demo child application has a busy wait, it never slept, it yielded by passing a sleep of zero (which make it give up its time slice but remain ready, and it would more or less always be rescheduled before the fixed frame rate next time point, plus we interpolated our animation example).
This child process was just proving more and more to not be to specification.
Their manager insisted it was to spec... Which was very frustrating, as he grandly declared "I have read the code, I know it is correct" and "when we run it, we do not see this issue".
It felt like a fundamental miss-understanding, none of the folks on the team I was within were being treated with any respect, nor acknowledging our collective experience and understanding.
I stewed on this all Friday, come the Monday things were getting fractious. This game had to be out! It came up in the master development round up that this game was held up by our system. This started the direct antagonism.
Everything I did, everything other members of the team did, all showed this content application was just staggering and stuttering on it's own volition, by design, by intent it staggered. It was not our system!
To prove this I therefore set about writing a harness, which would just give the client process the same DLL to load, but they were all stubbed out calls, and it would run WITHOUT our system. The content could be tested locally and checked. We got one of the same Celeron PC's, just a flat install of the OS and double clicked my harness.... Sure enough the game staggered about in the exact same way!
I presented a A4 page example and showed how the loop in our example application worked, explaining that Sleep is not a fixed time interval and a busy loop should be used with a yield not a time span anyway. I recorded this with my harness and their game, I presented both recordings too.
The developer went apoplectic....
He literally shouted at me that Sleep was guaranteed to come back after that amount of time, and when he ran my DLL shim locally it ran really smoothly.
He was right, it did, but he had a much different dual core machine with 4GB of RAM and a graphics card. Our platform was a single core Celeron 1GB of RAM and built-in Intel graphics shared vram, in short very different.
At this point the director who sat between my manager and this development manager ordered that I be allowed to look at their code.
The worse development argument I have ever had then hit a peak, as I walked over, flanked by my manager and their manager. I lent over his shoulder and pointed to the sleep function and said that is not going to be a fixed period.
That was all I said, he never let me explain any further, he just went MENTAL! He started shouting, screaming, and called me a few choice names. He would not accept all the evidence that his loop was not a fixed length, that it was changing from frame to frame, he just could not figure out that:
{
UpdateStuff();
Render();
Sleep(33);
}
Was not going to always take a fixed amount of time, first of all I pointed out that doing anything and then sleeping like this will be the time the work takes plus at least 33, and then there was no guarantee that the sleep would immediately come back. The windows scheduler would decide when you can come back after at minimum that amount of time rounded to the nearest platform tick.
His manager immediately backed him up and agreed with him, they both talked down to me. They insisted loudly and angrily that sleep was fixed and the functions they have took such a trivial amount of time they were not worth measuring....
Even my manager backed me up here, of course doing a function call, any function, will take some amount of time and they need to take that into account.
I had just had these two idiots literally shouting at me, whilst I had to stay so calm, it took an icy handful of minutes for them to accept the argument that 33+N is > 33 where N is none zero. It was just fundamental and they were not having it.
Their code became:
{
startTime = Now();
UpdateStuff();
Render();
endTime = Now();
Sleep(33 - (endTime - startTime));
}
Slightly better, but we still saw hitches and stutters, they were far far less frequent now.
This massive drop in frequency I immediately and without changing my argument pointed me to the sleep, as I said the sleep is not a fixed time, it was not going to sleep for X and come immediately back, that's not how Windows worked.
Their argument was that Sleep was fixed, that it was guaranteed to return after X.
They were very loud, very obnoxious and very adamant.
We returned to the developers desk: "Show me why you think Sleep is fixed".
I expected him to bring up some code, some harness, some proof of his thinking. Instead he opened Internet Explorer, went to MSDN and showed me the Sleep function documentation.
Sure enough it said "fixed interval". He was so smug. So infuriatingly smug. His manager was ultra smug too.
I reached down, scrolled the mouse and pointed to the screen....
He was reading Sleep in the Windows Mobile SDK. He's right on Windows Mobile sleep is a fixed interval. However, we're not on Windows Mobile are we.
My manager looked at the screen, I looked at the screen, they looked at the screen. And immediately the developer called me a horrible name, yup, just straight up called me a name.
I have to admit I didn't react well, fisty-cuffs didn't happen, though the way he erupted out of his seat raging I expected the guy to swing for me.
He could not take it, his manager still argued he was right, so invested in their mistake were they that they could not admit their miss-understanding. The manager always claiming he only hired the best minds, this guy being quite arrogant and the whole lot of them generally being very dismissive of both myself and the department to which I belonged.
I walked away with my head up high. My manager stood and pair programmed with both their manager and their developer for maybe twenty minutes and a new version of the content executable quietly appeared without any stutter; even when we ran obrut!
It was a horrible moment in my time with that employer, I remember how the guy never apologised, that development manager never apologised and the game went out without any further delay, but they never received any censure for the episode.
Our department also never shook this kind of effect either, for some reason because their manager had gone to bat for them from the off every following time a performance issue arose we had to prove everything to the Nth degree without ever seeing the other side doing the same. Very rarely was it ever truly our issue.
I have never forgotten, I have never forgiven.
When you foul up, just admit it, owning it and learning from it is far more wholesome than being uptight and obtuse.