Showing posts with label stl. Show all posts
Showing posts with label stl. Show all posts

Wednesday, 25 June 2025

XVE Game Engine : Ecs Containers & Update

This post title might look mightily like the post I made in January it is related, but we're going to discuss a different part of my Entity Component System and that is the containers backing my type registration and the actual values stored per component.

I've seen a few different entity component systems and I'll draw parallels with a few well known public domain ones.

The first is EnTT, it uses the same principles as my system which are sparse sets to store the metadata in regards the components per entity, my implementation also uses sparse sets however inspecting their code repository file names alone you can see maybe three container types a dense set, a dense map and a table.

I have known maps used to point to vectors of stored components such that the entity handle or id is the index into the components it contains, and then all the other vector entries are nulls (or otherwise empty).

And then in yet other systems I have seen a fixed number of components (64) possible being represented by binary flagging or bitsets to indicate the presence (or absence) of a component on an entity, then a linear storage of each component in a fixef size array.  This was a curious implementation, relying on Frozen to give constexpr & consteval Ecs ability to a certain extent at compile time.

My implementation started with the need to map the types (just any type) and then to allow those types to assigned to the entities, very much the sparse set model of EnTT and its cousins out there.

Why not just use EnTT?  Well, where's the fun and understanding built out of that?

So where did I begin?  Well, I began by just using a std::map, std::unordered_map and std::vector and a piece of hashing code to convert the type to a compile time evaluated id code; so I can for any type get a 64bit number which is its unique (or nearly unique) hash.  I am not using RTTI and infact compile with this disabled for speed.

With my types mapped and components going into the entities and being able to see which types were assigned to an entity (see January's post) I set about looking at more specialised containers; and finally we reach the meat of this post... Or should I say Paper?

As my self-review, redesign and frank improvements began by the careful selection of the AVL Tree as my container of choice, a self balacing binary search tree, often lumped in with Red-Black trees.  I had spent a little time in 2019 invaluating a red-black tree implementation for work for an ex-explorer, and watched on as a collegue did the same evaluation for my current.  I had not been best impressed with the lob-sided effects in the former, and the latter went with an off the shelf implementation from EASTL.

 

 My reading began with a few white papers, a couple of older text books, the STL guide and finally a read through the second edition of Grokking Algorithms by Aditya Bhargava.

This final read cemented the AVL Tree as my target container, and I set about implementation from the books examples.

My first addition was to include a visualization pass outputting .DOT format diagrams for passing through GraphViz... And so it was the first output, inserting an "A" into a new tree appeared:


 

I find these tree diagrams most full-filling, and had to research the trick to lay the diagrams out as an actual tree like this, but soon I had test cases galore.


Inserting a second item and then a third, where we see the tree rebalance itself:

 

This rebalancing is the strength, I can take all the inserts of new types at the start up and then balance once; cost once, to get much faster look up than hashing every time and traversion a tree.


 
If you can't spot the re-balancing in the leaves, just watch for the root changing, once it does take all the left arms to find the lowest level index (zero) containing 'A' and then march parent, right, common parent and so on and so forth.

My code to flatten and allow this traveral is not very optimal, it can only really allow you to pre-allocate the correct number of keys to then follow.  But again if the structure is not changing often (which is it now with type registrations) then it becomes the most optimal runtime solution I found.

Only the Frozen in compile time version, with its inherant more complex compile & lack of debuggabilty was faster.  But then anything you can pre-compute is going to be faster.

I did however find an edge case, depending on the order of insertions, and hence when rebalacing happens, the same data can result in variants on the shape of the tree.  Take the name of my wife and some of my pets....

Compare this with the same data inserted in a different order:

These two trees contain the exact same data and same key set; but they are different in form.

I had to decide what constituted equality, this is where I resorted to a left-right linear traversal per node and this results in an equiality check; it's not quick, and not really intended for production code to use, it is simply part of my test suite.

This difference, or cardinality one might say, is a temporary effect on the data, and I wondered whether it would have a knock on effect for determinism in my code.

As you may guess, one tree in traversal (key) order and then reloading could dramatically change the order of the tree node relationships, as such I could not rely on the direct pointer (or dereference) to any node over another, no instead the key value was literally key.

I found this effect of saving and reloading to "rebalance" the tree had the benefit when documenting my registered entity types that having gone through this save & load the graphs output where the same, I termed this a sanitzation pass on some of my data and I'm sure I'll return to it.

The last feature I added to the AVL tree implementation was the removal of nodes, this may sound trivial, and to an extent it is.

However, I had programmed the use of my AVL Tree with "forward intent" this is a rather short sighted way of working that you produce code only to do the exact features you need, you either ignore, or ticket and backlog the "nice to haves".

So it was after a fortnight of working on the Containers, the speed, the hashing within my engine, not to mention the test suite & graph viz pass I had dozens of types registrations, thousands (probably hundreds of thousands of each entity type being created)... And then I needed to destroy an entity.

Quite simply it never came up, I'd forgotten it as a feature and test case, because sometimes when you're working at near midnight tinkering with tech, you do forget things.

So there you go, my little sojourn into AVL Trees for my home engine Ecs implementation, and I learned a lot along the way in just this one little case of my own making.

Anyone out there looking to take up programming, do it, just start simple and move up along the way.  I'd say start simple, with a language which can just run on any box, without too much hassle, just a text editor and a language to use; like python.  Progress in small steps.


Thursday, 25 May 2023

Just Stand It Up: About Premature Pessimization

Engineers often talk about premature optimization, but today I'm going to just talk briefly about the opposite, premature pessimization.

I currently work on a very large code base, it has been developed over four years from scratch.  One of the first things performed were a series of investigations into "best performing" data structures, such as maps, lists and so forth.

Now of course one has to totally accept one can optimize any data structure that little bit more for a specific use case.  One also accepts that when in C++ the standard library lends itself to being replaced bu defining standard operators, iterators and the algorithms going with all this use those standard exposed APIs, so you can implement your own.

I just want you to stop though and think... Do you want to?

Too early in a project and you can start to introduce bloat, either in terms of slight differences in the optimized cases, the code from whatever third party "best" you picked and even from your build chain, as you are bringing in dependencies on someone else.

The standard library doesn't do any of this, its dependencies are guaranteed by the ABI.

So why not just use standard map, or standard vector and standard string, or standard formatting?

Quite often I'm finding it is due to premature pessimization, that developers voice who cries out about some issue they had, either when some technology was new and emerging or late in an earlier project's life where they had to optimize for that specific case I mention, where the standard version did prove itself to be a detriment.

These engineers carry with them the experience, sometimes scars, from such exposure to edge cases and bugs they had to quash.  Rightly and understandably they do not want to experience these self same issues again; their minds therefore are almost averted from just standing it up with the standard version.  They immediately seek and even proactively nay-say the standard versions in favour of domain specific "best" versions.

This is in my opinion the very definition of premature pessimization, the standard library is wonderful, diverse and full of very well tested code and will have nearly zero overhead in adding and using to your project in C++.

I would therefore coach any developer with such mental anguish over just using the standard library to simply stand it up, just get across that line of things both building and running, then extend it to remain maintainable.  And finally as you think you're getting close to stable, well then you can expend more time looking at, profiling, and understanding the edge cases.

Wednesday, 29 March 2023

A mini-break in C++ Optimization (Pt 1)

I have had one of those evenings, in reviewing some code for a friend I came across a repeated pattern where he would iterate over collections, strings and other linear items in a loop and mutate them.  We're of course talking C++ here, and he had this whole myriad of loops all over sanitizing, processing, event summing values.

I asked whether he had access to the STL he immediately got a bit uppity, he and I have never seen eye to eye about using the STL.

I advocate it be used, even just standing something up which you go back and make more performant or more memory efficient later, just stand up your project, don't get bogged down in details of the best string, or the best vector.

He however stands by his having "long tested examples I use all the time, just ignore them"...

No.

Simply put his examples, used in a smattering of programs, peer reviewed very infrequently by a miniscule handful of people is not good enough, use the STL, it is seen by literally millions of engineers, it is very portable and well adopted, use it to your advantage.

We sparred a little about this, but then I simply asked whether he had any parallelism in his algorithms?

Nope, of course STL can just use them.... You can provide the execution policy.

Lets take one of his examples, a simple function to give him the time & date as a character array.

char* GetDateTimeStringHis()
{
time_t now = time(0);
char* buffer = new char[64];
memset(buffer, 0, 64);
ctime_s(buffer, 63, &now);
for (int i = 0; i < 64; ++i)
{
if (buffer[i] == '\n') buffer[i] = 0;
}
return buffer;
}

This returns a character array buffer raw pointer.... No, he interjected, it returns a string.  A raw pointer to memory is NOT a string folks, this is where my argument on that point begins and ends.  This code is very unsafe, I don't like it and I certainly insist it can easily leak memory, for whom is going to delete that buffer?

So I erase that buffer with an actual standard string:

        std::string GetDateTimeStringV1()
{
time_t now = time(0);
std::string buffer(64, 0);
ctime_s(buffer.data(), 63, &now);
for (int i = 0; i < 64; ++i)
{
if (buffer[i] == '\n') buffer[i] = 0;
}
return buffer;
}

he didn't like this, but accepted perhaps that was better, he argued about the string constructor taking a length and a character (zero) to initialize itself with, but went with it.  It's safer, simpler, easier to read code.

I'm still not happy, I don't like that loop:

        std::string GetDateTimeStringV1()
{
time_t now = time(0);
std::string buffer(64, 0);
ctime_s(buffer.data(), 63, &now);
std::transform(
buffer.begin(), buffer.end(), buffer.begin(),
[](const char character) -> char
{
if (character == '\n')
{
return 0;
}
else
{
return character;
}});
return buffer;
}

He hated this, and I have to be honest I don't like the lambda either, but this is correct, using the transform algorithm and the lambda could be expressed in a location it could be reused, or just as a free function somewhere, it doesn't have to be inline here making a mess of reading the code.

What however is it doing?  It changes any new line to a zero in the buffer string, why?  Well ctime_s puts a new line at the end, so if we reverse the iteration we may get a much better performance we can early out of the loop.  But we can even just do away with the whole transform, we know it's the last character so just set it to zero as an optimization and simplification.  My friend can get behind this now:

        std::string GetDateTimeStringV2()
{
time_t now = time(0);
std::string buffer(64, 0);
ctime_s(buffer.data(), 63, &now);
buffer[std::strlen(buffer.data())-1] = 0;
return buffer;
}

We can even further think about this code, what will strlen do?  It might count from 0 to N to find the first zero character as the value, could we therefore count backwards from the 64th position in the buffer?  Well, we know the format of the data and time we get is known:

        Wed Mar 29 23:36:01 2023

Twenty five characters with the new line, adding another for the null termination character, so we can always just set the twenty fourth to zero and reduce the buffer size at the same time to perfectly fit!

        std::string GetDateTimeStringV3()
{
time_t now = time(0);
std::string buffer(26, 0);
ctime_s(buffer.data(), 26, &now);
buffer[24] = 0;
return buffer;
}

I feel I'm getting somewhere, I just want to make sure we use the STL version of the time functions now and use brace initialization for the "now".

        std::string GetDateTimeStringV4()
{
std::time_t now{ std::time(0) };
std::string buffer(26, 0);
ctime_s(buffer.data(), 26, &now);
buffer[24] = 0;
return buffer;
}

This ended my concerns and improved the performance of the code, he was calling this function nearly every time he logged, nearly every time he sent a network message and for very many of his database transactions.  This was 10 minutes of just thinking about the problem.

Now... Back to <algorithm> and execution policy with him...

Monday, 21 November 2022

It's a C++ STL Rant : Constness and Vector Erase

I believe I am right and correct to be seethingly annoyed that this code does not compile...


#include <vector>
#include <iostream>
#include <algorithm>

struct Test
{
    const float value;
    int changable;
};

int main()
{
    std::vector<Test> items;
    items.push_back({42.5f, 0});
    items.push_back({1.23f, 1});
    items.erase(items.begin());
}

It just really annoys me this won't compile, do you know why?  Don't cheat, just have a look why does this PERFECTLY REASONABLE AND ACTUALLY I THINK CORRECT CODE NOT COMPILE?

Yes, because of the ISO Standard, and it's really annoying.

ISO/IEC 14882:2003 23.1 [lib.container.requirements] / 3:

The type of objects stored in these components must meet the requirements of CopyConstructible types (20.1.3), and the additional requirements of Assignable types.

The compiler could have been told this far more simply if the std::vector had an earlier and cleaner static_assert that the type is copy constructible before it got into the weeds of the real message we get.

For can't compile because it seems the standard implementation of erase results in a bunch of move semantics in the vector, which then devolve into copies.  And to my simple monkey brain it's all just babbling horrible template junk that comes out the compiler as well.

You can of course move a struct of type Test perfectly well, or at least mark it for move (cast it thus):

#include <vector>
#include <iostream>
#include <algorithm>

struct Test
{
    const float value;
    int changable;
};

int main()
{
    std::vector<Test> items;
    items.push_back({42.5f, 0});
    items.push_back({1.23f, 1});

    Test bar { 123.456f, 2};
    items.push_back(std::move(bar));    
}

This compiles and runs just fine!

But it's not copy assignable.  Let us take a look at the compiler spew:

In file included from <source>:1:
In file included from /opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/vector:60:
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:427:4: error: static assertion failed due to requirement 'is_move_assignable<Test>::value': type must be assignable
          static_assert( __assignable::value, "type must be assignable" );
          ^              ~~~~~~~~~~~~~~~~~~~
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:495:22: note: in instantiation of function template specialization 'std::__copy_move<true, true, std::random_access_iterator_tag>::__copy_m<Test>' requested here
                              _Category>::__copy_m(__first, __last, __result);
                                          ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:522:19: note: in instantiation of function template specialization 'std::__copy_move_a2<true, Test *, Test *>' requested here
    { return std::__copy_move_a2<_IsMove>(__first, __last, __result); }
                  ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:530:8: note: in instantiation of function template specialization 'std::__copy_move_a1<true, Test *, Test *>' requested here
                std::__copy_move_a1<_IsMove>(std::__niter_base(__first),
                     ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:652:19: note: in instantiation of function template specialization 'std::__copy_move_a<true, __gnu_cxx::__normal_iterator<Test *, std::vector<Test>>, __gnu_cxx::__normal_iterator<Test *, std::vector<Test>>>' requested here
      return std::__copy_move_a<true>(std::__miter_base(__first),
                  ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/vector.tcc:179:2: note: in instantiation of function template specialization 'std::move<__gnu_cxx::__normal_iterator<Test *, std::vector<Test>>, __gnu_cxx::__normal_iterator<Test *, std::vector<Test>>>' requested here
        _GLIBCXX_MOVE3(__position + 1, end(), __position);
        ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_algobase.h:656:44: note: expanded from macro '_GLIBCXX_MOVE3'
#define _GLIBCXX_MOVE3(_Tp, _Up, _Vp) std::move(_Tp, _Up, _Vp)
                                           ^
/opt/compiler-explorer/gcc-12.2.0/lib/gcc/x86_64-linux-gnu/12.2.0/../../../../include/c++/12.2.0/bits/stl_vector.h:1530:16: note: in instantiation of member function 'std::vector<Test>::_M_erase' requested here
      { return _M_erase(begin() + (__position - cbegin())); }
               ^
<source>:20:11: note: in instantiation of member function 'std::vector<Test>::erase' requested here
    items.erase(items.begin());
          ^
1 error generated.
Compiler returned: 1

We read this from the bottom up, first erase needs to call "_M_erase(begin() + (__position - cbegin())) what on earth?  Well it appears this is trying to perform a copy-move of the rest of the vector; the rest of the stack all the way to the top shows us various iterator types to the top two messages where we have __copy_m and then a compile-time static assert that the type must be assignable.

With the intentionally const member we can't compile we therefore have two options, first we make the member non-constant.  This isn't very practical because the type is derived from a bunch of runtime values, they're not known upfront and can be different for each instance of the struct.

The second option therefore is even less appealing and looks like this hideous mess:

#include <vector>
#include <iostream>
#include <algorithm>

struct Test
{
    const float value;
    int changable;

    Test& operator=(Test pOther)
    {
        float& mutableV { const_cast<float&>(this->value) };
        mutableV = pOther.value;
        this->changable = pOther.changable;
        return *this;
    }
};

int main()
{
    std::vector<Test> items;
    items.push_back({42.5f, 0});
    items.push_back({1.23f, 1});

    Test bar { 123.456f, 2};
    items.push_back(std::move(bar));

    items.erase(items.begin());
}

We define this assignment operator performing this const reference magic and just.... ergh, don't do this folks, just don't.  Not only is this an excellent lesson in const member flagging in STL and it's pitfalls.  But then it's also a lesson that const doesn't really mean const.  AND it's a lesson on whether you want to keep a constness or nor given such awful hacks to get around it.

I could even go further than this, the operator= could be made private and the STL type "std::vector<Test>" could be made a friend, or the __copy_m could be made a friend; though in that last case the MSVC implementation of the STL will be different, so we'll have to work around platform differences too.

No, the most annoying thing here is picking open quite why __copy_m wants to end up assigning.  I am still digging into why, but oh boy am I annoyed by this one.  For in a module I unironically had to repeat at University solved this in 1996.... I would use a linked list, and simply pluck out the value... or I would skip to the index of the item to remove and copy those after (if they were host memory contiguous then memcpy speed achieved) and just write over and this erase is achieved.

To just see this wanting to copy assign at the element level is just hideous, and I hate it, and I with the ISO standard would change in this regard,

Thursday, 30 November 2017

C++ : Trust the STL

One sore lesson to teach some developers is when to trust the compiler, once you've gotten that across you have to start teaching folks to stop re-inventing the wheel.

If someone has already implemented a file handler, or a serial port abstraction, or a wrapper for some obscure feature, you need to evaluate that offering...

To evaluate whether a library is worth using, firstly see if it works, then see how many folks actually use it, the more that use it then the more likely bugs will be flushed out and the whole thing has been tested.

Leveraging this kind of mature code within your projects assists in bootstrapping the startup phase of new projects.

Boost is a note worthy example of what I'm talking about here, many software shops (at least the ones I know) resist using open-source or third party libraries, they prefer to stick to in-house developed niche implementations until the very last moment, this of course slows development and completely symies innovation.

Boost however is one step further than the problem I'm going to tackle today... The Standard Template Library...

The STL is often commented upon negatively, this is despite it being a hugely available resource, vastly and deeply tested throughout and constantly incorporating new innovations.  Whole books have been written on the topic, and yet one can still find projects and individuals resisting using the STL.

STL nay-sayers will quote "no need for an STL requirement", "uses less memory than an STL implementation" or "faster than the STL"...

The problem with this attitude is, are such attitudes going to sufficiently tackle testing of their bespoke solution, is that bespoke solution going to be as robust or as easily maintained as something using the STL?

Probably not, and this is a hard one for die hard "purist" developers to swallow, we want to write all our own code, we want to be gods in our domain, the trouble is for the vast number of us, god has already been there and he wrote a decent enough library to do the task we need doing... So leverage this!

I came across one such niche item the other day, with an algorithm to see if a string starts with...

They hadn't used boost, or the STL, to do the searching, yet perversely had used an std::string... Their code, looked a little like this:

const bool StartsWith(
const std::string& p_Text,
const std::string& p_Pattern)
{
bool l_result(true);

if ( p_Text.length() >= p_Pattern.length() )
{
for (unsigned int i(0);
i < p_Pattern.length();
++i)
{
if ( p_Text[i] != p_Pattern[i] )
{
l_result = false;
break;
}
}
}
else
{
l_result = false;
}

return l_result;
}

It is fairly logical code, they're looking at the length of the presented parameters, to avoid looping when not required, then they only loop from the start and only return a fail when the character is a miss-match, looking at this with programming eyes from 1996, I'd say this is fine.

Looking with eyes well aware of the STL, I cringe a little, and replaced this whole function like this...

const bool StartsWith(
const std::string& p_Text,
const std::string& p_Pattern)
{
return (p_Text.find(p_Pattern) == 0);
}

One line, of very much more maintainable, vastly more readable and easy to comprehend code...

The developer of the original however was not happy... "you're wasting resources, this will find any instance and tell you the input".... he's right it will, but the STL will still be faster than his code.

I demonstrated this by plugging both into CompilerExplorer... He still refused to listen.

Therefore, I've written this little helper project, to run the two functions side by side, threaded three tests, looking for the match, a long match and a negative match at the start of the string (Code on Github).


The results of this are interesting, you see the project itself favours cases where it's highly likely the string being searched for is present and therefore we don't need to worry too much about the odd test not finding a match taking longer... This is exactly the behaviour seen in the STL based find example.


The Short search time, for the same data, on the same processor went from 28358 microseconds to just 5234... That's about 81% faster.  The longer search is more stark, falling from 185966 microseconds to just 6884, just over 96% faster!

The rub is the negative case took longer, rising from 19765 in the hand-crafted search to 25695, just over 30% slower.  Some of this increase can be explained perhaps by the hand-crafted version using the lengths to quickly skip too short an input, otherwise it is simply that the STL find has to iterate over the whole string when no match is found.  A hybrid to not perform the find at all, when there is insufficient data maybe in order; however this may add to our maintenance burden and lower code clarity, swings and roundabouts.

However, clearly in the case of this project, dismissing the STL resulted in slower code, we have a system propensity for matches, they're quite short, and all target platforms have the STL built in, use it.

Never be affraid to ask questions of what you're working with, ever.

Sunday, 16 October 2016

Software Engineering : RegEx in C++ 11/14 with STL

I want to show you how the STL regular expressions in C++ work.... Note, complete source code & makefile at the bottom.

First we'll need a make file, I'm using Ubuntu Linux with GNU G++ v5.4.0, opening a terminal I get a text editor up and we create this makefile:

CC=g++
STD=c++14
WARNINGS=-Wall -Wfatal-errors
FLAGS=-pedantic
OUTPUT=application
FILES=main.cpp

all:
$(CC) -std=$(STD) $(WARNINGS) $(FLAGS) $(FILES) -o $(OUTPUT)
clean:
rm $(OUTPUT)
And I save that as "makefile"... Next we need a simple "main.cpp" file to test this with, so go a head and write:

#include <iostream>
#include <string>

int main ()
{
std::cout << "Hello World" << std::endl;
}

Save that and we can be in the folder and simply type "make".  Everything should complete cleanly, and you can then type "./application" to run the resulting "Hello World" application.

Now lets go back into the main.cpp, and we'll write a function... I'm not going to show you this as proper C++, I am not going to teach you about classes, so just go with me, we'll write this function above the "int main()" we just created, and then we'll define a function to split a string into words whenever it finds a space....

#include <regex>

std::vector<std::string> SplitString (const std::string& p_Source)
{
std::vector<std::string> l_result;
// The actual regular expression
std::regex l_regularExpression ("(\\S+)");
// Process the whole source string through the filter
auto l_regularExpressionResult = std::sregex_iterator(
p_Source.begin(),
p_Source.end(),
l_regularExpression);
// Use the result iterator to get all the individual strings
// into the result vector of strings
for (auto i = l_regularExpressionResult;
i != std::sregex_iterator();
++i)
{
auto l_item = (*i);
std::string l_TheString = l_item.str();
l_result.push_back(l_TheString);
}
// Return the result
return l_result;
}

Lets just take a look at this working, into your main and do this:

int main ()
{
const std::string l_SourceString ("Mary Had a Little Lamb");
std::vector<std::string> l_words = SplitString(l_SourceString);
for (auto i = l_words.cbegin();
i != l_words.cend();
++i)
{
std::cout << (*i) << std::endl;
}
}

We can save, exit and build the program again, running it we see this:

Mary
Had
a
Little
Lamb

So what did our new "SplitString" function do?  Well, lets first of all hope you're comfortable, with STL iterators, because we use one to go through the source string and then another to go through the expression result.

Our important lines of code are, std::regex l_regularExpression ("(\\S+)");  where we define the regular expression string, no I'm not going to teach you all the ins and outs of creating those strings, this expression however just gets individual strings.

The next important line is: auto l_regularExpressionResult = std::sregex_iterator(  where we are going to use the sregex_iterator constructor to actually apply the filter we created on the previous line, and we apply it to the span of the whole source string "begin()" to "end()" on the std::string::iterator there.

We could try to use the std::string::const_iterator too, by simply substituting with "cbegin()" and "cend()".

The final parameter is passing the actual filtering regular expression into place.

The result, and we don't need to worry about the type as we're leveraging auto there, is a copy of the iterator.  Depending on the STL implementation you have will define when the processing takes place, some versions will process as you iterate over the sregex_iterator, making you process the input on the fly, whilst others pre-process everything, holding off your code moving to the next line of code (when you step through) until the complete source has been processed through the regular expression.  This can be a performance trap for some, as they either think it will process, when it does not, or it does not until you iterate, and confusion ensues.  Especially when you are writing cross platform code and the platforms express different behaviours.

The last important piece of code is actually going through the result to see if there is anything in the resulting iterator.

The awkward piece of using auto shows up here, because on some platforms when you try to iterate through the result and get each string you might want to do "(*i).str()" rather than assigning the dereference (*i) to an auto first.  However, some compilers (especially when using -pedantic, GCC on this one) don't like this, so to make the code more maintainable and pre-empt it being on any platform where the dereference of the iterator is reported to "not contain a definition for "str()", I simply assign the dereference to an auto called "l_item" and then use "l_item.str()"... That's a lesson in maintainable code right there folks.

That is a very basic introduction to regular expressions, you can see why I have gone through this below.

Right now through, lets use a more complex regualr expresion, and avoid the complexity of the interator stuff, lets just validate a string as a UK Postcode:

const bool ValidatePostcode (const std::string& p_UKPostcode)
{
std::regex l_Validate ("^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$");

return std::regex_match(p_UKPostcode.c_str(), l_Validate);
}

This might look a little cramped, but I never wanted to make a mistake with the reg-ex.  This isn't a perfect solution btw, I'm still writing a test routine to check it against a full list of UK postcodes online, I think it will let some stranger codes through as valid, but they are edge cases, this will work for 99.5% of addresses, and 100% of those I've tested so far.

There you go, good luck!


=== WHY DOES THIS EXISTS ===
Today I've been using regular expressions in C++, some might consider this dark magic, however, I assure you it is all above board, the problem was the validation of a UK postcode, there was a quite terrible function:

bool Validate(char *Code);

Defined, which had all manner of hackery and trouble within, not least it could not handle some London Postcodes, we'll come back to postcodes later, however I replaced all the functionality of this code with two lines of code... Literally two, it went from around 500 lines of un-maintainable junk, to the two lines of active code to manage which you see above, I in fact could have placed the regular expression string into our master list of "strings" to yet further minimise where constants are defined, but I left that to him, left him a small victory to coerce acceptance of my drastically demonstrating his not thinking about the code changes needed, and spending all week on something which took me two lines and about 10 minutes to make sure the regex was right!

Handing it back to the owner, after my peer review, I think they wanted to cry, instead they rushed off to our common Director, avoiding all code managerial level input from fellow programmers, and said I had "shown them up by using a third party library".

I had used STL, something we use elsewhere, I had also followed the coding standards which exist, so the function had become:

const bool ValidatePostcode(const std::string& p_UKPostcode) const;

This, I think you must agree, is more informative as to what it does, it tells us we can't edit the values, we're still passing everything by reference but we're not changing the type of our system string handing from "std::string" to "char*" and we also define that the function changes nothing in the class it is within with a trailing const.

All these rules are in the coding standard, folks before you go around a peer to complain; a more senior peer at that; please check you are in fact on the right track.

So, having validated my changing the function prototype, I had to explain why I had used a third party library (as all such libraries need formal evaluation)... "Regular Expressions are in the standard library".... Was my simply reply... "Only in the latest technical release!"... Was the mouth frothing reply from the hurt chap.  "No, they've in C++11, we use STL all over the code, it is formally evaluated and signed off by everyone, including yourself".

The guy looked extremely crest fallen, and whatever his motivations for having a go at myself, I realised he just didn't know, he'd not read the books I had, he's not used the code as I have, and he'd simply always used regular expressions from third party sources, and that's fine, but please folks just check  your coding standard and have at least a look on google, before you go shouting to those above in an unprofessional manner.


---- THE COMPLETE SOURCE (main.cpp) ----

#include <iostream>
#include <string>
#include <regex>

std::vector<std::string> SplitString (const std::string& p_Source)
{
std::vector<std::string> l_result;
// The actual regular expression
std::regex l_regularExpression ("(\\S+)");
// Process the whole source string through the filter
auto l_regularExpressionResult = std::sregex_iterator(
p_Source.begin(),
p_Source.end(),
l_regularExpression);
// Use the result iterator to get all the individual strings
// into the result vector of strings
for (auto i = l_regularExpressionResult;
i != std::sregex_iterator();
++i)
{
auto l_item = (*i);
std::string l_TheString = l_item.str();
l_result.push_back(l_TheString);
}
// Return the result
return l_result;
}

const bool ValidatePostcode (const std::string& p_UKPostcode)
{
std::regex l_Validate ("^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$");

return std::regex_match(p_UKPostcode.c_str(), l_Validate);
}

int main ()
{
const std::string l_SourceString ("Mary Had a Little Lamb");
std::vector<std::string> l_words = SplitString(l_SourceString);
for (auto i = l_words.cbegin();
i != l_words.cend();
++i)
{
std::cout << (*i) << std::endl;
}

// Postcodes
std::cout << "--- Postcodes ---" << std::endl;
std::cout << ValidatePostcode("NG16 5BP") << std::endl;
std::cout << ValidatePostcode("NG10 1NQ") << std::endl;
std::cout << ValidatePostcode("Robert") << std::endl;
std::cout << ValidatePostcode("FP52 JTY") << std::endl;
}

---- makefile ----

CC=g++
STD=c++14
WARNINGS=-Wall -Wfatal-errors
FLAGS=-pedantic
OUTPUT=application
FILES=main.cpp

all:
$(CC) -std=$(STD) $(WARNINGS) $(FLAGS) $(FILES) -o $(OUTPUT)
clean:
rm $(OUTPUT)


P.S. Yes this will all work with "STD=c++11" in the make file!