Xavyiy wrote:Btw, how well SoA arranging can perform in the rest of operations? For example, the following code will be far slower due to non-contiguous memory access:
Code: Select all
const Ogre::Vector3 Node::getPosition() const
{
return Ogre::Vector3((*mPositions)[mChunk][0][mOffset], (*mPositions)[mChunk][1][mOffset], (*mPositions)[mChunk][2][mOffset]);
}
There's an easy fix. Because that's not the memory layout I was thinking of when using SoA.
The memory layout you're thinking is:
A: XXXXXXXXXXXXXXXXXXXXX
B: YYYYYYYYYYYYYYYYYYYYY
C: ZZZZZZZZZZZZZZZZZZZZZ
The memory layout I'm thinking is:
A: XXXXYYYYZZZZXXXXYYYYZZZZ
This way, each node's X & Z components are only 32 bytes apart
*, which is enough to enter in any cache line. As far as the caches go, the memory is still "contiguous".
The memory allocator may want to take into account dynamic layouts, because I've heard AVX now supports 256-bit registers. If it's faster to place 8 "X" together or just 4, we'll have to profile.
Although fixed memory layout has the advantage that we can infer the memory locations of Y & Z by knowing X (having a dynamic layout means an extra variable to hold how many bytes apart is each component).
The only performance penalty I can see is that such function can't retrieve a reference, and the vector must always be constructed. We can, however, build a special Vector3_SoA class that can inter-operate with Vector3. Therefore retrieving a Vector3_SoA would be fast, and may not even need conversion to Vector3 if the operations are trivial (i.e. calculating the length of the vector or the dot product)
CABAListic wrote:Hm, not sure I like the idea of having two functions for essentially the same thing. Would it really be that bad to return a const reference and write in the comment that the reference is not guaranteed to remain valid? We wouldn't be the first lib to do that; heck, even the standard library has more than enough ways to invalidate previously acquired iterators. Such subtleties are the price you pay for using C++. Just don't store a reference if you don't want to deal with them
That's a very good point. I agree. Proper documentation on how the return value can be invalidated beats any "duplicate function" pattern.
We should, however, think of a debug function that panics when an invalidated Vector3_SoA is used, just like iterator debug asserts do.
Edit:
* That means the "XXXXYYYYZ" necessary to have a working Vector needs 36-bytes. From the P4 to the current generation Core i7, the cache lines are 64-byte in size, which is enough to hold the float (but not double when using OGRE_DOUBLE_PRECISION) version before hitting a cache miss. I don't know about AMD; but
here it says 32, 64 & 128 bytes are common cache line sizes.
As for the necessary space between scene nodes to avoid false sharing in multiple threads, common sense dictates it is 64 bytes (the size of a cache line). But it is actually 128 bytes because of the reasons
explained here (long story short, P4 & Pentium D fetch two lines in a row instead of one)
Edit 2:
This is the easiest (in terms of readability of the inline functions to follow) of a Vector3_SoA implementation using fixed layout:
Code: Select all
#define ELEMENTS_TOGETHER 4 //Increase to 8 when compiling for AVX?
#define PAD_COUNT ELEMENTS_TOGETHER-1
class Vector3_SoA
{
float x;
private:
float pad0[PAD_COUNT];
public:
float y;
private:
float pad1[PAD_COUNT];
public:
float z;
private:
//Prevent stack allocation and copy operators. This structure always must be passed by reference
Vector3_SoA() {}
Vector3_SoA( Vector3_SoA& ) {}
inline Vector3_SoA& operator = ( const Vector3_SoA& rkVector ) { return *this; }
public:
//"Vector3_SoA()" must be explicit, hence compiler errors arise if users inadvertently try to use a
//Vector3_SoA in stack memory (which is totally pointless & wasteful and should use Vector3 instead).
explicit Vector3_SoA( void *dummy ) {}
//We can copy-paste Vector3's inline implementations, or even use preprocessors
//so that we only write those inlines once and kept them permanently in sync
};
Where Vector3_SoA is properly casted from the memory pointer:
Code: Select all
float *MEM; //Points to valid memory
Vector3_SoA *o = reinterpret_cast<Vector3_SoA*>(MEM);
Alternatively, Vector3_SoA can be constructed like this:
Code: Select all
class Vector3_SoA
{
private:
Real *base;
int elementsTogether; //Normally 4 or 8
public:
Real x() const { return *base; }
Real y() const { return *(base + elementsTogether); }
Real z() const { return *(base + elementsTogether * 2); }
Vector3_SoA( Real *_base, int _elementTogether ) : base(_base), elementTogether(_elementTogether) {}
//Implementing Vector3's inline is not as straightforward because x() is now a function, not a variable.
}
}
The first method seems more natural for users, but can be trickier for Ogre devs because Vector3_SoA must always be a pointer or a reference, and never live in stack memory. While the 2nd one doesn't have that restriction, but you have to call x().
Edit 3: Updated Vector3_SoA's code (changed constructors)