'Tindalos' (Ogre v2.0) SceneManager redesign

Discussion area about developing or extending OGRE, adding plugins for it or building applications on it. No newbie questions please, use the Help forum for that.
Post Reply
User avatar
Kojack
OGRE Moderator
OGRE Moderator
Posts: 7157
Joined: Sun Jan 25, 2004 7:35 am
Location: Brisbane, Australia
x 534

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Kojack »

Forgive me if my question may sound stupid, but after reading dark_sylinc's idea I was thinking whether some ideas, like software culling and optimising for SSE instructions can have a detrimental effect on speed on the mobile or older devices that Ogre is targetting.
But for any regular computer of up to 8 years old this will probably lead to a huge speed increase.
Some ARM chips (all Cortex A8, some Cortex A9) have NEON, which is a 128bit simd instruction set like sse.
So any SSE optimisations can probably be ported to NEON too, giving mobiles a boost.

For older devices, SSE has been available in Pentiums for 13 years. I'm pretty sure Ogre isn't targeted to support hardware before that. :)
User avatar
masterfalcon
OGRE Team Member
OGRE Team Member
Posts: 4270
Joined: Sun Feb 25, 2007 4:56 am
Location: Bloomington, MN
x 126
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by masterfalcon »

Thanks Kojack, I was just about to say that. And the software culling would use some CPU cycles to greatly reduce GPU load. So it's a win-win.
User avatar
Klaim
Old One
Posts: 2565
Joined: Sun Sep 11, 2005 1:04 am
Location: Paris, France
x 56
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Klaim »

Looks like a plan? :D


Everthing that have been said looks to go in a really good direction.

I just wanted to ask, how will the access to nodes position/matrixes be like, when you only have a pointer to a node?
Do you have to get the position through the manager? or does the node have a pointer to the position?

I'm trying to understand the changes from the user point of view. Everhting looks simpler the way described here, but this point I was asking myself if it would make arbitrartry data access different?
User avatar
Xavyiy
OGRE Expert User
OGRE Expert User
Posts: 847
Joined: Tue Apr 12, 2005 2:35 pm
Location: Albacete - Spain
x 87

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Xavyiy »

@Klaim
One option is to store the position vector ptr(which is managed by the SM), and then acceding to the correct chunk (node hierarchy level) and to the correct position. Something like that:

Code: Select all

const Ogre::Vector3& Node::getPosition() const
{
      return (*mPositions)[mChunk][mPositionOffset];
}
This way, from the user point of view it'll be the same than now while internally the update process is going to be quite faster due to less cache misses.
The same for matrices.

IMO, it'll be a good idea to redesign the current node system introducing this data-oriented management for Ogre 1.9, since at the end it's not a very big change(the big redesign will be done for 2.0) and it'll be easy to keep almost the whole current Node interface(get/setPosition, Orientation, _update(...), etc).
Dunno what the community think, but maybe a GSoC slot can be used for that. That will give us an idea of the gained speed-up and also it'll be a robust start point for the big redesign.

Xavier
Last edited by Xavyiy on Sun Mar 25, 2012 11:35 am, edited 2 times in total.
User avatar
Jabberwocky
OGRE Moderator
OGRE Moderator
Posts: 2819
Joined: Mon Mar 05, 2007 11:17 pm
Location: Canada
x 218
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Jabberwocky »

About removing features:
If a feature meets these criteria:
  • not commonly used
  • impacts performance
  • relatively easy to (re)implement if an ogre user needs it
Then it should be yanked.

Ogre is open source - if I need Ogre to support some uncommon requirement of my project, I can modify the code. Not everything needs to go into the trunk. If I make changes to ogre, I can maintain my own code branch. Ideally, any useful ogre modifications will be shared with others who have need of the same feature - maybe we could have a wiki section for this kind of code sharing.
Image
User avatar
duststorm
Minaton
Posts: 921
Joined: Sat Jul 31, 2010 6:29 pm
Location: Belgium
x 80
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by duststorm »

If it's not too much of a hassle to maintain them, all these obscure features could be IFDEFFED out of the default compile, but be enabled if people require them. This could keep some customers happy.

If things are going to be yanked out, it might be a good idea to verify why they were added in the first place, and who's using them. It wouldn't be sign of good faith to totally drop support for custom enhancements that were paid for by companies. Companies that might have agreed to donate useful code to Ogre.
Developer @ MakeHuman.org
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5296
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by dark_sylinc »

Xavyiy wrote:@Klaim
One option is to store the position vector ptr(which is managed by the SM), and then acceding to the correct chunk (node hierarchy level) and to the correct position. Something like that:

Code: Select all

const Ogre::Vector3& Node::getPosition() const
{
      return (*mPositions)[mChunk][mPositionOffset];
}
I'm in a hurry, so I'll make it short.
I was thinking a similar idea (management on the inside takes care of the hassles for the users). Although beware of this:
If mChunk & mPositionOffset are allowed to change (i.e. I can think of because they got detached and attached to something different) so that memory remains contiguous, then it's dangerous to return a const reference; since the user can try to save that pointer (reference). May be the memory will be valid in the future (we can reinforce that won't happen though) but it still will be pointing to the wrong values.

Issues like this must be taken care. I do believe we should return a const reference though, but explain to users they can't keep that reference out of scope or after attachment changes (and list whatever other actions that may cause a memory reassignment).
May be the change would be:

Code: Select all

Ogre::Vector3 Node::getPosition() const;
const Ogre::Vector3& Node::_getPosition() const;
The name could be getPositionRef() instead.
User avatar
Xavyiy
OGRE Expert User
OGRE Expert User
Posts: 847
Joined: Tue Apr 12, 2005 2:35 pm
Location: Albacete - Spain
x 87

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Xavyiy »

dark_sylinc wrote:Although beware of this:
If mChunk & mPositionOffset are allowed to change (i.e. I can think of because they got detached and attached to something different) so that memory remains contiguous, then it's dangerous to return a const reference; since the user can try to save that pointer (reference). May be the memory will be valid in the future (we can reinforce that won't happen though) but it still will be pointing to the wrong values.

Issues like this must be taken care. I do believe we should return a const reference though, but explain to users they can't keep that reference out of scope or after attachment changes (and list whatever other actions that may cause a memory reassignment).
May be the change would be:

Code: Select all

Ogre::Vector3 Node::getPosition() const;
const Ogre::Vector3& Node::_getPosition() const;
The name could be getPositionRef() instead.
Completely agree. I personally prefer the getPositionRef naming style, I think it's more intuitive and also underscore-preceded functions are generally privated or "only for internal use".

Also, the offset inside chunks is the same for each data memeber (position, orientation, scale and globals), so at the end we only have to store the chunk and the offset in the node itseflt, plus pointers.

Btw, how well SoA arranging can perform in the rest of operations? For example, the following code will be far slower due to non-contiguous memory access:

Code: Select all

const Ogre::Vector3 Node::getPosition() const
{
      return Ogre::Vector3((*mPositions)[mChunk][0][mOffset],  (*mPositions)[mChunk][1][mOffset],  (*mPositions)[mChunk][2][mOffset]);
}
One option is to store in the node itself an updated copy of vectors which are stored in SoA in order to avoid these memory access problems, but at the end we're duplicating data (anyway, memory is cheap).

Edit: Forget the last line, that's not a good option since it'll lead to update non-contiguous vectors(well, at least them are contiguous nodes). So, a better option is just creating an AoS which will contain the same data than the SoA version but stored in contiguous XYZ floats.
CABAListic
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 2903
Joined: Thu Jan 18, 2007 2:48 pm
x 58
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by CABAListic »

Hm, not sure I like the idea of having two functions for essentially the same thing. Would it really be that bad to return a const reference and write in the comment that the reference is not guaranteed to remain valid? We wouldn't be the first lib to do that; heck, even the standard library has more than enough ways to invalidate previously acquired iterators. Such subtleties are the price you pay for using C++. Just don't store a reference if you don't want to deal with them :)
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5296
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by dark_sylinc »

Xavyiy wrote:Btw, how well SoA arranging can perform in the rest of operations? For example, the following code will be far slower due to non-contiguous memory access:

Code: Select all

const Ogre::Vector3 Node::getPosition() const
{
      return Ogre::Vector3((*mPositions)[mChunk][0][mOffset],  (*mPositions)[mChunk][1][mOffset],  (*mPositions)[mChunk][2][mOffset]);
}
There's an easy fix. Because that's not the memory layout I was thinking of when using SoA.
The memory layout you're thinking is:
A: XXXXXXXXXXXXXXXXXXXXX
B: YYYYYYYYYYYYYYYYYYYYY
C: ZZZZZZZZZZZZZZZZZZZZZ

The memory layout I'm thinking is:
A: XXXXYYYYZZZZXXXXYYYYZZZZ

This way, each node's X & Z components are only 32 bytes apart*, which is enough to enter in any cache line. As far as the caches go, the memory is still "contiguous".
The memory allocator may want to take into account dynamic layouts, because I've heard AVX now supports 256-bit registers. If it's faster to place 8 "X" together or just 4, we'll have to profile.
Although fixed memory layout has the advantage that we can infer the memory locations of Y & Z by knowing X (having a dynamic layout means an extra variable to hold how many bytes apart is each component).

The only performance penalty I can see is that such function can't retrieve a reference, and the vector must always be constructed. We can, however, build a special Vector3_SoA class that can inter-operate with Vector3. Therefore retrieving a Vector3_SoA would be fast, and may not even need conversion to Vector3 if the operations are trivial (i.e. calculating the length of the vector or the dot product)
CABAListic wrote:Hm, not sure I like the idea of having two functions for essentially the same thing. Would it really be that bad to return a const reference and write in the comment that the reference is not guaranteed to remain valid? We wouldn't be the first lib to do that; heck, even the standard library has more than enough ways to invalidate previously acquired iterators. Such subtleties are the price you pay for using C++. Just don't store a reference if you don't want to deal with them :)
That's a very good point. I agree. Proper documentation on how the return value can be invalidated beats any "duplicate function" pattern.
We should, however, think of a debug function that panics when an invalidated Vector3_SoA is used, just like iterator debug asserts do.

Edit:
* That means the "XXXXYYYYZ" necessary to have a working Vector needs 36-bytes. From the P4 to the current generation Core i7, the cache lines are 64-byte in size, which is enough to hold the float (but not double when using OGRE_DOUBLE_PRECISION) version before hitting a cache miss. I don't know about AMD; but here it says 32, 64 & 128 bytes are common cache line sizes.

As for the necessary space between scene nodes to avoid false sharing in multiple threads, common sense dictates it is 64 bytes (the size of a cache line). But it is actually 128 bytes because of the reasons explained here (long story short, P4 & Pentium D fetch two lines in a row instead of one)

Edit 2:
This is the easiest (in terms of readability of the inline functions to follow) of a Vector3_SoA implementation using fixed layout:

Code: Select all

#define ELEMENTS_TOGETHER 4 //Increase to 8 when compiling for AVX?
#define PAD_COUNT ELEMENTS_TOGETHER-1
class Vector3_SoA
{
	float x;
private:
	float pad0[PAD_COUNT];
public:
	float y;
private:
	float pad1[PAD_COUNT];
public:
	float z;

private:
	//Prevent stack allocation and copy operators. This structure always must be passed by reference
	Vector3_SoA() {}
	Vector3_SoA( Vector3_SoA& ) {}
	inline Vector3_SoA& operator = ( const Vector3_SoA& rkVector ) { return *this; }
public:
	//"Vector3_SoA()" must be explicit, hence compiler errors arise if users inadvertently try to use a
	//Vector3_SoA in stack memory (which is totally pointless & wasteful and should use Vector3 instead).
	explicit Vector3_SoA( void *dummy ) {}

	//We can copy-paste Vector3's inline implementations, or even use preprocessors
	//so that we only write those inlines once and kept them permanently in sync
};
Where Vector3_SoA is properly casted from the memory pointer:

Code: Select all

float *MEM; //Points to valid memory
Vector3_SoA *o = reinterpret_cast<Vector3_SoA*>(MEM);
Alternatively, Vector3_SoA can be constructed like this:

Code: Select all

class Vector3_SoA
{
private:
 Real *base;
 int    elementsTogether; //Normally 4 or 8
public:
 Real x() const { return *base; }
 Real y() const { return *(base + elementsTogether);  }
 Real z() const { return *(base + elementsTogether * 2);  }

 Vector3_SoA( Real *_base, int _elementTogether ) : base(_base), elementTogether(_elementTogether) {}

 //Implementing Vector3's inline is not as straightforward because x() is now a function, not a variable.
}
}
The first method seems more natural for users, but can be trickier for Ogre devs because Vector3_SoA must always be a pointer or a reference, and never live in stack memory. While the 2nd one doesn't have that restriction, but you have to call x().

Edit 3: Updated Vector3_SoA's code (changed constructors)
User avatar
Kojack
OGRE Moderator
OGRE Moderator
Posts: 7157
Joined: Sun Jan 25, 2004 7:35 am
Location: Brisbane, Australia
x 534

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Kojack »

From the P4 to the current generation Core i7, the cache lines are 64-byte in size,
ARM Cortex A8 (omap3, ipad, iphone 4, etc) is 64 byte lines. Cortex A9 (omap4, tegra 2, tegra 3, ipad 2, ipad 3, iphone 4s, etc) is only 32 byte lines.
User avatar
m2codeGEN
Halfling
Posts: 52
Joined: Tue Apr 26, 2011 9:13 am
Location: Russia, Tver
x 2

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by m2codeGEN »

I don't see anything bad in returning a vector3 by value (well 12 bytes for float and 24 for double, but on x64 platforms sizeof pointer is 8 bytes).
We only not much more will increase overhead charge, record in memory as occurs not on bytes
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5296
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by dark_sylinc »

Returning by reference means the user can use a copy constructor nevertheless. If we return by value, the user can't get a reference.
Returning by reference has both options. Returning by value only one.

BTW, returning by ref. means "mov eax, [memory]", while by value means stack push & 3 movs. How fast the code that uses the return value runs, depends on compiler optimizations and strict aliasing rules; but if the compiler isn't smart, the user just can use a copy and problem solved.

As for the Cortex A9, that's too bad. [s]I guess iPhone 4 owners will be happy their games run faster than in iPhone 4S[/s]. I can think:
  • SoA becomes AoS by playing with the memory layout: XYZXYZXYZ. Use scalar instructions
  • Group 3 components together (XXXYYYZZZ) rather than 4, and use bitwise logic to prevent overwriting the 4th value. Use SIMD instructions
User avatar
Xavyiy
OGRE Expert User
OGRE Expert User
Posts: 847
Joined: Tue Apr 12, 2005 2:35 pm
Location: Albacete - Spain
x 87

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by Xavyiy »

There's an easy fix. Because that's not the memory layout I was thinking of when using SoA.
Great, indeed I was misunderstanding your idea. That solves all cache misses problems, at least with 64bytes+ cache lines.
If it's faster to place 8 "X" together or just 4, we'll have to profile.
Well, 8 "X" together means 68bytes for a working vector, which will indeed lead to an extra cache miss per vector retrieval. But indeed processing 8 transforms at the same time through AVX sounds pretty fast.

At the end that's the same question than the Cortex A9 32byte cache lines and placing 4 "X" together. We will have to profile it on a real app (although results are going to be very app and scene complexity dependant) and then go for the best option(or allow the developer to select the desired memory-layout with preprocessor defines, but that will increase the code complexity and also make it less clear/maintainable).

Also, we have to think about how to deal with OGRE_DOUBLE_PRECISION here.
This is the easiest (in terms of readability of the inline functions to follow) of a Vector3_SoA implementation using fixed layout:
Well, even if I like the idea of a Vector3_SoA class(interoperable with Vector3 and implementing the same functions), I still think it'll mess too much the code by returning these SoA Vector3 version in nodes.
IMO a smart solution is allowing the user getting these SoA through the SceneManager, for example something like: const Vector3_SoA& SceneManager::_getNodePosition(Node* n) instead of dealing with that in the Node class.
That way the common user(better said, for a common usage) will just use the copy version, and the expert user which cares about performance will be able to get the SoA version.

Btw, I like more the first implementation even if it always has to be a ptr/ref. Also, IMO the person who uses the SoA version will have the proper knowledge about how to use it.

Xavier
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5296
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by dark_sylinc »

Xavyiy wrote:
If it's faster to place 8 "X" together or just 4, we'll have to profile.
Well, 8 "X" together means 68bytes for a working vector, which will indeed lead to an extra cache miss per vector retrieval. But indeed processing 8 transforms at the same time through AVX sounds pretty fast.

At the end that's the same question than the Cortex A9 32byte cache lines and placing 4 "X" together. We will have to profile it on a real app (although results are going to be very app and scene complexity dependant) and then go for the best option(or allow the developer to select the desired memory-layout with preprocessor defines, but that will increase the code complexity and also make it less clear/maintainable).
Yes, indeed. My idea, at least, is that inside Ogre we try to squeeze the SoA layout at max. Which means internally we would rarely use Vector3 getPosition()

It's outside Ogre where it can become messy quickly. However in my experience most apps don't use Vector3's thaaaat much. It's not really the bottleneck. Furthermore advanced games tend to use other third party libraries to store the real pos & orientation data (hkVector4 in Havok, btVector4 in Bullet, list goes on) and then update the SceneNode to match the physics' data.

Simple games are probable going to be more affected, but again, simple games rarely have a bottleneck in the internal game loop (Except for RTS games).
Xavyiy wrote:Also, we have to think about how to deal with OGRE_DOUBLE_PRECISION here.
Hehe, that's an issue. I'm thinking on PCs OoOE (out of order execution) may mitigate the problem because the cache only needs to read two cache lines from contiguous memory. But it's doubtious. And handhelds & consoles don't have that advantage (although do they use double precision?)

Again, we can try the 3 elements together layout + bitwise logic & the SoA to AoS conversion and try. If we're clever, we can build the code so that switching between them should be straightforward.
Worst case scenario, we end up with an AoS layout, in which performance will already improve because of the cache locality of the nodes. The challenge now, is if we can further improve it with SIMD.
Xavyiy wrote:Well, even if I like the idea of a Vector3_SoA class(interoperable with Vector3 and implementing the same functions), I still think it'll mess too much the code by returning these SoA Vector3 version in nodes.
IMO a smart solution is allowing the user getting these SoA through the SceneManager, for example something like: const Vector3_SoA& SceneManager::_getNodePosition(Node* n) instead of dealing with that in the Node class.
That way the common user(better said, for a common usage) will just use the copy version, and the expert user which cares about performance will be able to get the SoA version.
Ok, that's possible (I don't think there's a need for the _underscore because the function is placed in the SceneManager, not in the Node...) Personally I like this more:

Code: Select all

Ogre::Vector3 myVec = node->getPosition();
const Ogre::Vector3_SoA &mySoAVec = node->getPosition();
This is possible because we add an additional Vector3 constructor:

Code: Select all

Vector3( const Vector3_SoA &ref ) : x( ref.x ), y( ref.y ), z( ref.z ) {}
So, people with less experience should be encouraged to convert to Vector3, while advanced users are encouraged to use the SoA version directly.
Xavyiy wrote: Btw, I like more the first implementation even if it always has to be a ptr/ref. Also, IMO the person who uses the SoA version will have the proper knowledge about how to use it.
Me too.
We could add a debug function:

Code: Select all

Vector3_SoA::checkValidity( const Node *node );
so that a user can ensure every now and then, that it's pointer still belongs to the same Node. I can't think of an easy way to add automatic debug checks without clunking the whole Vector3_SoA code & it's usage. It's slightly easier with the non ptr/ref version, as we have local data that can be stored in stack to check against, but it's still messy.

One more thing. Mixing SSE with x87 FPU code is a bad idea. There are three options:
  • Don't use x87. It's deprecated in x64. Compile with SSE/SSE2. I can't remember where I read that 98% of computers being surveyed supported SSE2, including Office PCs (no, I'm not talking about Steam's survey)
  • Allow x87, but mark x, y, z as private members and use a function to access them (i.e. Vector3 Vector3_Soa::_getVector3() const;) By using private members, we prevent the compiler from using the x, y & z directly in the fpu registers; which will only happen if the user explicitly makes that call.
  • Use a different code path which is x87 only. This may be feasible as we'll probably need to implement an AoS + scalar version of the code for handhelds (AoS is an instant switch, writing & maintaining the scalar code shouldn't be hard as it's the same as simd code but using different instructions/intrinsics and a loop that iterates 4x times more)
I'm between option 1 & 3.
Option 1 is simple but egocentric (I like simple), option 3 is broader (the beauty of Open Source is that there's always someone who needs it & can code it).
AgentC
Kobold
Posts: 33
Joined: Tue Apr 24, 2012 11:24 am
x 5

Re: 'Tindalos' (Ogre v2.0) SceneManager redesign

Post by AgentC »

Hi,
I'm new here, but a long-time Ogre user (using it at work since 2007.) In the meanwhile I've also been writing my own 3D engine which, like my 3D programming skills in general, owes a lot to Ogre for inspiration. I'd like to weigh in regarding possible Ogre reorganization for performance. Based on my own experiences & profiling, I believe Ogre could gain substantially (something like 3x - 5x) in regard to raw CPU performance in culling, draw call collection and draw call submission, *even* without yet resorting to data-oriented programming, SoA optimizations or multithreading, but doing so might need sacrificing its flexibility somewhat:

- To avoid cache misses, keep subclassing to a minimum and put the critical and often queried data (like a scene node's world transform matrix) as near to the object's start address as possible.

- Avoid whole scene graph traversals. Wouldn't a dirtying / lazy-update mechanism for nodes (ie. when world transform is queried, update it from the parent if necessary) also work on its own, without needing an explicit update step?

- Because culling is in the end about the renderables, maybe scene nodes themselves shouldn't have bounds, but the only the renderables would. Also, only the renderables, not the scene nodes, would be inserted into a scenemanager-specific acceleration structure, such as octree. For example having a ISpatialLeaf (from which OctreeLeaf would subclass) pointer in each renderable. This would ideally allow varying SM and acceleration structure implementations without needing to subclass the scene node itself.

- Keep the route of a renderable from the culling phase to draw calls as simple as possible, with minimal function calls and indirection in between, and without firing a lot of events / listeners. Preferably have a fast path for the case of no listeners having been registered.

- Perhaps most importantly, optimize the use of the underlying rendering API. If you look at Ogre with PIX or gDebugger, you see a lot of repeated state-setting calls that wouldn't need to be there. Ideally rendering several objects with same material and lighting would just result in (pseudocode) SetShaderParameter(worldMatrix); DrawPrimitive(); repeated over and over. Due to Ogre's flexibility in materials, and supporting both shaders & fixed function, I'd imagine this would also be the hardest change to make.
Last edited by AgentC on Wed Apr 25, 2012 9:05 am, edited 1 time in total.
Post Reply