I'm here after seeing David Rogers' tweet. Like him, I'm concerned the SM is getting old.
Before reading my post, reading these papers are a must:
I said in
another post:
Ogre's performance is below other AAAs engine standards (Anvil engine, CryEngine, Frostbite 2)
I'm struggling to get 1.000 of rendercalls @20 fps, while Anvil engine is doing three times those render calls at the same frame rate (in both cases, not being GPU bound)
Profiling reveals the compositor wastes a lot of time parsing the scene manager multiple times (when using something other than render_quad) and there are
A LOT of cache misses.
The lack of threaded culling makes this even worse. Furthermore, with DirectX 11 threading model, it's possible to process a scene and batch render calls in multiple threads in a very concurrent way:
- One thread handles shadow rendering.
- One thread handles main scene
- One thread handles environment mapping (i.e. reflections)
Ogre's already struggling to get a high amount of entities in scenes doing main scene's & shadow's rendering in the same thread. When I add env. mapping (one pass, not 6) the amount of cache misses inside the scene manager is gigantic.
I'm afraid, as someone suggested, fixing this may require some strong redesign of the Ogre core. For instance, automatic reference counting of pointers goes against a concurrent system. Singletons don't help (as it's very easy for programmers to make a mistake and access a singleton when it isn't safe to do so)
This thread started by Sinbad as "SM needs 2.0 because of lack of features", I'm going to say "SM needs 2.0 because of low performance.
Let's analyze the problems:
- Lack of multi-threading: Reference counting is troublesome
- Lots of cache misses: In 2001, an octree algorithm was very efficient. Today it contains too many branches. A more "brute force" approach outperforms allegedly "efficient" culling algorithms.
- Lots of cache misses: The way SceneNode creation is handled, is just very memory non-local.
Solutions proposed:
SceneNode, creation:
SceneNodes goes into an array. Plain old std::vector<SceneNode>. SceneManager plugins can send their own queue, but the principle is a large array. Or chunks of arrays similar to this:
Code: Select all
std::vector<std::vector<SceneNode>> mChunks;
mChunks[0].reserve( 1000 );
mChunks[1].reserve( 1000 );
A list/set/map with custom allocators to ensure data locality is possible, but they don't thread well.
An array of std::vector<SceneNode*> could be used if the mem. allocator places them contiguously in memory, and is probably a better option (keep reading).
The vector could be kept ordered for faster finds.
Child scene nodes must go in different arrays than the parent scene nodes.
First all parent nodes are updated, then all children nodes, then all their children. Then bottom up their bounding boxes are updated. Just like in Pitfalls of OO paper is suggested
SceneNode, updates:
Complete removal of "if( mDirty ) update()" idiom; unless the update() has a loooooot of code to execute inside.
SceneNode, position's & Matrix4 memory layout:
Rather than using Vector3 to store position, I would
suggest (optional) to use an Array of X, Y & Z for SoA arrangement, in which adjacent elements belong to the next the SceneNode to parse.
In other words:
Code: Select all
SceneNode0
float *posX = memPtr[0][0];
float *posY = memPtr[1][0];
float *posZ = memPtr[2][0];
SceneNode1
float *posX = mem[1][1];
float *posY = mem[2][1];
float *posZ = mem[3][1];
SceneNode2
float *posX = mem[1][2];
float *posY = mem[2][2];
float *posZ = mem[3][2];
SceneNode3
float *posX = mem[1][3];
float *posY = mem[2][3];
float *posZ = mem[3][3];
This way we can, using SSE, update 4 scene node's transform at the same time.
Multithreading
Each thread would handle a certain amount of SceneNodes to parse for updating.
For example:
- Thread 1: SceneNodes[0 - 100]
- Thread 2: SceneNodes[100 - 200]
Singletons: For those who like it, unfortunately they have to go away. Not because they can't coexist, but rather because they encourage non-advanced users to, well, use them; when they kill performance in a MT environment, or they just are not safe if accessed.
An exception is if we implement something like:
I've seen that Boost callbacks have been suggested. My reaction is "hell no!". MT is complex. MT is prone to thread unsafety, MT is hard to make scalable. Debugging MT bugs are a major PITA. The more simple it stays, the better it works.
Updating SceneNodes in multiple threads is dead easy once they're well stored in a vector. Upon creation a threadID is sent to each thread, which is used to index which portion of the vector they have to update. That's pretty much how OpenCL, CUDA & shaders work in general.
Proper care must be taken for:
- Memory barriers (Hardware out of order execution)
- Volatile variables in the right places (Compiler reordering execution of key variables)
- Ensuring children SceneNodes & parent SceneNodes are all updated in the same thread
Culling
Like I said in my quote, software occlusion culling is very popular today. They can be highly parallelized (heck, that's why GPUs are so good at rasterization), extremely cache friendly and very fast.
With SSE, and the proper cares above (use a large or multiple large SceneNode array(s), use SoA, remove the 'if dirty', etc) writing a SW rasterizer that only writes Depth (not to mention Hi Z & Z compression can be implemented by storing the triangle's plane equation for blocks of pixels, that's how GPUs do it) and outputs an array of SceneNodes (or Renderables?) that are visible; should be a three weeks top, job.
Multithreading part 2
Updating the SceneNodes in parallel is one task. But parsing a Scene can be parallelized differently:
- The "main" scene
- The "shadow texture" scene
- The "alternate" scenes (i.e. environment mapping for reflections)
All of them can have their culling and render queues parsed in different threads. The compositor should be updated for specifying which render_scene passes are independent of each other so they can be parallelized.
Furthermore, D3D11 allows for grouping all batches for each "scene" in different threads and then dispatching them to the main thread. A fallback for emulating this behavior ourselves must be implemented for D3D9 (I dunno OGL & GL ES status regarding this).
Memory & Resource management
There are different options to consider. All of them mean the removal of automatic reference counting. It's a very lazy & handy solution, but I've seen ref. counts going way high (i.e.
because an object was passed down multiple levels down the stack. And that's even happening inside Ogre, when the render queues query for camera visibility. Not to mention Ogre's ref. count implementation contains a level of indirection (= cache misses)
- Use explicit reference counting. obj->addReference() obj->removeReference(). Havok goes along this way. I don't particularly like this method.
- Use a load-remove pattern. Specify a set of rules at which we say when & where an object can or must be loaded/created, and when & where it must be deleted. For example if he wants to destroy a Material, offer an utility function that helps him track down all the Entities using that material so he can decide what to do (don't destroy material, destroy the entities, change their material). Careful developers know that you whenever you explicitly write SceneMgr::destroy( xx ) you must be sure you don't have an instance of 'xx' wondering around. I like this idea better, some don't
D3D11 Readiness (watch out)
One of the key elements that's holding back D3D11 is that Microsoft switched the behavior of const GPU memory. Static buffers are filled when declared.
Ogre assumes that static buffers can be filled after being declared; but D3D11 doesn't support this.
It's normal in Ogre code to see this:
Code: Select all
VertexBuffer vb = createVB( staticTrue );
float *ptr = vb->lock();
..
*ptr++ = myPos[i];
..
vb->unlock();
While D3D11 expects this:
Code: Select all
VertexBuffer vb = createVB( staticTrue, myPos );
So, if anyone decides to accomplish this task of major SM overhaul and sees this code, please change it to the new behavior so D3D11 plugin development can be accelerated.
Reading this post can cause headaches, I know. I could implement the SM from scratch reusing old code where necessary to account for these changes, but truth is; I don't have the time & money to do that.
I'll be waiting for feedback.
Cheers
Dark Sylinc
PS: I estimate that an update as large as this for an experienced programmer could take around 4-6 months + testing, which is more than what GSoC provides.