Here is the results of a few experiments that I had to see how much
/arch:SSE and
/arch:SSE2 switches will improve the Horde3D performance. As you have suggested already, software skinning will benefit from such optimizations, so let's enable the SWSkinning in the Chicago sample :
Code:
// Chicago Sample - crowd.cpp
void CrowdSim::init()
{
...
// Add characters
for( unsigned int i = 0; i < 200; ++i )
{
Particle p;
// Add character to scene and apply animation
p.node = h3dAddNodes( H3DRootNode, characterRes );
h3dSetNodeParamI(p.node, H3DModel::SWSkinningI, 1);
h3dSetupModelAnimStage( p.node, 0, characterWalkRes, 0, "", false );
// Characters start in a circle formation
p.px = sinf( (i / 100.0f) * 6.28f ) * 10.0f;
p.pz = cosf( (i / 100.0f) * 6.28f ) * 10.0f;
chooseDestination( p );
h3dSetNodeTransform( p.node, p.px, 0.02f, p.pz, 0, 0, 0, 1, 1, 1 );
_particles.push_back( p );
}
}
Now it's time for compiling and compare the time spent on
Geo Updates :
Code:
Normal code : about 60ms
/arch:SSE : about 60ms
/arch:SSE2 : about 130ms
Well, results are too interesting! There is no such difference between normal and SSE code, but hey why SSE2 code is 2x slower there? Profiler says that most of the time is consumed by
ModelNode::updateGeometry() so I decided to compare the SSE and SSE2 generated assembly codes :
Here is the SSE disassembly and
Here is the SSE2 disassemblySo what? First noticeable thing is that with
/arch:SSE enabled, compiler has failed to optimize the code and generated the normal code instead that's why there is no difference between the SSE and non-SSE generated code. Second is that with
/arch:SSE2 enabled, compiler has done a horrible job there. Why? If you notice, compiler is converting the all of the
floats to
doubles to perform SSE2 operations on them and lot more problems ...
All of the tests are done with MSVC 2010