Horde3D • View topic - MSVC poor generated code performance

View unanswered posts | View active topics

Board index » Horde3D Usage » General Discussion

All times are UTC + 1 hour

MSVC poor generated code performance

Page 1 of 1

[ 7 posts ]

Print view

Previous topic | Next topic

Author

Message

Siavash

Post subject: MSVC poor generated code performance

Posted: 26.09.2010, 15:04

Joined: 21.08.2008, 11:44
Posts: 354

Here is the results of a few experiments that I had to see how much /arch:SSE and /arch:SSE2 switches will improve the Horde3D performance. As you have suggested already, software skinning will benefit from such optimizations, so let's enable the SWSkinning in the Chicago sample :

Code:

// Chicago Sample - crowd.cpp

void CrowdSim::init()
{
   ...
   
   // Add characters
   for( unsigned int i = 0; i < 200; ++i )
   {
      Particle p;
      
      // Add character to scene and apply animation
      p.node = h3dAddNodes( H3DRootNode, characterRes );
      h3dSetNodeParamI(p.node, H3DModel::SWSkinningI, 1);
      h3dSetupModelAnimStage( p.node, 0, characterWalkRes, 0, "", false );
      
      // Characters start in a circle formation
      p.px = sinf( (i / 100.0f) * 6.28f ) * 10.0f;
      p.pz = cosf( (i / 100.0f) * 6.28f ) * 10.0f;

      chooseDestination( p );

      h3dSetNodeTransform( p.node, p.px, 0.02f, p.pz, 0, 0, 0, 1, 1, 1 );

      _particles.push_back( p );
   }
}

Now it's time for compiling and compare the time spent on Geo Updates :

Code:

Normal code : about 60ms
/arch:SSE   : about 60ms
/arch:SSE2  : about 130ms

Well, results are too interesting! There is no such difference between normal and SSE code, but hey why SSE2 code is 2x slower there? Profiler says that most of the time is consumed by ModelNode::updateGeometry() so I decided to compare the SSE and SSE2 generated assembly codes :
Here is the SSE disassembly and Here is the SSE2 disassembly

So what? First noticeable thing is that with /arch:SSE enabled, compiler has failed to optimize the code and generated the normal code instead that's why there is no difference between the SSE and non-SSE generated code. Second is that with /arch:SSE2 enabled, compiler has done a horrible job there. Why? If you notice, compiler is converting the all of the floats to doubles to perform SSE2 operations on them and lot more problems ...

All of the tests are done with MSVC 2010

Top

swiftcoder

Post subject: Re: MSVC poor generated code performance

Posted: 26.09.2010, 16:08

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA

What other compiler optimisation flags are set?

_________________
Tristam MacDonald - [swiftcoding]

Top

Siavash

Post subject: Re: MSVC poor generated code performance

Posted: 26.09.2010, 16:25

Joined: 21.08.2008, 11:44
Posts: 354

Default options of a fresh CMake generated project + /arch:SSE2 :

Code:

/I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../Shared" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../Bindings/C++" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../.." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/CM_BIN_x86" /Zi /nologo /W3 /WX- /O2 /Ob1 /Oy- /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "CMAKE" /D "CMAKE_INTDIR=\"RelWithDebInfo\"" /D "Horde3D_EXPORTS" /D "_WINDLL" /D "_MBCS" /Gm- /EHsc /MD /GS /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /GR /Fp"Horde3D.dir\RelWithDebInfo\Horde3D.pch" /Fa"RelWithDebInfo" /Fo"Horde3D.dir\RelWithDebInfo\" /Fd"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Binaries/RelWithDebInfo/Horde3D.pdb" /Gd /TP /analyze- /errorReport:queue 

Top

swiftcoder

Post subject: Re: MSVC poor generated code performance

Posted: 26.09.2010, 17:11

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA

I would try swapping /fp:precise for /fp:fast, as that should let the compiler generate significantly faster floating-point code. You generally only need /fp:precise if you need to make strict maximum accuracy guarantees.

_________________
Tristam MacDonald - [swiftcoding]

Top

Siavash

Post subject: Re: MSVC poor generated code performance

Posted: 26.09.2010, 17:24

Joined: 21.08.2008, 11:44
Posts: 354

Now it's a bit faster :

Code:

Normal code : about 60ms
SSE code    : about 53ms
SSE2 code   : about 53ms

EDIT : SSE and SSE2 generated codes are pretty similar, most (may be all?) of the operations are done on single data instead of being SIMD and still results are very far away from hand tuned code.

Top

Siavash

Post subject: Re: MSVC poor generated code performance

Posted: 26.09.2010, 19:04

Joined: 21.08.2008, 11:44
Posts: 354

Round 2 of the experiments : 64bit generated code

Code:

x64 code  : about 53ms
SSE code  : about 53ms
SSE2 code : about 53ms

Normal generated code is faster at 64bit mode, but wait, why there is no difference between x64, SSE and SSE2 generated codes? Answer is here, MSVC generates exactly same code there. By default SSE optimizations are done and if you compare the x64 and x86 (SSE) codes you will notice that it is using movaps instruction to load 4 (aligned) floating point values at same time in x64 mode and uses movss to load 1 floating point value in x86 mode. BTW, IMHO it won't make much difference.

Top

ZONER

Post subject: Re: MSVC poor generated code performance

Posted: 27.09.2010, 07:26

Joined: 17.01.2010, 13:30
Posts: 7

Siavash, FPU, SSE and SSE2 under x64 are the same because of MSVC compiler. It has limitations on optimization under x64 code - no SSE, no ASM inlines - these things are avaliable only when you are compiling x86 code.
Also don't use Maximize Speed /O2 or Full optimization /Ox, just use Custom. I mean - do not always trust MSVC optimizer.
Another advice is to use __forceinline instead inline in code(not everywhere, but algebra/math is best thing where to use __forceinline).
My compiler options are: x86 code, /arch:SSE2, /fp:precise, Custom optimization, Favor Fast Code /Ot.

Yeah, another advice is to use not built-in memory allocator - use you custom memory allocator instead(my allocator is TBB Allocator, and experiments with NedAlloc).

Top

Page 1 of 1

[ 7 posts ]

Board index » Horde3D Usage » General Discussion

All times are UTC + 1 hour

Who is online

Users browsing this forum: No registered users and 12 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum