Horde3D

Next-Generation Graphics Engine
It is currently 12.05.2024, 20:38

All times are UTC + 1 hour




Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: 18.10.2008, 04:20 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Codepoet wrote:
What do you think about integrating that library - or parts of it - into Horde after doing real world tests instead of writing your own version?
The math library provided by Sony is an all-in-one library and a bit complex and we can integrate ~20%-30% of that library into our utmath.

But this needs some changes in whole structure of engine [vec3f and matrix4f usages]. By using current utmath_rcx there is no need to change the whole structure, but we can't gain the real power of SSE, because utmath_rcx wastes a lot of cpu cycles for simple operations such as loading floats into __m128 and storing __m128 in floats.

If you have a closer look at other Sony or NebulaDevices libraries, you see that they load floats into __m128 when they are constructing vector and matrix classes once, after that everything is performed on __m128 types without wasting any cpu cycles and they will gain another ~20% performance boost.

There is many solutions to overcome this problem :

A.using current structure of utmath_rcx, so there is no need to change the engine.

B.using unions [m128] to store and load the data into them and storing them into float types [vectors and ...] but this is slower than A because of current structure of engine.

C.Changing the whole structure of engine to remove the cpu cycles wasting load and store operations like Sony and Nebula math libraries and by this way we can gain the real power of SSE.

IMHO it's better to choose the solution A if you don't want to change whole structure of engine and gained performance will stay ~1:1 [FPU|SSE] but if you really want to enjoy the full power of your ~200$ Core2Duo and expensive Quad cores it's better to choose the solution C.

I don't know what to do but I prefer to choose the solution C, every thing depends on you, community and main developers :!:


Top
 Profile  
Reply with quote  
PostPosted: 18.10.2008, 05:04 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Here is a simple code :
Code:
Vec3f ivec1,ivec2,ivec3,ivec4;

//Initialize the vectors
...
//Now doing some calculations on them

ivec1=(ivec1*ivec2)/(ivec3*ivec4);

ivec1=sqrt_sse(ivec1);

//Now storing the results into the final destination

SSE_ALIGNED ( float dest[4] );
ivec1.store(dest);


By using utmath_rcx following operations will happen :
-loading the x,y and z of ivec1,ivec2,ivec3,ivec4 into m128
-mul their x,y and z
-storing m128 into vec3f
-reloading them from vec3f into m128 again
-performaing div on them
-storing m128 into vec3f
-reloading the x,y and z of ivec into m128
-performing sqrt on m128
-storing m128 into vec3f

By using a library similar to the Sony following operations will happen :
-loading x,y and z into m128 union at initialize time
-performing mul
-performing div
-performing sqrt
-at last storing m128 to the destination

This is why current utmath_rcx is slower than other Sony and Nebula libraries because of current structure of engine :idea:


Top
 Profile  
Reply with quote  
PostPosted: 18.10.2008, 05:23 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
I'm just a young and an unexprienced n00b programmer but IMHO it's better to change the current structure of engine. I know that this is a nightmare for the main developers of engine but if you really want to turn the engine from a university project to a professional and cutting edge engine like other C4 and well known commercial engines, some changes are needed :|

Current structure of engine is too simple and changing that to use m128 types prevents the future nightmares because of integrating the SSE2, SSE3, SSE4.x, AltiVec, multithreading or porting the engine to the other consols such as PS3 and XBOX :idea:

Do you have any silver bullets ?


Top
 Profile  
Reply with quote  
PostPosted: 19.10.2008, 14:53 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Obviously there's no such thing as a silver bullet here...


But I did some profiling with the Chicago sample using the default view, without moving the mouse.
All tests done under Linux on AMD 64 X2 using gcc / g++ and oprofile AFTER warmup in app. Data source: CPU_CLK_UNHALTED.
GPU: NV7600GT


software skinning, -O2 (inlining only "small" functions)
76.01% ModelNode::updateGeometry
16.88% external library: libxcb-xlib.so
1.89% MeshNode::onPreUpdate
1.86% SceneManager::updateQueuesRec
0.74% JointNode::onPostUpdate
0.63% SceneNode::update
0.6% ModelNode::onPostUpdate
0.36% SceneNode::markChildrenDirty
0.31% CrowdSim::update
0.12% Renderer::drawModels


hardware skinning, -O2 (inlining only "small" functions)
74.31% external library: libXext.so
9.83% SceneManager::updateQueuesRec
3.65% ModelNode::onPostUpdate
3.06% JointNode::onPostUpdate
2.51% SceneNode::update
2.12% CrowdSim::update
1.29% SceneNode::markChildrenDirty
1.08% Frustrum::cullBox
0.51% Renderer::DrawModels
0.35% SceneNode::setTransform
0.22% ".plt" (unknown / anonymous functions)
0.10% Renderer::calcLightMat

How much could these functions be improved when moving to an SIMD approach? Are there easier / other ways to make it faster?


Last edited by Codepoet on 20.10.2008, 14:05, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: 19.10.2008, 17:39 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Thanks a lot dear Codepoet for performing that great and accurate tests.
Currently I'm not too friendly with other sections of engine but I have performed a quick look at ModelNode::updateGeometry @ egModel.
Code:
      for( uint32 i = 0; i < _morphers.size(); ++i )
      {
         if( _morphers[i].weight > Math::Epsilon )
         {
            MorphTarget &mt = _geometryRes->_morphTargets[_morphers[i].index];
            float weight = _morphers[i].weight;
            
            for( uint32 j = 0; j < mt.diffs.size(); ++j )
            {
               MorphDiff &md = mt.diffs[j];
               VertexData &vd = *_geometryRes->getVertData();
               
               vd.positions[md.vertIndex] += md.posDiff * weight;
               vd.normals[md.vertIndex] += md.normDiff * weight;
               vd.tangents[md.vertIndex] += md.tanDiff * weight;
               vd.bitangents[md.vertIndex] += md.bitanDiff * weight;
            }
         }
      }
Optimizing utmath will affect this for loop, as you see there is many Vec3f objects.
Code:
         for( uint32 j = 0; j < 4; ++j )
         {
            uint32 ind0 = (uint32)vd.staticData[i].jointVec[j] * 3 + 0;
            uint32 ind1 = ind0 + 1, ind2 = ind0 + 2;
            
            mat.x[0] = _skinMatRows[ind0].x;
            mat.x[1] = _skinMatRows[ind1].x;
            mat.x[2] = _skinMatRows[ind2].x;
            mat.x[4] = _skinMatRows[ind0].y;
            mat.x[5] = _skinMatRows[ind1].y;
            mat.x[6] = _skinMatRows[ind2].y;
            mat.x[8] = _skinMatRows[ind0].z;
            mat.x[9] = _skinMatRows[ind1].z;
            mat.x[10] = _skinMatRows[ind2].z;
            mat.x[12] = _skinMatRows[ind0].w;
            mat.x[13] = _skinMatRows[ind1].w;
            mat.x[14] = _skinMatRows[ind2].w;

            if( j == 0) skinningMat = mat * vd.staticData[i].weightVec[j];
            else skinningMat += mat * vd.staticData[i].weightVec[j];
         }
And here is some Matrix4f again !
Overall our utmath will 100% affect this function and we can't perform too much optimizations on the function body.

Currently we must change the way that engine performs operations on vectors. We must change their usage to something like this :

Code:
Vec3f ivec;
ivec+=sqrt(ivec);
instead of :
Code:
Vec3f ivec;
ivec.x+=sqrt(ivec.x);
ivec.y+=sqrt(ivec.y);
ivec.z+=sqrt(ivec.z);
By using m128 unions.

I'm only free @ thursdays, wednesdays and fridays and I'm busy with university in other days of week.


Top
 Profile  
Reply with quote  
PostPosted: 20.10.2008, 14:47 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Doing some more profiling with valgrind / callgrind I realised gcc DOES inlining on -O2: But only for "small" functions.

Then it looks like this with hardware skinning and compile options -O2 -g -fno-inlining:
Times are % of time spent in function without subfunctions.
Code:
 14.81   Horde3D/Source/Horde3DEngine/egScene.cpp:SceneManager::updateQueuesRec(Frustum const&, Frustum const*, bool, SceneNode&, bool, bool)'2
  8.54   Horde3D/Source/Horde3DEngine/../Shared/utMath.h:Matrix4f::fastMult(Matrix4f const&, Matrix4f const&)
  4.98   Horde3D/Source/Horde3DUtils/../Shared/utMath.h:Matrix4f::operator*(Matrix4f const&) const
  4.26   Horde3D/Samples/Chicago/crowd.cpp:CrowdSim::update(float)
  2.44   Horde3D/Source/Horde3DEngine/egPrimitives.cpp:Frustum::cullBox(BoundingBox&) const
  1.64   Horde3D/Source/Horde3DEngine/egScene.cpp:SceneNode::update()'2
  1.41   Horde3D/Source/Horde3DEngine/egModel.cpp:ModelNode::onPostUpdate()
  1.24   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<SceneNode*, std::allocator<SceneNode*> >::size() const
  1.17   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<SceneNode*, std::allocator<SceneNode*> >::operator
  1.06   Horde3D/Source/Horde3DEngine/egModel.h:ModelNode::setSkinningMat(unsigned, Matrix4f const&)
  1.02   Horde3D/Source/Horde3DEngine/../Shared/utMath.h:Matrix4f::getRow(unsigned) const
  0.92   Horde3D/Source/Horde3DEngine/egScene.cpp:SceneNode::markChildrenDirty()'2
  0.88   Horde3D/Source/Horde3DEngine/egAnimatables.cpp:JointNode::onPostUpdate()
  0.59   Horde3D/Source/Horde3DEngine/../Shared/utMath.h:Vec3f::operator*(Vec3f const&) const
  0.58   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<Particle, std::allocator<Particle> >::size() const
  0.56   Horde3D/Source/Horde3DEngine/egRenderer.cpp:Renderer::drawModels(std::string const&, std::string const&, bool, Frustum const*, Frustum const*, RenderingOrder::List, int)
  0.56   Horde3D/Source/Horde3DUtils/../Shared/utMath.h:Matrix4f::inverted() const
  0.39   Horde3D/Source/Horde3DEngine/../Shared/utMath.h:Plane::distToPoint(Vec3f const&) const
  0.38   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<Particle, std::allocator<Particle> >::operator
  0.33   /usr/include/c++/4.2/bits/stl_iterator.h:bool __gnu_cxx::operator==<Frame const*, std::vector<Frame, std::allocator<Frame> > >(__gnu_cxx::__normal_iterator<Frame const*, std::vector<Frame, std::allocator<Frame> > > const&, __gnu_cxx::__normal_iterator<Frame const*, std::vector<Frame, std::allocator<Frame> > > const&)
  0.33   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<Frame, std::allocator<Frame> >::empty() const
  0.33   Horde3D/Source/Horde3DEngine/egAnimatables.cpp:AnimatableSceneNode::onPreUpdate()
  0.32   Horde3D/Source/Horde3DUtils/../Shared/utMath.h:Vec4f::Vec4f(float, float, float, float)
  0.30   Horde3D/Source/Horde3DEngine/egPrimitives.h:BoundingBox::transform(Matrix4f const&)
  0.27   Horde3D/Source/Horde3DEngine/egModel.h:ModelNode::jointExists(unsigned)
  0.27   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<Frame, std::allocator<Frame> >::size() const
  0.26   /usr/include/c++/4.2/bits/stl_vector.h:std::vector<Vec4f, std::allocator<Vec4f> >::operator


Top
 Profile  
Reply with quote  
PostPosted: 20.10.2008, 16:32 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Thanks a lot dear Codepoet, I'll have look at this cpu consuming functions and being friendly with other sections of engine to find a silver bullet to popup the dragon's heart :wink:


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group