I was working on a similar shader at work recently (in HLSL, but I'll use Horde's GLSL to illustrate).
Our skinning shader worked the same as Horde's, in that it contained a uniform array of individual columns. When I took over maintaining the shader, I decided to convert it to store 3x4 matrices instead (
in HLSL you've got the column_major and row_major keywords to control packing, so that both mat3x4 and mat4x3 can be stored in 3 registers).
So I started out with something like:
Code:
return mat4( skinMatRows[jointIndex * 3],
skinMatRows[jointIndex * 3 + 1],
skinMatRows[jointIndex * 3 + 2],
vec4( 0, 0, 0, 1 ) );
And ended up with something like:
Code:
return mat4( skinMats[jointIndex],
vec4( 0, 0, 0, 1 ) );
After compiling both versions and checking out the assembly code,
they were practically identical -- the compiler had taken my first version and added the "*3", "+1" and "+2" into it, in order to access the individual registers storing the parts of the matrix anyway.
Assuming that GLSL compilers are as good as the HLSL one, it's probably not a concern performance wise.
In the end, I kept the first version of the code, but I stripped out the "* 3" part, and edited our model exporter to pre-multiply all joint indices with 3 when they're exported, which saves a few cycles per vertex
