Speeding up transform calculations

Question

I am programming an OpenGL3 2D Engine. Currently, I am trying to solve a bottleneck. Please hence the following output of the AMD Profiler: http://h7.abload.de/img/profilerausa.png

The data was made using several thousand sprites.

However, at 50.000 sprites the testapp is already unusable at 5 fps.

This shows, that my bottleneck is the transform function I use. That is the corresponding function: http://code.google.com/p/nightlight2d/source/browse/NightLightDLL/NLBoundingBox.cpp#130

void NLBoundingBox::applyTransform(NLVertexData* vertices) 
{
    if ( needsTransform() )
    {
            // Apply Matrix
            for ( int i=0; i<6; i++ )
            {
                glm::vec4 transformed = m_rotation * m_translation * glm::vec4(vertices[i].x, vertices[i].y, 0, 1.0f);
                vertices[i].x = transformed.x;
                vertices[i].y = transformed.y;
            }
            m_translation = glm::mat4(1);
            m_rotation    = glm::mat4(1);
            m_needsTransform = false;
    }
}

I can't do that in the shader, because I am batching all sprites at once. That means, I have to use the CPU to calculate transforms.

My Question is: What is the best way to solve this bottleneck?

I don't use any threads atm, so when I use vsync, I get an extra performance hit too, because it waits for the screen to finish. That tells me I should use threading.

The other way to go would be to use OpenCL maybe? I want to avoid CUDA, because as far as I know it only runs on NVIDIA cards. Is that right?

post scriptum:

You can download a demo here, if you like:

http://www63.zippyshare.com/v/45025690/file.html

Please note, that this requires VC++2008 installed, because it is a debug version for running a profiler.

Of course you can use a shader, batching will not prevent this. — datenwolf, Aug 02 '11 at 19:52
Please elaborate on that. What I mean is, that I cannot calculate the transform for every sprite in the shader, since I draw all sprite at one with 1 drawcall. — , Aug 02 '11 at 20:02
As an aside, the if block checking needsTransform() is a bit of a code smell. Whether or not you *should* transform is a high level concern, distinct from the low level concern of its implementation. — Tom Kerr, Aug 02 '11 at 20:17
In the above code you have exactly one common transform, that's applied on all sprites. If the transformation is per sprite, the transformation could be either understood as an additional vertex attribute, with the transformation value applied onto all vertices of a sprite. Or you use an transformation index (again a vertex attribute) into a uniform buffer. Instancing makes things even more concise. — datenwolf, Aug 02 '11 at 20:32
What happens if you don't call this function at all, or just make it a straight copy? Even for 200,000 vertices, I don't see this being your major bottleneck. — Nicol Bolas, Aug 02 '11 at 22:08
If I render 50k sprites without calling this, I get 50 fps, so 45 fps more. VSYNC off. — , Aug 02 '11 at 22:15
I see you have accepted an answer. I have however noticed you are using vec4 amd mat4. As you are working in 2D you only need vec2 (or vec3 if you want a basic z-depth) and mat3 is needed for 2D transformations. See [Why do 2D transformations need 3x3 matrices?](http://stackoverflow.com/questions/10698962/why-do-2d-transformations-need-3x3-matrices) which explains what I mean. However a 3x3 may not be actually be faster due to SSE optimization of vec4 & mat4 but it might be worth testing if you are still having issues. — Crog, Sep 04 '13 at 15:51

score 4 · Accepted Answer · edited May 23 '17 at 09:58

4

The first thing I would do is concatenate your rotation and transform matricies into one matrix before you enter the for-loop ... that way you aren't calculating two matrix multiplications and a vector on every for-loop; instead you would only be multiplying a single vector and matrix. Secondly, you may want to look into unrolling your loop and then compiling with a higher optimization level (on g++ I would use at least -O2, but I'm not familiar with MSVC, so you'll have to translate that optimization level yourself). That would avoid any overhead that branches in the code might incur, especially on cache-flushes. Lastly, if you haven't already looked into it, check into doing some SSE optimizations since you're dealing with vectors.

UPDATE: I'm going to add one last idea that would involve threading ... basically pipeline your vertices when you do your threading. So for instance, let's say you have a machine with eight available CPU threads (i.e., quad-core with hyper-threading). Setup six threads for the vertex pipeline processing, and use non-locking single-consumer/producer queues to pass messages between stages of the pipeline. Each stage will transform a single member of your six-member vertex-array. I'm guessing there are a bunch of these six-member vertex arrays, so setup in a stream that is passed through the pipeline, you can very efficiently process the stream, and avoid the use of mutexes and other locking semaphores, etc. For more info on a fast non-locking single-producer/consumer queue, see my answer here.

UPDATE 2: You only have a dual-core processor ... so dump the pipeline idea since it's going to run into bottlenecks as each thread contends for CPU resources.

edited May 23 '17 at 09:58

Community

1
1

answered Aug 02 '11 at 19:12

Jason

30,174
7
55
73

This a profile from the optimized binary: http://h3.abload.de/img/opt_profile1d84.png It does not really take it down. Also, I haev moved the mult of the 2 matrices out of the loop (how could I oversee that -.-) but still, it does not really help. with 50k sprites, all moving and rotating every frame, I have won 1 frame and a couple of ms. Nothing big. I also disabled RTTI. – Aug 02 '11 at 19:38
Sorry to hear you only got a single fps out of this ... what kind of machine are you running this on? Also are your rotation and transform matrix static enough that you can actually cache the value of the multiplication of the two matrices (i.e., you only multiply them once in your actual `NboundingBox` class instance)? Can you cache any other values such as the transforms themselves? – Jason Aug 02 '11 at 20:10
AMD5200 DualCore with a ATI890HD 1GB RAM. No, I cannot cache it further. m_rotation and m_translation are the matrices for well rotation and movement and in this testcase every sprite is moved and rotated to stress test it. And every sprite has its own transform and position ofc. When they are not moving or rotating at all, I have about 50 fps without vsync. http://tiny.cc/dt5f9 – Aug 02 '11 at 20:18
What video card is an ATI890HD? ... BTW, with only a dual-core, you could try the pipeline, but there would be some contention, i.e., you'd end up with a lot of stalls – Jason Aug 02 '11 at 20:23
HD4890, my 4 is not working well anymore >_>. Need a new keyboard. – Aug 02 '11 at 21:02
So yeah, since it seems the CPU optimization route isn't working too well, I would guess you're only option for some type of order-of-magnitude speed-up would be to use OpenCL. This kernel is pretty tiny, and shouldn't be an issue to porting over. Unfortunately my understanding of OpenCL is a bit limited, so I can't tell you what type of contention you may see at the driver level between OpenGL and OpenCL. Simply doing a transform on 50K objects though I would think should be pretty straight forward for a GPU. – Jason Aug 02 '11 at 21:22
BTW, I was reading datenwolf's comment ... OpenCL may be too heavy a hammer for this ... I'm no expert on shaders, but his comments do seem like an attractive solution. – Jason Aug 02 '11 at 21:28

score 2 · Answer 2 · answered Aug 02 '11 at 22:17

I can't do that in the shader, because I am batching all sprites at once. That means, I have to use the CPU to calculate transforms.

That sounds suspiciously like a premature optimization you made, under the assumption that batching is the most important thing you can do, and you therefore structured your renderer around making the fewest number of draw calls. And now it's coming back to bite you.

What you need to do is not have fewer batches. You need to have the right number of batches. You know you've gone too far with batching when you forgo GPU vertex transforms in favor of CPU transforms.

As Datenwolf suggested, you need to get some instancing happening to get the transformation back on the GPU. But even then, you need to undo some of the over-batching you've got here. You haven't spoken much about what kind of scene you're rendering (tilemaps with sprites on top, a large particle system, etc), so it's hard to know what to suggest.

Also, GLM is a fine math library, but it is not designed for maximum performance. It generally isn't what I would use if I needed to transform 300,000 vertices on the CPU every frame.

Tom Kerr · Answer 3 · 2011-08-02T20:34:52.270

The assignment inside the loop could be a problem, I'm not familiar with the library though. Moving it outside the for loop, and doing the field assignments manually might help. Moving the transformations outside the loop would help as well.

Edit:

This is more along the lines of what I was thinking.

// Apply Matrix
glm::vec4 transformed;
glm::mat4 translation = m_rotation * m_translation;
for ( int i=0; i<6; i++ )
{
    transformed.x = vertices[i].x;
    transformed.y = vertices[i].y;
    transformed.z = vertices[i].z;
    transformed.w = 1.f; // ?
    /* I can't find docs, but assume they have an in-place multiply
    transformed.mult(translation);
    // */
    vertices[i].x = transformed.x;
    vertices[i].y = transformed.y;
}

Maybe, just maybe, the assignment is keeping the compiler from inlining or unrolling something. I kind of guess that the multiply is hefty enough to bump this out of the instruction cache though. And really, if you start talking about the sizes of caches, you aren't going to be resilient across many platforms.

You could try to duplicate some stack and make more, smaller loops.

glm::vec4 transformed[6];
for (size_t i = 0; i < 6; i++) {
    transformed[i].x = vertices[i].x;
    transformed[i].y = vertices[i].y;
    transformed[i].z = vertices[i].z;
    transformed.w = 1.f; // ?
}
glm::mat4 translation = m_rotation * m_translation;
for (size_t i = 0; i < 6; i++) {
    /* I can't find docs, but assume they have an in-place multiply
    transformed.mult(translation);
    // */
}
for (size_t i = 0; i < 6; i++) {
    vertices[i].x = transformed[i].x;
    vertices[i].y = transformed[i].y;
}

As Jason mentioned, unrolling these loops manually could be interesting.

I really don't think that you'll see an order of magnitude improvement on any of these changes, though.

I suspect that calling this function less is more important than making this function faster. The fact that you have this needsTransform check inside of this function makes me think that this is probably relevant.

When you have high level concerns like this in your low level code, you end up just blindly calling this method over and over thinking that it is free. Whether or not that your assumptions about how often needsTransform is true could be wildly incorrect.

The reality is that you should be just be calling this method once. You should applyTransform, when you want to applyTransform. You shouldn't call applyTransform when you might want to applyTransform. Interfaces should be a contract, treat them as such.

I have moved the 2 matrices calculations out of the loop, m_rotation * m_translation, but still only 1 frame won. — , Aug 02 '11 at 19:39
"I can't find docs, but assume they have an in-place multiply" An in-place vector/matrix multiply would have to create a temporary to store the result, then copy it back into the original. Or it would copy the original and multiply into the original. Either way, you gain nothing compared to using operator*. Copy-elision should save you from any excess copying of the returned temporary. Both methods would have a temporary and would do a copy, so both methods are effectively equivalent. — Nicol Bolas, Aug 02 '11 at 22:15
@Nicol I would expect that to be the case, honestly. It's hard to do anything with this but guess at things to try since we only have a snippet. :) — Tom Kerr, Aug 03 '11 at 14:13

score 1 · Answer 4 · answered Aug 03 '11 at 00:18

If you insist on doing your calculations on the CPU, you should do the math yourself.

Right now, you're using 4x4 matrices in a 2D environment, where one 2x2 matrix for rotation and a simple vector for translation should suffice. That's 4 multiplications and 4 additions for rotation, as well as two additions for translation.

If you absolutely need two matrices (because you need to combine translation and rotation), it'll still be a lot less than what you have now. But you can also "manually" combine these two by moving the position of the vector, rotating, and then moving it back again, which maybe might be a little bit faster than the multiplications, although I'm not sure about that.

Compared to the operations that those 4x4 matrices do right now, that's a lot less.

+1 ... this is a good idea, although rather than 2x2 matrices and a transform vector, it would probably be better to use a 3x3 matrix with homogeneous coordinates to maintain the linearity of rotations and translations. In other words if you use a 2x2 rotation matrix and a translation vector, you would have to use a very specific ordering to reverse transforms, and you lose the ability to concatenate rotations and translations. 3x3 homogeneous matrices will maintain the linearity of both translation and rotations, and you can concatenate any series of transforms into a single matrix. — Jason, Aug 04 '11 at 01:52

Speeding up transform calculations

4 Answers4