Be warned: this is highly technical stuff and probably only useful to other devs, if at all. But it took me so long to figure out that it would be a shame not to preserve somewhere for later reference. A wiki would be the perfect place for stuff like this, but oh well.
To understand this post, some familiarity with GPUView is required. It's a great tool but not exactly intuitive to use. Here's an introduction by its programmer: http://graphics.stan...er/GPUView.html
So here's a GPUView capture of VP9_DX9_test7 when starting a particular table:
Screenshot 2014-03-07 22.54.14.png 393.2KB 23 downloads
The red vertical lines are the Present() intervals, i.e., basically the times when a frame is rendered. We want these to be as equidistant as possible in order to have constant FPS and smooth gameplay. In the left part of the screenshot, this is not the case: there are often gaps between the regular intervals. In the right part, everything is completely even and smooth.
So what happened between the "bad" and the "good" part? The answer is: a ball was launched onto the table.
To understand why this smooths out the framerate, we have to look at the reddish/purple peaks and valleys. These are the command buffers VP submits to the D3D runtime. Since VP is GPU-bound in its DX9 incarnation, the CPU has an easy time completely filling the existing buffer, so up to 4 frames are pre-rendered at the start of the above chart. This manifests itself in the high reddish peak in the left part. Now, this part I still don't understand completely, but I assume that somehow when this buffer gets too congested, either VP or the D3D runtime skips a beat and a laggy frame is introduced. These are the gaps between Present calls which are felt as stutter.
In Test7, balls still used a static vertex buffer which was locked every frame, something you shouldn't do in general. The reason is that this stalls the CPU until the vertex buffer in question isn't needed by the GPU anymore. In our graph, however, this kind of turns out to be a good thing: the GPU can catch up while the CPU stalls; only one frame worth of drawahead is produced; the command buffer queue isn't congested anymore; and we get a stable framerate without stutter.
In Test8, I changed the ball rendering to use a dynamic vertex buffer. This eliminated the pipeline stall and increased the FPS, but now the whole plot would look like the left part above, and we get stutter even after the ball is launched.
So in the upcoming Test9, I introduced a new mechanism which artificially creates a partial pipeline stall and produces a similar behavior as the right part of the above chart consistently, but without decreasing the framerate unnecessarily when multiple balls are in play. As an added benefit, this also reduces the latency (input lag).
All of this is tested only on Nvidia and Windows 7. I have no idea if and how this is relevant on AMD cards. So this is a very experimental change at this point and might well make things worse for some people.
Thanks to jimmyfingers for first noticing this particular form of stutter and sending me on this wild goose chase in the first place.
Edited by mukuste, 07 March 2014 - 09:39 PM.