Moving towards GPU driven

We were using a traditional for_each style drawing G-buffer and shadow in Vulkan, with over 2.5 million triangles, and 25,000+ objects, I started to see my GTX 1650 having hard time following it up. Although you can pre-record command buffers in Vulkan to reduce the CPU time but we will also end up with a very large command buffer to submit and potentially miss the driver optimizations with indirect draws. These days, GPUs are getting more and more powerful and complex, including tons of new features. It’s promising to draw millions or billions more triangles compared to before. The cost is that it changed the programming paradigm completely. If you want embark on new hardware, chances are you need to rewrite the rendering code.

The So called GPU-driven term is a catchy phrase for marketing. It’s different from the traditional ‘‘CPU-driven’’ paradigm, what it really means is the introduce of new drawIndirect commands. Using the indirect commands, much of the CPU issuing draw commands gets off loaded to GPU, In Vulkan, we have

VKAPI_ATTR void VKAPI_CALL vkCmdDrawIndirect(
    VkCommandBuffer                          commandBuffer,
    VkBuffer                                 buffer,
    VkDeviceSize                             offset,
    uint32_t                                 drawCount,
    uint32_t                                 stride);

For drawing triangles and

VKAPI_ATTR void VKAPI_CALL vkCmdDrawIndexedIndirect(
    VkCommandBuffer                                 commandBuffer,
    VkBuffer                                        buffer,
    VkDeviceSize                                    offset,
    uint32_t                                        drawCount,
    uint32_t                                        stride);

For drawing indices. With these two commands and use of compute shaders, we can play a lot interesting tricks like instance and culling. Performance is also much better than calling individual draw commands on CPU. In this post I am going to layout the steps converting from a CPU-driven to a GPU-driven.

Step 1: Changing data Structure

The hardest part starts from the beginning, not because the indirect draw commands, but because command buffer vertex buffer bindings. In the traditional ‘‘CPU-driven’’ method, we bind vertices for every mesh using vkCmdBindVertexBuffers and vkCmdBindIndexBuffer before drawing. Since right now you want to merge the draw calls into one command, you can now only bind one vertices once (or only a few times) for indirect draw. What requires us here is to merge different meshes into a single buffer. Note that you can extend this into new cluster drawing techniques like mesh shaders and Nanite, but instead of using the static vertex buffer (which is the case in this tutorial) you will be dynamically filling the vertex buffer with different LoDs data.

Here we look at the simplest case: static meshes. It is very simple just append one mesh after another for vertices and index buffer. Then we need the additional draw command buffer to track each mesh.

Building draw commands

For indirect draw, we need to fill up the draw command:

Struct VkDrawIndexedIndirectCommand {
    uint32_t    indexCount;
    uint32_t    instanceCount;
    uint32_t    firstIndex;
    int32_t     vertexOffset;
    uint32_t    firstInstance;
};

indexCount is the size of the each index buffer. firstIndex is offset (or the size of all previous mesh). vertexOffset is same as firstIndex for offsetting meshes. We can generate this command for every mesh then adding to the instanceCount later for instancing.

What about skinning meshes?

Skinning meshes needs additional bone_weights and bone_indices for the vertex attributes, it would be quite annoying to combine with normal static meshes. One solution would be just leaving the skinning meshes out of indirect draw. You can also padding the vkDrawIndexIndirectCommand with additional skinning info like bone_weight_offset so we can access bone_weights and bone_indices in a dedicated buffer. The solution I found most elegant is using compute skinning, using compute shader to skin vertices into our gigantic vertex buffer then afterwards the vertex buffer only contains skinned vertices can be treated as same as normal static meshes. Note that you probably need to duplicate vertices for skinning mesh if you need to compute the motion vectors. In this case we can pad this VkDrawIndexIndirectCommand into:

Struct CustomizedIndirectDrawCommand {
    uint32_t    indexCount;
    uint32_t    instanceCount;
    uint32_t    firstIndex;
    int32_t     vertexOffset;
    uint32_t    firstInstance;
	uint32_t    vertexCount; // vertexOffset + vertexCount gives previous frame
		                     // for skinning meshes
};

Stage 2: indirect draw

Once all of our hard work preparing buffer is done, it’s matter of issuing the indirect draw calls. On the cpp side we have:

vkCmdBindVertexBuffers(cmd_buf, 1, scene_vertex_buffer, nullptr);
vkCmdBindIndexBuffer(cmd_buf, scene_index_buffer, 0, VK_INDEX_TYPE_UINT32);
 
if (device>features.multiDrawIndirect)
{
  vkCmdDrawIndexedIndirect(cmd_buf,
                           indirectCommandsBuffer.buffer, 
						   0,
						   instances.size(),
						   sizeof(CustomizedIndirectDrawCommand));
}

In the vertex shader we could have something like

layout(location = 0) in vec3 Position;
//... other layouts

layout(location = 3) out vec4 PrevPostion;

struct Vertex
{
    vec4 position;
    vec4 tex_coord;
    vec4 normal;
    vec4 tangent;
    vec4 bitangent;
};

layout(set = 0, binding = 0, std430) readonly buffer Vertices
{
	Vertex data[];
} u_vertices;

layout(set = 0, binding = 1, std430) readonly buffer drawCommand
{
	CustomDrawCommand commands[];
} u_draw_commands;

layout(push_constant) uniform PushConstants
{
	mat4 mvp;
	mat4 prev_mvp;
}

//somewhere later in the shader

gl_Position = mvp * Position;

uint vertex_count = u_draw_commands.commands[gl_InstanceID].vertexCount;
if (vertexCount > 0) //skinning mesh
{
    
    uint vertex_id = gl_VertexID + vertex_count;
	PreviousPosition = prev_mvp * u_vertices[vertex_id];
}
else
{
	PreviousPosition = prev_mvp * Position;
}

Stage 3: Culling

The full details of culling (mostly occlusion culling) is out of the scope here. A good material here The thing we need from the culling shader is to setting the draw commands.

Struct VkDrawIndexedIndirectCommand {
    uint32_t    indexCount;    //-----> 0 if culled
    uint32_t    instanceCount; //-----> 0 if culled
    ...
};

Because for the same mesh, we would see the combination of visible and invisible instances, we need a copy of the command for every instance. With the help of culling, the triangle draw is no longer scene bound, this would give sufficient performance in many cases.

Conclusion

Now you can see the complexity of modern GPU and the work programmers need to take advantage of the hardware. GPU driven may be a cool technique right now but it will soon be obsolete when the ray tracing take over.