We were using a traditional for_each
style drawing G-buffer and shadow in Vulkan, with over 2.5 million triangles, and 25,000+ objects, I started to see my GTX 1650 having hard time following it up. Although you can pre-record command buffers in Vulkan to reduce the CPU time but we will also end up with a very large command buffer to submit and potentially miss the driver optimizations with indirect draws. These days, GPUs are getting more and more powerful and complex, including tons of new features. It’s promising to draw millions or billions more triangles compared to before. The cost is that it changed the programming paradigm completely. If you want embark on new hardware, chances are you need to rewrite the rendering code.
The So called GPU-driven term is a catchy phrase for marketing. It’s different from the traditional ‘‘CPU-driven’’ paradigm, what it really means is the introduce of new drawIndirect
commands. Using the indirect commands, much of the CPU issuing draw commands gets off loaded to GPU, In Vulkan, we have
VKAPI_ATTR void VKAPI_CALL vkCmdDrawIndirect(
VkCommandBuffer commandBuffer,
VkBuffer buffer,
VkDeviceSize offset,
uint32_t drawCount,
uint32_t stride);
For drawing triangles and
VKAPI_ATTR void VKAPI_CALL vkCmdDrawIndexedIndirect(
VkCommandBuffer commandBuffer,
VkBuffer buffer,
VkDeviceSize offset,
uint32_t drawCount,
uint32_t stride);
For drawing indices. With these two commands and use of compute shaders, we can play a lot interesting tricks like instance and culling. Performance is also much better than calling individual draw commands on CPU. In this post I am going to layout the steps converting from a CPU-driven to a GPU-driven.
Step 1: Changing data Structure
The hardest part starts from the beginning, not because the indirect draw commands, but because command buffer vertex buffer bindings. In the traditional ‘‘CPU-driven’’ method, we bind vertices for every mesh using vkCmdBindVertexBuffers
and vkCmdBindIndexBuffer
before drawing. Since right now you want to merge the draw calls into one command, you can now only bind one vertices once (or only a few times) for indirect draw. What requires us here is to merge different meshes into a single buffer. Note that you can extend this into new cluster drawing techniques like mesh shaders and Nanite, but instead of using the static vertex buffer (which is the case in this tutorial) you will be dynamically filling the vertex buffer with different LoDs data.
Here we look at the simplest case: static meshes. It is very simple just append one mesh after another for vertices and index buffer. Then we need the additional draw command buffer to track each mesh.
Building draw commands
For indirect draw, we need to fill up the draw command:
Struct VkDrawIndexedIndirectCommand {
uint32_t indexCount;
uint32_t instanceCount;
uint32_t firstIndex;
int32_t vertexOffset;
uint32_t firstInstance;
};
indexCount
is the size of the each index buffer. firstIndex
is offset (or the size of all previous mesh). vertexOffset
is same as firstIndex
for offsetting meshes. We can generate this command for every mesh then adding to the instanceCount
later for instancing.
What about skinning meshes?
Skinning meshes needs additional bone_weights
and bone_indices
for the vertex attributes, it would be quite annoying to combine with normal static meshes. One solution would be just leaving the skinning meshes out of indirect draw. You can also padding the vkDrawIndexIndirectCommand
with additional skinning info like bone_weight_offset
so we can access bone_weights
and bone_indices
in a dedicated buffer. The solution I found most elegant is using compute skinning, using compute shader to skin vertices into our gigantic vertex buffer then afterwards the vertex buffer only contains skinned vertices can be treated as same as normal static meshes. Note that you probably need to duplicate vertices for skinning mesh if you need to compute the motion vectors. In this case we can pad this VkDrawIndexIndirectCommand
into:
Struct CustomizedIndirectDrawCommand {
uint32_t indexCount;
uint32_t instanceCount;
uint32_t firstIndex;
int32_t vertexOffset;
uint32_t firstInstance;
uint32_t vertexCount; // vertexOffset + vertexCount gives previous frame
// for skinning meshes
};
Stage 2: indirect draw
Once all of our hard work preparing buffer is done, it’s matter of issuing the indirect draw calls. On the cpp side we have:
vkCmdBindVertexBuffers(cmd_buf, 1, scene_vertex_buffer, nullptr);
vkCmdBindIndexBuffer(cmd_buf, scene_index_buffer, 0, VK_INDEX_TYPE_UINT32);
if (device>features.multiDrawIndirect)
{
vkCmdDrawIndexedIndirect(cmd_buf,
indirectCommandsBuffer.buffer,
0,
instances.size(),
sizeof(CustomizedIndirectDrawCommand));
}
In the vertex shader we could have something like
layout(location = 0) in vec3 Position;
//... other layouts
layout(location = 3) out vec4 PrevPostion;
struct Vertex
{
vec4 position;
vec4 tex_coord;
vec4 normal;
vec4 tangent;
vec4 bitangent;
};
layout(set = 0, binding = 0, std430) readonly buffer Vertices
{
Vertex data[];
} u_vertices;
layout(set = 0, binding = 1, std430) readonly buffer drawCommand
{
CustomDrawCommand commands[];
} u_draw_commands;
layout(push_constant) uniform PushConstants
{
mat4 mvp;
mat4 prev_mvp;
}
//somewhere later in the shader
gl_Position = mvp * Position;
uint vertex_count = u_draw_commands.commands[gl_InstanceID].vertexCount;
if (vertexCount > 0) //skinning mesh
{
uint vertex_id = gl_VertexID + vertex_count;
PreviousPosition = prev_mvp * u_vertices[vertex_id];
}
else
{
PreviousPosition = prev_mvp * Position;
}
Stage 3: Culling
The full details of culling (mostly occlusion culling) is out of the scope here. A good material here The thing we need from the culling shader is to setting the draw commands.
Struct VkDrawIndexedIndirectCommand {
uint32_t indexCount; //-----> 0 if culled
uint32_t instanceCount; //-----> 0 if culled
...
};
Because for the same mesh, we would see the combination of visible and invisible instances, we need a copy of the command for every instance. With the help of culling, the triangle draw is no longer scene bound, this would give sufficient performance in many cases.
Conclusion
Now you can see the complexity of modern GPU and the work programmers need to take advantage of the hardware. GPU driven may be a cool technique right now but it will soon be obsolete when the ray tracing take over.