Automated Vulkan Synchronization

Recently I finally decided to solve the synchronization system in my Vulkan renderer. It is something that I really want to do for a long time, to me it should be part of the FrameGraph design in the application, where you declare your individual render passes and the resources it uses (read/write). The synchronization system itself handles the resource transitions, where it should change layout from? Whether we need to setup a semaphore etc.

Vulkan Synchronization API

Vulkan Synchronization from the API is really powerful, but it is not really friendly if you want to use them directly in the application. It becomes quite awkward because need to both the current and previous access pattern of the resource (vkImage and VkBuffer). Then there is the multi-queue, things will get quite complex once you decided to work with not only one queue anymore.

Barriers

Barriers is the most lightweight synchronization mechanism you can use and it is probably the primary tool you should use. You inject them between your draw/dispatch/copy commands to ensure your cache flushed, image layout is in proper place etc.

void vkCmdPipelineBarrier2(
    VkCommandBuffer                             commandBuffer,
    const VkDependencyInfo*                     pDependencyInfo);

typedef struct VkDependencyInfo {
    VkStructureType                  sType;
    const void*                      pNext;
    VkDependencyFlags                dependencyFlags;
    uint32_t                         memoryBarrierCount;
    const VkMemoryBarrier2*          pMemoryBarriers;
    uint32_t                         bufferMemoryBarrierCount;
    const VkBufferMemoryBarrier2*    pBufferMemoryBarriers;
    uint32_t                         imageMemoryBarrierCount;
    const VkImageMemoryBarrier2*     pImageMemoryBarriers;
} VkDependencyInfo;

This requires VK_KHR_synchronization2 which is in core Vulkan 1.3, however the concept is quite similar to Vulkan 1.0, with some minor difference. The meat here is that you need to record your buffer memory barrier and image memory barrier (You can choose either VkMemoryBarrier2 or VkBufferMemoryBarrier2 for buffers). They have something in common:

typedef struct VkCOMMONMemoryBarrier2 {
    VkStructureType          sType;
    const void*              pNext;
    VkPipelineStageFlags2    srcStageMask;
    VkAccessFlags2           srcAccessMask;
    VkPipelineStageFlags2    dstStageMask;
    VkAccessFlags2           dstAccessMask;
    uint32_t                 srcQueueFamilyIndex;
    uint32_t                 dstQueueFamilyIndex;
};

//Then buffer has
typedef struct VkBufferMemoryBarrier2 {
    // common stuff
    VkBuffer                 buffer;
    VkDeviceSize             offset;
    VkDeviceSize             size;	
};

// image has
typedef struct VkImageMemoryBarrier2 {
    //common stuff
    VkImage                    image;
    VkImageSubresourceRange    subresourceRange;	
};

You need to know both the source stageMask/AcessMask (which is what you are waiting for), and destined stageMask/AcessMask (which is what you need for the moment). The mental model here is that you need to think your application writes all the commands for the frame in a single big static function, where you obviously know where your previous resource you accessed.

Barrier is for synchronizing within the queue.

Initially I thought the barriers are for synchronization inside the a single submission, that is actually not the case, you can use them across submissions. This is because Vulkan guarantees the commands starts exactly the same order you submitted, but they may finished in differently.

//1st submission
vkCmdDispatch();
vkCmdDispatch();
vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER);
vkQueueSubmit();

//2nd submission
vkCmdCopy();
VkQueueSubmit();

//3rd submission
vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = DRAW);
vkCmdDraw();

So in this example that we have 3 submissions, where we inject the barriers both before or after the submission. That is fine. In reality though, you want to minimize the submissions to keep GPU busy.

semaphores

Semaphores are the primary GPU sync mechanism across queues, that is it most useful if decide to use multiple queues. The primary use case is that if use the same resource in different queue, you need to define their order of finish using semaphores.

// queue 1
vkCmdDispatch();
vkQueueSubmit(signal = my_semaphore, wait = null);

// queue 2
vkCmdBeginRenderPass();
vkCmdDraw()
vkCmdEndRenderPass();
vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER);

That is in my queue2, I want to fragment shaders in the VkCmdDraw() to wait on previous VkCmdDispatch() to finish in queue 1.

Implicit memory guarantees when waiting the queue.

Note that using semaphores levitates the need of barriers because it makes all previous signaled semaphore finished.

queue ownership transfer

Normally that is all you need but Vulkan always make your life harder right? To ensure the most performance you create your resource VkImage VkBuffer with VK_SHARING_MODE_EXCLUSIVE. That is they can only be accessed by a single queue family at a time. Then if you want to take advantage of the async compute for example, you need to ensure the Queue Family Ownership transfer. This is what the srcQueuefamilyindex and dstQueuefamilyIndex is for.

For this to work you need to have almost the identical memoryBarriers recorded in both previous queue and new queue.

//in queue 1
vkCmdDraw();
vkCmdMemoryBarrier2(resource, srcQFI=0, dstQFI=1, OldLayout=Attachment, NewLayout=READ);
VkQueueSubmit(signal=my_semaphore);

//in queue 2, we inject the same barrier
vkCmdMemoryBarrier2(resource, srcQFI=0, dstQFI=1, OldLayout=Attachment, NewLayout=READ);
vkCmdDraw();
VkQueueSubmit(wait=my_semaphore);

The first barrier is called a ownership release operation. The second "identical" barrier is called ownership acquire operation, although it seems you applied image memory transition twice, it actually only applied once based on the spec.

Although the image layout transition is submitted twice, it will only be executed once. A layout transition specified in this way happens-after the release operation and happens-before the acquire operation.

Application designs

So that's the API part of the story, the rest is up to how the application applies this in practice. (NOTE that this is mainly my design, your millage may vary). When it comes to synchronization, we mainly need to have answers to the following questions:

When/Where to inject barriers?
When/Where to inject semaphores?
When to submit queue?
How do I manage the src/dst access marks and stages.

In my design, I took advantage the following feature (or guarantees) that Vulkan offers:

barriers can be used across submissions.
commands starts exactly the same order you submit.
only use semaphore if you need resource on different queues.
apply queue ownership transfer appropriately.

Global submission state

Because the feature 1 and 2, that I decided to have a global submission state for every command I record, to use them you need to submit you command to this submissions state with

struct dependency
{
    struct timeline_t
    {
        vk::Semaphore          semaphore;
        uint64_t                wait_value;
        vk::PipelineStageFlags2 wait_stage;


    } timeline;
    // a list of records which is stored somewhere else
    std::span<const access_record> accesses;
};

void submmissions::add(vk::CommandBuffer cmd, vk::Queue queue, dependency const& dep = {});

So the render passes only cares about the command it uses and which queue it uses for this command, there is NO explicit submission when it comes to users.

Then the question 3 becomes apparent now, we only want to submit when the vkCommandBuffer is no longer on the current VkQueue. In the submissions state, we have a pending state which takes the accumulated commands and its dependencies, I have a (timeline) semaphore in the dependency in case we need a VkQueueSubmit. The question 2 is sort of answered as well, the semaphores is only injected when the queue switches.

access records

The access_record in the dependency looks like the following

struct access_record
{
    enum pattern_t { READ, WRITE };
    using resource_t = std::variant<vk::Image, vk::Buffer>;
    struct key_t
    {
        pattern_t  pattern;
        resource_t resource;
    } key;

    struct dst_t
    {
        vk::PipelineStageFlagBits2 stage;
        vk::AccessFlags2           access;
        // image specific
        vk::ImageLayout           layout;
        vk::ImageSubresourceRange subresource;
    } dst;
};

You have the READ/WRITE pattern, the resource it touches, and dstStage/dstAccess flags, you only specify the access of the current command (which is a local knowledge). And the

Up to now we have answers to question 1: we inject barriers at every command buffer submitted (not to be confused with command itself) if you specify the access_records, this is often not a problem, because you often record multiple commands before you add to submission.

vkCmdDraw(cmd1);
vkCmdDraw(cmd1);
submission::add(cmd1, queue1, dep={{DRAW, FRAGMENT}, {DRAW, VERTEX}});

vkCmdDispatch(cmd2);
vkCmdDispatch(cmd2);
//still accumulating since on the same queue.
submission::add(cmd2, queue1, dep={READ, COMPUTE});

The answer to question 4 is ready now as well, since we have a global submission state, you are obviously able to find the previous source access patterns.

management logic

Now we are done with interface, how do we handle the internals? It may become apparent to you already, since you have concrete answers to all the questions. The full function is probably a bit too much, but the pseudo code is actually quite simple:

submissions::add(vk::CommandBuffer cmd, vk::Queue queue, dependency const& dep)
{
    if (queue != prev_queue)
    {
        flush(queue);
    }

    for (auto access : dep.accesses) {
	auto src_access = find_prev_access(access.key);
	auto dst_access = new_acess(access.dst, queue);

	if (require_barrier(src_access, dst_access)) {
            if (require_semaphore(src_access, dst_access)) {
		// semaphore implict gurantee.
                src_access.access = vk::AccessFlagBits2::eNone;
                src_access.stage  = vk::PipelineStageFlagBits2::eNone;
                if (require_queue_transfer(src_acess, dst_access)) {
                     src_acess = transfer_queue_access(src_acess, dst_access);
		}
            }
	    insert_barrier(cmd, src_access, dst_access);
	}
    }
}

The algorithm is relatively simple, firstly we check if need to flush the queue (VkQueueSubmit), then we just loop over the access_records and check its previous access, only in this way we can find the srcAccessMask/srcStageMask we need. Then compare, there are 3 iterative more complex scenario. The best case is that we don't require any barriers, such as the case of READ_AFTER_READ, then nothing happens. Then if we require barrier, we further check if require semaphore and queue transfer and adapt the that scenario. The worst case is the requiring queue ownership transfer, that we have to do an additional submit inside transfer_queue_access().

Conclusions

There you go, a complete solution towards Vulkan synchronization, you only requires the local knowledge of your access patterns. So with this you decouple each of the render pass you wrote and have them work modular. Hope you find it useful.