WebGPU學(xué)習(xí)(十一):學(xué)習(xí)兩個(gè)優(yōu)化:“reuse render command buffer”和“dynamic uniform buffer offset”
大家好,本文介紹了“reuse render command buffer”和“dynamic uniform buffer offset”這兩個(gè)優(yōu)化,以及Chrome->webgpu-samplers->animometer示例對(duì)它們進(jìn)行的benchmark性能測(cè)試。
上一篇博文:
WebGPU學(xué)習(xí)(十):介紹“GPU實(shí)現(xiàn)粒子效果”
學(xué)習(xí)優(yōu)化:reuse render command buffer
提出問(wèn)題
每一幀經(jīng)過(guò)下面的步驟進(jìn)行繪制:
- 創(chuàng)建一個(gè)command buffer
- 開(kāi)始一個(gè)render pass
- 設(shè)置多個(gè)render command到command buffer中
- 結(jié)束該render pass
相關(guān)代碼如下:
return function frame() {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
passEncoder.setBindGroup(0, uniformBindGroup1);
passEncoder.draw(36, 1, 0, 0);
passEncoder.endPass();
...
}
我們可以發(fā)現(xiàn),一般來(lái)說(shuō),每幀創(chuàng)建的command buffer設(shè)置的command是一樣的,因此這造成了重復(fù)記錄的開(kāi)銷(xiāo)。開(kāi)銷(xiāo)具體包括兩個(gè)方面:
- js binding的開(kāi)銷(xiāo)
如轉(zhuǎn)換descriptor object(如轉(zhuǎn)換創(chuàng)建render pipeline時(shí)傳入的參數(shù):GPURenderPipelineDescriptor)和字符串、處理邊界、檢驗(yàn)數(shù)據(jù)的合法性等開(kāi)銷(xiāo) - 創(chuàng)建render command的開(kāi)銷(xiāo)和設(shè)置render command到command buffer的開(kāi)銷(xiāo)
優(yōu)化方案
WebGPU提供了GPURenderBundle,只需設(shè)置一次render command到render bundle,然后每幀執(zhí)行該bundle,從而實(shí)現(xiàn)了command buffer的復(fù)用。
WebGPU還支持創(chuàng)建多個(gè)bundle,從而可以設(shè)置不同的render command到對(duì)應(yīng)的render bundle中
案例代碼
對(duì)案例代碼的說(shuō)明:
1.發(fā)起兩個(gè)drawcall,對(duì)應(yīng)兩個(gè)bind group。
這里給出原始的案例代碼和優(yōu)化后的案例代碼,供讀者參考:
- 原始的案例代碼:不使用bundle
代碼如下:
return function frame() {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
passEncoder.setBindGroup(0, uniformBindGroup1);
passEncoder.draw(36, 1, 0, 0);
passEncoder.setBindGroup(0, uniformBindGroup2);
passEncoder.draw(36, 1, 0, 0);
passEncoder.endPass();
...
}
- 優(yōu)化后的案例代碼:創(chuàng)建一個(gè)bundle
代碼如下:
function recordRenderPass(passEncoder) {
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
passEncoder.setBindGroup(0, uniformBindGroup1);
passEncoder.draw(36, 1, 0, 0);
passEncoder.setBindGroup(0, uniformBindGroup2);
passEncoder.draw(36, 1, 0, 0);
}
const renderBundleEncoder = device.createRenderBundleEncoder({
colorFormats: [swapChainFormat],
});
recordRenderPass(renderBundleEncoder);
const renderBundle = renderBundleEncoder.finish();
return function frame(timestamp) {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.executeBundles([renderBundle]);
passEncoder.endPass();
...
}
- 優(yōu)化后的案例代碼:創(chuàng)建兩個(gè)bundle
代碼如下:
function recordRenderPass1(passEncoder) {
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
passEncoder.setBindGroup(0, uniformBindGroup1);
passEncoder.draw(36, 1, 0, 0);
}
function recordRenderPass2(passEncoder) {
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
passEncoder.setBindGroup(0, uniformBindGroup2);
passEncoder.draw(36, 1, 0, 0);
}
const renderBundleEncoder1 = device.createRenderBundleEncoder({
colorFormats: [swapChainFormat],
});
recordRenderPass1(renderBundleEncoder1);
const renderBundle1 = renderBundleEncoder1.finish();
const renderBundleEncoder2 = device.createRenderBundleEncoder({
colorFormats: [swapChainFormat],
});
recordRenderPass2(renderBundleEncoder2);
const renderBundle2 = renderBundleEncoder2.finish();
return function frame(timestamp) {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.executeBundles([renderBundle1, renderBundle2]);
passEncoder.endPass();
...
}
}
進(jìn)一步分析
我們?cè)賮?lái)看下bundle和render pass相關(guān)的定義:
interface GPUDevice : EventTarget {
...
GPURenderBundleEncoder createRenderBundleEncoder(GPURenderBundleEncoderDescriptor descriptor);
...
}
dictionary GPURenderBundleEncoderDescriptor : GPUObjectDescriptorBase {
required sequence<GPUTextureFormat> colorFormats;
GPUTextureFormat depthStencilFormat;
unsigned long sampleCount = 1;
};
...
interface GPUCommandEncoder {
...
GPURenderPassEncoder beginRenderPass(GPURenderPassDescriptor descriptor);
...
}
...
dictionary GPURenderPassDescriptor : GPUObjectDescriptorBase {
required sequence<GPURenderPassColorAttachmentDescriptor> colorAttachments;
GPURenderPassDepthStencilAttachmentDescriptor depthStencilAttachment;
};
注意:創(chuàng)建bundle時(shí),需要指定與所屬render pass相同的color attachments、depthAndStencil attachment的format。
參考資料
Encoder results reuse
Add GPURenderBundle
How do people reuse command buffers?(要FQ)
學(xué)習(xí)優(yōu)化:dynamic uniform buffer offset
提出問(wèn)題
在大多數(shù)應(yīng)用中,每個(gè)drawcall需要不同的uniform變量,對(duì)應(yīng)不同的uniform buffer。而uniform buffer被設(shè)置在bind group中,這意味著需要在每一幀中為每個(gè)drawcall創(chuàng)建并設(shè)置一個(gè)bind group。
創(chuàng)建bind group比drawcall的開(kāi)銷(xiāo)更大。通過(guò)在“Proposal: Dynamic uniform and storage buffer offsets”中進(jìn)行的性能測(cè)試,我們知道現(xiàn)代圖形API創(chuàng)建bind group的個(gè)數(shù)是有限的(而WebGPU是基于現(xiàn)代圖形API而實(shí)現(xiàn)的,因此它在WebGPU中也是有限的):
This means, in a single frame, the Metal devices can create 285 bind groups, the D3D12 devices can create 7270 bind groups, and the Vulkan devices can create 18561 bind groups.
優(yōu)化方案
- 我們可以一次性創(chuàng)建所有的bind group作為cache,然后在每一幀drawcall時(shí)只需設(shè)置對(duì)應(yīng)的bind group,從而省去了drawcall時(shí)創(chuàng)建bind group的開(kāi)銷(xiāo)。
- 使用dynamic uniform buffer
除此之外,因?yàn)閃ebGPU支持“dynamic uniform buffer offset”,所以我們也可以使用下面的方法來(lái)優(yōu)化:
只創(chuàng)建一個(gè)bind group,將其設(shè)置為dynamic offset;
每一幀drawcall時(shí)用對(duì)應(yīng)的offset來(lái)設(shè)置同一個(gè)bind group。
第二種優(yōu)化與第一種優(yōu)化相比,更簡(jiǎn)單,只需創(chuàng)建一個(gè)bind group,不需要維護(hù)cache。
根據(jù)Proposal: Dynamic uniform and storage buffer offsets:
I believe we said:
We need at least one of the two for the MVP
Having both causes more complication because they will fight for root table space so we might have to introduce a combined limit for pushConstantSize + N * DynamicBufferCount.
WebGPU的MVP版本應(yīng)該不會(huì)支持dynamic storage buffer offset,也就是說(shuō)設(shè)置為dynamic offset的bind group只能設(shè)置一個(gè)或多個(gè)uniform buffer,不能設(shè)置storage buffer。
案例代碼
對(duì)案例代碼的說(shuō)明:
1.每個(gè)bind group都設(shè)置同一個(gè)uniform buffer,只是它的offset不同
uniform buffer包含的uniform變量為:
float scale;
float offsetX;
float offsetY;
float scalar;
float scalarOffset;
2.一共有100個(gè)gameObject,分別對(duì)應(yīng)100個(gè)draw call和uniform變量的100份數(shù)據(jù)(在uniformBufferData中)
3.在使用第二種優(yōu)化的案例代碼中,每個(gè)drawcall對(duì)應(yīng)的bind group->uniform buffer的offset需要為256的倍數(shù)
這里給出使用第一種優(yōu)化的案例代碼和使用第二種優(yōu)化的案例代碼,供讀者參考:
- 使用第一種優(yōu)化的案例代碼
代碼如下:
const bindGroupLayout = device.createBindGroupLayout({
bindings: [
{ binding: 0, visibility: GPUShaderStage.VERTEX, type: "uniform-buffer" },
],
});
const pipelineLayout = device.createPipelineLayout({ bindGroupLayouts: [bindGroupLayout] });
const pipeline = device.createRenderPipeline({
layout: pipelineLayout,
...
});
const gameObjects = 100;
const uniformBytes = 5 * Float32Array.BYTES_PER_ELEMENT;
const alignedUniformBytes = Math.ceil(uniformBytes / 256) * 256;
const alignedUniformFloats = alignedUniformBytes / Float32Array.BYTES_PER_ELEMENT;
const uniformBuffer = device.createBuffer({
size: gameObjects * alignedUniformBytes + Float32Array.BYTES_PER_ELEMENT,
usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.UNIFORM
});
const uniformBufferData = new Float32Array(gameObjects * alignedUniformFloats);
//bind group的cache數(shù)組
const bindGroups = new Array(gameObjects);
function setUniformBufferData(i) {
uniformBufferData[alignedUniformFloats * i + 0] = Math.random() * 0.2 + 0.2; // scale
uniformBufferData[alignedUniformFloats * i + 1] = 0.9 * 2 * (Math.random() - 0.5); // offsetX
uniformBufferData[alignedUniformFloats * i + 2] = 0.9 * 2 * (Math.random() - 0.5); // offsetY
uniformBufferData[alignedUniformFloats * i + 3] = Math.random() * 1.5 + 0.5; // scalar
uniformBufferData[alignedUniformFloats * i + 4] = Math.random() * 10; // scalarOffset
}
for (let i = 0; i < gameObjects; ++i) {
setUniformBufferData(i);
bindGroups[i] = device.createBindGroup({
layout: bindGroupLayout,
bindings: [{
binding: 0,
resource: {
buffer: uniformBuffer,
offset: i * alignedUniformBytes,
size: 5 * Float32Array.BYTES_PER_ELEMENT,
}
}]
});
}
uniformBuffer.setSubData(0, uniformBufferData);
return function frame() {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
for (let i = 0; i < gameObjects; ++i) {
passEncoder.setBindGroup(0, bindGroups[i]);
passEncoder.draw(3, 1, 0, 0);
}
passEncoder.endPass();
...
}
- 使用第二種優(yōu)化的案例代碼
代碼如下:
//設(shè)置hasDynamicOffset為true
const dynamicBindGroupLayout = device.createBindGroupLayout({
bindings: [
{ binding: 0, visibility: GPUShaderStage.VERTEX, type: "uniform-buffer", hasDynamicOffset: true },
],
});
const dynamicBindGroup = device.createBindGroup({
layout: dynamicBindGroupLayout,
bindings: [{
binding: 0,
resource: {
buffer: uniformBuffer,
offset: 0,
size: 5 * Float32Array.BYTES_PER_ELEMENT,
},
}],
});
const dynamicPipelineLayout = device.createPipelineLayout({ bindGroupLayouts: [dynamicBindGroupLayout] });
const dynamicPipeline = device.createRenderPipeline({
layout: dynamicPipelineLayout,
...
});
//定義gameObjects等代碼與使用第一種優(yōu)化的案例代碼相同,故省略
...
for (let i = 0; i < gameObjects; ++i) {
//setUniformBufferData函數(shù)與使用第一種優(yōu)化的案例代碼相同
setUniformBufferData(i);
}
const dynamicBindGroup = device.createBindGroup({
layout: dynamicBindGroupLayout,
bindings: [{
binding: 0,
resource: {
buffer: uniformBuffer,
offset: 0,
size: 5 * Float32Array.BYTES_PER_ELEMENT,
},
}],
});
uniformBuffer.setSubData(0, uniformBufferData);
const dynamicOffsets = [0];
return function frame() {
...
const commandEncoder = device.createCommandEncoder();
...
const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
passEncoder.setPipeline(pipeline);
passEncoder.setVertexBuffer(0, verticesBuffer);
for (let i = 0; i < gameObjects; ++i) {
//這里進(jìn)行了小優(yōu)化:之所以要預(yù)先創(chuàng)建dynamicOffsets數(shù)組,然后在這里設(shè)置它的元素,而不直接用“passEncoder.setBindGroup(0, dynamicBindGroup, [i * alignedUniformBytes]);”,是因?yàn)檫@樣可以省去“創(chuàng)建數(shù)組:[i * alignedUniformBytes]”的開(kāi)銷(xiāo)
dynamicOffsets[0] = i * alignedUniformBytes;
passEncoder.setBindGroup(0, dynamicBindGroup, dynamicOffsets);
passEncoder.draw(3, 1, 0, 0);
}
passEncoder.endPass();
...
}
參考資料
Proposal: Dynamic uniform and storage buffer offsets
性能測(cè)試
animometer示例對(duì)這兩個(gè)優(yōu)化進(jìn)行了benchmark測(cè)試。
(需要說(shuō)明的是,該示例的“size: 6 * Float32Array.BYTES_PER_ELEMENT”應(yīng)該被改為“size: 5 * Float32Array.BYTES_PER_ELEMENT”)
該示例的運(yùn)行截圖如下所示:

在右側(cè)的紅圈內(nèi)選中按鈕可啟用對(duì)應(yīng)的優(yōu)化;
右上角的紫圈可設(shè)置繪制的三角形個(gè)數(shù);
在左上角的藍(lán)圈內(nèi),第一行顯示每一幀在CPU端所用時(shí)間,主要包括render pass的js binding所用的時(shí)間;第二行顯示每一幀總時(shí)間,它等于CPU端+GPU端的所用時(shí)間。
測(cè)試數(shù)據(jù)
在我的電腦(Mac Pro 2014,MacOS Catalina10.15.1,Chrome Canary 80.0.3977.4)上繪制4萬(wàn)個(gè)三角形的測(cè)試結(jié)果:
- 只使用bundle與沒(méi)用任何優(yōu)化相比
大幅降低了js binding所用時(shí)間,由14ms變?yōu)?.2ms;
每一幀總時(shí)間只降低了20%。
- 同時(shí)使用bundle與offset與只使用bundle相比
js binding所用時(shí)間和每一幀總時(shí)間幾乎沒(méi)有變化
- 只使用offset與沒(méi)用任何優(yōu)化相比
js binding所用時(shí)間大幅增加了60%;
每一幀總時(shí)間只稍微增加了10%。
結(jié)論
使用offset優(yōu)化,雖然增加了CPU端開(kāi)銷(xiāo),但也降低了GPU端開(kāi)銷(xiāo),從而使每一幀總時(shí)間增加得很少。而且它使代碼更為簡(jiǎn)潔(只創(chuàng)建一個(gè)bind group),可能也減少了內(nèi)存占用(我沒(méi)有進(jìn)行測(cè)試,僅為推測(cè)),所以推薦使用。
使用bundle優(yōu)化,雖然大幅降低了CPU端開(kāi)銷(xiāo),但也增加了GPU端開(kāi)銷(xiāo)。不過(guò)考慮到每一幀總時(shí)間還是降低了20%,而且有被瀏覽器進(jìn)一步優(yōu)化的空間(參考Encoder results reuse),所以推薦使用。
浙公網(wǎng)安備 33010602011771號(hào)