深入剖析MSAA

2018-01-08 21:37 風戀殘雪閱讀(19906) 評論(7) 收藏舉報

本文打算對MSAA(Multisample anti aliasing)做一個深入的講解，包括基本的原理、以及不同平臺上的實現對比（主要是PC與Mobile）。為了對MSAA有個更好的理解，所以寫下了這篇文章。當然文章中難免有錯誤之處，如有發現，還請指證，以免誤導其他人。好了，廢話不多說，下面我們開始正文。

MSAA的原理

Aliasing(走樣)

在介紹MSAA原理之前，我們先對走樣（Aliasing）做個簡單介紹。在信號處理以及相關領域中，走樣（混疊）在對不同的信號進行采樣時，導致得出的信號相同的現象。它也可以指信號從采樣點重新信號導致的跟原始信號不匹配的瑕疵。它分為時間走樣（比如數字音樂、以及在電影中看到車輪倒轉等）和空間走樣兩種（摩爾紋）。這里我們不詳細展開。

具體到實時渲染領域中，走樣有以下三種：

幾何體走樣（幾何物體的邊緣有鋸齒），幾何走樣由于對幾何邊緣采樣不足導致。
著色走樣，由于對著色器中著色公式（渲染方程）采樣不足導致。比較明顯的現象就是高光閃爍。

上面一張圖顯示了由于對使用了高頻法線貼圖的高頻高光BRDF采樣不足時產生的著色走樣。下面這張圖顯示了使用4倍超采樣產生的效果。
時間走樣，主要是對高速運動的物體采樣不足導致。比如游戲中播放的動畫發生跳變等。

SSAA（超采樣反走樣）

從名字可以看出，超采樣技術就是以一個更大的分辨率來渲染場景，然后再把相鄰像素值做一個過濾（比如平均等）得到最終的圖像（Resolve）。因為這個技術提高了采樣率，所以它對于解決上面幾何走樣和著色走樣都是有效果的。如下圖所示，首先經對每個像素取n個子采樣點，然后針對每個子像素點進行著色計算。最后根據每個子像素的值來合成最終的圖像。

雖然SSAA可以有效的解決幾何走樣和著色走樣問題，但是它需要更多的顯存空間以及更多的著色計算（每個子采樣點都需要進行光照計算），所以一般不會使用這種技術。順著上面的思路，如果我們對最終的每個像素著色，而不是每個子采樣點著色的話，那這樣雖然顯存還是那么多，但是著色的數量少了，那它的效率也會有比較大的提高。這就是我們今天想要主要說的MSAA技術。

MSAA(多重采樣反走樣)

在前面提到的SSAA中，每個子采樣點都要進行單獨的著色，這樣在片斷（像素）著色器比較復雜的情況下還是很費的。那么能不能只計算每個像素的顏色，而對于那些子采樣點只計算一個覆蓋信息（coverage）和遮擋信息（occlusion）來把像素的顏色信息寫到每個子采樣點里面呢？最終根據子采樣點里面的顏色值來通過某個重建過濾器來降采樣生成目標圖像。這就是MSAA的原理。注意這里有一個很重要的點，就是每個子像素都有自己的顏色、深度模板信息，并且每一個子采樣點都是需要經過深度和模板測試才能決定最終是不是把像素的顏色得到到這個子采樣點所在的位置，而不是簡單的作一個覆蓋測試就寫入顏色。關于這個的出處，我在接下來的文章里會寫出多個出處來佐證這一點?，F在讓我們先把MSAA的原理講清楚。

Coverage（覆蓋）以及Occlusion(遮擋)

一個支持D3D11的顯卡支持通過光柵化來渲染點、線以及三角形。顯卡上的光柵化管線把圖形的頂點當作輸入，這些頂點的位置是在經由透視變換的齊次裁剪空間。它們用來決定這個三角形在當前渲染目標上的像素的位置。這個可見像素由兩個因素決定:

覆蓋覆蓋是通過判斷一個圖形是否跟一個指定的像素重疊來決定的。在顯卡中，覆蓋是通過測試一個采樣點是否在像素的中心來決定的。接下來的圖片說明了這個過程。

一個三角形的覆蓋信息。藍色的點代表采樣點，每一個都在像素的中心位置。紅色的點代表三角形覆蓋的采樣點。

遮擋告訴我們被一個圖形覆蓋的像素是否被其它的像素覆蓋了，這種情況大家應該很熟悉就是z buffer的深度測試。

覆蓋和遮擋兩個一起決定了一個圖形的可見性。

就光柵化而言，MSAA跟SSAA的方式差不多，覆蓋和遮擋信息都是在一個更大分辨率上進行的。對于覆蓋信息來說，硬件會對每個子像素根據采樣規則生成n的子采樣點。接下來的這張圖展示了一個使用了旋轉網格（rotated grid）采樣方式的子采樣點位置。

三角形會與像素的每個子采樣點進行覆蓋測試，會生成一個二進制覆蓋掩碼，它代表了這個三角形覆蓋當前像素的比例。對于遮擋測試來說，三角形的深度在每一個覆蓋的子采樣點的位置進行插值，并且跟z buffer中的深度信息進行比較。由于深度測試是在每個子采樣點的級別而不是像素級別進行的，深度buffer必須相應的增大以來存儲額外的深度值。在實現中，這意味著深度緩沖區是非MSAA情況下的n倍。

MSAA跟SSAA不同的地方在于，SSAA對于所有子采樣點著色，而MSAA只對當前像素覆蓋掩碼不為0的進行著色，頂點屬性在像素的中心進行插值用于在片斷程序中著色。這是MSAA相對于SSAA來說最大的好處。

雖然我們只對每個像素進行著色，但是并不意味著我們只需要存儲一個顏色值，而是需要為每一個子采樣點都存儲顏色值，所以我們需要額外的空間來存儲每個子采樣點的顏色值。所以，顏色緩沖區的大小也為非MSAA下的n倍。當一個片斷程序輸出值時，只有地了覆蓋測試和遮擋測試的子采樣點才會被寫入值。因此如果一個三角形覆蓋了4倍采樣方式的一半，那么一半的子采樣點會接收到新的值?；蛘呷绻械淖硬蓸狱c都被覆蓋，那么所有的都會接收到值。接下來的這張圖展示了這個概念：

通過使用覆蓋掩碼來決定子采樣點是否需要更新值，最終結果可能是n個三角形部分覆蓋子采樣點的n個值。接下來的圖像展示了4倍MSAA光柵化的過程。

MSAA Resolve(MSAA 解析)

像超采樣一樣，過采樣的信號必須重新采樣到指定的分辨率，這樣我們才可以顯示它。

這個過程叫解析（resolving）。在它最早的版本里，解析過程是在顯卡的固定硬件里完成的。一般使用的采樣方法就是一像素寬的box過濾器。這種過濾器對于完全覆蓋的像素會產生跟沒有使用MSAA一樣的效果。好不好取決于怎么看它（好是因為你不會因為模糊而減少細節，壞是因為一個box過濾器會引入后走樣（postaliasing））。對于三角形邊上的像素，你會得到一個標志性的漸變顏色值，數量等于子采樣點的個數。接下來的圖展示了這一現象：

當然不同的硬件廠商可能會使用不同的算法。比如nVidia的"Quincunx" AA等。隨著顯卡的不斷升級，我們現在可以通過自定義的shader來做MSAA的解析了。

小結

通過上面的解釋，我們可以看到，整個MSAA并不是在光柵化階段就可以完全的，它在這個階段只是生成覆蓋信息。然后計算像素顏色，根據覆蓋信息和深度信息決定是否來寫入子采樣點。整個完成后再通過某個過濾器進行降采樣得到最終的圖像。大體流程如下所示：

PC與Mobile對比

上面我們講解了MSAA的基本原理，那么具體到不同顯卡廠商以及不同平臺上的實現有什么不同嗎？下面就讓我們做些簡單的對比。其實，既然算法已經確定了，那么差異基本上就是在一些細節上的處理，以及GPU架構不同帶來的差異。

版本	MSAA是否支持	自定義Shader解析	是否需要更大的顏色深度緩沖區
Direct3D 9	是	否	需要
Direct3D 11	是	是	需要
Direct3D 12	是	是	需要
OpenGL ES 2.0	(Multisample rasterization cannot be enabled or disabled after a GL context is created. It is enabled if the value of SAMPLE_BUFFERS is one, and disabled otherwise) Multisample Texture: 使用GL_EXT_multisampled_render_to_texture擴展蘋果： APPLE_framebuffer_multisample 安卓：使用EGL	否	看GPU架構： TBR(Mali Qualcomm Adreno(300系列之前)） TBDR（PowerVR）不需要 IMR（nVidia Tera Qualcomm Adreno 300系列以及之后可以在IMR、TBR之間切換）需要。如果使用GL_EXT_multisampled_render_to_texture也需要（跟硬件實現有關（enabling MSAA the right way in OpenGL ES））。
OpenGL ES 3.0	是（The technique is to sample all primitives multiple times at each pixel. The color sample values are resolved to a single, displayable color. For window system-provided framebuffers, this occurs each time a pixel is updated, so the antialiasing appears to be automatic at the application level. For application-created framebuffers, this must be requested by calling the BlitFramebuffer command (see section 4.3.3).） When rendering textures, emphasis is placed on multisample anti-aliasing (MSAA), which earlier hardware generations could only run against the framebuffer. OpenGL ES 3.0 can presently support MSAA-type rendering for a texture.	否	如果是系統提供的framebuffer,那么同OpenGL ES 2.0的版本。如果是用戶創建的framebuffer，那么是需要額外的顯存的(跟硬件實現有關？？？)。
OpenGL ES 3.1	是	是（sampler2DMS）	如果是系統提供的framebuffer,那么同OpenGL ES 2.0的版本。如果是用戶創建的framebuffer，那么是需要額外的顯存的(跟硬件實現有關？？？)。

IMR vs TBR vs TBDR

IMR （立即渲染模式）

目前PC平臺上基本上都是立即渲染模式，CPU提交渲染數據和渲染命令，GPU開始執行。它跟當前已經畫了什么以及將來要畫什么的關系很?。‥arly Z除外）。流程如下圖所示：

TBR（分塊渲染）

TBR把屏幕分成一系列的小塊，每個單獨來處理，所以可以做到并行。由于在任何時候顯卡只需要場景中的一部分數據就可完成工作，這些數據（如顏色深度等）足夠小到可以放在顯卡芯片上（on-chip），有效得減少了存取系統內存的次數。它帶來的好處就是更少的電量消耗以及更少的帶寬消耗，從而會獲得更高的性能。

分塊

TBDR （分塊延遲渲染）

TBDR跟TBR有些相似，也是分塊，并使用在芯片上的緩存來存儲數據（顏色以及深度等），它還使用了延遲技術，叫隱藏面剔除（Hidden Surface Removal），它把紋理以及著色操作延遲到每個像素已經在塊中已經確定可見性之后，只有那些最終被看到的像素才消耗處理資源。這意味著隱藏像素的不必要處理被去掉了，這確保了每幀使用最低可能的帶寬使用和處理周期數，這樣就可以獲取更高的性能以及更少的電量消耗。

一個簡單的對比傳統GPU與TBDR

移動平臺上的MSAA

有了上面對移動GPU架構的簡單了解，下面我們看下在移動平臺上是怎么處理MSAA的，如下圖所示：

可以看到如果相對于IMR模式的顯卡來說，TBR或者TBDR的實現MSAA會省很多，因為好多工作直接在on-chip上就完成了。這里還是有兩個消耗：

4倍MSAA需要四倍的塊緩沖內存。由于芯片上的塊緩沖內存很最貴，所以顯卡會通過減少塊的大小來消除這個問題。減少塊的大小對性能有所影響，但是減少一半的大小并不意味著性能會減半，瓶頸在片斷程序的只會有一個很小的影響。

第二個影響就是在物體邊緣會產生更多的片斷，這個在IMR模式下也有。每個多邊形都會覆蓋更多的像素如下圖所示。而且，背景和前景的圖形都貢獻到一個交互的地方，兩片斷都需要著色，這樣硬件隱藏背面剔除就會剔除更少的像素。這些額外片斷的消耗跟場景是由多少邊緣組成有關，但是10%是一個比較好的猜測。

主流移動GPU的實現細節

Mali:

JUST22 - Multisampled resolve on-tile is supported in hardware with no bandwidth hit Mali GPUs support resolving multisampled framebuffers on-tile. Combined with tile-buffer support for full throughput in 4x MSAA makes 4x MSAA a very compelling way of improving quality with minimal speed hit.

In GLES on Mali GPUs, the simplest case for 4xMSAA would be to render directly to the window surface (FB0), having set EGL_SAMPLES to 4. This will do all multisampling and resolving in the GPU registers, and will only flush the resolved buffer to memory. This is the most efficient way to implement MSAA on a Mali GPU, and comes at almost no performance cost compared to rendering to a normal window surface. Note that this does not expose the sample buffers themselves to you, and does not require an explicit resolve.

Qualcomm Adreno:

Anti-aliasing is an important technique for improving the quality of generated images. It reduces

the visual artifacts of rendering into discrete pixels.

Among the various techniques for reducing aliasing effects, multisampling is efficiently

supported by Adreno 4x. Multisampling divides every pixel into a set of samples, each of which

is treated like a "mini-pixel" during rasterization. Each sample has its own color, depth, and

stencil value. And those values are preserved until the image is ready for display. When it is time

to compose the final image, the samples are resolved into the final pixel color. Adreno 4xx

supports the use of two or four samples per pixel.

PowerVR:

Another benefit of the SGX and SGX-MP architecture is the ability to perform efficient 4x Multi-Sample Anti-Aliasing (MSAA). MSAA is performed entirely on-chip, which keeps performance high without introducing a system memory bandwidth overhead (as would be seen when performing anti-aliasing in some other architectures). To achieve this, the tile size is effectively quartered and 4 sample positions are taken for each fragment (e.g., if the tile size is 16x16, an 8x8 tile will be processed when MSAA is enabled). The reduction in tile size ensures the hardware has sufficient memory to process and store colour, depth and stencil data for all of the sample positions. When the ISP operates on each tile, HSR and depth tests are performed for all sample positions. Additionally, the ISP uses a 1 bit flag to indicate if a fragment contains an edge. This flag is used to optimize blending operations later in the render. When the subsamples are submitted to the TSP, texturing and shading operations are executed on a per-fragment basis, and the resultant colour is set for all visible subsamples. This means that the fragment workload will only slightly increase when MSAA is enabled, as the subsamples within a given fragment may be coloured by different primitives when the fragment contains an edge. When performing blending, the edge flag set by the ISP indicates if the standard blend path needs to be taken, or if the optimized path can be used. If the destination fragment contains an edge, then the blend needs to be performed individually for each visible subsample to give the correct resultant colour (standard blend). If the destination fragment does not contain an edge, then the blend operation is performed once and the colour is set for all visible subsamples (optimized blend). Once a tile has been rendered, the Pixel Back End (PBE) combines the subsample colours for each fragment into a single colour value that can be written to the frame buffer in system memory. As this combination is done on the hardware before the colour data is sent, the system memory bandwidth required for the tile flush is identical to the amount that would be required when MSAA is not enabled.

On PowerVR hardware Multi-Sampled Anti-Aliasing (MSAA) can be performed directly in on-chip memory before being written out to system memory, which saves valuable memory bandwidth. In general, MSAA is considered to cost relatively little performance. This is true for typical games and UIs, which have low geometry counts but very complex shaders. The complex shaders typically hide the cost of MSAA and have a reduced blend workload. 2x MSAA is virtually free on most PowerVR graphics cores (Rogue onwards), while 4x MSAA+ will noticeably impact performance. This is partly due to the increased on-chip memory footprint, which results in a reduction in tile dimensions (for instance 32 x 32 -> 32 x 16 -> 16 x 16 pixels) as the number of samples taken increases. This in turn results in an increased number of tiles that need to be processed by the tile accelerator hardware, which then increases the vertex stages overall processing cost. The concept of "good enough? should be followed in determining how much anti-aliasing is enough. An application may only require 2x MSAA to look "good enough?, while performing comfortably at a consistent 60 FPS. In some cases there may be no need for anti-aliasing to be used at all e.g. when the target device?s display has high PPI (pixels per-inch). Performing MSAA becomes more costly when there is an alpha blended edge, resulting in the graphics core marking the pixels on the edge to "on edge blend". On edge blend is a costly operation, as the blending is performed for each sample by a shader (i.e. in software). In contrast, on opaque edge is performed by dedicated hardware, and is a much cheaper operation as a result. On edge blend is also "sticky?, which means that once an on-screen pixel is marked, all subsequent blended pixels are blended by a shader, rather than by dedicated hardware. In order to mitigate these costs, submit all opaque geometry first, which keeps the pixels "off edge" for as long as possible. Also, developers should be extremely reserved with the use of blending, as blending has lots of performance implications, not just for MSAA.

總結

通過上面的講解，我們了解了MSAA的實現原理，以及在PC平臺和移動平臺上因為架構的不同導致具體實現細節的不同。MSAA是影響了GPU管理的光柵化、片斷程序、光柵操作階段（每個子采樣點都要做深度測試）的。每個子采樣點都是有自己的顏色和深度存儲的，并且每個子采樣點都會做深度測試。在移動平臺上，是否需要額外的空間來存儲顏色和深度需要根據OpenGL ES的版本以及具體硬件的實現有關。MSAA在一般的情況下（不需要額外空間來存儲顏色和深度，直接在on-chip上完成子采樣點計算，然后直接resolve到framebuffer）是要比PC平臺上效率高的，因為沒有了那么大的帶寬消耗。但是鑒于硬件實現差異大，建議還是以實測為準。由于本人水平有限，難免會有錯誤的地方。如果發現，還請指正，以免誤導了他人。

參考文獻

https://en.wikipedia.org/wiki/Aliasing
https://en.wikipedia.org/wiki/Moir%C3%A9_pattern
https://mynameismjp.wordpress.com/2012/10/21/applying-sampling-theory-to-real-time-graphics/
https://en.wikipedia.org/wiki/Supersampling
https://mynameismjp.wordpress.com/2012/10/24/msaa-overview/
https://mynameismjp.wordpress.com/2012/10/28/msaa-resolve-filters/
http://graphics.stanford.edu/courses/cs248-07/lectures/2007.10.11%20CS248-06%20Multisample%20Antialiasing/2007.10.11%20CS248-06%20Multisample%20Antialiasing.ppt
https://msdn.microsoft.com/en-us/library/windows/desktop/cc627092(v=vs.85).aspx
https://www.khronos.org/registry/OpenGL/specs/es/2.0/es_full_spec_2.0.pdf
https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_multisampled_render_to_texture.txt
https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/WorkingwithEAGLContexts/WorkingwithEAGLContexts.html#//apple_ref/doc/uid/TP40008793-CH103-SW4
https://stackoverflow.com/questions/27035893/antialiasing-in-opengl-es-2-0
https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
https://www.imgtec.com/blog/understanding-powervr-series5xt-powervr-tbdr-and-architecture-efficiency-part-4/
https://en.wikipedia.org/wiki/Tiled_rendering
https://www.qualcomm.com/media/documents/files/the-rise-of-mobile-gaming-on-android-qualcomm-snapdragon-technology-leadership.pdf
https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
https://www.imgtec.com/blog/introducing-the-brand-new-opengl-es-3-0/
https://www.khronos.org/assets/uploads/developers/library/2014-gdc/Khronos-OpenGL-ES-GDC-Mar14.pdf
https://android.googlesource.com/platform/external/deqp/+/193f598/modules/gles31/functional/es31fMultisampleShaderRenderCase.cpp
https://www.anandtech.com/show/4686/samsung-galaxy-s-2-international-review-the-best-redefined/15
https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-TileBasedArchitectures.pdf
https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
https://community.arm.com/graphics/f/discussions/4426/multisample-antialiasing-using-multisample-fbo
http://cdn.imgtec.com/sdk-documentation/PowerVR+Series5.Architecture+Guide+for+Developers.pdf
http://cdn.imgtec.com/sdk-documentation/PowerVR.Performance+Recommendations.pdf

刷新頁面返回頂部

凡事看本質不識廬山真面目，只緣身在此山中。

深入剖析MSAA

MSAA的原理

Aliasing(走樣)

SSAA（超采樣反走樣）

MSAA(多重采樣反走樣)