zyl910

優(yōu)化技巧、硬件體系、圖像處理、圖形學、游戲編程、國際化與文本信息處理。

:: :: :: :: :: ::

:: ::

公告

[C#] 對32位圖像進行水平翻轉(zhuǎn)(FlipX)的跨平臺SIMD硬件加速向量算法（使用VectorTraits的YShuffleKernel方法來解決Shuffle的缺點）

在上一篇文章里，我們講解了圖像的垂直翻轉(zhuǎn)(FlipY)算法，于是本文來探討水平翻轉(zhuǎn)(FlipX)。先講解比較容易的32位圖像水平翻轉(zhuǎn)算法，便于后續(xù)文章來探討復雜的24位圖像水平翻轉(zhuǎn)算法。
本文除了會給出標量算法外，還會給出向量算法。且這些算法是跨平臺的，同一份源代碼，能在 X86（Sse、Avx等指令集）及Arm（AdvSimd等指令集）等架構上運行，且均享有SIMD硬件加速。

一、標量算法

1.1 算法思路

水平翻轉(zhuǎn)又稱左右翻轉(zhuǎn)，是將圖像沿著垂直中軸線進行翻轉(zhuǎn)。
假設用 src[x, y] 可以訪問源圖像中的像素，用 dst[x, y] 可以訪問目標圖像中的像素，width是圖像的像素寬度。那么水平翻轉(zhuǎn)的公式為——

dst[x, y] = src[width - 1 - x, y]

注意像素坐標是從0開始編號的。于是最右邊像素的x坐標是 width - 1。

簡單來說，就是將行內(nèi)的每一個像素，按相反的方向復制一遍。

由于需要逐個逐個的處理每一個像素，所以得根據(jù)不同的像素大小來編寫算法。

1.1.1 32位像素的說明

32位像素是容易處理的。因為32位就是4字節(jié)，這是2的整數(shù)次冪，處理起來很方便。所以32位圖像的使用頻率最高。

受到 RGB通道順序、是否含有 Alpha通道等細節(jié)的影響，32位的像素有很多種像素格式——

Bgr32。又稱 BGRX8888、B8G8R8X8。GDI+ 里稱 “Format32bppRgb”。
Bgra32。又稱 BGRA8888、B8G8R8A8。GDI+ 里稱“Format32bppArgb”。
Pbgra32。又稱預乘的BGRA8888、B8G8R8A8。GDI+ 里稱“Format32bppPArgb”。
Rgb32。又稱 RGBX8888、R8G8B8X8。
Rgba32。又稱 RGBA8888、R8G8B8A8。
Prgba32。又稱預乘的RGBA8888、R8G8B8A8。

由于現(xiàn)在是做圖像水平翻轉(zhuǎn)，無需精確到顏色通道，而是可以將整個像素作為整體進行處理。所以，本文的算法對所有的32位像素格式都有效，不僅上面提到的Bgr32等格式，其實連 Cmyk32等其他32位像素也有效。

1.2 算法實現(xiàn)

知道像素的字節(jié)數(shù)（cbPixel）后，便可以根據(jù)它來復制像素了。32位，就是4個字節(jié)。
源代碼如下。

public static unsafe void ScalarDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 4; // 32 bit: Bgr32, Bgra32, Rgb32, Rgba32.
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        byte* p = pRow + (width - 1) * cbPixel;
        byte* q = qRow;
        for (int j = 0; j < width; j++) {
            for (int k = 0; k < cbPixel; k++) {
                q[k] = p[k];
            }
            p -= cbPixel;
            q += cbPixel;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

用指針來編寫圖像的水平翻轉(zhuǎn)算法，最關鍵的是做好地址計算?，F(xiàn)在是水平翻轉(zhuǎn)，故重點是做好行內(nèi)像素（內(nèi)循環(huán)j）相關的地址計算。

內(nèi)循環(huán)采用了“逆序讀取、順序?qū)懭搿钡牟呗浴＞唧w來說——

讀取是從最后像素開始的，每次循環(huán)后移動到前一個像素。于是在上面的源代碼中，p的初值是 pRow + (width - 1) * cbPixel（目標位圖最后一行的地址），每次循環(huán)后q會減去 cbPixel。
寫入是從第0個像素開始的，每次循環(huán)后移動到下一個像素。于是在上面的源代碼中，q的初值就是 qRow，每次循環(huán)后q會加上 cbPixel。

1.3 基準測試代碼

使用 BenchmarkDotNet 進行基準測試。

[Benchmark(Baseline = true)]
public void Scalar() {
    ScalarDo(_sourceBitmapData, _destinationBitmapData, false);
}

//[Benchmark]
public void ScalarParallel() {
    ScalarDo(_sourceBitmapData, _destinationBitmapData, true);
}

public static unsafe void ScalarDo(BitmapData src, BitmapData dst, bool useParallel = false) {
    int width = src.Width;
    int height = src.Height;
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    bool allowParallel = useParallel && (height > 16) && (Environment.ProcessorCount > 1);
    if (allowParallel) {
        Parallel.For(0, height, i => {
            int start = i;
            int len = 1;
            byte* pSrc2 = pSrc + start * (long)strideSrc;
            byte* pDst2 = pDst + start * (long)strideDst;
            ScalarDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
        ScalarDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

由于現(xiàn)在是圖像水平翻轉(zhuǎn)，是對行內(nèi)像素進行處理。而對外循環(huán)的每一行，可以簡單的依次來處理。于是并行（allowParallel）計算時的地址計算比較簡單。

二、向量算法

2.1 算法思路

2.1.1 一個向量內(nèi)如何做翻轉(zhuǎn)

上面的標量算法是每次復制1個字節(jié)，而向量算法可以每次復制1個向量。

此時會遇到第一個難點——向量的顆粒度太大了。先前的標量算法是逐個字節(jié)的復制，能精準定位到每一個字節(jié)，具有很高的靈活性。而現(xiàn)在使用向量類型后，是一次性操作至少16個字節(jié)，笨重了很多。

Vector 類型的最小長度是128位，既16個字節(jié)。此時對于32位像素來說，1個Vector內(nèi)可以存儲4個像素。

所以首先要解決一個向量內(nèi)如何做翻轉(zhuǎn)的難題。

2.1.1.1 `.NET 7.0`的解決辦法

在 .NET 7.0 之前，是沒有好的辦法。

從.NET 7.0開始，Vector128等向量類型增加了 Shuffle 方法。用該方法，可以給向量內(nèi)的元素進行換位。為了支持不同的元素類型，該方法具有這些重載：

public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices);
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices);
...

參數(shù)說明如下。

vector: 源向量。
indices: 索引。
返回值：一個新向量，其中包含在vector里根據(jù) indices所選定的值。例如它的第i個元素，就是vector里的第 indices[i] 個元素。即 vector[indices[i]]。若索引超過范圍，對應的元素會設置為0。

首先想到的是使用byte版的Shuffle方法，來做向量內(nèi)的翻轉(zhuǎn)。因為這是先前標量算法的思路。

但它不是最佳選擇。因為現(xiàn)在是對32位像素進行處理，可以將整個像素一起處理。int是32位整數(shù)，于是可以選擇int版的Shuffle方法。（由于是整個像素進行處理，不必關心符號位等細節(jié)，故 int、uint都能處理。只是用int會更簡潔一些）

Vector128里可以存放4個32位像素。于是可以使用下面的代碼進行翻轉(zhuǎn)。

// Vector128<int> src = …… // 加載源值.
Vector128<int> indices = Vector128.Create((int)3, 2, 1, 0);
Vector128<int> dst = Vector128.Shuffle(src, indices);

上述代碼能夠正常工作。但是實際使用，你會發(fā)現(xiàn)它存在一個重大缺點——速度太慢。

對它進行反匯編分析，會發(fā)現(xiàn)直至 .NET 8.0，Shuffle都沒有硬件加速。而是使用了標量回退代碼。

除了沒有硬件加速外，Shuffle還存在這些缺點：

僅固定大小的向量類型（如 Vector128、Vector256 等）提供了Shuffle方法，而自動大小的向量類型（Vector）尚未提供。
.NET 7.0才開始提供Shuffle方法，而早期版本的 .NET 沒有這個方法，導致很多算法難以實現(xiàn)。

2.1.1.2 使用VectorTraits來解決Shuffle的缺點

為了解決 Shuffle 方法沒有硬件加速的問題，我開發(fā)了VectorTraits 庫。它使用了各個架構的shuffle類別的指令，從而使 Shuffle 方法具有硬件加速。具體來說，它分別使用了以下指令。

X86: 使用 _mm_shuffle_epi8 等指令.
Arm: 使用 vqvtbl1q_u8 指令.
Wasm: 使用 i8x16.swizzle 指令.

VectorTraits 不僅為固定大小的向量類型（如 Vector128）提供了Shuffle方法，它還為自動大小的向量類型（Vector）也提供了Vector方法。

而且 VectorTraits 能支持早期版本的 .NET。目前 3.0 版的VectorTraits，支持以下 .NET 版本。

.NET: 5.0 - 8.0。
.NET Core: 2.0 - 3.1。
.NET Framework: 4.5 - 4.8.1。
.NET Standard: 1.1 - 2.1。

借助VectorTraits，可以方便的編寫跨平臺的SIMD硬件加速向量算法。

VectorTraits給各種向量類型，都提供了對應的靜態(tài)類，規(guī)則是 “原名+s”。例如對于Vector128，提供了Vector128s類。于是將上述代碼中的 Vector128s.Shuffle，加上一個字母“s”，使其變?yōu)?Vector128.Shuffle，便能享有SIMD硬件加速。

using Zyl.VectorTraits;

// Vector128<int> src = …… // 加載源值.
Vector128<int> indices = Vector128.Create((int)3, 2, 1, 0);
Vector128<int> dst = Vector128s.Shuffle(src, indices);

2.1.1.3 使用自動大小的向量類型Vector

從.NET Core 3.0開始，才提供Vector128等固定大小向量類型。所以上面代碼需要 .NET Core 3.0 或更高的環(huán)境。

如果是更早版本的 .NET，該怎么辦呢？

答案是——換成自動大小的向量類型Vector。

從 .NET Framework 4.5開始，使用 nuget 安裝了 System.Numerics.Vectors 包后，就能使用自動大小的向量類型Vector。

Vector 類型的大小不是固定的。一般來說，它是本機CPU的最大向量大小。

X86：當支持 Avx和Avx2 指令集時，為256位；否則（例如僅支持 Sse系列指令集時）為128位。（直至 .NET 8.0, Vector 類型還不支持512位。即使CPU支持Avx512指令集，Vector 類型還是最高256位）
Arm：目前固定為 128位。
Wasm：目前固定為 128位。

Vector 類型提供了 Count屬性，用來獲取向量內(nèi)元素數(shù)量。

若 Vector為128位時，Vector<int>.Count 的結果為4。
若 Vector為256位時，Vector<int>.Count 的結果為8。

由于Vector 類型的大小不是固定的，這給我們使用Shuffle方法帶來了一些麻煩。先前給Vector128類型的indices設置初值時，因為元素數(shù)量固定，故直接寫好每一個值就行。而面對自動大小的向量類型Vector，不能直接給indices設置初值。

查看文檔，會發(fā)現(xiàn) Vector的構造函數(shù)支持數(shù)組參數(shù)。故可以事先創(chuàng)建好數(shù)組，隨后寫個循環(huán)，在數(shù)組內(nèi)填充值，最后用 Vector的構造函數(shù)來創(chuàng)建向量。

從.NET Core 3.0開始，Vector的構造函數(shù)還支持Span參數(shù)。于是可以使用棧分配，來減少內(nèi)存分配的開銷。源代碼如下。

Span<int> buf = stackalloc int[Vector<int>.Count];
for (int i = 0;i< Vector<int>.Count; i++) {
    buf[i] = Vector<int>.Count - 1 - i;
}
indices = Vectors.Create(buf);

上面代碼中的 Vector<int>.Count - 1 - i，就是計算各個元素在逆序時的索引。

還可以注意到，上面的代碼并未使用構造函數(shù)來創(chuàng)建 Vector，而是使用VectorTraits提供的 Vectors.Create方法。這是為了能支持 .NET Core 3.0 之前的版本，例如 .NET Framework 4.5。

由于在程序啟動后，Vector的Count屬性將會固定為實際的值。于是沒必要每次重新計算 indices，可以將它的計算挪至類的靜態(tài)構造方法。

private static readonly Vector<int> _shuffleIndices;

static ImageFlipXOn32bitBenchmark() {
    bool AllowCreateByDoubleLoop = true;
    if (AllowCreateByDoubleLoop) {
        _shuffleIndices = Vectors.CreateByDoubleLoop<int>(Vector<int>.Count - 1, -1);
    } else {
        Span<int> buf = stackalloc int[Vector<int>.Count];
        for (int i = 0;i< Vector<int>.Count; i++) {
            buf[i] = Vector<int>.Count - 1 - i;
        }
        _shuffleIndices = Vectors.Create(buf);
    }
}

從上面的源代碼中可以發(fā)現(xiàn)，Vectors 還提供了 CreateByDoubleLoop 方法，可以簡化 indices 這樣的向量的初始化。它比使用for循環(huán)要方便了很多。

有了 indices值（實際是 _shuffleIndices）后，便可以使用 Vectors 的 Shuffle 方法，對自動大小的向量類型Vector 進行換位了。源代碼如下。

// Vector<int> src = …… // 加載源值.
Vector<int> indices =_shuffleIndices;
Vector<int> dst = Vectors.Shuffle(src, indices);

與先前Vector128的用法相同。

2.1.1.4 使用YShuffleKernel方法來做進一步的優(yōu)化

Shuffle 方法還具有清零功能。若索引超過范圍，對應的元素會設置為0。

而現(xiàn)在是做翻轉(zhuǎn)，索引總是在有效范圍內(nèi)。于是，可以將 Shuffle 更換成 YShuffleKernel 方法。它不判斷索引是否在范圍內(nèi)，所以它的性能一般更好。

// Vector<int> src = …… // 加載源值.
Vector<int> indices =_shuffleIndices;
Vector<int> dst = Vectors.YShuffleKernel(src, indices);

為了與BCL的方法名進行區(qū)分，VectorTraits庫追加的方法，都統(tǒng)一以字母“Y”開頭。

2.1.2 翻轉(zhuǎn)一行

基于“一個向量內(nèi)翻轉(zhuǎn)”的辦法，可實現(xiàn)對一行像素進行翻轉(zhuǎn)。具體來說，依然可以按照“逆序讀取、順序?qū)懭搿钡牟呗詠硖幚怼?/p>

若一行像素的字節(jié)數(shù)，剛好是向量大小的整數(shù)倍時，此時處理起來最簡單。算法步驟如下：

對源指針p設置初值，將它指向源位圖當前行的最后一筆數(shù)據(jù)的地址。即: Vector<int>* p = (Vector<int>*)(pRow + maxX * cbPixel)。
對目標指針q設置初值，將它指向目標位圖當前行的起始地址。即: Vector<int>* q = (Vector<int>*)qRow。
根據(jù)源指針p，將內(nèi)存中的數(shù)據(jù)加載到向量中。即: data = *p。
對向量進行翻轉(zhuǎn)。即: temp = Vectors.YShuffleKernel(data, indices)。
根據(jù)目標指針q，將向量寫入到內(nèi)存中。即: *q = temp。
判斷數(shù)據(jù)是否都處理完了。若是，跳到第9步完成。
移動指針，處理下一個向量。即: --p; ++q;。
跳到第3步，繼續(xù)循環(huán)。
完成。

假設向量大小為128位，此時1個向量里可以存儲4個像素。下面演示一下各種倍數(shù)時的處理情況：

1倍：此時一行是 4*1=4個像素。水平翻轉(zhuǎn)是將 {x0, x1, x2, x3}，翻轉(zhuǎn)為 {x3, x2, x1, x0}.
2倍：此時一行是 4*2=8個像素。水平翻轉(zhuǎn)是將 {x0, x1, x2, x3, x4, x5, x6, x7}，翻轉(zhuǎn)為 {x7, x6, x5, x4, x3, x2, x1, x0}.
3倍：此時一行是 4*3=12個像素。水平翻轉(zhuǎn)是將 {x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11}，翻轉(zhuǎn)為 {x11, x10, x9, x8, x7, x6, x5, x4, x3, x2, x1, x0}.
……

從上面的數(shù)據(jù)中可以看出，按照“逆序讀取、順序?qū)懭搿钡牟呗詠矸D(zhuǎn)數(shù)據(jù)，便能順利的完成圖像的水平翻轉(zhuǎn)。

在實際使用中，一行像素的字節(jié)數(shù)在大多數(shù)時候，并不是向量大小的整數(shù)倍，此時處理起來會復雜一些?？梢詤⒖忌弦黄恼吕锾岬降摹澳┪仓羔槨鞭k法，進行處理。

2.2 算法實現(xiàn)

根據(jù)上面的思路，編寫代碼。源代碼如下。

public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 4; // 32 bit: Bgr32, Bgra32, Rgb32, Rgba32.
    Vector<int> indices = _shuffleIndices;
    int vectorWidth = Vector<int>.Count;
    int maxX = width - vectorWidth;
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        Vector<int>* pLast = (Vector<int>*)pRow;
        Vector<int>* qLast = (Vector<int>*)(qRow + maxX * cbPixel);
        Vector<int>* p = (Vector<int>*)(pRow + maxX * cbPixel);
        Vector<int>* q = (Vector<int>*)qRow;
        for (; ; ) {
            Vector<int> data, temp;
            // Load.
            data = *p;
            // FlipX.
            //temp = Vectors.Shuffle(data, indices);
            temp = Vectors.YShuffleKernel(data, indices);
            // Store.
            *q = temp;
            // Next.
            if (p <= pLast) break;
            --p;
            ++q;
            if (p < pLast) p = pLast; // The last block is also use vector.
            if (q > qLast) q = qLast;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

2.3 基準測試代碼

隨后為該算法編寫基準測試代碼。

[Benchmark]
public void UseVectors() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, false);
}

//[Benchmark]
public void UseVectorsParallel() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, true);
}

public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, bool useParallel = false) {
    int vectorWidth = Vector<byte>.Count;
    int width = src.Width;
    int height = src.Height;
    if (width <= vectorWidth) {
        ScalarDo(src, dst, useParallel);
        return;
    }
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    bool allowParallel = useParallel && (height > 16) && (Environment.ProcessorCount > 1);
    if (allowParallel) {
        Parallel.For(0, height, i => {
            int start = i;
            int len = 1;
            byte* pSrc2 = pSrc + start * (long)strideSrc;
            byte* pDst2 = pDst + start * (long)strideDst;
            UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
        UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

2.4 使用 YShuffleKernel_Args 來做進一步的優(yōu)化

可以進一步提高性能，就是使用 YShuffleKernel_Args與YShuffleKernel_Core。

若循環(huán)內(nèi)存在一些重復計算的話，可以將這些計算挪至循環(huán)外，從而提高了性能。Args、Core 后綴的方法，就是這種情況下使用的。

Args: 參數(shù)運算。例如用于檢查及轉(zhuǎn)換參數(shù)。用本方法轉(zhuǎn)換參數(shù)后，隨后可調(diào)用 Core 版方法。一般在循環(huán)前使用。
Core: 核心運算。需先調(diào)用 Args 版函數(shù)，才可調(diào)用本方法。一般在循環(huán)內(nèi)使用。

于是我們可以將YShuffleKernel，換為 YShuffleKernel_Args與YShuffleKernel_Core。源代碼如下。

public static unsafe void UseVectorsArgsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 4; // 32 bit: Bgr32, Bgra32, Rgb32, Rgba32.
    Vector<int> indices = _shuffleIndices;
    Vector<int> args0, args1;
    Vectors.YShuffleKernel_Args(indices, out args0, out args1);
    int vectorWidth = Vector<int>.Count;
    int maxX = width - vectorWidth;
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        Vector<int>* pLast = (Vector<int>*)pRow;
        Vector<int>* qLast = (Vector<int>*)(qRow + maxX * cbPixel);
        Vector<int>* p = (Vector<int>*)(pRow + maxX * cbPixel);
        Vector<int>* q = (Vector<int>*)qRow;
        for (; ; ) {
            Vector<int> data, temp;
            // Load.
            data = *p;
            // FlipX.
            //temp = Vectors.YShuffleKernel(data, indices);
            temp = Vectors.YShuffleKernel_Core(data, args0, args1);
            // Store.
            *q = temp;
            // Next.
            if (p <= pLast) break;
            --p;
            ++q;
            if (p < pLast) p = pLast; // The last block is also use vector.
            if (q > qLast) q = qLast;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

三、基準測試結果

3.1 X86 架構

X86架構下的基準測試結果如下。

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4541/23H2/2023Update/SunValley3)
AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method         | Width | Mean        | Error     | StdDev    | Ratio | RatioSD |
|--------------- |------ |------------:|----------:|----------:|------:|--------:|
| Scalar         | 1024  |    784.7 us |  14.56 us |  14.30 us |  1.00 |    0.03 |
| UseVectors     | 1024  |    106.4 us |   2.12 us |   4.96 us |  0.14 |    0.01 |
| UseVectorsArgs | 1024  |    101.4 us |   2.03 us |   3.85 us |  0.13 |    0.01 |
|                |       |             |           |           |       |         |
| Scalar         | 2048  |  3,453.5 us |  25.88 us |  22.94 us |  1.00 |    0.01 |
| UseVectors     | 2048  |  1,520.8 us |  15.11 us |  14.13 us |  0.44 |    0.00 |
| UseVectorsArgs | 2048  |  1,412.9 us |  27.96 us |  47.48 us |  0.41 |    0.01 |
|                |       |             |           |           |       |         |
| Scalar         | 4096  | 12,932.8 us | 177.40 us | 165.94 us |  1.00 |    0.02 |
| UseVectors     | 4096  |  6,113.0 us |  43.35 us |  40.55 us |  0.47 |    0.01 |
| UseVectorsArgs | 4096  |  6,270.9 us |  56.80 us |  50.35 us |  0.48 |    0.01 |

Scalar: 標量算法。
UseVectors: 向量算法。
UseVectorsArgs: 使用Args將部分運算挪至循環(huán)前的向量算法。

以1024時的測試結果為例，UseVectorsArgs的處理性能，大約是Scalar的 7.74 倍。即向量化算法的性能，是標量算法的7.74 倍。

注：784.7 / 101.4 ≈ 7.74

3.2 Arm 架構

同樣的源代碼可以在 Arm 架構上運行。基準測試結果如下。

BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD [AttachedDebugger]
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD


| Method         | Width | Mean        | Error    | StdDev   | Ratio |
|--------------- |------ |------------:|---------:|---------:|------:|
| Scalar         | 1024  |    625.8 us |  0.81 us |  0.68 us |  1.00 |
| UseVectors     | 1024  |    151.9 us |  0.32 us |  0.27 us |  0.24 |
| UseVectorsArgs | 1024  |    151.2 us |  0.13 us |  0.12 us |  0.24 |
|                |       |             |          |          |       |
| Scalar         | 2048  |  2,522.4 us |  1.28 us |  1.14 us |  1.00 |
| UseVectors     | 2048  |    666.9 us |  0.55 us |  0.51 us |  0.26 |
| UseVectorsArgs | 2048  |    663.8 us |  0.80 us |  0.67 us |  0.26 |
|                |       |             |          |          |       |
| Scalar         | 4096  | 10,797.2 us | 11.21 us | 10.48 us |  1.00 |
| UseVectors     | 4096  |  3,349.0 us | 39.67 us | 37.11 us |  0.31 |
| UseVectorsArgs | 4096  |  3,339.6 us | 20.76 us | 16.21 us |  0.31 |

以1024時的測試結果為例，UseVectorsArgs的處理性能，大約是Scalar的 4.14 倍。即向量化算法的性能，是標量算法的4.14 倍。

注：625.8 / 151.2 ≈ 4.14

此時很多人會注意到，UseVectors 與 UseVectorsArgs的性能差距不大。貌似Args版方法的作用不大啊。

這是因為從 .NET 7.0 開始，即時編譯器（JIT）會自動將部分運算挪至循環(huán)前去處理，造成了差距不大的現(xiàn)象。若換成早期版本的 .NET，差距會比較明顯。

3.2.1 Arm 架構的 `.NET 6.0` 測試結果

將程序編譯為 .NET 6.0 的，拿到 Arm 架構上運行?；鶞蕼y試結果如下。

BenchmarkDotNet v0.14.0, macOS Sequoia 15.1.1 (24B91) [Darwin 24.1.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
  [Host]     : .NET 6.0.33 (6.0.3324.36610), Arm64 RyuJIT AdvSIMD [AttachedDebugger]
  DefaultJob : .NET 6.0.33 (6.0.3324.36610), Arm64 RyuJIT AdvSIMD


| Method         | Width | Mean        | Error    | StdDev   | Ratio |
|--------------- |------ |------------:|---------:|---------:|------:|
| Scalar         | 1024  |  1,805.2 us |  0.72 us |  0.60 us |  1.00 |
| UseVectors     | 1024  |    454.5 us |  5.45 us |  5.10 us |  0.25 |
| UseVectorsArgs | 1024  |    158.4 us |  0.05 us |  0.04 us |  0.09 |
|                |       |             |          |          |       |
| Scalar         | 2048  |  7,229.0 us |  2.88 us |  2.69 us |  1.00 |
| UseVectors     | 2048  |  1,857.4 us |  2.73 us |  2.56 us |  0.26 |
| UseVectorsArgs | 2048  |    656.2 us |  0.26 us |  0.23 us |  0.09 |
|                |       |             |          |          |       |
| Scalar         | 4096  | 29,574.1 us | 13.21 us | 11.03 us |  1.00 |
| UseVectors     | 4096  |  8,117.2 us | 28.06 us | 26.25 us |  0.27 |
| UseVectorsArgs | 4096  |  4,671.7 us |  2.50 us |  2.21 us |  0.16 |

以1024時的測試結果為例，來觀察向量化算法相對于標量算法的性能提升。

UseVectors：1,805.2/454.5 ≈ 3.97。即性能提高了 3.97 倍。
UseVectorsArgs：1,805.2/158.4 ≈ 11.40。即性能提高了 11.40 倍。

3.3 .NET Framework

同樣的源代碼可以在 .NET Framework 上運行。基準測試結果如下。

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4541/23H2/2023Update/SunValley3)
AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
  [Host]     : .NET Framework 4.8.1 (4.8.9282.0), X64 RyuJIT VectorSize=256
  DefaultJob : .NET Framework 4.8.1 (4.8.9282.0), X64 RyuJIT VectorSize=256


| Method         | Width | Mean        | Error     | StdDev    | Ratio | RatioSD | Code Size |
|--------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| Scalar         | 1024  |  1,315.2 us |  26.06 us |  25.59 us |  1.00 |    0.03 |   2,718 B |
| UseVectors     | 1024  |    968.2 us |  17.55 us |  16.42 us |  0.74 |    0.02 |   3,507 B |
| UseVectorsArgs | 1024  |    887.0 us |   9.91 us |   8.78 us |  0.67 |    0.01 |   3,507 B |
|                |       |             |           |           |       |         |           |
| Scalar         | 2048  |  5,259.4 us |  85.87 us |  80.32 us |  1.00 |    0.02 |   2,718 B |
| UseVectors     | 2048  |  3,696.0 us |  29.64 us |  27.72 us |  0.70 |    0.01 |   3,507 B |
| UseVectorsArgs | 2048  |  3,722.9 us |  39.36 us |  34.90 us |  0.71 |    0.01 |   3,507 B |
|                |       |             |           |           |       |         |           |
| Scalar         | 4096  | 19,763.1 us | 300.29 us | 266.20 us |  1.00 |    0.02 |   2,718 B |
| UseVectors     | 4096  | 14,303.8 us |  62.36 us |  55.28 us |  0.72 |    0.01 |   3,507 B |
| UseVectorsArgs | 4096  | 14,988.7 us | 286.49 us | 281.37 us |  0.76 |    0.02 |   3,507 B |

以1024時的測試結果為例，UseVectorsArgs的處理性能，大約是Scalar的 1.48 倍。

注：1,315.2 / 887.0 ≈ 1.48

其實，因為 .NET Framework 不支持Sse等指令集，所以 Vectors用的是標量回退代碼。只要由于它的標量算法也是高度優(yōu)化的，且它是基于 int 來處理的，于是它的性能比基于byte的標量算法要好。

附錄

完整源代碼: https://github.com/zyl910/VectorTraits.Sample.Benchmarks/blob/main/VectorTraits.Sample.Benchmarks.Inc/Image/ImageFlipXOn32bitBenchmark.cs
YShuffleKernel 的文檔: https://zyl910.github.io/VectorTraits_doc/api/Zyl.VectorTraits.Vectors.YShuffleKernel.html
微軟官方文檔-Vector128.Shuffle 方法: https://learn.microsoft.com/zh-cn/dotnet/api/system.runtime.intrinsics.vector128.shuffle?view=net-9.0
VectorTraits 的NuGet包: https://www.nuget.org/packages/VectorTraits
VectorTraits 的在線文檔: https://zyl910.github.io/VectorTraits_doc/
VectorTraits 源代碼: https://github.com/zyl910/VectorTraits
C# 使用SIMD向量類型加速浮點數(shù)組求和運算（1）：使用Vector4、Vector<T>

posted on 2024-12-01 22:01 zyl910 閱讀(123) 評論(0) 收藏舉報

刷新頁面返回頂部