算子開發(fā)-昇騰CANN訓(xùn)練營-Ascend C-Mmad 算子遷移

1. 項(xiàng)目介紹

源項(xiàng)目倉：https://gitee.com/ascend/samples/tree/master/operator/ascendc/0_introduction/20_mmad_kernellaunch
目標(biāo)項(xiàng)目倉：https://gitee.com/ascend/cann-ops/tree/master/src/matmul/mmad
Pr 名稱：貢獻(xiàn)身份：活動(dòng)----Mmad算子貢獻(xiàn)
Pr 鏈接：https://gitee.com/ascend/cann-ops/pulls/431
mmad 算子介紹:

		C = A * B + Bias
A、B為源操作數(shù)，A為左矩陣，形狀為[M, K]；B為右矩陣，形狀為[K, N]。
C為目的操作數(shù)，存放矩陣乘結(jié)果的矩陣，形狀為[M, N]。
Bias為矩陣乘偏置，形狀為[N]。對(duì)A*B結(jié)果矩陣的每一行都采用該Bias進(jìn)行偏置。

源倉庫代碼介紹：現(xiàn)有代碼包含帶 bias 和不帶 bias 參數(shù)兩個(gè)工程。都支持分離架構(gòu)（mmad_custom_cube_only.h）和耦合架構(gòu)（mmad_custom.h）實(shí)現(xiàn)?；?Ascend C 低級(jí)API實(shí)現(xiàn)，手動(dòng)實(shí)現(xiàn)數(shù)據(jù)搬運(yùn)。
遷移目標(biāo)：對(duì)分離架構(gòu)帶 bias 參數(shù)的實(shí)現(xiàn)增加 tiling 結(jié)構(gòu)的數(shù)據(jù)搬運(yùn)劃分。

2. 實(shí)現(xiàn)方案：

Tiling 策略：對(duì)B矩陣進(jìn)行列分塊，因?yàn)?cube 計(jì)算單位為 1616，所以設(shè)置tiling 參數(shù)列方向shape 為16；對(duì)單核計(jì)算數(shù)據(jù)的切分不應(yīng)太大。本測(cè)試單 cube 上計(jì)算矩陣shape 為3232，數(shù)據(jù)較少，A矩陣一次性讀入；避免對(duì)A 矩陣 tiling只要是為了避免多次計(jì)算結(jié)果最后還要加和計(jì)算。
單 cube 核 tiling 計(jì)算結(jié)果匯總策略：分離架構(gòu)數(shù)據(jù)流向?yàn)?GM -> L1 -> L0A/L0B -> Cube -> L0C -> Fixpipe -> GM 和 GM -> L1 -> L0A/L0B -> Cube -> L0C -> L1;tiling 計(jì)算結(jié)果考慮暫存，單核計(jì)算完成匯總后經(jīng)過 Fixpipe 通路搬運(yùn)到 GM。

后者通路使用 L1 緩存空間，會(huì)占用輸入 L1空間，不可行；

矩陣計(jì)算臨時(shí)空間 TPosition 有 TSCM，但是查閱文檔，TSCM 只支持標(biāo)量數(shù)據(jù) VECIN/VECOUT/VECCALC -> TSCM,矩陣計(jì)算無法使用，不可行；

結(jié)論：暫存數(shù)據(jù)匯總的方案不可行，只能每次計(jì)算完仍然走 L0C -> Fixpipe -> GM 的數(shù)據(jù)通路。

4. 核心代碼

op_host\mmad_tiling.h



#ifndef MMAD_TILING_H
#define MMAD_TILING_H
#include "register/tilingdata_base.h"

namespace optiling {
  BEGIN_TILING_DATA_DEF(TilingData)
  //對(duì) B 矩陣 tiling 參數(shù),split N
  TILING_DATA_FIELD_DEF(uint32_t, tileBBlockShape);
  END_TILING_DATA_DEF;
  
  REGISTER_TILING_DATA_CLASS(Mmad, TilingData)
}
#endif  //MMAD_TILING_H

op_kernel\mmad.cpp

#include "kernel_operator.h"

class KernelMmad {
public:
    __aicore__ inline KernelMmad()
    {
        aSize = m * k;
        bSize = k * n;
        cSize = m * n;
    }
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR bias, GM_ADDR c, uint32_t tileBBlockShape)
    {
        // set cube only
        KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIC_ONLY);
        cubeBlockShape = tileBBlockShape;
        CubeBlockSize = cubeBlockShape * cubeBlockShape;

        aGM.SetGlobalBuffer((__gm__ half *)a);
        bGM.SetGlobalBuffer((__gm__ half *)b);
        cGM.SetGlobalBuffer((__gm__ float *)c);
        biasGM.SetGlobalBuffer((__gm__ float *)bias);
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
        pipe.InitBuffer(inQueueB2, 1, k * cubeBlockShape * sizeof(half));
        pipe.InitBuffer(outQueueCO1, 1, m * cubeBlockShape * sizeof(float));
        pipe.InitBuffer(inQueueC1, 1, n * sizeof(float));
        pipe.InitBuffer(inQueueC2, 1, cubeBlockShape * sizeof(float));
    }

    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
        AscendC::LocalTensor<float> bias1Local = inQueueC1.DeQue<float>();
        for(int i=0;i<n/cubeBlockShape;i++)
        {
            SplitB(b1Local,i);
            SplitBias(bias1Local,i);
            Compute(a2Local);
            CopyOut(i);
        }
        inQueueA2.FreeTensor(a2Local);
        inQueueB1.FreeTensor(b1Local);
        inQueueC1.FreeTensor(bias1Local);
    }

private:
    __aicore__ inline uint32_t CeilCubeBlock(uint32_t len) {
        return (len + cubeBlockShape - 1) / cubeBlockShape;
    }

    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
        AscendC::LocalTensor<float> bias1Local = inQueueC1.AllocTensor<float>();

        AscendC::Nd2NzParams nd2nzA1Params;
        nd2nzA1Params.ndNum = 1;
        nd2nzA1Params.nValue = m;
        nd2nzA1Params.dValue = k;
        nd2nzA1Params.srcNdMatrixStride = 0;
        nd2nzA1Params.srcDValue = k;
        nd2nzA1Params.dstNzC0Stride = CeilCubeBlock(m) * cubeBlockShape;
        nd2nzA1Params.dstNzNStride = 1;
        nd2nzA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, nd2nzA1Params);

        AscendC::Nd2NzParams nd2nzB1Params;
        nd2nzB1Params.ndNum = 1;
        nd2nzB1Params.nValue = k;
        nd2nzB1Params.dValue = n;
        nd2nzB1Params.srcNdMatrixStride = 0;
        nd2nzB1Params.srcDValue = n;
        nd2nzB1Params.dstNzC0Stride = CeilCubeBlock(k) * cubeBlockShape;
        nd2nzB1Params.dstNzNStride = 1;
        nd2nzB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, nd2nzB1Params);

        AscendC::DataCopy(bias1Local, biasGM, n);
        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
        inQueueC1.EnQue(bias1Local);
    }

    __aicore__ inline void SplitA()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
        AscendC::LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();

        uint32_t dstOffset = CeilCubeBlock(k) * CubeBlockSize;
        uint32_t srcOffset = CubeBlockSize;
 
        //nz to zz
        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = CeilCubeBlock(k);
        loadDataParams.srcStride = CeilCubeBlock(m);
        loadDataParams.dstGap = 0;
        loadDataParams.ifTranspose = false;
        for (int i = 0; i < CeilCubeBlock(m); ++i) {
            AscendC::LoadData(a2Local[i * dstOffset], a1Local[i * srcOffset], loadDataParams);
        }

        inQueueA2.EnQue<half>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
    __aicore__ inline void SplitB(const AscendC::LocalTensor<half>& b1Local,const uint32_t bSplitIdx)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();

        // Nz -> Zn
        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = CeilCubeBlock(k);
        loadDataParams.srcStride = 1;
        loadDataParams.ifTranspose = true;
        AscendC::LoadData(b2Local, b1Local[bSplitIdx * CeilCubeBlock(n) * CubeBlockSize], loadDataParams);

        inQueueB2.EnQue<half>(b2Local);
    }
    __aicore__ inline void SplitBias(const AscendC::LocalTensor<float>& bias1Local,const uint32_t bSplitIdx)
    {
        AscendC::LocalTensor<float> bias2Local = inQueueC2.AllocTensor<float>();
        AscendC::DataCopy(bias2Local, bias1Local[bSplitIdx*cubeBlockShape], cubeBlockShape);
        inQueueC2.EnQue<float>(bias2Local);
    }
    __aicore__ inline void Compute(const AscendC::LocalTensor<half> a2Local)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
        AscendC::LocalTensor<float> bias2Local = inQueueC2.DeQue<float>();
        AscendC::LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = cubeBlockShape;
        mmadParams.k = k;
        mmadParams.cmatrixInitVal = false;
        AscendC::Mmad(c1Local, a2Local, b2Local, bias2Local, mmadParams);
        outQueueCO1.EnQue<float>(c1Local);
        inQueueB2.FreeTensor(b2Local);
        inQueueC2.FreeTensor(bias2Local);
    }
    __aicore__ inline void CopyOut(const uint32_t bSplitIdx )
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
        // FixpipeParamsV220 : CO1 -> gm
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = cubeBlockShape;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = cubeBlockShape*sizeof(float); //表示源NZ矩陣中相鄰Z排布的起始地址偏移
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        // 默認(rèn)設(shè)置 nz -> nd
        AscendC::Fixpipe(cGM[bSplitIdx*cubeBlockShape], c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    // AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1_;  //分離架構(gòu)無 CO2
    AscendC::TQue<AscendC::TPosition::C1, 1> inQueueC1;
    AscendC::TQue<AscendC::TPosition::C2, 1> inQueueC2;

    AscendC::GlobalTensor<half> aGM;
    AscendC::GlobalTensor<half> bGM;
    AscendC::GlobalTensor<float> cGM;
    AscendC::GlobalTensor<float> biasGM;
    uint16_t m = 32, k = 32, n = 32;
    uint16_t aSize, bSize, cSize;
    uint32_t cubeBlockShape,CubeBlockSize;
};


extern "C" __global__ __aicore__ void mmad(GM_ADDR a, GM_ADDR b, GM_ADDR bias, GM_ADDR c, GM_ADDR workspace, GM_ADDR tiling) {

    GET_TILING_DATA(tiling_data,tiling);
    KernelMmad op;
    op.Init(a,b,bias,c,tiling_data.tileBBlockShape);
    op.Process();
}

3. 項(xiàng)目實(shí)施中遇到的問題及解決方案

運(yùn)行環(huán)境問題：在華為云測(cè)試服務(wù)器運(yùn)行環(huán)運(yùn)行境腳本 init_env.sh 安裝 cann-toolkit 時(shí)，經(jīng)常遇到網(wǎng)絡(luò)代理錯(cuò)誤，安裝包下載失敗的問題。

解決方案：先不要運(yùn)行環(huán)境腳本，本地下載 cann-toolkit 安裝包上傳到服務(wù)器，再運(yùn)行安裝環(huán)境腳本。

執(zhí)行權(quán)限問題

Permission denied

解決方案：項(xiàng)目文件權(quán)限對(duì)其他用戶設(shè)置為 0。

tiling 注冊(cè)失敗

/home/user/cann/03mmad/mmad-2/MmadCustom/op_kernel/mmad.cpp:12:21: error: use of undeclared identifier 'tiling_data'
    GET_TILING_DATA(tiling_data,tiling);
                    ^
/home/user/cann/03mmad/mmad-2/MmadCustom/op_kernel/mmad.cpp:14:24: error: use of undeclared identifier 'tiling_data'
    op.Init(a,b,bias,c,tiling_data.tileBBlockShape);
...
[ERROR] [ascend910b] Mmad do not registe tiling struct!!!

分析：以上錯(cuò)誤為本地編譯錯(cuò)誤，推測(cè)本地不是昇騰硬件安裝的軟件環(huán)境導(dǎo)致。
解決方案：在服務(wù)器上編譯正常。

ACLNN 算子調(diào)用運(yùn)行錯(cuò)誤

aclnnMmadmGetWorkspaceSize failed. ERROR: 161002

分析：似乎與數(shù)據(jù)分配的空間有關(guān)。
解決：檢查 ACLNN 調(diào)用示例代碼，發(fā)現(xiàn)輸入數(shù)據(jù) bias 數(shù)據(jù)類型錯(cuò)誤，修正后解決。

ACLNN 算子調(diào)用運(yùn)行錯(cuò)誤

aclrtSynchronizeStream failed. ERROR: 507015

分析：檢查錯(cuò)誤代碼為 AICore 運(yùn)行錯(cuò)誤，通過檢查 ~/ascend/log/debug/plog 中的日志文件發(fā)現(xiàn)錯(cuò)誤代碼 rtKernelLaunchWithHandleV2:ErrCode=107000, 為 UB 內(nèi)存錯(cuò)誤問題。
解決：檢查發(fā)現(xiàn)使用了 CO2 數(shù)據(jù)空間，分離架構(gòu)不支持該 TPosition，應(yīng)該使用分離架構(gòu)的數(shù)據(jù)通路，L0C -> Fixpipe -> GM。

ACLNN 算子調(diào)用測(cè)試結(jié)果錯(cuò)誤

分析：測(cè)試結(jié)果與標(biāo)桿結(jié)果數(shù)據(jù)比對(duì)不一致，考慮每次計(jì)算完 Fixpipe 向 GM 搬運(yùn)數(shù)據(jù) NZ2ND 的過程數(shù)據(jù)組織錯(cuò)誤。
解決：目標(biāo) GM 數(shù)據(jù)位置應(yīng)為第一行元素的索引。

AscendC::Fixpipe(cGM[bSplitIdx*m*cubeBlockShape], c1Local, fixpipeParams); 
更正為
AscendC::Fixpipe(cGM[bSplitIdx*cubeBlockShape], c1Local, fixpipeParams);

ST 測(cè)試未找到 Mmad op

b"ATC run failed, Please check the detail log, Try 'atc --help' for more information"
b'EZ3003: [PID: 174824] 2025-05-22-16:46:44.846.249 No supported Ops kernel and engine are found for [Mmad], optype [Mmad].'
b'Possible Cause: The operator is not supported by the system. Therefore, no hit is found in any operator information library.'

分析：使用 cann-ops 倉其進(jìn)行其他單算子編譯安裝的方式，出現(xiàn)不同的錯(cuò)誤，環(huán)境應(yīng)有問題。
解決：倉庫 CI 檢查流程會(huì)自動(dòng)進(jìn)行 ST，使用 CI ST 測(cè)試環(huán)境。

CI 門禁x86編譯錯(cuò)誤

分析：經(jīng)管理員判定服務(wù)器環(huán)境配置規(guī)格問題。
解決：約一個(gè)月后CI部門解決了該問題，代碼已合入。

posted @ 2025-05-27 10:16 安洛8 閱讀(185) 評(píng)論(0) 收藏舉報(bào)

刷新頁面返回頂部

anluo8

算子開發(fā)-昇騰CANN訓(xùn)練營-Ascend C-Mmad 算子遷移

1. 項(xiàng)目介紹

2. 實(shí)現(xiàn)方案：

4. 核心代碼

3. 項(xiàng)目實(shí)施中遇到的問題及解決方案

公告