算子開發(fā)-昇騰CANN訓(xùn)練營-Ascend C-Mmad 算子遷移
1. 項(xiàng)目介紹
- 源項(xiàng)目倉:https://gitee.com/ascend/samples/tree/master/operator/ascendc/0_introduction/20_mmad_kernellaunch
- 目標(biāo)項(xiàng)目倉:https://gitee.com/ascend/cann-ops/tree/master/src/matmul/mmad
- Pr 名稱:貢獻(xiàn)身份:活動(dòng)----Mmad算子貢獻(xiàn)
- Pr 鏈接:https://gitee.com/ascend/cann-ops/pulls/431
- mmad 算子介紹:
C = A * B + BiasA、B為源操作數(shù),A為左矩陣,形狀為[M, K];B為右矩陣,形狀為[K, N]。
C為目的操作數(shù),存放矩陣乘結(jié)果的矩陣,形狀為[M, N]。
Bias為矩陣乘偏置,形狀為[N]。對(duì)A*B結(jié)果矩陣的每一行都采用該Bias進(jìn)行偏置。
- 源倉庫代碼介紹:現(xiàn)有代碼包含 帶 bias 和不帶 bias 參數(shù)兩個(gè)工程。都支持分離架構(gòu)(mmad_custom_cube_only.h)和耦合架構(gòu)(mmad_custom.h)實(shí)現(xiàn)?;?Ascend C 低級(jí)API實(shí)現(xiàn),手動(dòng)實(shí)現(xiàn)數(shù)據(jù)搬運(yùn)。
- 遷移目標(biāo):對(duì)分離架構(gòu)帶 bias 參數(shù)的實(shí)現(xiàn)增加 tiling 結(jié)構(gòu)的數(shù)據(jù)搬運(yùn)劃分。
2. 實(shí)現(xiàn)方案:
- Tiling 策略:對(duì)B矩陣進(jìn)行列分塊,因?yàn)?cube 計(jì)算單位為 1616,所以設(shè)置tiling 參數(shù)列方向shape 為16;對(duì)單核計(jì)算數(shù)據(jù)的切分不應(yīng)太大。本測(cè)試單 cube 上計(jì)算矩陣shape 為3232,數(shù)據(jù)較少,A矩陣一次性讀入;避免對(duì)A 矩陣 tiling只要是為了避免多次計(jì)算結(jié)果最后還要加和計(jì)算。
- 單 cube 核 tiling 計(jì)算結(jié)果匯總策略:分離架構(gòu)數(shù)據(jù)流向?yàn)?GM -> L1 -> L0A/L0B -> Cube -> L0C -> Fixpipe -> GM 和 GM -> L1 -> L0A/L0B -> Cube -> L0C -> L1;tiling 計(jì)算結(jié)果考慮暫存,單核計(jì)算完成匯總后經(jīng)過 Fixpipe 通路搬運(yùn)到 GM。
- 后者通路使用 L1 緩存空間,會(huì)占用輸入 L1空間,不可行;
- 矩陣計(jì)算臨時(shí)空間 TPosition 有 TSCM,但是查閱文檔,TSCM 只支持標(biāo)量數(shù)據(jù) VECIN/VECOUT/VECCALC -> TSCM,矩陣計(jì)算無法使用,不可行;
- 結(jié)論:暫存數(shù)據(jù)匯總的方案不可行,只能每次計(jì)算完仍然走 L0C -> Fixpipe -> GM 的數(shù)據(jù)通路。
4. 核心代碼
- op_host\mmad_tiling.h
#ifndef MMAD_TILING_H
#define MMAD_TILING_H
#include "register/tilingdata_base.h"
namespace optiling {
BEGIN_TILING_DATA_DEF(TilingData)
//對(duì) B 矩陣 tiling 參數(shù),split N
TILING_DATA_FIELD_DEF(uint32_t, tileBBlockShape);
END_TILING_DATA_DEF;
REGISTER_TILING_DATA_CLASS(Mmad, TilingData)
}
#endif //MMAD_TILING_H
- op_kernel\mmad.cpp
#include "kernel_operator.h"
class KernelMmad {
public:
__aicore__ inline KernelMmad()
{
aSize = m * k;
bSize = k * n;
cSize = m * n;
}
__aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR bias, GM_ADDR c, uint32_t tileBBlockShape)
{
// set cube only
KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIC_ONLY);
cubeBlockShape = tileBBlockShape;
CubeBlockSize = cubeBlockShape * cubeBlockShape;
aGM.SetGlobalBuffer((__gm__ half *)a);
bGM.SetGlobalBuffer((__gm__ half *)b);
cGM.SetGlobalBuffer((__gm__ float *)c);
biasGM.SetGlobalBuffer((__gm__ float *)bias);
pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
pipe.InitBuffer(inQueueB2, 1, k * cubeBlockShape * sizeof(half));
pipe.InitBuffer(outQueueCO1, 1, m * cubeBlockShape * sizeof(float));
pipe.InitBuffer(inQueueC1, 1, n * sizeof(float));
pipe.InitBuffer(inQueueC2, 1, cubeBlockShape * sizeof(float));
}
__aicore__ inline void Process()
{
CopyIn();
SplitA();
AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
AscendC::LocalTensor<float> bias1Local = inQueueC1.DeQue<float>();
for(int i=0;i<n/cubeBlockShape;i++)
{
SplitB(b1Local,i);
SplitBias(bias1Local,i);
Compute(a2Local);
CopyOut(i);
}
inQueueA2.FreeTensor(a2Local);
inQueueB1.FreeTensor(b1Local);
inQueueC1.FreeTensor(bias1Local);
}
private:
__aicore__ inline uint32_t CeilCubeBlock(uint32_t len) {
return (len + cubeBlockShape - 1) / cubeBlockShape;
}
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
AscendC::LocalTensor<float> bias1Local = inQueueC1.AllocTensor<float>();
AscendC::Nd2NzParams nd2nzA1Params;
nd2nzA1Params.ndNum = 1;
nd2nzA1Params.nValue = m;
nd2nzA1Params.dValue = k;
nd2nzA1Params.srcNdMatrixStride = 0;
nd2nzA1Params.srcDValue = k;
nd2nzA1Params.dstNzC0Stride = CeilCubeBlock(m) * cubeBlockShape;
nd2nzA1Params.dstNzNStride = 1;
nd2nzA1Params.dstNzMatrixStride = 0;
AscendC::DataCopy(a1Local, aGM, nd2nzA1Params);
AscendC::Nd2NzParams nd2nzB1Params;
nd2nzB1Params.ndNum = 1;
nd2nzB1Params.nValue = k;
nd2nzB1Params.dValue = n;
nd2nzB1Params.srcNdMatrixStride = 0;
nd2nzB1Params.srcDValue = n;
nd2nzB1Params.dstNzC0Stride = CeilCubeBlock(k) * cubeBlockShape;
nd2nzB1Params.dstNzNStride = 1;
nd2nzB1Params.dstNzMatrixStride = 0;
AscendC::DataCopy(b1Local, bGM, nd2nzB1Params);
AscendC::DataCopy(bias1Local, biasGM, n);
inQueueA1.EnQue(a1Local);
inQueueB1.EnQue(b1Local);
inQueueC1.EnQue(bias1Local);
}
__aicore__ inline void SplitA()
{
AscendC::LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
AscendC::LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();
uint32_t dstOffset = CeilCubeBlock(k) * CubeBlockSize;
uint32_t srcOffset = CubeBlockSize;
//nz to zz
AscendC::LoadData2DParams loadDataParams;
loadDataParams.repeatTimes = CeilCubeBlock(k);
loadDataParams.srcStride = CeilCubeBlock(m);
loadDataParams.dstGap = 0;
loadDataParams.ifTranspose = false;
for (int i = 0; i < CeilCubeBlock(m); ++i) {
AscendC::LoadData(a2Local[i * dstOffset], a1Local[i * srcOffset], loadDataParams);
}
inQueueA2.EnQue<half>(a2Local);
inQueueA1.FreeTensor(a1Local);
}
__aicore__ inline void SplitB(const AscendC::LocalTensor<half>& b1Local,const uint32_t bSplitIdx)
{
AscendC::LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();
// Nz -> Zn
AscendC::LoadData2DParams loadDataParams;
loadDataParams.repeatTimes = CeilCubeBlock(k);
loadDataParams.srcStride = 1;
loadDataParams.ifTranspose = true;
AscendC::LoadData(b2Local, b1Local[bSplitIdx * CeilCubeBlock(n) * CubeBlockSize], loadDataParams);
inQueueB2.EnQue<half>(b2Local);
}
__aicore__ inline void SplitBias(const AscendC::LocalTensor<float>& bias1Local,const uint32_t bSplitIdx)
{
AscendC::LocalTensor<float> bias2Local = inQueueC2.AllocTensor<float>();
AscendC::DataCopy(bias2Local, bias1Local[bSplitIdx*cubeBlockShape], cubeBlockShape);
inQueueC2.EnQue<float>(bias2Local);
}
__aicore__ inline void Compute(const AscendC::LocalTensor<half> a2Local)
{
AscendC::LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
AscendC::LocalTensor<float> bias2Local = inQueueC2.DeQue<float>();
AscendC::LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();
AscendC::MmadParams mmadParams;
mmadParams.m = m;
mmadParams.n = cubeBlockShape;
mmadParams.k = k;
mmadParams.cmatrixInitVal = false;
AscendC::Mmad(c1Local, a2Local, b2Local, bias2Local, mmadParams);
outQueueCO1.EnQue<float>(c1Local);
inQueueB2.FreeTensor(b2Local);
inQueueC2.FreeTensor(bias2Local);
}
__aicore__ inline void CopyOut(const uint32_t bSplitIdx )
{
AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
// FixpipeParamsV220 : CO1 -> gm
AscendC::FixpipeParamsV220 fixpipeParams;
fixpipeParams.nSize = cubeBlockShape;
fixpipeParams.mSize = m;
fixpipeParams.srcStride = cubeBlockShape*sizeof(float); //表示源NZ矩陣中相鄰Z排布的起始地址偏移
fixpipeParams.dstStride = n;
fixpipeParams.ndNum = 1;
fixpipeParams.srcNdStride = 0;
fixpipeParams.dstNdStride = 0;
// 默認(rèn)設(shè)置 nz -> nd
AscendC::Fixpipe(cGM[bSplitIdx*cubeBlockShape], c1Local, fixpipeParams);
outQueueCO1.FreeTensor(c1Local);
}
private:
AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
// AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1_; //分離架構(gòu)無 CO2
AscendC::TQue<AscendC::TPosition::C1, 1> inQueueC1;
AscendC::TQue<AscendC::TPosition::C2, 1> inQueueC2;
AscendC::GlobalTensor<half> aGM;
AscendC::GlobalTensor<half> bGM;
AscendC::GlobalTensor<float> cGM;
AscendC::GlobalTensor<float> biasGM;
uint16_t m = 32, k = 32, n = 32;
uint16_t aSize, bSize, cSize;
uint32_t cubeBlockShape,CubeBlockSize;
};
extern "C" __global__ __aicore__ void mmad(GM_ADDR a, GM_ADDR b, GM_ADDR bias, GM_ADDR c, GM_ADDR workspace, GM_ADDR tiling) {
GET_TILING_DATA(tiling_data,tiling);
KernelMmad op;
op.Init(a,b,bias,c,tiling_data.tileBBlockShape);
op.Process();
}
3. 項(xiàng)目實(shí)施中遇到的問題及解決方案
- 運(yùn)行環(huán)境問題:在華為云測(cè)試服務(wù)器運(yùn)行環(huán)運(yùn)行境腳本 init_env.sh 安裝 cann-toolkit 時(shí),經(jīng)常遇到網(wǎng)絡(luò)代理錯(cuò)誤,安裝包下載失敗的問題。
解決方案:先不要運(yùn)行環(huán)境腳本,本地下載 cann-toolkit 安裝包上傳到服務(wù)器,再運(yùn)行安裝環(huán)境腳本。
- 執(zhí)行權(quán)限問題
Permission denied
解決方案:項(xiàng)目文件權(quán)限對(duì)其他用戶設(shè)置為 0。
- tiling 注冊(cè)失敗
/home/user/cann/03mmad/mmad-2/MmadCustom/op_kernel/mmad.cpp:12:21: error: use of undeclared identifier 'tiling_data'
GET_TILING_DATA(tiling_data,tiling);
^
/home/user/cann/03mmad/mmad-2/MmadCustom/op_kernel/mmad.cpp:14:24: error: use of undeclared identifier 'tiling_data'
op.Init(a,b,bias,c,tiling_data.tileBBlockShape);
...
[ERROR] [ascend910b] Mmad do not registe tiling struct!!!
分析:以上錯(cuò)誤為本地編譯錯(cuò)誤,推測(cè)本地不是昇騰硬件安裝的軟件環(huán)境導(dǎo)致。
解決方案:在服務(wù)器上編譯正常。
- ACLNN 算子調(diào)用運(yùn)行錯(cuò)誤
aclnnMmadmGetWorkspaceSize failed. ERROR: 161002
分析:似乎與數(shù)據(jù)分配的空間有關(guān)。
解決:檢查 ACLNN 調(diào)用示例代碼,發(fā)現(xiàn)輸入數(shù)據(jù) bias 數(shù)據(jù)類型錯(cuò)誤,修正后解決。
- ACLNN 算子調(diào)用運(yùn)行錯(cuò)誤
aclrtSynchronizeStream failed. ERROR: 507015
分析:檢查錯(cuò)誤代碼為 AICore 運(yùn)行錯(cuò)誤,通過檢查 ~/ascend/log/debug/plog 中的日志文件發(fā)現(xiàn)錯(cuò)誤代碼 rtKernelLaunchWithHandleV2:ErrCode=107000, 為 UB 內(nèi)存錯(cuò)誤問題。
解決:檢查發(fā)現(xiàn)使用了 CO2 數(shù)據(jù)空間,分離架構(gòu)不支持該 TPosition,應(yīng)該使用分離架構(gòu)的數(shù)據(jù)通路,L0C -> Fixpipe -> GM。
- ACLNN 算子調(diào)用測(cè)試結(jié)果錯(cuò)誤
分析:測(cè)試結(jié)果與標(biāo)桿結(jié)果數(shù)據(jù)比對(duì)不一致,考慮每次計(jì)算完 Fixpipe 向 GM 搬運(yùn)數(shù)據(jù) NZ2ND 的過程數(shù)據(jù)組織錯(cuò)誤。
解決:目標(biāo) GM 數(shù)據(jù)位置應(yīng)為第一行元素的索引。
AscendC::Fixpipe(cGM[bSplitIdx*m*cubeBlockShape], c1Local, fixpipeParams);
更正為
AscendC::Fixpipe(cGM[bSplitIdx*cubeBlockShape], c1Local, fixpipeParams);
- ST 測(cè)試未找到 Mmad op
b"ATC run failed, Please check the detail log, Try 'atc --help' for more information"
b'EZ3003: [PID: 174824] 2025-05-22-16:46:44.846.249 No supported Ops kernel and engine are found for [Mmad], optype [Mmad].'
b'Possible Cause: The operator is not supported by the system. Therefore, no hit is found in any operator information library.'
分析:使用 cann-ops 倉其進(jìn)行其他單算子編譯安裝的方式,出現(xiàn)不同的錯(cuò)誤,環(huán)境應(yīng)有問題。
解決:倉庫 CI 檢查流程會(huì)自動(dòng)進(jìn)行 ST,使用 CI ST 測(cè)試環(huán)境。
- CI 門禁x86編譯錯(cuò)誤
分析:經(jīng)管理員判定服務(wù)器環(huán)境配置規(guī)格問題。
解決:約一個(gè)月后CI部門解決了該問題,代碼已合入。

浙公網(wǎng)安備 33010602011771號(hào)