高性能計算-Intel IPP庫ARM移植示例(20)

1. 簡介

(1) Intel? Integrated Performance Primitives，即英特爾集成性能基元（簡稱IPP），為信號、數據和圖像處理特定應用領域，提供simd優化的一組全面的函數庫。

(2) 本項目將對 exp、cos、sin、tone、Triangle函數用NEON向量化指令實現ARM移植版本，有串行和向量化兩個版本。

(3)計算使用泰勒展開，使用前后加和項的比例關系，加速計算。達到接口要求的精度即可。

(4)項目地址：https://github.com/libo-0379/IPP_ARM

2. 函數及接口介紹

(1) exp

Parameters
pSrc Pointer to the source vector.
pDst Pointer to the destination vector.
len Number of elements in the vectors.

IppStatus ippsExp_32f_A24 (const Ipp32f* pSrc, Ipp32f* pDst, Ipp32s len);

(2) cos

Parameters
pSrc Pointer to the source vector.
pDst Pointer to the destination vector.
len Number of elements in the vectors.

IppStatus ippsCos_32f_A24 (const Ipp32f* pSrc, Ipp32f* pDst, Ipp32s len);

(3) sin

Parameters
pSrc Pointer to the source vector.
pDst Pointer to the destination vector.
len Number of elements in the vectors.

IppStatus ippsSin_32f_A24 (const Ipp32f* pSrc, Ipp32f* pDst, Ipp32s len);

(4) Tone

Description：
This function generates the tone with the specified frequency rFreq, phase pPhase, and magnitude magn.
The function computes len samples of the tone, and stores them in the array pDst. For real tones, each
generated value x[n] is defined as:
x[n] = magn * cos(2πn*rFreq + phase)
For complex tones, x[n] is defined as:
x[n] = magn * (cos(2πn*rFreq + phase)+j* sin(2πn*rFreq + phase))
The parameter hint suggests using specific code, which provides for either fast but less accurate calculation,
or more accurate but slower execution.

Parameters：
magn Magnitude of the tone, that is, the maximum value attained by
the wave.
pPhase Pointer to the phase of the tone relative to a cosine wave. It
must be in range [0.0, 2π). You can use the returned value to
compute the next continuous data block.
rFreq Frequency of the tone relative to the sampling frequency. It
must be in the interval [0.0, 0.5) for real tone and in [0.0, 1.0)
for complex tone.
pDst Pointer to the array that stores the samples.
len Number of samples to be computed.

IppStatus ippsTone_32f(Ipp32f* pDst, int len, Ipp32f magn, Ipp32f rFreq, Ipp32f*
pPhase, IppHintAlgorithm hint);

函數表達式

x[n] = magn * cos(2πn*rFreq + phase)

函數圖像:

(5) Triangle

Description：
This function generates the triangle with the specified frequency rFreq, phase pointed by pPhase, and
magnitude magn. The function computes len samples of the triangle, and stores them in the array pDst. For
real triangle, x[n] is defined as:
x[n] = magn * cth(2π* rFreq*n + phase), n = 0, 1, 2,...
For complex triangles, x[n] is defined as:
x[n] = magn * [cth(2π* rFreq*n + phase) + j * sth(2π* rFreq*n + phase)], n = 0, 1, 2,...
See Triangle-Generating Functions for the definition of functions cth and sth.
The cth () function is determined as follows:
H = π + h

Parameters：
rFreq Frequency of the triangle relative to the sampling frequency. It
must be in range [0.0, 0.5).
pPhase Pointer to the phase of the triangle relative to a cosine
triangular analog wave. It must be in range [0.0, 2π). You can
use the returned value to compute the next continuous data
block.
magn Magnitude of the triangle, that is, the maximum value attained
by the wave.
asym Asymmetry h of a triangle. It must be in range [-π , π ). If h=0,
then the triangle is symmetric and a direct analog of a tone.
pDst Pointer to the array that stores the samples.
len Number of samples to be computed

IppStatus ippsTriangle_32fc(Ipp32fc* pDst, int len, Ipp32f magn, Ipp32f rFreq, Ipp32f
asym, Ipp32f* pPhase);

函數實部和虛部

函數圖像

3. 核心代碼

#include <stdio.h>
#include <math.h>
#include "../include/ipp.h"

IppStatus ippsExp_32f_A24(const Ipp32f *pSrc, Ipp32f *dst, Ipp32s len)
{
    for(int i=0;i<len/4*4;i+=4)
    {   
        //每個元素泰勒展開項計數，初始化為0
        float32x4_t vn = vdupq_n_f32(0);
        float32x4_t vx = vld1q_f32(pSrc+i);
        //記錄當前項的展開值，初始化為第一項的值
        float32x4_t vprior = vdupq_n_f32(1.0);
        //保存每個元素展開的求和
        float32x4_t vsum = vprior;
        while(1)
        {
            //計算當前展開項
            // 1. 計算 prior * x / ++n
            vn = vaddq_f32(vn,vdupq_n_f32(1));
            // 自己實現的 vdivq_f32_neon 對 -20.055 計算誤差太大，改用系統提供接口 vdivq_f32
            // float32x4_t vcur = vdivq_f32_neon(vmulq_f32(vprior,vx),vn);
            float32x4_t vcur = vdivq_f32(vmulq_f32(vprior,vx),vn);
            // 2. 將當前項賦值為前項            
            vprior = vcur;
            // printf("0 %f\n",vgetq_lane_f32(vcur,0));
            // 3. 如果所有元素的最大展開項小于精度控制，不再計算
            if(vmaxvq_f32(vabsq_f32(vcur)) - EPSILON < 0)
                break;
            // 4. 將當前展開項加和
            vsum = vaddq_f32(vsum,vcur);
        }
        // printf("n %f\n",vgetq_lane_f32(vn,0));
        vst1q_f32(dst+i,vsum);
    }

    //處理尾部數據
    for(int i=len/4*4;i<len;i++)
    {
        // 初始化第一個值
        int n = 0;
        double prior = 1.0; //前一項數值初始化為第一項的數值
        double sum = prior; //求和保存結果
        Ipp32f x = *(pSrc+i);
        while (1)
        {
            double cur = prior * x / ++n;   //當前加項的數值 = 前項*(src[i]/n)
            prior = cur;
            // printf("%f\n",cur);
            if (fabs(cur)-EPSILON <= 0) //如果小于精度值停止計算
                break;
            sum += cur;
        }
        // printf("n %d\n",n);
        *(dst+i) = sum;
    }
    return ippStsNoErr;
}

IppStatus ippsCos_32f_A24(const Ipp32f *pSrc, Ipp32f *dst, Ipp32s len)
{
    float32x4_t v2PI = vdupq_n_f32(IPP_2PI);
    for(int i=0;i<len/4*4;i+=4)
    {
        //每個元素泰勒展開項計數，初始化為0
        float32x4_t vn = vdupq_n_f32(0);
        float32x4_t vx = vld1q_f32(pSrc+i);
        //記錄當前項的展開值，初始化為第一項的值
        float32x4_t vprior = vdupq_n_f32(1.0);
        //保存每個元素展開的求和
        float32x4_t vsum = vprior;
        //將x轉換到 [-2π,2π]
        //類型轉換向下取整,注意這里是浮點數除法
        float32x4_t vk = vcvtq_f32_s32(vcvtq_s32_f32(vdivq_f32_neon(vx,v2PI)));   
        vx = vfmsq_f32(vx,v2PI,vk);
        while(1)
        {
            //計算當前展開項
            // 1. 計算 prior * (x^2) / (2(++n)*(2n-1))
            vn = vaddq_f32(vn,vdupq_n_f32(1));
            float32x4_t vcur = vmulq_n_f32(vn,2);
            vcur = vdivq_f32_neon(vmulq_f32(vprior,vmulq_f32(vx,vx)),vmulq_f32(vcur,vsubq_f32(vcur,vdupq_n_f32(1.0))));
            //將當前項賦值為前項,不帶符號          
            vprior = vcur;
            // 2. 添加正負號
            float32_t n = vgetq_lane_f32(vn,0);
            vcur = vmulq_n_f32(vcur,(int)n%2==0?1:-1);
            // 3. 如果所有元素的展開項絕對值最大項小于精度控制，不再計算
            if(vmaxvq_f32(vabsq_f32(vcur)) - EPSILON <= 0)
                break;
            // 4. 將當前展開項加和
            vsum = vaddq_f32(vsum,vcur);
        }
        printf("n %f\n",vgetq_lane_f32(vn,0));
        vst1q_f32(dst+i,vsum);
    }
    //處理尾部數據
    for(int i=len/4*4;i<len;i++)
    {
        // 初始化第一個值
        int n = 0;
        double prior = 1.0; //前一項數值初始化為第一項的數值
        double sum = prior; //求和保存結果
        Ipp64f x = *(pSrc+i);
        //對x轉換到 [-2π,2π]
        Ipp32s k = x/IPP_2PI;
        x -= k*IPP_2PI;
        while (1)
        {
            int32_t temp1 = 2* ++n;
            double cur = prior * (x * x) /(temp1-- * temp1) ;   //當前加項的數值 = 前項*(x^2/(2n*(2n-1)))
            prior = cur;
            if (fabs(cur) - EPSILON <= 0) //如果小于精度值停止計算
                break;
            sum += (n%2==0?1:-1)*cur;   //符號位
        }
        // printf("n %d\n",n);
        *(dst+i) = sum;
    }
    return ippStsNoErr;
}

IppStatus ippsSin_32f_A24(const Ipp32f *pSrc, Ipp32f *dst, Ipp32s len)
{    
    float32x4_t v2PI = vdupq_n_f32(IPP_2PI);
    for(int i=0;i<len/4*4;i+=4)
    {
        //每個元素泰勒展開項計數，初始化為0
        float32x4_t vn = vdupq_n_f32(0);
        float32x4_t vx = vld1q_f32(pSrc+i);
        //將x轉換到 [-2π,2π]
        //類型轉換向下取整,注意這里是浮點數除法
        float32x4_t vk = vcvtq_f32_s32(vcvtq_s32_f32(vdivq_f32_neon(vx,v2PI)));   
        vx = vfmsq_f32(vx,v2PI,vk);
        // !! 注意這里易錯點，應在區間轉換后再賦值給 vprior 
        //記錄當前項的展開值，初始化為第一項的值
        float32x4_t vprior = vx;
        //保存每個元素展開的求和
        float32x4_t vsum = vprior;
        printf("x0 %f\n",vgetq_lane_f32(vx,0));
        while(1)
        {
            // 1. 計算 prior * (x^2) / (2(++n)*(2n+1))
            vn = vaddq_f32(vn,vdupq_n_f32(1));
            float32x4_t vcur = vmulq_n_f32(vn,2);
            vcur = vdivq_f32_neon(vmulq_f32(vprior,vmulq_f32(vx,vx)),vmulq_f32(vcur,vaddq_f32(vcur,vdupq_n_f32(1))));
            // printf("cur0 %f\n",vgetq_lane_f32(vcur,0));
            //將當前項賦值為前項,不帶符號          
            vprior = vcur;
            // 2. 添加正負號
            float32_t n = vgetq_lane_f32(vn,0);
            vcur = vmulq_n_f32(vcur,(int)n%2==0?1:-1);
            // 3.如果所有元素的展開項絕對值最大項小于精度控制，不再計算
            if(vmaxvq_f32(vabsq_f32(vcur)) - EPSILON <= 0)
                break;
            // 4.將當前展開項加和
            vsum = vaddq_f32(vsum,vcur);
        }
        printf("n %f\n",vgetq_lane_f32(vn,0));
        vst1q_f32(dst+i,vsum);
    }
    //處理尾部數據
    for(int i=len/4*4;i<len;i++)
    {
        // 初始化第一個值
        int n = 0;
        Ipp64f x = *(pSrc+i);
        //將x轉換到 [-2π,2π]
        Ipp32s k = x/IPP_2PI;
        x -= k*IPP_2PI;
        double prior = x; //前一項數值初始化為第一項的數值
        double sum = prior; //求和保存結果
        while (1)
        {
            int32_t temp1 = 2* ++n;
            double cur = prior * (x * x) /(temp1++ * temp1) ;   //當前加項的數值 = 前項*(x^2/(2n*(2n+1)))
            prior = cur;
            if (fabs(cur) - EPSILON <= 0) //如果小于精度值停止計算
                break;
            sum += (n%2==0?1:-1)*cur;   //符號位
        }
        // printf("n %d\n",n);
        *(dst+i) = sum;
    }
    return ippStsNoErr;
}

IppStatus ippsTone_32f(Ipp32f* pDst, int len, Ipp32f magn, Ipp32f rFreq,
Ipp32f* pPhase, IppHintAlgorithm hint)
{
    if(rFreq<0.0 || (rFreq-0.5)>EPSILON)
        return ippStsToneFreqErr;
    if(*pPhase<0.0 || (*pPhase-IPP_2PI)>=EPSILON)
        return ippStsTonePhaseErr;
    if(magn<=0.0)
        return ippStsToneMagnErr;
    
    float32x4_t vphase = vdupq_n_f32(*pPhase);
    float32x4_t vrFreq= vdupq_n_f32(rFreq);
    for(int i=0;i<len/4*4;i+=4)
    {
        //保存結果
        float32_t result[4] = {0.0};
        float32x4_t vresult = {0.0};
        // IPP_2PI*rFreq*i + phase
        float32_t x[4] = {i,i+1,i+2,i+3};
        float32x4_t vx = vld1q_f32(x);
        vx = vaddq_f32(vphase,vmulq_n_f32(vmulq_f32(vrFreq,vx),IPP_2PI));
        vst1q_f32(x,vx);
        ippsCos_32f_A24(x,result,4);
        vresult = vld1q_f32(result);
        vst1q_f32(pDst+i,vmulq_n_f32(vresult,magn));
    }
    for(int i=len/4*4;i<len;i++)
    {
        float32_t result = 0.0;
        Ipp32f x = IPP_2PI*rFreq*i + (*pPhase);
        ippsCos_32f_A24_s(&x,&result,1);
        *(pDst+i) = result * magn;  //類型強轉
    }
    return ippStsNoErr;
}

IppStatus ippsTriangle_32fc(Ipp32fc* pDst, int len, Ipp32f magn, Ipp32f rFreq, 
Ipp32f asym, Ipp32f* pPhase)
{
    if(rFreq<0.0 || (rFreq-0.5)>EPSILON)
        return ippStsToneFreqErr;
    if(*pPhase<0.0 || (*pPhase-IPP_2PI)>=EPSILON)
        return ippStsTonePhaseErr;
    if(magn<=0.0)
        return ippStsToneMagnErr;
    
    float32x4_t vH = vdupq_n_f32(IPP_PI+asym);
    printf("h %f\n",IPP_PI+asym);

    float32x4_t vPI = vdupq_n_f32(IPP_PI);
    float32x4_t v2PI = vdupq_n_f32(IPP_2PI);
    float32x4_t v2PI_H = vsubq_f32(v2PI,vH);
    float32x4_t vphase = vdupq_n_f32(*pPhase);
    float32x4_t vrFreq= vdupq_n_f32(rFreq);
    for(int i=0;i<len/4*4;i+=4)
    {
        float32_t x[4] = {i,i+1,i+2,i+3};
        float32x4_t vx = vld1q_f32(x);
        // 計算位置 x = 2*PI*rFreq*n + phase
        vx = vaddq_f32(vphase,vmulq_n_f32(vmulq_f32(vrFreq,vx),IPP_2PI));
        //因為x>0 可以轉換到 [0,2π]
        //類型轉換向下取整,注意這里是浮點數除法
        float32x4_t vk = vcvtq_f32_s32(vcvtq_s32_f32(vdivq_f32_neon(vx,v2PI)));   
        vx = vfmsq_f32(vx,v2PI,vk);

        printf("x %f\n",vgetq_lane_f32(vx,0));
        
        vst1q_f32(x,vx);
        // for(int i=0;i<4;i++)
        //     printf("x %f\n",x[i]);
        //保存結果包含實部虛部
        float32x4x2_t vresult;
        //實部
        float32x4_t vresultRE;
        float32x4_t vresultRE1;
        float32x4_t vresultRE2;
        //虛部
        float32x4_t vresultIM;
        float32x4_t vresultIM_1;
        float32x4_t vresultIM1;
        float32x4_t vresultIM2;
        float32x4_t vresultIM3;
        //對實部和虛部所有區間都計算最后根據 x 的范圍取舍
        //計算實部
        //[0,H] -2*x/H + 1 
        vresultRE1 = vaddq_f32(vdupq_n_f32(1.0),vdivq_f32_neon(vmulq_n_f32(vx,-2),vH));
        //[H,2π] (2(x-PI)-H)/(2PI-H)
        vresultRE2 = vdivq_f32_neon(vsubq_f32(vmulq_n_f32(vsubq_f32(vx,vPI),2),vH),v2PI_H);

        //計算虛部
        //[0,π-H/2] 2*x/(2PI-H)
        vresultIM1 = vdivq_f32_neon(vmulq_n_f32(vx,2),v2PI_H);
        //[π-H/2,π+H/2] -2*(x-PI)/H
        vresultIM2 = vdivq_f32_neon(vmulq_n_f32(vsubq_f32(vx,vPI),-2),vH);
        //[π+H/2,2π] 2(x-2PI)/(2PI-H)
        vresultIM3 = vdivq_f32_neon(vmulq_n_f32(vsubq_f32(vx,v2PI),2),v2PI_H);

        //實部取舍
        uint32x4_t vcompare = vcltq_f32(vx,vH);
    
        printf("c %u\n",vgetq_lane_u32(vcompare,0));

        vresultRE = vbslq_f32(vcompare,vresultRE1,vresultRE2);
        vresultRE = vmulq_n_f32(vresultRE,magn);
        //虛部取舍
        float32x4_t v2_H = vdivq_f32_neon(vH,vdupq_n_f32(2));
        // x< π-H/2 取 vresultIM1,否則取 vresultIM2
        vcompare = vcltq_f32(vx,vsubq_f32(vPI,v2_H));
        vresultIM_1 = vbslq_f32(vcompare,vresultIM1,vresultIM2);
        // x> π+H/2 取 vresultIM3,否則取 vresultIM
        vcompare = vcgtq_f32(vx,vaddq_f32(vPI,v2_H));
        vresultIM = vbslq_f32(vcompare,vresultIM3,vresultIM_1);
        vresultIM = vmulq_n_f32(vresultIM,magn);

        vresult.val[0] = vresultRE;
        vresult.val[1] = vresultIM;
        printf("re %f\n",vgetq_lane_f32(vresultRE,0));
        printf("im %f\n",vgetq_lane_f32(vresultIM,0));
        vst2q_f32((float32_t*)(pDst+i),vresult);
    }
    Ipp32f H = IPP_PI+asym;
    for(int i=len/4*4;i<len;i++)
    {
        Ipp32f x = IPP_2PI*rFreq*i + *pPhase;
        //x 需要轉換到 [0,2π]
        Ipp32s k = x/IPP_2PI;
        x -= k*IPP_2PI;
        Ipp32fc result;
        // printf("x %f\n",x);
        //計算實部
        if(x>=0 && (H-x)>=EPSILON)  
             result.re = -2*x/H + 1.0;  // -2*x/H + 1 
        else    
            result.re = (2*(x-IPP_PI)-H)/(IPP_2PI-H);// (2(x-PI)-H)/(2PI-H)
        //計算虛部
        if(x>=0.0 && ((IPP_2PI-H)/2)-x >= EPSILON)  
            result.im = 2*x/(IPP_2PI-H);    // 2*x/(2PI-H)
        else if(x-(IPP_2PI-H)/2 >=EPSILON && (IPP_2PI+H)/2-x>=EPSILON)
            result.im = -2*(x-IPP_PI)/H;    //-2*(x-PI)/H
        else
            result.im = 2*(x-IPP_2PI)/(IPP_2PI-H);  //2(x-2PI)/(2PI-H)
        result.re *= magn;
        result.im *= magn;
        *(pDst+i) = result;
    }

    return ippStsNoErr;
}

測試代碼

#include <stdio.h>
#include "../include/ipp.h"
#include "../include/ipp_s.h"

#define N 5

int main()
{
    Ipp32f src[N] = {-20.055, -5.456, 7.835, 11.89, 18.76};
    Ipp32f dst[N] ={0.0};
    for(int i=0;i<N;i++)
    {
        ippsExp_32f_A24_s(&src[i],&dst[i],1);
        printf("result %f\n",dst[i]);
    }
    printf("======\n");
    ippsExp_32f_A24(src,dst,N);
    for(int i=0;i<N;i++)
        printf("result %f\n",dst[i]);
    printf("======\n");

    for(int i=0;i<N;i++)
    {
        ippsCos_32f_A24_s(&src[i],&dst[i],1);
        printf("result %f\n",dst[i]);
    }
    printf("======\n");
    ippsCos_32f_A24(src,dst,N);
    for(int i=0;i<N;i++)
        printf("result %f\n",dst[i]);
    printf("======\n");

    for(int i=0;i<N;i++)
    {
        ippsSin_32f_A24_s(&src[i],&dst[i],1);
        printf("result %f\n",dst[i]);
    }
    printf("======\n");
    ippsSin_32f_A24(src,dst,N);
    for(int i=0;i<N;i++)
        printf("result %f\n",dst[i]);
    printf("======\n");

    Ipp32f phase = 5.3865;
    ippsTone_32f_s(dst,N,3,0.2456,&phase,ippAlgHintFast);
    for(int i=0;i<N;i++)
        printf("result %f\n",dst[i]);
    printf("======\n");
    ippsTone_32f(dst,N,3,0.2456,&phase,ippAlgHintFast);
    for(int i=0;i<N;i++)
        printf("result %f\n",dst[i]);
    printf("======\n");

    Ipp32fc dst1[N];
    ippsTriangle_32fc_s(dst1,N,3,0.2456,1.47,&phase);
    for(int i=0;i<N;i++)
        printf("result %f %f\n",dst1[i].re,dst1[i].im);
    printf("======\n");

    ippsTriangle_32fc(dst1,N,3,0.2456,1.47,&phase);
    for(int i=0;i<N;i++)
        printf("result %f %f\n",dst1[i].re,dst1[i].im);
    printf("======\n");

    return 0;
}

4. Intel IPP庫測試

官網安裝教程：https://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/hpc-toolkit-download.html?packages=hpc-toolkit&hpc-toolkit-os=linux&hpc-toolkit-lin=offline

運行前設置環境變量

source /opt/intel/oneapi/setvars.sh

編譯命令

icx test.c -o test -lippvm -lippcore -lipps

5. 測試數據

函數	串行單精度	FIELD3	串行雙精度	FIELD5	向量化單精度	FIELD7	intel IPP	FIELD9
ippsExp_32f_A24	1.976372		0		1.976372		0
	0.004269		0.00427		0.004269		0.004271
	2527.535645		2527.535645		2527.535645		2527.535645
	145801.3281		145801.3438		145801.3281		145801.3438
	140399216		140399184		140399184		140399184
ippsCos_32f_A24	0.357278		0.357278		0.357279		0.357278
	0.676947		0.67695		0.676952		0.67695
	0.01898		0.01898		0.018981		0.01898
	0.77985		0.77985		0.779854		0.77985
	0.995996		0.995993		0.995993		0.995993
ippsSin_32f_A24	-0.933998		-0.933998		-0.933998		-0.933998
	0.73603		0.736029		0.73603		0.736029
	0.99982		0.99982		0.99982		0.99982
	-0.625968		-0.625967		-0.625965		-0.625966
	-0.089438		-0.089436		-0.089436		-0.089436
ippsTone_32f	1.872611		1.872609		1.872612		1.872609
	2.394652		2.394654		2.394655		2.394655
	-1.740218		-1.74022		-1.74022		-1.74022
	-2.490863		-2.490862		-2.490861		-2.490863
	1.602509		1.602513		1.602509		1.602513
ippsTriangle_32fc	實部	虛部	實部	虛部	實部	虛部	實部	虛部
	-0.218555	-2.920779	-0.218555	-2.920779			-0.218555	-2.920779
	2.158905	2.320416	2.158905	2.320415			2.158905	2.320415
	0.15116	1.238589	0.15116	1.238589			0.15116	1.238588
	-1.856586	-0.769157	-1.856585	-0.769157			-1.856585	-0.769157
	-0.615485	-2.776901	-0.615485	-2.776901			-0.615484	-2.776901

6. 結果分析

(1) 用單精度計算的結果可能誤差較大，無法保證接口A24要求單精度十進制6位數的準確，建議使用雙精度計算轉換成單精度結果；

(2) NEON 提供有 vdiv 除法接口，比使用倒數的自定義版本精度更高。

posted @ 2024-12-17 17:34 安洛8 閱讀(308) 評論(0) 收藏舉報

刷新頁面返回頂部

anluo8

高性能計算-Intel IPP庫ARM移植示例(20)

1. 簡介

(1) Intel? Integrated Performance Primitives，即英特爾集成性能基元（簡稱IPP），為信號、數據和圖像處理特定應用領域，提供simd優化的一組全面的函數庫。

(2) 本項目將對 exp、cos、sin、tone、Triangle函數用NEON向量化指令實現ARM移植版本，有串行和向量化兩個版本。

(3)計算使用泰勒展開，使用前后加和項的比例關系，加速計算。達到接口要求的精度即可。

(4)項目地址：https://github.com/libo-0379/IPP_ARM

2. 函數及接口介紹

(1) exp

(2) cos

(3) sin

(4) Tone

函數表達式

函數圖像:

(5) Triangle

函數實部和虛部

函數圖像

3. 核心代碼

測試代碼

4. Intel IPP庫測試

官網安裝教程：https://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/hpc-toolkit-download.html?packages=hpc-toolkit&hpc-toolkit-os=linux&hpc-toolkit-lin=offline

運行前設置環境變量

編譯命令

5. 測試數據

6. 結果分析

(1) 用單精度計算的結果可能誤差較大，無法保證接口A24要求單精度十進制6位數的準確，建議使用雙精度計算轉換成單精度結果；

(2) NEON 提供有 vdiv 除法接口，比使用倒數的自定義版本精度更高。

公告