Intel HDSLB 高性能四層負載均衡器 — 代碼剖析和高級特性

2024-06-16 16:09 云物互聯閱讀(368) 評論(0) 收藏舉報

前言

在前 2 篇文章中，我們從快速入門、應用場景、基本原理、部署配置這 4 個方面，整體地介紹了 Intel HDSLB 作為新一代高性能四層負載均衡器的研發背景、解決方案以及性能優勢，并通過 step by step 的方式，希望幫助更多的讀者能夠便捷地在自己的開發機運行和使用起來。在本篇中，我們將繼續向前，對 HDSLB-DPVS 開源版本的代碼進行剖析，并介紹其中一些有趣的高級特性。

《Intel HDSLB 高性能四層負載均衡器 — 快速入門和應用場景》
《Intel HDSLB 高性能四層負載均衡器 — 基本原理和部署配置》

代碼剖析

下載代碼：

git clone https://github.com/intel/high-density-scalable-load-balancer.git

軟件架構

上圖是 HDSLB 的軟件架構圖，自上而下的可以分為下述 5 個層面。

Control Plane 控制面層：依舊沿用了 LVS 的控制面，實現了下列 3 個 CLI 工具，使用了 Local unix socket 通信機制：
1. ipvsadm：用于管理 ipvs 的配置。
2. dpip：用于設置 VIP、RIP 和相關的 Route 規則。
3. keepalive：用于 HA 和 RS 健康檢查。
Load Balancer 負載均衡層：實現了 scheduler 流量調度、proto 四層協議、conn 連接跟蹤、FastPath 等功能模塊。尤其是 FastPath 快慢路徑分離是 HDSLB 對 DPVS 的核心優化之一。
Lite IP-Stack 層：實現了 ARP / IPv4v6 / ICMP 等 L2-3 層網絡協議、以及 inetaddr 和 L3 routing 等功能模塊。
Net Devices 層：實現了物理網卡納管、bonding、VLAN、KNI、TC 流量控制，hw-addr-list 地址列表等功能模塊。
Hardware Acceleration 層：提供了 Intel CPU 和 NIC 硬件級別的加速技術，包括：FDIR mark、FDIR to queue、RSS、Checksum offload、AVX512、DLB、SR-IOV 等等。

目錄結構

$ high-density-scalable-load-balancer git:(main) tree -L 2
.
├── Makefile
├── conf  # 配置示例目錄
│   ├── hdslb.bond.conf.sample
│   ├── hdslb.conf.items
│   ├── hdslb.conf.sample
│   ├── hdslb.conf.single-bond.sample
│   └── hdslb.conf.single-nic.sample
├── include  # 頭文件庫目錄
│   ├── cfgfile.h
│   ├── common.h
│   ├── conf
│   ├── ctrl.h
│   ├── dpdk.h
│   ├── flow.h
│   ├── global_conf.h
│   ├── icmp.h
│   ├── icmp6.h
│   ├── inet.h
│   ├── inetaddr.h
│   ├── ip_tunnel.h
│   ├── ipset.h
│   ├── ipv4.h
│   ├── ipv4_frag.h
│   ├── ipv6.h
│   ├── ipvs
│   ├── kni.h
│   ├── laddr_multiply.h
│   ├── lb
│   ├── linux_ipv6.h
│   ├── list.h
│   ├── log.h
│   ├── match.h
│   ├── mbuf.h
│   ├── md5.h
│   ├── mempool.h
│   ├── ndisc.h
│   ├── neigh.h
│   ├── netif.h
│   ├── netif_addr.h
│   ├── parser
│   ├── pidfile.h
│   ├── route.h
│   ├── route6.h
│   ├── route6_hlist.h
│   ├── route6_lpm.h
│   ├── sa_pool.h
│   ├── sys_time.h
│   ├── tc
│   ├── timer.h
│   ├── uoa.h
│   └── vlan.h
├── patch  # DPDK 補丁目錄
│   ├── dpdk-16.07
│   ├── dpdk-20.08
│   ├── dpdk-stable-17.05.2
│   ├── dpdk-stable-17.11.2
│   └── dpdk-stable-19.11
├── scripts  # 運維腳本目錄
│   ├── ipvs-tunnel.rs.deploy.sh
│   ├── setup.fnat.two-arm.sample.sh
│   ├── setup.snat-gre.sample.sh
│   ├── setup.snat.sample.sh
│   └── setup.tc.sample.sh
├── src  # 核心源碼目錄
│   ├── Makefile
│   ├── VERSION
│   ├── cfgfile.c
│   ├── common.c
│   ├── config.mk
│   ├── ctrl.c
│   ├── dpdk.mk
│   ├── global_conf.c
│   ├── icmp.c
│   ├── inet.c
│   ├── inetaddr.c
│   ├── ip_gre.c
│   ├── ip_tunnel.c
│   ├── ipip.c
│   ├── ipset.c
│   ├── ipv4.c
│   ├── ipv4_frag.c
│   ├── ipv6
│   ├── ipvs  # ipvs 業務邏輯實現目錄
│   ├── kni.c
│   ├── laddr_multiply.c
│   ├── lb    # lb 轉發邏輯實現目錄
│   ├── log.c
│   ├── main.c
│   ├── mbuf.c
│   ├── mempool.c
│   ├── neigh.c
│   ├── netif.c
│   ├── netif_addr.c
│   ├── parser.c
│   ├── pidfile.c
│   ├── route.c
│   ├── sa_pool.c
│   ├── sys_time.c
│   ├── tc
│   ├── timer.c
│   └── vlan.c
└── tools  # 工具庫目錄
    ├── Makefile
    ├── dpip
    ├── ipvsadm
    ├── keepalived
    └── lbdebug

配置解析

global_defs 全局配置模塊：指定日志級別，日志路徑等。

! global config
global_defs {
    log_level   DEBUG # 方便調試
    ! log_file    /var/log/hdslb.log
    ! log_async_mode    on
}

netif_defs 網卡設備配置模塊：
1. pktpool 指定 DPDK memory/cache pool 的大小。
2. device 指定 DPDK 網卡設備。
3. tx/rx 指定 DPDK 網卡設備的硬件隊列數。
4. bonding 指定 DPDK 網卡綁定。
5. kni 指定 DPDK 網卡設備對應的 kni 虛擬網絡接口設備。DPDK 程序會將物理網卡設備納管，流量 bypass 內核，如果其它程序，比如 ssh 想使用網絡接口，則需要通過 kni 模塊來提供虛擬網絡接口，DPDK 程序會將不感興趣的流量送到內核。
6. RSS（Receive Side Scaling）：指定網卡設備的接受多隊列和多核處理器的映射關系，充分利用網卡多隊列和多核處理器的技術優勢，提高網絡吞吐量和數據包處理效率。
7. FDIR（Flow Director）：指定網卡設備的流量識別和分類模式，提高對老鼠流量、大象流量等特定流量的處理效率。

! netif config
netif_defs {
    <init> pktpool_size     1048575
    <init> pktpool_cache    256
    <init> device dpdk0 {
        rx {
            queue_number        3
            descriptor_number   1024
            ! rss                 all
        }
        tx {
            queue_number        3
            descriptor_number   1024
        }
        fdir {
            mode                perfect
            pballoc             64k
            status              matched
        }
        ! promisc_mode
        kni_name                dpdk0.kni
    }
    ! <init> bonding bond0 {
    !    mode        0
    !    slave       dpdk0
    !    slave       dpdk1
    !    primary     dpdk0
    !    kni_name    bond0.kni
    !}
}

worker_defs 工作核心配置：DPDK 將 CPU 抽象為 lcore，有 master、slave 兩種類型。通常的，master 做 Control Plane 處理，而 slave 作為 Data Plane 處理。每個 lcore 可以負責多個網卡設備的多個隊列。另外，DPDK 將網卡設備抽象為 Port，rx_queue_ids 和 tx_queue_ids 分別是接收和發送的隊列編號。其中 isol_rx_cpu_ids 表示當前 lcore 專職負責接收數據，isol_rxq_ring_sz 專職接收數據的 ring buffer 大小。

! worker config (lcores)
worker_defs {
    # control plane CPU
    <init> worker cpu0 {
        type    master
        cpu_id  0
    }
    # data plane CPU
    # dpdk0、1 這 2 個 Port 的同一個收發隊列共用同一個 CPU
    <init> worker cpu1 {
        type    slave
        cpu_id  1
        port    dpdk0 {
            rx_queue_ids     0
            tx_queue_ids     0
            ! isol_rx_cpu_ids  9
            ! isol_rxq_ring_sz 1048576
        }
        port    dpdk1 {
            rx_queue_ids     0
            tx_queue_ids     0
            ! isol_rx_cpu_ids  9
            ! isol_rxq_ring_sz 1048576
        }
    }
}

timer_defs 定時器配置：

! timer config
timer_defs {
    # cpu job loops to schedule dpdk timer management
    schedule_interval    500
}

neight_defs 鄰居子系統配置：Lite IP-Stack 包括 L3 Route 子系統和 L2 Neighbor 子系統。

! hdslb neighbor config
neigh_defs {
    <init> unres_queue_length  128
    <init> timeout             60
}

ipv4/v6_defs 三層網絡配置：

! hdslb ipv4 config
ipv4_defs {
    forwarding                 off
    <init> default_ttl         64
    fragment {
        <init> bucket_number   4096
        <init> bucket_entries  16
        <init> max_entries     4096
        <init> ttl             1
    }
}

! hdslb ipv6 config
ipv6_defs {
    disable                     off
    forwarding                  off
    route6 {
        <init> method           hlist
        recycle_time            10
    }
}

ctrl_defs 控制面配置：使用 Local unix socket 通信方式。

! control plane config
ctrl_defs {
    lcore_msg {
        <init> ring_size                4096
        sync_msg_timeout_us             30000000
        priority_level                  low
    }
    ipc_msg {
        <init> unix_domain /var/run/hdslb_ctrl
    }
}

ipvs_defs 核心配置：
1. conn 指定用于維護網絡 conntrack 連接跟蹤表資源的相關配置。
2. udp/tcp 協議處理配置。
3. synproxy 是與 TCP SYN flood 相關的配置。

! ipvs config
ipvs_defs {
    conn {
        <init> conn_pool_size       2097152
        <init> conn_pool_cache      256
        conn_init_timeout           30
        ! expire_quiescent_template
        ! fast_xmit_close
        ! <init> redirect           off
    }

    udp {
        ! defence_udp_drop
        uoa_mode        opp
        uoa_max_trail   3
        timeout {
            normal      300
            last        3
        }
    }

    tcp {
        ! defence_tcp_drop
        timeout {
            none        2
            established 90
            syn_sent    3
            syn_recv    30
            fin_wait    7
            time_wait   7
            close       3
            close_wait  7
            last_ack    7
            listen      120
            synack      30
            last        2
        }
        synproxy {
            synack_options {
                mss             1452
                ttl             63
                sack
                ! wscale
                ! timestamp
            }
            ! defer_rs_syn
            rs_syn_max_retry    3
            ack_storm_thresh    10
            max_ack_saved       3
            conn_reuse_state {
                close
                time_wait
                ! fin_wait
                ! close_wait
                ! last_ack
           }
        }
    }
}

FDIR sa_pool 配置：

! sa_pool config
sa_pool {
    pool_hash_size   16
}

啟動流程分析

程序入口：

high-density-scalable-load-balancer/src/main.c main(int argc, char *argv[])

NUMA 節點數量檢查：

    if (get_numa_nodes() > DPVS_MAX_SOCKET) {
        fprintf(stderr, "DPVS_MAX_SOCKET is smaller than system numa nodes!\n");
        return -1;
    }

CPU 親和性設定：NUMA 親和和 CPU 綁定是 DPDK 程序的一大特性。

    if (set_all_thread_affinity() != 0) {
        fprintf(stderr, "set_all_thread_affinity failed\n");
        exit(EXIT_FAILURE);
    }

初始化 RTE 運行時環境：完成 DPDK 運行時環境的基礎配置，詳情請瀏覽《DPDK — EAL 環境抽象層》

    err = rte_eal_init(argc, argv);
    if (err < 0)
        rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n");
    argc -= err, argv += err;

進入 HDSLB 核心業務流程：

    RTE_LOG(INFO, DPVS, "HDSLB version: %s, build on %s\n", HDSLB_VERSION, HDSLB_BUILD_DATE);

初始化配置解析器：加載并解析 hdslb.conf 配置文件。

    if ((err = cfgfile_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail init configuration file: %s\n",
                 dpvs_strerror(err));

bond 虛擬接口配置：如果配置文件中沒有 bond 則不做處理。

    if ((err = netif_virtual_devices_add()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail add virtual devices:%s\n",
                 dpvs_strerror(err));

初始化 lcore 定時器：每個 lcore 都有自己的定時器，底層通過調用 timer_lcore_init 完成初始化，用于實現 conn 老化等業務邏輯。

    if ((err = dpvs_timer_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail init timer on %s\n", dpvs_strerror(err));

初始化 traffic control 流控模塊：

    if ((err = tc_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init traffic control: %s\n",
                 dpvs_strerror(err));

初始化 DPDK 網卡設備：Data Plane 的核心 jobs 處理函數在這里被注冊，netif_init->netif_lcore_init 函數中會注冊 3 個 NETIF_LCORE_JOB_LOOP。

    if ((err = netif_init(NULL)) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init netif: %s\n", dpvs_strerror(err));
    /* Default lcore conf and port conf are used and may be changed here
     * with "netif_port_conf_update" and "netif_lcore_conf_set" */

初始化 ctrl 和 tc_ctrl 控制面接口：ctrl_init->msg_init 會注冊 1 個 NETIF_LCORE_JOB_LOOP。

    if ((err = ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init ctrl plane: %s\n",
                 dpvs_strerror(err));

    if ((err = tc_ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init tc control plane: %s\n",
                 dpvs_strerror(err));

初始化 L2 VLAN 網絡：

    if ((err = vlan_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init vlan: %s\n", dpvs_strerror(err));

初始化 TCPv4 網絡：inet_init 注冊了一系列的 NETIF_LCORE_JOB_XXX，L4 LB 的核心。

    if ((err = inet_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init inet: %s\n", dpvs_strerror(err));

初始化 FDIR 的 sa_pool：

    if ((err = sa_pool_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init sa_pool: %s\n", dpvs_strerror(err));

初始化 Tunnel：如果配置文件中沒有啟動 IP tunnel 模式則不做處理。

    if ((err = ip_tunnel_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init tunnel: %s\n", dpvs_strerror(err));

初始化原始 lvs 的功能：包括注冊處理 IPv4 包的鉤子函數 dp_vs_in 和 dp_vs_pre_routing。

    if ((err = dp_vs_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init ipvs: %s\n", dpvs_strerror(err));

初始化 netif_ctrl 控制面接口：

    if ((err = netif_ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init netif_ctrl: %s\n",
                 dpvs_strerror(err));

啟動 DPDK 網絡設備：包括 Port、rx/tx queues、cpu binding 等等。

    /* config and start all available dpdk ports */
    nports = rte_eth_dev_count_avail();
    for (pid = 0; pid < nports; pid++) {
        dev = netif_port_get(pid);
        if (!dev) {
            RTE_LOG(WARNING, DPVS, "port %d not found\n", pid);
            continue;
        }

        err = netif_port_start(dev);
        if (err != EDPVS_OK)
            rte_exit(EXIT_FAILURE, "Start %s failed, skipping ...\n",
                    dev->name);

    }

啟動數據面處理：rte_eal_mp_remote_launch 在每個 slave lcore 調用 netif_loop 進入工作線程的永循環。

// src/main.c
    /* start data plane threads */
    netif_lcore_start();

// src/netif.c
int netif_lcore_start(void)
{
    rte_eal_mp_remote_launch(netif_loop, NULL, SKIP_MASTER);
    return EDPVS_OK;
}

// src/netif.c
static int netif_loop(void *dummy)

啟動控制面處理：進入 master 主線程的永循環。

    /* start control plane thread */
    while (1) {
        /* reload configuations if reload flag is set */
        try_reload();
        /* IPC loop */
        sockopt_ctl(NULL);
        /* msg loop */
        msg_master_process(0);
        /* timer */
        now_cycles = rte_get_timer_cycles();
        if ((now_cycles - prev_cycles) * 1000000 / cycles_per_sec > timer_sched_interval_us) {
            rte_timer_manage();
            prev_cycles = now_cycles;
        }
        /* kni */
        kni_process_on_master();

        /* process mac ring on master */
        neigh_process_ring(NULL, 0);

        /* increase loop counts */
        netif_update_master_loop_cnt();
    }

數據面 jobs 注冊

前文提到，在 main 的 init 初始化流程中，會注冊一系列的數據面的 jobs，并存儲在全局變量 netif_lcore_jobs 中。這些 jobs 的本質是一個函數引用，在處理報文時會在不同的環節被調用。

main->netif_init->netif_lcore_init 注冊了 3 個 NETIF_LCORE_JOB_LOOP：
1. lcore_job_recv_fwd
2. lcore_job_xmit
3. lcore_job_timer_manage

static void netif_lcore_init(void)
{
......
    /* register lcore jobs*/
    snprintf(netif_jobs[0].name, sizeof(netif_jobs[0].name) - 1, "%s", "recv_fwd");
    netif_jobs[0].func = lcore_job_recv_fwd;
    netif_jobs[0].data = NULL;
    netif_jobs[0].type = NETIF_LCORE_JOB_LOOP;
    
    snprintf(netif_jobs[1].name, sizeof(netif_jobs[1].name) - 1, "%s", "xmit");
    netif_jobs[1].func = lcore_job_xmit;
    netif_jobs[1].data = NULL;
    netif_jobs[1].type = NETIF_LCORE_JOB_LOOP;
    
    snprintf(netif_jobs[2].name, sizeof(netif_jobs[2].name) - 1, "%s", "timer_manage");
    netif_jobs[2].func = lcore_job_timer_manage;
    netif_jobs[2].data = NULL;
    netif_jobs[2].type = NETIF_LCORE_JOB_LOOP;
}

main->ctrl_init->msg_init 注冊了 1 個 NETIF_LCORE_JOB_LOOP。
1. slave_lcore_loop_func

    /* register netif-lcore-loop-job for Slaves */
    snprintf(ctrl_lcore_job.name, sizeof(ctrl_lcore_job.name) - 1, "%s", "slave_ctrl_plane");
    ctrl_lcore_job.func = slave_lcore_loop_func;
    ctrl_lcore_job.data = NULL;
    ctrl_lcore_job.type = NETIF_LCORE_JOB_LOOP;

main->inet_init->ipv4_init->ipv4_frag_init 注冊了 1 個 NETIF_LCORE_JOB_SLOW。
1. ipv4_frag_job

    snprintf(frag_job.name, sizeof(frag_job.name) - 1, "%s", "ipv4_frag");
    frag_job.func = ipv4_frag_job;
    frag_job.data = NULL;
    frag_job.type = NETIF_LCORE_JOB_SLOW;
    frag_job.skip_loops = IP4_FRAG_FREE_DEATH_ROW_INTERVAL;

main->inet_init -> neigh_init -> arp_init 也注冊了 NETIF_LCORE_JOB_SLOW。
1. neigh_process_ring

    /*get static arp entry from master*/
    snprintf(neigh_sync_job.name, sizeof(neigh_sync_job.name) - 1, "%s", "neigh_sync");
    neigh_sync_job.func = neigh_process_ring;
    neigh_sync_job.data = NULL;
    neigh_sync_job.type = NETIF_LCORE_JOB_SLOW;
    neigh_sync_job.skip_loops = NEIGH_PROCESS_MAC_RING_INTERVAL;

實際上，還有其他的 NETIF_LCORE_JOB_XXX 沒有列出來，此處先不作展開。這些 jobs 都會在 netif_loop 中被調用，用于完成數據面的收包、處理、轉發工作。

數據面 jobs 執行

netif_lcore_jobs 作為數據面的處理邏輯，采用了類似 Kernel netfilter 的 chain 鏈式處理模式。在 netif_loop 中 jobs 函數的調用順序為： lcore_job_recv_fwd -> lcore_job_xmit -> lcore_job_timer_manage -> slave_lcore_loop_func -> ipv4_frag_job -> neigh_process_ring。

static int netif_loop(void *dummy)
{
    struct netif_lcore_loop_job *job;
    // 獲取當前 lcore id
    lcoreid_t cid = rte_lcore_id();
    enum netif_principal_status stat = NETIF_PRINCIPAL_STAT_IDLE;
    lb_cycle_t deadline;
    int ret;

    assert(LCORE_ID_ANY != cid);

	// lcore 是否配置為了 isol_rx_cpu_ids 專職收包？是則永循環，否則繼續。
	// 此處的設計思想是將收包和處理包的 Core 分離，增加網卡的吞吐能力吧。
    try_isol_rxq_lcore_loop();
    if (0 == lcore_conf[lcore2index[cid]].nports) {
    	// 沒有 lcore 對應的 port
        RTE_LOG(INFO, NETIF, "[%s] Lcore %d has nothing to do.\n", __func__, cid);
        return EDPVS_IDLE;
    }

	// 首先，立即運行 lcore 中注冊的 NETIF_LCORE_JOB_INIT 任務
    list_for_each_entry(job, &netif_lcore_jobs[NETIF_LCORE_JOB_INIT], list) {
        do_lcore_job(job, 0);
    }

    while (1) {
        lcore_stats[cid].lcore_loop++;
        deadline = lcore_timer_hz_cycle() + rte_rdtsc();

		// 運行 lcore 中注冊的 NETIF_LCORE_JOB_HIGH 任務       
        do {       
            list_for_each_entry(job, &netif_lcore_jobs[NETIF_LCORE_JOB_HIGH], list) {
                ret = do_lcore_job(job, 0);
                if (ret) {
                    stat = ret;
                }
            }
        } while (stat == NETIF_PRINCIPAL_STAT_FULL && rte_rdtsc() < deadline);

		// 運行 lcore 中注冊的 NETIF_LCORE_JOB_LOOP 任務
        list_for_each_entry(job, &netif_lcore_jobs[NETIF_LCORE_JOB_LOOP], list) {
            do_lcore_job(job, stat);
        }
        
		// 每隔一定時間，運行 lcore 中注冊的 NETIF_LCORE_JOB_SLOW 任務        
        ++netif_loop_tick[cid];
        list_for_each_entry(job, &netif_lcore_jobs[NETIF_LCORE_JOB_SLOW], list) {
            if (netif_loop_tick[cid] % job->skip_loops == 0) {
                do_lcore_job(job, stat);
                //netif_loop_tick[cid] = 0;
            }
        }

		// 運行 lcore 中注冊的 NETIF_LCORE_JOB_IDLE 任務       
        if (is_lcore_idle(cid)) {
            /* TODO NETIF_PRINCIPAL_STAT_IDLE == stat is strict, consider lcore_stats[cid].opackets */
            list_for_each_entry(job, &netif_lcore_jobs[NETIF_LCORE_JOB_IDLE], list) {
                do_lcore_job(job, stat);
            }
            rec_lcore_tx_idle_credit(cid);
        } else {
            inc_lcore_tx_idle_credit(cid);
        }
    }
    return EDPVS_OK;
}

轉發流程分析

下面以 lcore_job_recv_fwd job 為例分析數據面的轉發處理流程。

收包階段

lcore_job_recv_fwd 收包。

static int lcore_job_recv_fwd(void *arg __rte_unused, int high_stat __rte_unused)
{
    int i, j;
    portid_t pid;
    lcoreid_t cid;
    uint32_t nic_type;
    struct netif_queue_conf *qconf;
    enum netif_principal_status stat = NETIF_PRINCIPAL_STAT_IDLE;

    cid = rte_lcore_id();
    assert((LCORE_ID_ANY != cid) && cid < DPVS_MAX_LCORE);

    for (i = 0; i < lcore_conf[lcore2index[cid]].nports; i++) {
        pid = lcore_conf[lcore2index[cid]].pqs[i].id;
        nic_type = lcore_conf[lcore2index[cid]].pqs[i].nic_type;
        assert(pid <= bond_pid_end);

        for (j = 0; j < lcore_conf[lcore2index[cid]].pqs[i].nrxq; j++) {
            qconf = &lcore_conf[lcore2index[cid]].pqs[i].rxqs[j];

			// 從 arp_ring 獲取 arp 報文
            lcore_process_arp_ring(qconf, cid);
            lcore_process_redirect_ring(qconf, cid);
            qconf->len = netif_rx_burst(pid, qconf, nic_type);

            stat = lcore_update_rx_principal_status(qconf->len, stat);
            lcore_stats_burst(&lcore_stats[cid], qconf->len);

            lcore_process_marked_flow(qconf);
            lcore_stats[cid].impackets += qconf->marked_cnt;
            lb_redirect_ring_proc(qconf, cid);

			// 讀取網卡隊列數據之后，調用 lcore_process_packets 對數據包進行處理。
            lcore_process_packets(qconf, qconf->mbufs, cid, qconf->len, 0);
            kni_send2kern_loop(pid, qconf);
        }
    }

    return stat;
}

L2 處理階段

lcore_process_packets 處理 L2 報文。

void lcore_process_packets(struct netif_queue_conf *qconf, struct rte_mbuf **mbufs,
                      lcoreid_t cid, uint16_t count, bool pkts_from_ring)
{
......
    /* L2 filter */
    for (i = 0; i < count; i++) {
        struct rte_mbuf *mbuf = mbufs[i];
        struct netif_port *dev;

        if (t < count) {
            rte_prefetch0(rte_pktmbuf_mtod(mbufs[t], void *));
            t++;
        }

        dev = netif_port_get(mbuf->port);
        if (unlikely(!dev)) {
            rte_pktmbuf_free(mbuf);
            lcore_stats[cid].dropped++;
            continue;
        }
        if (dev->type == PORT_TYPE_BOND_SLAVE) {
            dev = dev->bond->slave.master;
            mbuf->port = dev->id;
        }

        eth_hdr = rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *);
        /* reuse mbuf.packet_type, it was RTE_PTYPE_XXX */
        mbuf->packet_type = eth_type_parse(eth_hdr, dev);

        /*
         * In NETIF_PORT_FLAG_FORWARD2KNI mode.
         * All packets received are deep copied and sent to  KNI
         * for the purpose of capturing forwarding packets.Since the
         * rte_mbuf will be modified in the following procedure,
         * we should use mbuf_copy instead of rte_pktmbuf_clone.
         */
        if (dev->flag & NETIF_PORT_FLAG_FORWARD2KNI) {
            if (likely(NULL != (mbuf_copied = mbuf_copy(mbuf,
                                pktmbuf_pool[dev->socket]))))
                kni_ingress(mbuf_copied, dev, qconf);
            else
                RTE_LOG(WARNING, NETIF, "%s: Failed to copy mbuf\n",
                        __func__);
        }

        /*
         * do not drop pkt to other hosts (ETH_PKT_OTHERHOST)
         * since virtual devices may have different MAC with
         * underlying device.
         */

        /*
         * handle VLAN
         * if HW offload vlan strip, it's still need vlan module
         * to act as VLAN filter.
         */
        if (eth_hdr->ether_type == htons(ETH_P_8021Q) ||
            mbuf->ol_flags & PKT_RX_VLAN_STRIPPED) {

            if (vlan_rcv(mbuf, netif_port_get(mbuf->port)) != EDPVS_OK) {
                rte_pktmbuf_free(mbuf);
                lcore_stats[cid].dropped++;
                continue;
            }

            dev = netif_port_get(mbuf->port);
            if (unlikely(!dev)) {
                rte_pktmbuf_free(mbuf);
                lcore_stats[cid].dropped++;
                continue;
            }

            eth_hdr = rte_pktmbuf_mtod(mbuf, struct rte_ether_hdr *);
        }

        if (lb_sync_lcore_is_backup() && lb_sync_process_message(mbuf) == 0) {
            /*mbuf is freed in lb_sync_process_message */
            deliver_mbuf = false;
        } else {
            deliver_mbuf = true;
        }

		// 鏈路層的過濾處理完之后，調用 netif_deliver_mbuf 進入 IP 層
        if (likely(deliver_mbuf)) {
            /* handler should free mbuf */
            netif_deliver_mbuf(mbuf, eth_hdr->ether_type, dev, qconf,
                (dev->flag & NETIF_PORT_FLAG_FORWARD2KNI) ? true : false,
                cid, pkts_from_ring);
        }

        lcore_stats[cid].ibytes += mbuf->pkt_len;
        lcore_stats[cid].ipackets++;
    }
}

L3 處理階段

netif_deliver_mbuf 處理 L3 IP 包：

static inline int netif_deliver_mbuf(struct rte_mbuf *mbuf,
                                     uint16_t eth_type,
                                     struct netif_port *dev,
                                     struct netif_queue_conf *qconf,
                                     bool forward2kni,
                                     lcoreid_t cid,
                                     bool pkts_from_ring)
{
......
    /* Remove ether_hdr at the beginning of an mbuf */
    data_off = mbuf->data_off;
    if (unlikely(NULL == rte_pktmbuf_adj(mbuf, sizeof(struct rte_ether_hdr))))
        return EDPVS_INVPKT;

	// IP 層的 pkt_type 只注冊了 2 種類型 ip4_pkt_type 和 arp_pkt_type。
	// ip4_pkt_type 在 ipv4_init 中注冊。
	// arp_pkt_type 在 arp_init 中注冊。
    err = pt->func(mbuf, dev);
......

// 對于 ipv4 包，實際上 pt->func 調用的就是 ipv4_rcv。
static struct pkt_type ip4_pkt_type = {
    //.type       = rte_cpu_to_be_16(ETHER_TYPE_IPv4),
    .func       = ipv4_rcv,
    .port       = NULL,
};

static struct pkt_type arp_pkt_type = {
    //.type       = rte_cpu_to_be_16(ETHER_TYPE_ARP),
    .func       = neigh_resolve_input,
    .port       = NULL,
};

ipv4_rcv

static int ipv4_rcv(struct rte_mbuf *mbuf, struct netif_port *port)
{
......
	// ipv4_rcv 完成一系列錯誤檢查后調用了 INET_HOOK 函數
    return INET_HOOK(INET_HOOK_PRE_ROUTING, mbuf, port, NULL, ipv4_rcv_fin);

INET_HOOK

int INET_HOOK(unsigned int hook, struct rte_mbuf *mbuf,
        struct netif_port *in, struct netif_port *out,
        int (*okfn)(struct rte_mbuf *mbuf))
{
......
	// inet_hooks 在 dp_vs_init 中注冊
    state.hook = hook;
    hook_list = &inet_hooks[hook];
    
    ops = list_entry(hook_list, struct inet_hook_ops, list);

    if (!list_empty(hook_list)) {
        verdict = INET_ACCEPT;
        list_for_each_entry_continue(ops, hook_list, list) {
repeat:
			// 先后執行 dp_vs_in 和 dp_vs_pre_routin
            verdict = ops->hook(ops->priv, mbuf, &state);/*g*/
            if (verdict != INET_ACCEPT) {
                if (verdict == INET_REPEAT)
                    goto repeat;
                break;
            }
        }
    }
}

L4 處理階段

dp_vs_in 的主體邏輯是判斷 IP 包的 src/dst 是否存在 conn，若存在則直接轉發；否則 prot->conn_sched 創建一個新的 conn 然后轉發。

static int dp_vs_in(void *priv, struct rte_mbuf *mbuf, 
                    const struct inet_hook_state *state)
{
......
    /* packet belongs to existing connection ? */
    conn = prot->conn_lookup(prot, &iph, mbuf, &dir, false);

	// 如果是 tcp 協議，則會調用到 conn_sched->tcp_conn_sched
    if (unlikely(!conn)) {
        /* try schedule RS and create new connection */
        if (prot->conn_sched(prot, &iph, mbuf, &conn, &verdict) != EDPVS_OK) {
            /* RTE_LOG(DEBUG, IPVS, "%s: fail to schedule.\n", __func__); */
            return verdict;
        }

        /* only SNAT triggers connection by inside-outside traffic. */
        if (conn->dest->fwdmode == DPVS_FWD_MODE_SNAT)
            dir = DPVS_CONN_DIR_OUTBOUND;
        else
            dir = DPVS_CONN_DIR_INBOUND;
    }

......
	// xmit_inbound 將包轉發給 RS
	// xmit_outbound 從 RS 回包
    /* holding the conn, need a "put" later. */
    if (dir == DPVS_CONN_DIR_INBOUND)
        return xmit_inbound(mbuf, prot, conn);
    else
        return xmit_outbound(mbuf, prot, conn);
}

高級特性

大象流轉發優化

現代 DPDK 程序都會基于 RSS 收包多核擴展技術來將不同的 IP 5-tuple Traffic 映射到特定的 Core 上進行處理。

但當出現大象流時，10% 的大象流就占據了總流量的 90%，繼而造成某些 Core 忙死，而另外一些 Core 則閑死的情況。更甚者，即便忙死了某個 Core 也依舊無法滿足大象流的處理需求，而導致丟包。

為了解決大象流問題，HDSLB 基于以下思路進行了 3 方面的優化：

大象流識別：首先，要識別出大象流（Heavy）和老鼠流（Light）。
大象流拆分：然后，將大象流能夠拆分并映射到多個 Cores 上并行處理，而不僅僅映射到一個 Core 上。
大象流重排：最終，還需要將拆分到多個 Cores 上并行處理的流量再進行合法性排序。

從上述原理圖可以看出，這里面的關鍵技術是由 Intel CPU 硬件提供的 DLB（Dynamic Load Balancer）特性。基于 DLB 可以實現：

收包時：將大象流切分到多個 Cores 中進行處理。
發包時：將多個 Cores 上的流量進行匯聚并合法化排序。

更具體而言，HDSLB 的大象流處理方案需要在 Main Core 上實現一個基于 Intel NIC FDIR 硬件特性的 Switch Filter，用于完成大象流和老鼠流的識別、標記并分類映射到不同的 Core，通過硬件的方式減少了軟件上的匹配和查表，性能更高；而在 Worker Cores 上還需要實現基于 Intel CPU DLB 硬件特性的大象流拆分。如下圖所示：

Main Core
Worker Cores

快慢路徑分離轉發優化

快慢路徑分離現在已然是高性能轉發模式的標配了，HDSLB 為性能敏感且處理邏輯復雜的 TCP 流量實現了一套 Session/Conn 快路徑。

報文基礎轉發優化

在基礎的報文轉發方面，HDSLB 做了 2 方面的努力：

Vectorize：向量轉發的思路來自于 VPP，批量處理同類報文可以有效提高 icache/dcache 的命中率。詳細推薦瀏覽：《FD.io/VPP — VPP 的實現原理解析》
microjobs：將原來的 jobs 進一步的細化為了多個符合 icache size 對齊的 microjobs。結合 pipeline nodes prefetch 的方式，可以進一步減少 icache/dcache miss 帶來的性能損耗。

最后

通過本系列的文章，筆者希望向對 DPDK 數據面開發感興趣的讀者們推薦 HDSLB 這個優秀的開源項目。實際上，DVPS 本身就已經是一個足夠優秀的數據面項目，HDSLB 更是在其基礎上疊加了 Intel 多年積累的軟硬件融合加速技術。難能可貴的是，HDSLB 不僅僅滿足于作為一個研究項目，而是針對大象流此類在生產環境中存在的典型問題，給出了一個完整可落地的解決方案。這一點非常值得大多數開源項目學習！

參考文檔

https://cloud.tencent.com/developer/article/1180256
https://cloud.tencent.com/developer/article/1180838
https://cloud.tencent.com/developer/article/1182928
https://www.jianshu.com/p/d8ee301f9122
https://static.sched.com/hosted_files/dpdksummitapac2021/35/Handling Elephant Flow on a DPDK-Based Load Balancer.pdf
https://zhuanlan.zhihu.com/p/416992198

刷新頁面返回頂部

云物互聯云計算、云原生、5G 網絡、邊緣計算。

Intel HDSLB 高性能四層負載均衡器 — 代碼剖析和高級特性

目錄

前言

代碼剖析

軟件架構

目錄結構

配置解析

啟動流程分析

數據面 jobs 注冊

數據面 jobs 執行

轉發流程分析

收包階段

L2 處理階段

L3 處理階段

L4 處理階段

高級特性

大象流轉發優化

快慢路徑分離轉發優化

報文基礎轉發優化

最后

參考文檔

About

云物互聯 云計算、云原生、5G 網絡、邊緣計算。

Intel HDSLB 高性能四層負載均衡器 — 代碼剖析和高級特性

目錄

前言

代碼剖析

軟件架構

目錄結構

配置解析

啟動流程分析

數據面 jobs 注冊

數據面 jobs 執行

轉發流程分析

收包階段

L2 處理階段

L3 處理階段

L4 處理階段

高級特性

大象流轉發優化

快慢路徑分離轉發優化

報文基礎轉發優化

最后

參考文檔

About

云物互聯云計算、云原生、5G 網絡、邊緣計算。