深夜视频国产在线观看,鲁一鲁一鲁一鲁一澡,欧美成人午夜在线观看视频

linux的調度域和調度組的初始化

在擁有多cpu的系統(tǒng)中，調度域和調度組是進程負載均衡的基礎。本文介紹在x86系統(tǒng)中，多cpu環(huán)境下，調度域和調度組是如何初始化的。

什么是調度域？

在大型計算機系統(tǒng)中可能擁有上百cpu，這些cpu并不是完全對稱的，比如smt之間是共享L1cache和一些計算單元的，有些cpu共享一個L3cache。同一個numa內部的cpu共享內存總線。還有更大的層級比如一個socket，整個系統(tǒng)可能存在多個socket。于是整個系統(tǒng)形成了層級拓撲結構。這些cpu物理層級的存在使得如果不加控制地去進行負載的遷移，可能得不到最好的效果。為了能夠利用這些層級信息，人們長造出能夠反映系統(tǒng)拓撲結構的調度結構，這就是調度域，sched_domain。

什么是調度組？

一般，負載均衡是在一個個調度域中進行的。調度域內的負載是否均衡也是需要計算其cpu子集之間的平衡度，這個cpu子集就是調度組。調度域的cpu子集并不是隨意劃分的，一般情況下，一個調度域的調度組所包含的cpu范圍與其子調度域相當。這樣是合理的，我們在計算一個調度域的負載是否均衡是，是在判斷各個子調度域之間是否均衡。

調度域和調度組相關的數(shù)據(jù)結構

sched_domain_topology_level

struct sched_domain_topology_level {
    sched_domain_mask_f mask;
    sched_domain_flags_f sd_flags;
    int            flags;
    int            numa_level;
    struct sd_data      data;
#ifdef CONFIG_SCHED_DEBUG
    char                *name;
#endif
};

sd_data

struct sd_data {
    struct sched_domain *__percpu *sd;
    struct sched_domain_shared *__percpu *sds;
    struct sched_group *__percpu *sg;
    struct sched_group_capacity *__percpu *sgc;
};

sched_domain

struct sched_domain {
    /* These fields must be setup */
    struct sched_domain __rcu *parent;    /* top domain must be null terminated */
    struct sched_domain __rcu *child;    /* bottom domain must be null terminated */
    struct sched_group *groups;    /* the balancing groups of the domain */
    unsigned long min_interval;    /* Minimum balance interval ms */
    unsigned long max_interval;    /* Maximum balance interval ms */
    unsigned int busy_factor;    /* less balancing by factor if busy */
    unsigned int imbalance_pct;    /* No balance until over watermark */
    unsigned int cache_nice_tries;    /* Leave cache hot tasks for # tries */
    unsigned int imb_numa_nr;    /* Nr running tasks that allows a NUMA imbalance */

    int nohz_idle;            /* NOHZ IDLE status */
    int flags;            /* See SD_* */
    int level;

    /* Runtime fields. */
    unsigned long last_balance;    /* init to jiffies. units in jiffies */
    unsigned int balance_interval;    /* initialise to 1. units in ms. */
    unsigned int nr_balance_failed; /* initialise to 0 */

    /* idle_balance() stats */
    u64 max_newidle_lb_cost;
    unsigned long last_decay_max_lb_cost;
...
#ifdef CONFIG_SCHED_DEBUG
    char *name;
#endif
    union {
        void *private;        /* used during construction */
        struct rcu_head rcu;    /* used during destruction */
    };
    struct sched_domain_shared *shared;

    unsigned int span_weight;
    /*
     * Span of all CPUs in this domain.
     *
     * NOTE: this field is variable length. (Allocated dynamically
     * by attaching extra space to the end of the structure,
     * depending on how many CPUs the kernel has booted up with)
     */
    unsigned long span[];
};

sched_group

struct sched_group {
    struct sched_group    *next;            /* Must be a circular list */
    atomic_t        ref;

    unsigned int        group_weight;
    unsigned int        cores;
    struct sched_group_capacity *sgc;
    int            asym_prefer_cpu;    /* CPU of highest priority in group */
    int            flags;

    /*
     * The CPUs this group covers.
     *
     * NOTE: this field is variable length. (Allocated dynamically
     * by attaching extra space to the end of the structure,
     * depending on how many CPUs the kernel has booted up with)
     */
    unsigned long        cpumask[];
};

sched_group_capacity

struct sched_group_capacity {
    atomic_t        ref;
    /*
     * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
     * for a single CPU.
     */
    unsigned long        capacity;
    unsigned long        min_capacity;        /* Min per-CPU capacity in group */
    unsigned long        max_capacity;        /* Max per-CPU capacity in group */
    unsigned long        next_update;
    int            imbalance;        /* XXX unrelated to capacity but shared group state */

#ifdef CONFIG_SCHED_DEBUG
    int            id;
#endif

    unsigned long        cpumask[];        /* Balance mask */
};

調度域和調度組的初始化

初始化的工作肯定是在kernel初始化階段完成的。在start_kernel->rest_init->kernel_init->kernel_init_freeable內有兩個函數(shù)與此有關，smp_init和sched_init_smp。

build_sched_topology初始化sched_domain_topology_level結構體的全局變量sched_domain_topology，這個變量會在后面的sched_domain初始化的時候使用。

static void __init build_sched_topology(void)
{
    int i = 0;

#ifdef CONFIG_SCHED_SMT
    x86_topology[i++] = (struct sched_domain_topology_level){
        cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT)
    };
#endif
#ifdef CONFIG_SCHED_CLUSTER
    x86_topology[i++] = (struct sched_domain_topology_level){
        cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
    };
#endif
#ifdef CONFIG_SCHED_MC
    x86_topology[i++] = (struct sched_domain_topology_level){
        cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC)
    };
#endif
    /*
     * When there is NUMA topology inside the package skip the PKG domain
     * since the NUMA domains will auto-magically create the right spanning
     * domains based on the SLIT.
     */
    if (!x86_has_numa_in_package) {
        x86_topology[i++] = (struct sched_domain_topology_level){
            cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG)
        };
    }

    /*
     * There must be one trailing NULL entry left.
     */
    BUG_ON(i >= ARRAY_SIZE(x86_topology)-1);

    set_sched_topology(x86_topology);
}

我們來分別看看這里提到的幾個層級。smt，cluster，mc。smt是指一個core中的超線程組成的范圍，cluster是指共享L2cache的cpu域，在x86上，smt是共享L1,L2 cache的，所以這個域跟smt域是一致的，也就是在x86的機器上看不到這個域。mc域是共享L3cache的cpu域。在intel的機器中，一個numa可能只有一個LLC，因此，這個域可能是一個numa的區(qū)域。對于amd機器，可能一個numa上有多個LLC，因此，這個域的層級比numa要低。

如果機器上沒有numa，這里直接將所有cpu作為最后一個domain，PKG。

這里的初始化應該是不全的，只包含了smt，cluster，mc域，或者還有pkg域，numa域在其他地方初始化。x86_topology是一個全局變量。

static struct sched_domain_topology_level x86_topology[6];

只包含6個層級，且最后一個層級是NULL。x86_topology會被設置給sched_domain_topolog。

初始化numa域的代碼位于sched_init_numa。這個函數(shù)值得讀一讀，由于比較長這里就不展示了。該函數(shù)會首先判斷numa之間有多少種距離，這個距離數(shù)就是numa域的層級數(shù)。

void sched_init_numa(int offline_node)
{
    struct sched_domain_topology_level *tl;
    unsigned long *distance_map;
    int nr_levels = 0;
    int i, j;
    int *distances;
    struct cpumask ***masks;

    /*
     * O(nr_nodes^2) de-duplicating selection sort -- in order to find the
     * unique distances in the node_distance() table.
     */
    distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL);
    if (!distance_map)
        return;

    bitmap_zero(distance_map, NR_DISTANCE_VALUES);
    for_each_cpu_node_but(i, offline_node) {
        for_each_cpu_node_but(j, offline_node) {
            int distance = node_distance(i, j);

            if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
                sched_numa_warn("Invalid distance value range");
                bitmap_free(distance_map);
                return;
            }

            bitmap_set(distance_map, distance, 1);                       // 設置distance到distance_map，distance相同的在同一個bit
        }
    }
    /*
     * We can now figure out how many unique distance values there are and
     * allocate memory accordingly.
     */
    nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES);         // 數(shù)一數(shù)distance有幾種作為之后numa層級數(shù)

然后根據(jù)這個層級數(shù)建立numa層級的調度域。

    rcu_assign_pointer(sched_domains_numa_masks, masks);                 // 省略了計算每個distance層級cpumask的部分

    /* Compute default topology size */
    for (i = 0; sched_domain_topology[i].mask; i++);                     // 之前已經(jīng)初始化過sched_domain_topology，也就是前面提到的smt，cluster，mc域。

    tl = kzalloc((i + nr_levels + 1) *                                   // 因為有numa域的加入，需要重新分配內存來放置數(shù)量不確定的numa域。
            sizeof(struct sched_domain_topology_level), GFP_KERNEL);
    if (!tl)
        return;

    /*
     * Copy the default topology bits..
     */
    for (i = 0; sched_domain_topology[i].mask; i++)
        tl[i] = sched_domain_topology[i];

    /*
     * Add the NUMA identity distance, aka single NODE.
     */
    tl[i++] = (struct sched_domain_topology_level){                    // numa最低的層級為NODE域，就是本地numa
        .mask = sd_numa_mask,
        .numa_level = 0,
        SD_INIT_NAME(NODE)
    };

    /*
     * .. and append 'j' levels of NUMA goodness.
     */
    for (j = 1; j < nr_levels; i++, j++) {                            // 包含多個numa的域都稱之為NUMA域
        tl[i] = (struct sched_domain_topology_level){
            .mask = sd_numa_mask,
            .sd_flags = cpu_numa_flags,
            .flags = SDTL_OVERLAP,
            .numa_level = j,
            SD_INIT_NAME(NUMA)
        };
    }

    sched_domain_topology_saved = sched_domain_topology;
    sched_domain_topology = tl;                               // sched_domain_topology不再是原先那個全局數(shù)組了

可以看到，單獨一個numa為NODE域，包含多個numa的numa域的name都是NUMA。比如一個有兩個socket的機器，由于kernel只是分辨numa之間的距離，它會將socket域也認為是NUMA域。當sched domain topo被設置好后就被賦給全局變量sched_domain_topology了，此時sched_domain_topology不再是之前的全局數(shù)組了。

初始化sched domain

sched_init_domains會調用build_sched_domains來初始化sched_domain。

int __init sched_init_domains(const struct cpumask *cpu_map)
{
    int err;

    ndoms_cur = 1;
    doms_cur = alloc_sched_domains(ndoms_cur);
    if (!doms_cur)
        doms_cur = &fallback_doms;
    cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
    err = build_sched_domains(doms_cur[0], NULL);

    return err;
}

doms_cur包含的cpu排除掉了isolated cpu，所以隔離的cpu是不會被負載均衡使用的。

build_sched_domains是初始化sched domain的關鍵函數(shù)。下面分析這個函數(shù)。

static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
...
    alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
..
}

static enum s_alloc
__visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
{
    memset(d, 0, sizeof(*d));

    if (__sdt_alloc(cpu_map))
        return sa_sd_storage;
    d->sd = alloc_percpu(struct sched_domain *);
    if (!d->sd)
        return sa_sd_storage;
    d->rd = alloc_rootdomain();
    if (!d->rd)
        return sa_sd;

    return sa_rootdomain;
}

首先調用__visit_domain_allocation_hell。它會調用__sdt_alloc給cpu_map中的每一個cpu分配sched_domain, sched_group, sched_group_capacity, sched_domain_shared對應的per_cpu變量。

接下來會給每個cpu創(chuàng)建調度域拓撲。

    /* Set up domains for CPUs specified by the cpu_map: */
    for_each_cpu(i, cpu_map) {
        struct sched_domain_topology_level *tl;

        sd = NULL;
        for_each_sd_topology(tl) {

            if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
                goto error;

            sd = build_sched_domain(tl, cpu_map, attr, sd, i);

            has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;

            if (tl == sched_domain_topology)
                *per_cpu_ptr(d.sd, i) = sd;
            if (tl->flags & SDTL_OVERLAP)
                sd->flags |= SD_OVERLAP;
            if (cpumask_equal(cpu_map, sched_domain_span(sd)))
                break;
        }
    }

這是一個兩層循環(huán)，外層是遍歷cpu，內層遍歷調度域層級。可知，sched_domain是每個cpu都會有的per_cpu變量。然后每次循環(huán)都調用build_sched_domain創(chuàng)建該層調度域。注意，只有最低層的調度域會設置到該cpu的sched_domain per_cpu變量上，sched_domain會形成一個層級，從最底層向上遍歷即可得到該cpu的所有調度域。

接下來是創(chuàng)建調度組。

    /* Build the groups for the domains */
    for_each_cpu(i, cpu_map) {
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            sd->span_weight = cpumask_weight(sched_domain_span(sd));
            if (sd->flags & SD_OVERLAP) {
                if (build_overlap_sched_groups(sd, i))
                    goto error;
            } else {
                if (build_sched_groups(sd, i))
                    goto error;
            }
        }
    }

每個cpu的每個調度域都有一組調度組，因此也是兩層循環(huán)。根據(jù)是否有重疊選擇調用build_overlap_sched_groups還是build_sched_groups。

接下來是對多l(xiāng)lc numa的優(yōu)化，跳過。之后是初始化sched_group_capacity。

    /* Calculate CPU capacity for physical packages and nodes */
    for (i = nr_cpumask_bits-1; i >= 0; i--) {
        if (!cpumask_test_cpu(i, cpu_map))
            continue;

        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            claim_allocations(i, sd);
            init_sched_groups_capacity(i, sd);
        }
    }

同樣是針對每個調度域都存在該結構，也是兩層循環(huán)。第一層遍歷cpu，第二層遍歷調度域。claim_allocations將調度域對應的sched_group_capacity為NULL。

init_sched_groups_capacity會計算當前cpu對應的調度組的cores和capacity。

capacity是怎么來的？

cpu的capacity在不同的架構中有不同的獲取方式。一般在非異構cpu系統(tǒng)中capacity是1024。在init_sched_groups_capacity會調用update_cpu_capacity完成每個sched group capacity的capacity初始化。

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
    unsigned long capacity = scale_rt_capacity(cpu);
    struct sched_group *sdg = sd->groups;

    if (!capacity)
        capacity = 1;

    cpu_rq(cpu)->cpu_capacity = capacity;
    trace_sched_cpu_capacity_tp(cpu_rq(cpu));

    sdg->sgc->capacity = capacity;
    sdg->sgc->min_capacity = capacity;
    sdg->sgc->max_capacity = capacity;
}

scale_rt_capacity會獲取cpu的capacity。

static unsigned long scale_rt_capacity(int cpu)
{
    unsigned long max = get_actual_cpu_capacity(cpu);
    struct rq *rq = cpu_rq(cpu);
    unsigned long used, free;
    unsigned long irq;

    irq = cpu_util_irq(rq);

    if (unlikely(irq >= max))
        return 1;

    /*
     * avg_rt.util_avg and avg_dl.util_avg track binary signals
     * (running and not running) with weights 0 and 1024 respectively.
     */
    used = cpu_util_rt(rq);
    used += cpu_util_dl(rq);

    if (unlikely(used >= max))
        return 1;

    free = max - used;

    return scale_irq_capacity(free, irq, max);
}

它會從get_actual_cpu_capacity中獲取cpu的最大capacity，減去irq，rt class，dl調度器的util，根據(jù)比例得到最終的capacity。

get_actual_cpu_capacity會從arch_scale_cpu_capacity獲取。

unsigned long arch_scale_cpu_capacity(int cpu)
{
    if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
        return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity);

    return SCHED_CAPACITY_SCALE;
}

可知，arch_scale_cpu_capacity會返回SCHED_CAPACITY_SCALE。在SCHED_CAPACITY_SCALE內核中是1024。

接下來是attatch domain。

    rcu_read_lock();
    for_each_cpu(i, cpu_map) {
        rq = cpu_rq(i);
        sd = *per_cpu_ptr(d.sd, i);

        cpu_attach_domain(sd, d.rd, i);

        if (lowest_flag_domain(i, SD_CLUSTER))
            has_cluster = true;
    }

給每個cpu調用cpu_attach_domain。該函數(shù)會檢查每個cpu的sched_domain是否合理，刪除那些不應存在的domain，比如一個domain只有一個cpu，是沒必要存在的。接著會將指定cpu加入root_domain，之后將其添加到rq的rd域。調用update_top_cache_domain更新拓撲相關信息，包括：

per_cpu(sd_llc, cpu)         // LLC（最后一級緩存）調度域
per_cpu(sd_llc_size, cpu)    // LLC 域的 CPU 數(shù)量
per_cpu(sd_llc_id, cpu)      // LLC 域的代表 CPU ID
per_cpu(sd_llc_shared, cpu) // LLC 域的共享狀態(tài)
per_cpu(sd_share_id, cpu)    // 集群/LLC 域的代表 ID
per_cpu(sd_numa, cpu)        // NUMA 域
per_cpu(sd_asym_packing, cpu)// 非對稱打包調度域
per_cpu(sd_asym_cpucapacity, cpu) // 非對稱算力調度域

build_sched_domains整體分析完了，我們來深入分析一下build_sched_domain和創(chuàng)建調度組的代碼。

創(chuàng)建調度域，build_sched_domain

static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
        const struct cpumask *cpu_map, struct sched_domain_attr *attr,
        struct sched_domain *child, int cpu)
{
    struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);

    if (child) {
        sd->level = child->level + 1;
        sched_domain_level_max = max(sched_domain_level_max, sd->level);
        child->parent = sd;

    }
    set_domain_attribute(sd, attr);

    return sd;
}

可以分為三部分，sd_init負責初始化sched_domain，建立層級關系，設置attribute。

sd_init

static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl,
    const struct cpumask *cpu_map,
    struct sched_domain *child, int cpu)
{
#ifdef CONFIG_NUMA
    /*
     * Ugly hack to pass state to sd_numa_mask()...
     */
    sched_domains_curr_level = tl->numa_level;
#endif

    sd_weight = cpumask_weight(tl->mask(cpu));

    if (tl->sd_flags)
        sd_flags = (*tl->sd_flags)();
    if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
            "wrong sd_flags in topology description\n"))
        sd_flags &= TOPOLOGY_SD_FLAGS;

    *sd = (struct sched_domain){
        .min_interval        = sd_weight,
        .max_interval        = 2*sd_weight,
        .busy_factor        = 16,
        .imbalance_pct        = 117,

        .cache_nice_tries    = 0,

        .flags            = 1*SD_BALANCE_NEWIDLE
                    | 1*SD_BALANCE_EXEC
                    | 1*SD_BALANCE_FORK
                    | 0*SD_BALANCE_WAKE
                    | 1*SD_WAKE_AFFINE
                    | 0*SD_SHARE_CPUCAPACITY
                    | 0*SD_SHARE_LLC
                    | 0*SD_SERIALIZE
                    | 1*SD_PREFER_SIBLING
                    | 0*SD_NUMA
                    | sd_flags
                    ,

        .last_balance        = jiffies,
        .balance_interval    = sd_weight,
        .max_newidle_lb_cost    = 0,
        .last_decay_max_lb_cost    = jiffies,
        .child            = child,
        .name            = tl->name,
    };

可以看到sd_init會對sched_domain結構中的部分成員做初始化。min_interval與調度域包含的cpu數(shù)量有關。busy_factor， imbalance_pct，balance_interval都被初始化。

    if (sd->flags & SD_SHARE_CPUCAPACITY) {
        sd->imbalance_pct = 110;

    } else if (sd->flags & SD_SHARE_LLC) {
        sd->imbalance_pct = 117;
        sd->cache_nice_tries = 1;

#ifdef CONFIG_NUMA
    } else if (sd->flags & SD_NUMA) {
        sd->cache_nice_tries = 2;

        sd->flags &= ~SD_PREFER_SIBLING;
        sd->flags |= SD_SERIALIZE;
        if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) {
            sd->flags &= ~(SD_BALANCE_EXEC |
                       SD_BALANCE_FORK |
                       SD_WAKE_AFFINE);
        }

#endif
    } else {
        sd->cache_nice_tries = 1;
    }

這里對不同層級的調度域的cache_nice_tries做了調整，層級越高值越大，SD_NUMA層級是2，MC層級是1。對于共享llc的域，調大了imbalance_pct的值，可以減少llc內進程遷移。對于SD_NUMA域，去掉了SD_PREFER_SIBLING，加上了SD_SERIALIZE，后者會讓load balance在這些調度域上串行。還有一個特殊的改動，當調度域內部最大的numa distance大于node_reclaim_distance時會去掉SD_BALANCE_EXEC，SD_BALANCE_FORK, SD_WAKE_AFFINE。也就是說，這個調度域之間的距離太大，不適合做wakeup和新建進程時不適合在此調度域內找cpu。

    if (sd->flags & SD_SHARE_LLC) {
        sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
        atomic_inc(&sd->shared->ref);
        atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
    }

對于共享LLC的域會初始化sched_domain_share結構，這是將nr_busy_cpus設置成調度域的weight（為啥一開始就設置成最大值？）

接下來看創(chuàng)建調度組是怎么實現(xiàn)的。

創(chuàng)建調度組有兩個相關函數(shù)，build_overlap_sched_groups和build_sched_groups。根據(jù)是否存在SD_OVERLAP標志調用對應的函數(shù)。什么是overlap？我們知道調度組是調度域的子集，一個調度域有多個調度組。一個疑問就是，這些作為子集的調度組之間是否存在交集？這是可能的，有關這個問題我們可以在以后詳細講解，這里只說明一個結論：調度組之間有交集的該調度域就會被打上SD_OVERLAP的標簽。而這個標簽就是在創(chuàng)建好調度域之后，如果當前的SDTL flag中存在SDTL_OVERLAP標簽打上的。

build_sched_groups相對簡單，先看它。

static int
build_sched_groups(struct sched_domain *sd, int cpu)
{
    struct sched_group *first = NULL, *last = NULL;
    struct sd_data *sdd = sd->private;
    const struct cpumask *span = sched_domain_span(sd);
    struct cpumask *covered;
    int i;

    lockdep_assert_held(&sched_domains_mutex);
    covered = sched_domains_tmpmask;

    cpumask_clear(covered);

    for_each_cpu_wrap(i, span, cpu) {
        struct sched_group *sg;

        if (cpumask_test_cpu(i, covered))
            continue;

        sg = get_group(i, sdd);

        cpumask_or(covered, covered, sched_group_span(sg));

        if (!first)
            first = sg;
        if (last)
            last->next = sg;
        last = sg;
    }
    last->next = first;
    sd->groups = first;

    return 0;
}

主要邏輯就上面的代碼。對當前調度域的每個cpu進行迭代。通過get_group獲取該cpu的調度組，將調度組鏈接起來形成一個環(huán)形鏈表。調度域結構中的groups成員指向它的第一個調度組。這里有個比較關鍵的點，調度域的第一個調度組，也就是first指向的那個調度組就是該調度域的子調度域。這里大家不要搞混淆，調度組和調度域是兩個結構，這里說的等同是從他們所包含的cpu集合是否一致而言的。這里covered變量記錄了當前調度組的cpumask，判斷當前的cpu是否已經(jīng)包含在調度組中，如果是就跳過，避免將重復的調度組引入調度組鏈表。這樣就保證了調度組之間是沒有重疊的。剩下的重點就是get_group函數(shù)。

get_group是一個比較重要的函數(shù)，可以幫助你理解調度組是怎么來的。該函數(shù)也提供了非常詳細的注釋，值得好好讀一下。關于這個注釋，這里稍微總結一下：

1. 有三個重要的概念，調度域，調度組，sched_group_capability。調度域是垂直方向移動，調度組是水平方向移動，sched_group_capability每個調度組的cpu成員共享一個。

2. 調度域的第一個調度組就是它的子調度域。

3. 對于cpu拓撲結構之間沒有重疊的情況，只需對每個調度組的第一個cpu創(chuàng)建調度組即可。這也就是上面代碼中covered變量的作用。

下面看代碼。

static struct sched_group *get_group(int cpu, struct sd_data *sdd)
{
    struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
    struct sched_domain *child = sd->child;
    struct sched_group *sg;
    bool already_visited;

    if (child)
        cpu = cpumask_first(sched_domain_span(child));

    sg = *per_cpu_ptr(sdd->sg, cpu);
    sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);

    /* Increase refcounts for claim_allocations: */
    already_visited = atomic_inc_return(&sg->ref) > 1;
    /* sgc visits should follow a similar trend as sg */
    WARN_ON(already_visited != (atomic_inc_return(&sg->sgc->ref) > 1));

    /* If we have already visited that group, it's already initialized. */
    if (already_visited)
        return sg;

    if (child) {
        cpumask_copy(sched_group_span(sg), sched_domain_span(child));
        cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
        sg->flags = child->flags;
    } else {
        cpumask_set_cpu(cpu, sched_group_span(sg));
        cpumask_set_cpu(cpu, group_balance_mask(sg));
    }

    sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
    sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
    sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;

    return sg;
}

代碼不多，也比較容易理解。入?yún)⑹钦{度域和sd_data變量。在前面的代碼中已經(jīng)提到，每個調度組是一個per-cpu變量，每個cpu都有一個。首先是獲取調度域的子調度域，然后拿到它的第一個cpu，只對第一個cpu創(chuàng)建調度組。sgc代表的sched_group_capacity也是一個per-cpu變量，作為調度組的一個成員出現(xiàn)。對于有child調度域的情況，直接將child調度域的cpu集合復制到調度組的cpu span上。這也就是上面所說，調度組就是子調度域的由來。對于沒有child的情況，說該調度域處于最低級別，它的調度組只包含當前的cpu。這里還有group_balance_mask，它是sched_group_capability的成員，它與調度組的cpumask保持一致。需要特別注意，在做負載均衡的時候，不是調度組的每個cpu都要去做，只會找其中一個cpu，一般是第一個idle的cpu。調度組也有標志，含義與調度域相同。調度組的標志指向調度域的子調度域的flags。

最后初始化sgc的capacity相關的成員。前面已經(jīng)講述過cpu capability，由于每個cpu都有相同的CAPABILITY，調度組總的capacity只需SCHED_CAPABILITY_SCALE乘以調度組包含的cpu總數(shù)即可。

講完沒有overlap的調度組的創(chuàng)建過程，大家應該會對調度域和調度組的拓撲結構有一個清晰的了解了。這里比較容易混淆的是子調度域和調度組，雖然他們有相同的cpumask，但是他們仍然是不同的概念，有著不一樣的使命。

沒有overlap的調度域創(chuàng)建調度組是基于子調度域。有overlap的調度域在創(chuàng)建調度組時靠子調度域是會出錯的。一個例子就包含在build_overlap_sched_groups的注釋中。下面我們看看這個例子。

        /*
         * Usually we build sched_group by sibling's child sched_domain
         * But for machines whose NUMA diameter are 3 or above, we move
         * to build sched_group by sibling's proper descendant's child
         * domain because sibling's child sched_domain will span out of
         * the sched_domain being built as below.
         *
         * Smallest diameter=3 topology is:
         *
         *   node   0   1   2   3
         *     0:  10  20  30  40
         *     1:  20  10  20  30
         *     2:  30  20  10  20
         *     3:  40  30  20  10
         *
         *   0 --- 1 --- 2 --- 3
         *
         * NUMA-3       0-3             N/A             N/A             0-3
         *  groups:     {0-2},{1-3}                                     {1-3},{0-2}
         *
         * NUMA-2       0-2             0-3             0-3             1-3
         *  groups:     {0-1},{1-3}     {0-2},{2-3}     {1-3},{0-1}     {2-3},{0-2}
         *
         * NUMA-1       0-1             0-2             1-3             2-3
         *  groups:     {0},{1}         {1},{2},{0}     {2},{3},{1}     {3},{2}
         *
         * NUMA-0       0               1               2               3
         *
         * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
         * group span isn't a subset of the domain span.
         */

這個例子中有4個numa node，由于4個node之間的距離并非完全對稱，造成了各個node的調度域也出現(xiàn)不對稱的情況。圖中顯示的調度組成員是正確的。假設基于子調度域去構建調度組，對于node0的NUMA2調度域而言，它有3個cpu，0對應的調度組是0-1，沒問題，而2對應的子調度域是1-3，這就不對了，0的NUMA2調度域只是0-2，并不包含3，因此，該調度組是錯誤的。也就是說，我們沒辦法再按照創(chuàng)建沒有overlap調度域的方法來操作。

來看一下kernel是怎么解決這個問題的，也就是find_descended_sibling的實現(xiàn)。

static struct sched_domain *
find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling)
{
    /*
     * The proper descendant would be the one whose child won't span out
     * of sd
     */
    while (sibling->child &&
           !cpumask_subset(sched_domain_span(sibling->child),
                   sched_domain_span(sd)))
        sibling = sibling->child;

    /*
     * As we are referencing sgc across different topology level, we need
     * to go down to skip those sched_domains which don't contribute to
     * scheduling because they will be degenerated in cpu_attach_domain
     */
    while (sibling->child &&
           cpumask_equal(sched_domain_span(sibling->child),
                 sched_domain_span(sibling)))
        sibling = sibling->child;

    return sibling;
}

也比較容易理解，就是向下找能夠包含在當前調度域內的子調度域。注意，當前調度域對應的cpu和這個子調度域對應的cpu并不是一個，他們是兄弟關系。

找到子調度域之后，調用build_group_from_child_sched_domain創(chuàng)建調度組。

static struct sched_group *
build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
{
    struct sched_group *sg;
    struct cpumask *sg_span;

    sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
            GFP_KERNEL, cpu_to_node(cpu));

    if (!sg)
        return NULL;

    sg_span = sched_group_span(sg);
    if (sd->child) {
        cpumask_copy(sg_span, sched_domain_span(sd->child));
        sg->flags = sd->child->flags;
    } else {
        cpumask_copy(sg_span, sched_domain_span(sd));
    }

    atomic_inc(&sg->ref);
    return sg;
}

這里有一個優(yōu)化，給sg分配的內存在cpu所在的node，這樣訪問該sg就不需要到遠端。主要的邏輯還是將子調度域的cpumask復制到sg的cpumask上。這里有個疑問，函數(shù)入?yún)⒁呀?jīng)是前面找到的子調度域，為何此處還要用它的child調度域？答案是，前面找的子調度域是去跟當前調度域比較的。以沒有overlap的調度域為例，尋找到的子調度域就是當前調度域（其實這種情況就是不需要在子調度域中找）對于沒有子調度域的最低層調度域，直接將該調度域的cpumask復制到sg中（這不會有問題嗎？）

之后調用init_overlap_sched_group初始化balance cpu mask和capability。

static void init_overlap_sched_group(struct sched_domain *sd,
                     struct sched_group *sg)
{
    struct cpumask *mask = sched_domains_tmpmask2;
    struct sd_data *sdd = sd->private;
    struct cpumask *sg_span;
    int cpu;

    build_balance_mask(sd, sg, mask);
    cpu = cpumask_first(mask);

    sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
    if (atomic_inc_return(&sg->sgc->ref) == 1)
        cpumask_copy(group_balance_mask(sg), mask);
    else
        WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));

    /*
     * Initialize sgc->capacity such that even if we mess up the
     * domains and no possible iteration will get us here, we won't
     * die on a /0 trap.
     */
    sg_span = sched_group_span(sg);
    sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
    sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
    sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}

build_balance_mask會過濾掉那些子調度域跟調度組cpumask不一致的cpu。

創(chuàng)建好調度組后跟前面一樣將其串聯(lián)起來，調度組就創(chuàng)建好了。

至此調度域和調度組的初始化代碼就分析完了。

posted on 2025-06-08 08:10 半山隨筆閱讀(90) 評論(0) 收藏舉報

刷新頁面返回頂部

linux的調度域和調度組的初始化

導航

公告