linux的調度域和調度組的初始化
在擁有多cpu的系統(tǒng)中,調度域和調度組是進程負載均衡的基礎。本文介紹在x86系統(tǒng)中,多cpu環(huán)境下,調度域和調度組是如何初始化的。
什么是調度域?
在大型計算機系統(tǒng)中可能擁有上百cpu,這些cpu并不是完全對稱的,比如smt之間是共享L1cache和一些計算單元的,有些cpu共享一個L3cache。同一個numa內部的cpu共享內存總線。還有更大的層級比如一個socket,整個系統(tǒng)可能存在多個socket。于是整個系統(tǒng)形成了層級拓撲結構。這些cpu物理層級的存在使得如果不加控制地去進行負載的遷移,可能得不到最好的效果。為了能夠利用這些層級信息,人們長造出能夠反映系統(tǒng)拓撲結構的調度結構,這就是調度域,sched_domain。
什么是調度組?
一般,負載均衡是在一個個調度域中進行的。調度域內的負載是否均衡也是需要計算其cpu子集之間的平衡度,這個cpu子集就是調度組。調度域的cpu子集并不是隨意劃分的,一般情況下,一個調度域的調度組所包含的cpu范圍與其子調度域相當。這樣是合理的,我們在計算一個調度域的負載是否均衡是,是在判斷各個子調度域之間是否均衡。
調度域和調度組相關的數(shù)據(jù)結構
sched_domain_topology_level
struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; int flags; int numa_level; struct sd_data data; #ifdef CONFIG_SCHED_DEBUG char *name; #endif };
sd_data
struct sd_data { struct sched_domain *__percpu *sd; struct sched_domain_shared *__percpu *sds; struct sched_group *__percpu *sg; struct sched_group_capacity *__percpu *sgc; };
sched_domain
struct sched_domain { /* These fields must be setup */ struct sched_domain __rcu *parent; /* top domain must be null terminated */ struct sched_domain __rcu *child; /* bottom domain must be null terminated */ struct sched_group *groups; /* the balancing groups of the domain */ unsigned long min_interval; /* Minimum balance interval ms */ unsigned long max_interval; /* Maximum balance interval ms */ unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ int level; /* Runtime fields. */ unsigned long last_balance; /* init to jiffies. units in jiffies */ unsigned int balance_interval; /* initialise to 1. units in ms. */ unsigned int nr_balance_failed; /* initialise to 0 */ /* idle_balance() stats */ u64 max_newidle_lb_cost; unsigned long last_decay_max_lb_cost; ... #ifdef CONFIG_SCHED_DEBUG char *name; #endif union { void *private; /* used during construction */ struct rcu_head rcu; /* used during destruction */ }; struct sched_domain_shared *shared; unsigned int span_weight; /* * Span of all CPUs in this domain. * * NOTE: this field is variable length. (Allocated dynamically * by attaching extra space to the end of the structure, * depending on how many CPUs the kernel has booted up with) */ unsigned long span[]; };
sched_group
struct sched_group { struct sched_group *next; /* Must be a circular list */ atomic_t ref; unsigned int group_weight; unsigned int cores; struct sched_group_capacity *sgc; int asym_prefer_cpu; /* CPU of highest priority in group */ int flags; /* * The CPUs this group covers. * * NOTE: this field is variable length. (Allocated dynamically * by attaching extra space to the end of the structure, * depending on how many CPUs the kernel has booted up with) */ unsigned long cpumask[]; };
sched_group_capacity
struct sched_group_capacity { atomic_t ref; /* * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity * for a single CPU. */ unsigned long capacity; unsigned long min_capacity; /* Min per-CPU capacity in group */ unsigned long max_capacity; /* Max per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ #ifdef CONFIG_SCHED_DEBUG int id; #endif unsigned long cpumask[]; /* Balance mask */ };
調度域和調度組的初始化
初始化的工作肯定是在kernel初始化階段完成的。在start_kernel->rest_init->kernel_init->kernel_init_freeable內有兩個函數(shù)與此有關,smp_init和sched_init_smp。

build_sched_topology初始化sched_domain_topology_level結構體的全局變量sched_domain_topology,這個變量會在后面的sched_domain初始化的時候使用。
static void __init build_sched_topology(void) { int i = 0; #ifdef CONFIG_SCHED_SMT x86_topology[i++] = (struct sched_domain_topology_level){ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) }; #endif #ifdef CONFIG_SCHED_CLUSTER x86_topology[i++] = (struct sched_domain_topology_level){ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) }; #endif #ifdef CONFIG_SCHED_MC x86_topology[i++] = (struct sched_domain_topology_level){ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) }; #endif /* * When there is NUMA topology inside the package skip the PKG domain * since the NUMA domains will auto-magically create the right spanning * domains based on the SLIT. */ if (!x86_has_numa_in_package) { x86_topology[i++] = (struct sched_domain_topology_level){ cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG) }; } /* * There must be one trailing NULL entry left. */ BUG_ON(i >= ARRAY_SIZE(x86_topology)-1); set_sched_topology(x86_topology); }
我們來分別看看這里提到的幾個層級。smt,cluster,mc。smt是指一個core中的超線程組成的范圍,cluster是指共享L2cache的cpu域,在x86上,smt是共享L1,L2 cache的,所以這個域跟smt域是一致的,也就是在x86的機器上看不到這個域。mc域是共享L3cache的cpu域。在intel的機器中,一個numa可能只有一個LLC,因此,這個域可能是一個numa的區(qū)域。對于amd機器,可能一個numa上有多個LLC,因此,這個域的層級比numa要低。
如果機器上沒有numa,這里直接將所有cpu作為最后一個domain,PKG。
這里的初始化應該是不全的,只包含了smt,cluster,mc域,或者還有pkg域,numa域在其他地方初始化。x86_topology是一個全局變量。
static struct sched_domain_topology_level x86_topology[6];
只包含6個層級,且最后一個層級是NULL。x86_topology會被設置給sched_domain_topolog。
初始化numa域的代碼位于sched_init_numa。這個函數(shù)值得讀一讀,由于比較長這里就不展示了。該函數(shù)會首先判斷numa之間有多少種距離,這個距離數(shù)就是numa域的層級數(shù)。
void sched_init_numa(int offline_node) { struct sched_domain_topology_level *tl; unsigned long *distance_map; int nr_levels = 0; int i, j; int *distances; struct cpumask ***masks; /* * O(nr_nodes^2) de-duplicating selection sort -- in order to find the * unique distances in the node_distance() table. */ distance_map = bitmap_alloc(NR_DISTANCE_VALUES, GFP_KERNEL); if (!distance_map) return; bitmap_zero(distance_map, NR_DISTANCE_VALUES); for_each_cpu_node_but(i, offline_node) { for_each_cpu_node_but(j, offline_node) { int distance = node_distance(i, j); if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) { sched_numa_warn("Invalid distance value range"); bitmap_free(distance_map); return; } bitmap_set(distance_map, distance, 1); // 設置distance到distance_map,distance相同的在同一個bit } } /* * We can now figure out how many unique distance values there are and * allocate memory accordingly. */ nr_levels = bitmap_weight(distance_map, NR_DISTANCE_VALUES); // 數(shù)一數(shù)distance有幾種作為之后numa層級數(shù)
然后根據(jù)這個層級數(shù)建立numa層級的調度域。
rcu_assign_pointer(sched_domains_numa_masks, masks); // 省略了計算每個distance層級cpumask的部分 /* Compute default topology size */ for (i = 0; sched_domain_topology[i].mask; i++); // 之前已經(jīng)初始化過sched_domain_topology,也就是前面提到的smt,cluster,mc域。 tl = kzalloc((i + nr_levels + 1) * // 因為有numa域的加入,需要重新分配內存來放置數(shù)量不確定的numa域。 sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; /* * Copy the default topology bits.. */ for (i = 0; sched_domain_topology[i].mask; i++) tl[i] = sched_domain_topology[i]; /* * Add the NUMA identity distance, aka single NODE. */ tl[i++] = (struct sched_domain_topology_level){ // numa最低的層級為NODE域,就是本地numa .mask = sd_numa_mask, .numa_level = 0, SD_INIT_NAME(NODE) }; /* * .. and append 'j' levels of NUMA goodness. */ for (j = 1; j < nr_levels; i++, j++) { // 包含多個numa的域都稱之為NUMA域 tl[i] = (struct sched_domain_topology_level){ .mask = sd_numa_mask, .sd_flags = cpu_numa_flags, .flags = SDTL_OVERLAP, .numa_level = j, SD_INIT_NAME(NUMA) }; } sched_domain_topology_saved = sched_domain_topology; sched_domain_topology = tl; // sched_domain_topology不再是原先那個全局數(shù)組了
可以看到,單獨一個numa為NODE域,包含多個numa的numa域的name都是NUMA。比如一個有兩個socket的機器,由于kernel只是分辨numa之間的距離,它會將socket域也認為是NUMA域。當sched domain topo被設置好后就被賦給全局變量sched_domain_topology了,此時sched_domain_topology不再是之前的全局數(shù)組了。
初始化sched domain
sched_init_domains會調用build_sched_domains來初始化sched_domain。
int __init sched_init_domains(const struct cpumask *cpu_map) { int err; ndoms_cur = 1; doms_cur = alloc_sched_domains(ndoms_cur); if (!doms_cur) doms_cur = &fallback_doms; cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN)); err = build_sched_domains(doms_cur[0], NULL); return err; }
doms_cur包含的cpu排除掉了isolated cpu,所以隔離的cpu是不會被負載均衡使用的。
build_sched_domains是初始化sched domain的關鍵函數(shù)。下面分析這個函數(shù)。
static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { ... alloc_state = __visit_domain_allocation_hell(&d, cpu_map); .. } static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map) { memset(d, 0, sizeof(*d)); if (__sdt_alloc(cpu_map)) return sa_sd_storage; d->sd = alloc_percpu(struct sched_domain *); if (!d->sd) return sa_sd_storage; d->rd = alloc_rootdomain(); if (!d->rd) return sa_sd; return sa_rootdomain; }
首先調用__visit_domain_allocation_hell。它會調用__sdt_alloc給cpu_map中的每一個cpu分配sched_domain, sched_group, sched_group_capacity, sched_domain_shared對應的per_cpu變量。
接下來會給每個cpu創(chuàng)建調度域拓撲。
/* Set up domains for CPUs specified by the cpu_map: */ for_each_cpu(i, cpu_map) { struct sched_domain_topology_level *tl; sd = NULL; for_each_sd_topology(tl) { if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) goto error; sd = build_sched_domain(tl, cpu_map, attr, sd, i); has_asym |= sd->flags & SD_ASYM_CPUCAPACITY; if (tl == sched_domain_topology) *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) sd->flags |= SD_OVERLAP; if (cpumask_equal(cpu_map, sched_domain_span(sd))) break; } }
這是一個兩層循環(huán),外層是遍歷cpu,內層遍歷調度域層級。可知,sched_domain是每個cpu都會有的per_cpu變量。然后每次循環(huán)都調用build_sched_domain創(chuàng)建該層調度域。注意,只有最低層的調度域會設置到該cpu的sched_domain per_cpu變量上,sched_domain會形成一個層級,從最底層向上遍歷即可得到該cpu的所有調度域。
接下來是創(chuàng)建調度組。
/* Build the groups for the domains */ for_each_cpu(i, cpu_map) { for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { sd->span_weight = cpumask_weight(sched_domain_span(sd)); if (sd->flags & SD_OVERLAP) { if (build_overlap_sched_groups(sd, i)) goto error; } else { if (build_sched_groups(sd, i)) goto error; } } }
每個cpu的每個調度域都有一組調度組,因此也是兩層循環(huán)。根據(jù)是否有重疊選擇調用build_overlap_sched_groups還是build_sched_groups。
接下來是對多l(xiāng)lc numa的優(yōu)化,跳過。之后是初始化sched_group_capacity。
/* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) continue; for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { claim_allocations(i, sd); init_sched_groups_capacity(i, sd); } }
同樣是針對每個調度域都存在該結構,也是兩層循環(huán)。第一層遍歷cpu,第二層遍歷調度域。claim_allocations將調度域對應的sched_group_capacity為NULL。
init_sched_groups_capacity會計算當前cpu對應的調度組的cores和capacity。
capacity是怎么來的?
cpu的capacity在不同的架構中有不同的獲取方式。一般在非異構cpu系統(tǒng)中capacity是1024。在init_sched_groups_capacity會調用update_cpu_capacity完成每個sched group capacity的capacity初始化。
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { unsigned long capacity = scale_rt_capacity(cpu); struct sched_group *sdg = sd->groups; if (!capacity) capacity = 1; cpu_rq(cpu)->cpu_capacity = capacity; trace_sched_cpu_capacity_tp(cpu_rq(cpu)); sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; sdg->sgc->max_capacity = capacity; }
scale_rt_capacity會獲取cpu的capacity。
static unsigned long scale_rt_capacity(int cpu) { unsigned long max = get_actual_cpu_capacity(cpu); struct rq *rq = cpu_rq(cpu); unsigned long used, free; unsigned long irq; irq = cpu_util_irq(rq); if (unlikely(irq >= max)) return 1; /* * avg_rt.util_avg and avg_dl.util_avg track binary signals * (running and not running) with weights 0 and 1024 respectively. */ used = cpu_util_rt(rq); used += cpu_util_dl(rq); if (unlikely(used >= max)) return 1; free = max - used; return scale_irq_capacity(free, irq, max); }
它會從get_actual_cpu_capacity中獲取cpu的最大capacity,減去irq,rt class,dl調度器的util,根據(jù)比例得到最終的capacity。
get_actual_cpu_capacity會從arch_scale_cpu_capacity獲取。
unsigned long arch_scale_cpu_capacity(int cpu) { if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) return READ_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity); return SCHED_CAPACITY_SCALE; }
可知,arch_scale_cpu_capacity會返回SCHED_CAPACITY_SCALE。在SCHED_CAPACITY_SCALE內核中是1024。
接下來是attatch domain。
rcu_read_lock(); for_each_cpu(i, cpu_map) { rq = cpu_rq(i); sd = *per_cpu_ptr(d.sd, i); cpu_attach_domain(sd, d.rd, i); if (lowest_flag_domain(i, SD_CLUSTER)) has_cluster = true; }
給每個cpu調用cpu_attach_domain。該函數(shù)會檢查每個cpu的sched_domain是否合理,刪除那些不應存在的domain,比如一個domain只有一個cpu,是沒必要存在的。接著會將指定cpu加入root_domain,之后將其添加到rq的rd域。調用update_top_cache_domain更新拓撲相關信息,包括:
per_cpu(sd_llc, cpu) // LLC(最后一級緩存)調度域
per_cpu(sd_llc_size, cpu) // LLC 域的 CPU 數(shù)量
per_cpu(sd_llc_id, cpu) // LLC 域的代表 CPU ID
per_cpu(sd_llc_shared, cpu) // LLC 域的共享狀態(tài)
per_cpu(sd_share_id, cpu) // 集群/LLC 域的代表 ID
per_cpu(sd_numa, cpu) // NUMA 域
per_cpu(sd_asym_packing, cpu)// 非對稱打包調度域
per_cpu(sd_asym_cpucapacity, cpu) // 非對稱算力調度域
build_sched_domains整體分析完了,我們來深入分析一下build_sched_domain和創(chuàng)建調度組的代碼。
創(chuàng)建調度域,build_sched_domain
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) { struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu); if (child) { sd->level = child->level + 1; sched_domain_level_max = max(sched_domain_level_max, sd->level); child->parent = sd; } set_domain_attribute(sd, attr); return sd; }
可以分為三部分,sd_init負責初始化sched_domain, 建立層級關系, 設置attribute。
sd_init
static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain *child, int cpu) { #ifdef CONFIG_NUMA /* * Ugly hack to pass state to sd_numa_mask()... */ sched_domains_curr_level = tl->numa_level; #endif sd_weight = cpumask_weight(tl->mask(cpu)); if (tl->sd_flags) sd_flags = (*tl->sd_flags)(); if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) sd_flags &= TOPOLOGY_SD_FLAGS; *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 16, .imbalance_pct = 117, .cache_nice_tries = 0, .flags = 1*SD_BALANCE_NEWIDLE | 1*SD_BALANCE_EXEC | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_LLC | 0*SD_SERIALIZE | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , .last_balance = jiffies, .balance_interval = sd_weight, .max_newidle_lb_cost = 0, .last_decay_max_lb_cost = jiffies, .child = child, .name = tl->name, };
可以看到sd_init會對sched_domain結構中的部分成員做初始化。min_interval與調度域包含的cpu數(shù)量有關。busy_factor, imbalance_pct,balance_interval都被初始化。
if (sd->flags & SD_SHARE_CPUCAPACITY) { sd->imbalance_pct = 110; } else if (sd->flags & SD_SHARE_LLC) { sd->imbalance_pct = 117; sd->cache_nice_tries = 1; #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { sd->cache_nice_tries = 2; sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > node_reclaim_distance) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE); } #endif } else { sd->cache_nice_tries = 1; }
這里對不同層級的調度域的cache_nice_tries做了調整,層級越高值越大,SD_NUMA層級是2,MC層級是1。對于共享llc的域,調大了imbalance_pct的值,可以減少llc內進程遷移。對于SD_NUMA域,去掉了SD_PREFER_SIBLING,加上了SD_SERIALIZE,后者會讓load balance在這些調度域上串行。還有一個特殊的改動,當調度域內部最大的numa distance大于node_reclaim_distance時會去掉SD_BALANCE_EXEC,SD_BALANCE_FORK, SD_WAKE_AFFINE。也就是說,這個調度域之間的距離太大,不適合做wakeup和新建進程時不適合在此調度域內找cpu。
if (sd->flags & SD_SHARE_LLC) { sd->shared = *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref); atomic_set(&sd->shared->nr_busy_cpus, sd_weight); }
對于共享LLC的域會初始化sched_domain_share結構,這是將nr_busy_cpus設置成調度域的weight(為啥一開始就設置成最大值?)
接下來看創(chuàng)建調度組是怎么實現(xiàn)的。
創(chuàng)建調度組有兩個相關函數(shù),build_overlap_sched_groups和build_sched_groups。根據(jù)是否存在SD_OVERLAP標志調用對應的函數(shù)。什么是overlap?我們知道調度組是調度域的子集,一個調度域有多個調度組。一個疑問就是,這些作為子集的調度組之間是否存在交集?這是可能的,有關這個問題我們可以在以后詳細講解,這里只說明一個結論:調度組之間有交集的該調度域就會被打上SD_OVERLAP的標簽。而這個標簽就是在創(chuàng)建好調度域之后,如果當前的SDTL flag中存在SDTL_OVERLAP標簽打上的。
build_sched_groups相對簡單,先看它。
static int build_sched_groups(struct sched_domain *sd, int cpu) { struct sched_group *first = NULL, *last = NULL; struct sd_data *sdd = sd->private; const struct cpumask *span = sched_domain_span(sd); struct cpumask *covered; int i; lockdep_assert_held(&sched_domains_mutex); covered = sched_domains_tmpmask; cpumask_clear(covered); for_each_cpu_wrap(i, span, cpu) { struct sched_group *sg; if (cpumask_test_cpu(i, covered)) continue; sg = get_group(i, sdd); cpumask_or(covered, covered, sched_group_span(sg)); if (!first) first = sg; if (last) last->next = sg; last = sg; } last->next = first; sd->groups = first; return 0; }
主要邏輯就上面的代碼。對當前調度域的每個cpu進行迭代。通過get_group獲取該cpu的調度組,將調度組鏈接起來形成一個環(huán)形鏈表。調度域結構中的groups成員指向它的第一個調度組。這里有個比較關鍵的點,調度域的第一個調度組,也就是first指向的那個調度組就是該調度域的子調度域。這里大家不要搞混淆,調度組和調度域是兩個結構,這里說的等同是從他們所包含的cpu集合是否一致而言的。這里covered變量記錄了當前調度組的cpumask,判斷當前的cpu是否已經(jīng)包含在調度組中,如果是就跳過,避免將重復的調度組引入調度組鏈表。這樣就保證了調度組之間是沒有重疊的。剩下的重點就是get_group函數(shù)。
get_group是一個比較重要的函數(shù),可以幫助你理解調度組是怎么來的。該函數(shù)也提供了非常詳細的注釋,值得好好讀一下。關于這個注釋,這里稍微總結一下:
1. 有三個重要的概念,調度域,調度組,sched_group_capability。調度域是垂直方向移動,調度組是水平方向移動,sched_group_capability每個調度組的cpu成員共享一個。
2. 調度域的第一個調度組就是它的子調度域。
3. 對于cpu拓撲結構之間沒有重疊的情況,只需對每個調度組的第一個cpu創(chuàng)建調度組即可。這也就是上面代碼中covered變量的作用。
下面看代碼。
static struct sched_group *get_group(int cpu, struct sd_data *sdd) { struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); struct sched_domain *child = sd->child; struct sched_group *sg; bool already_visited; if (child) cpu = cpumask_first(sched_domain_span(child)); sg = *per_cpu_ptr(sdd->sg, cpu); sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); /* Increase refcounts for claim_allocations: */ already_visited = atomic_inc_return(&sg->ref) > 1; /* sgc visits should follow a similar trend as sg */ WARN_ON(already_visited != (atomic_inc_return(&sg->sgc->ref) > 1)); /* If we have already visited that group, it's already initialized. */ if (already_visited) return sg; if (child) { cpumask_copy(sched_group_span(sg), sched_domain_span(child)); cpumask_copy(group_balance_mask(sg), sched_group_span(sg)); sg->flags = child->flags; } else { cpumask_set_cpu(cpu, sched_group_span(sg)); cpumask_set_cpu(cpu, group_balance_mask(sg)); } sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg)); sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; return sg; }
代碼不多,也比較容易理解。入?yún)⑹钦{度域和sd_data變量。在前面的代碼中已經(jīng)提到,每個調度組是一個per-cpu變量,每個cpu都有一個。首先是獲取調度域的子調度域,然后拿到它的第一個cpu,只對第一個cpu創(chuàng)建調度組。sgc代表的sched_group_capacity也是一個per-cpu變量,作為調度組的一個成員出現(xiàn)。對于有child調度域的情況,直接將child調度域的cpu集合復制到調度組的cpu span上。這也就是上面所說,調度組就是子調度域的由來。對于沒有child的情況,說該調度域處于最低級別,它的調度組只包含當前的cpu。這里還有group_balance_mask,它是sched_group_capability的成員,它與調度組的cpumask保持一致。需要特別注意,在做負載均衡的時候,不是調度組的每個cpu都要去做,只會找其中一個cpu,一般是第一個idle的cpu。調度組也有標志,含義與調度域相同。調度組的標志指向調度域的子調度域的flags。
最后初始化sgc的capacity相關的成員。前面已經(jīng)講述過cpu capability,由于每個cpu都有相同的CAPABILITY,調度組總的capacity只需SCHED_CAPABILITY_SCALE乘以調度組包含的cpu總數(shù)即可。
講完沒有overlap的調度組的創(chuàng)建過程,大家應該會對調度域和調度組的拓撲結構有一個清晰的了解了。這里比較容易混淆的是子調度域和調度組,雖然他們有相同的cpumask,但是他們仍然是不同的概念,有著不一樣的使命。
沒有overlap的調度域創(chuàng)建調度組是基于子調度域。有overlap的調度域在創(chuàng)建調度組時靠子調度域是會出錯的。一個例子就包含在build_overlap_sched_groups的注釋中。下面我們看看這個例子。
/* * Usually we build sched_group by sibling's child sched_domain * But for machines whose NUMA diameter are 3 or above, we move * to build sched_group by sibling's proper descendant's child * domain because sibling's child sched_domain will span out of * the sched_domain being built as below. * * Smallest diameter=3 topology is: * * node 0 1 2 3 * 0: 10 20 30 40 * 1: 20 10 20 30 * 2: 30 20 10 20 * 3: 40 30 20 10 * * 0 --- 1 --- 2 --- 3 * * NUMA-3 0-3 N/A N/A 0-3 * groups: {0-2},{1-3} {1-3},{0-2} * * NUMA-2 0-2 0-3 0-3 1-3 * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2} * * NUMA-1 0-1 0-2 1-3 2-3 * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2} * * NUMA-0 0 1 2 3 * * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the * group span isn't a subset of the domain span. */
這個例子中有4個numa node,由于4個node之間的距離并非完全對稱,造成了各個node的調度域也出現(xiàn)不對稱的情況。圖中顯示的調度組成員是正確的。假設基于子調度域去構建調度組,對于node0的NUMA2調度域而言,它有3個cpu,0對應的調度組是0-1,沒問題,而2對應的子調度域是1-3,這就不對了,0的NUMA2調度域只是0-2,并不包含3,因此,該調度組是錯誤的。也就是說,我們沒辦法再按照創(chuàng)建沒有overlap調度域的方法來操作。
來看一下kernel是怎么解決這個問題的,也就是find_descended_sibling的實現(xiàn)。
static struct sched_domain * find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling) { /* * The proper descendant would be the one whose child won't span out * of sd */ while (sibling->child && !cpumask_subset(sched_domain_span(sibling->child), sched_domain_span(sd))) sibling = sibling->child; /* * As we are referencing sgc across different topology level, we need * to go down to skip those sched_domains which don't contribute to * scheduling because they will be degenerated in cpu_attach_domain */ while (sibling->child && cpumask_equal(sched_domain_span(sibling->child), sched_domain_span(sibling))) sibling = sibling->child; return sibling; }
也比較容易理解,就是向下找能夠包含在當前調度域內的子調度域。注意,當前調度域對應的cpu和這個子調度域對應的cpu并不是一個,他們是兄弟關系。
找到子調度域之后,調用build_group_from_child_sched_domain創(chuàng)建調度組。
static struct sched_group * build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) { struct sched_group *sg; struct cpumask *sg_span; sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), GFP_KERNEL, cpu_to_node(cpu)); if (!sg) return NULL; sg_span = sched_group_span(sg); if (sd->child) { cpumask_copy(sg_span, sched_domain_span(sd->child)); sg->flags = sd->child->flags; } else { cpumask_copy(sg_span, sched_domain_span(sd)); } atomic_inc(&sg->ref); return sg; }
這里有一個優(yōu)化,給sg分配的內存在cpu所在的node,這樣訪問該sg就不需要到遠端。主要的邏輯還是將子調度域的cpumask復制到sg的cpumask上。這里有個疑問,函數(shù)入?yún)⒁呀?jīng)是前面找到的子調度域,為何此處還要用它的child調度域?答案是,前面找的子調度域是去跟當前調度域比較的。以沒有overlap的調度域為例,尋找到的子調度域就是當前調度域(其實這種情況就是不需要在子調度域中找)對于沒有子調度域的最低層調度域,直接將該調度域的cpumask復制到sg中(這不會有問題嗎?)
之后調用init_overlap_sched_group初始化balance cpu mask和capability。
static void init_overlap_sched_group(struct sched_domain *sd, struct sched_group *sg) { struct cpumask *mask = sched_domains_tmpmask2; struct sd_data *sdd = sd->private; struct cpumask *sg_span; int cpu; build_balance_mask(sd, sg, mask); cpu = cpumask_first(mask); sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(&sg->sgc->ref) == 1) cpumask_copy(group_balance_mask(sg), mask); else WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); /* * Initialize sgc->capacity such that even if we mess up the * domains and no possible iteration will get us here, we won't * die on a /0 trap. */ sg_span = sched_group_span(sg); sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; }
build_balance_mask會過濾掉那些子調度域跟調度組cpumask不一致的cpu。
創(chuàng)建好調度組后跟前面一樣將其串聯(lián)起來,調度組就創(chuàng)建好了。
至此調度域和調度組的初始化代碼就分析完了。
浙公網(wǎng)安備 33010602011771號