Kubernetes 源码【左扬精讲】—— kube-scheduler（调度专题 · 二）：内置插件逐个精读 — NodeResourcesFit / NodeAffinity / TaintToleration / PodTopologySpread / VolumeBinding / InterPodAffinity

博客园 - 左扬

左扬 · 2026-06-20 · via 博客园 - 左扬

Kubernetes 源码【左扬精讲】—— kube-scheduler（调度专题 · 二）：内置插件逐个精读 — NodeResourcesFit / NodeAffinity / TaintToleration / PodTopologySpread / VolumeBinding / InterPodAffinity

调度专题第一篇我们梳理了 framework 的六大扩展点，第二篇（上一篇）讲透了 CycleState 这个"黑板"。这一篇我们顺着 PreFilter → Filter → PreScore → Score → Reserve → PreBind 的数据流，把 6 个内置插件逐个拆开看：每个插件在哪个扩展点干什么事、读写了 CycleState 的哪个 key、有什么性能细节或踩坑点。

读完本篇，你应该能回答：NodeResourcesFit 的 Filter 为什么要在 PreFilter 阶段就合并 initContainers 资源请求？PodTopologySpread 为什么用 [2]criticalPaths 而不是把所有路径都存下来？VolumeBinding 凭什么能跨 PreFilter/Reserve/PreBind 三个阶段持久化 PVC 绑定结果？

Kubernetes Scheduler Plugin NodeResourcesFit PodTopologySpread VolumeBinding k8s v1.36.1

学习重点提示 — 建议先通读全文，再重点回顾标注内容

重点掌握（必须）

6 个插件的源码位置：pkg/scheduler/framework/plugins/{noderesources,nodeaffinity,tainttoleration,podtopologyspread,volumebinding,interpodaffinity}/*.go

实现的扩展点矩阵：每个插件的 var _ fwk.XxxPlugin = &Plugin{} 是它能力的唯一真相

CycleState key 命名：所有插件统一约定 preFilterStateKey = "PreFilter" + Name、preScoreStateKey = "PreScore" + Name

state.Clone() 的三种模式：(1) 直接 return 原 state；(2) 浅拷贝 slice/map；(3) nil-safe return

次重点（了解即可）

PodTopologySpread 的 [2]criticalPaths：只保留 2 条最小匹配路径的工程妥协

VolumeBinding 的 sync.Mutex 嵌入 state：Filter 阶段并发修改 podVolumesByNode 的保护机制

InterPodAffinity 的 IgnorePreferredTermsOfExistingPods：跳过其他 Pod 的 preferred 词条优化策略

文章目录

一、综述：6 个内置插件全景
二、NodeResourcesFit —— 资源容量"是否装得下"
三、NodeAffinity —— 节点选择器与节点亲和性
四、TaintToleration —— 污点与容忍
五、PodTopologySpread —— 拓扑打散
六、VolumeBinding —— 存储绑定（横跨三阶段）
七、InterPodAffinity —— Pod 之间的亲和 / 反亲和
八、横向对比表：6 插件能力速查
九、踩坑实录
十、FAQ & Roadmap

一、综述：6 个内置插件全景

我们先把 6 个插件的扩展点实现矩阵拉出来看一眼：

插件	PreFilter	Filter	PreScore	Score	Reserve	PreBind	核心职责
NodeResourcesFit	✓	✓	✓	✓	—	—	CPU/内存/自定义资源是否够
NodeAffinity	✓	✓	✓	✓	—	—	节点选择器 + required/preferred
TaintToleration	—	✓	✓	✓	—	—	NoSchedule 硬过滤 + PreferNoSchedule 软分数
PodTopologySpread	✓	✓	✓	✓	—	—	多 zone/host 打散
VolumeBinding	✓	✓	✓	✓	✓	✓	PVC/PV 绑定 + 容量评分
InterPodAffinity	✓	✓	✓	✓	—	—	Pod 亲和 / 反亲和

小贴士 — 关于扩展点矩阵的读法

VolumeBinding 是唯一一个同时实现 Reserve + PreBind 的内置插件——它要"占座 + 实际绑定"两件事都做。其他 5 个都到 Score 为止，后续交给 Reserve（如果用了 noderesources 占座）或直接 PreBind。

二、NodeResourcesFit —— 资源容量"是否装得下"

2.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/noderesources/fit.go。struct 定义在 fit.go:93-105：

// fit.go:93 (k8s v1.36.1) — NodeResourcesFit 插件结构体
type Fit struct {
    ignoredResources                              sets.Set[string]
    ignoredResourceGroups                         sets.Set[string]
    enableInPlacePodVerticalScaling               bool
    enableSidecarContainers                       bool
    enableSchedulingQueueHint                     bool
    enablePodLevelResources                       bool
    enableDRAExtendedResource                     bool
    enableInPlacePodLevelResourcesVerticalScaling bool
    handle                                        fwk.Handle
    *resourceAllocationScorer
    placementScorer *resourceAllocationScorer
}

// fit.go:44-50 — 该插件实现的接口列表
var _ fwk.PreFilterPlugin  = &Fit{}
var _ fwk.FilterPlugin     = &Fit{}
var _ fwk.EnqueueExtensions = &Fit{}
var _ fwk.PreScorePlugin   = &Fit{}
var _ fwk.ScorePlugin      = &Fit{}
var _ fwk.SignPlugin       = &Fit{}
var _ fwk.PlacementScorePlugin = &Fit{}

2.2 PreFilter：合并 Pod 的资源请求

fit.go:334 的 PreFilter 会把 Pod 的 所有 containers + initContainers + sidecars + overhead 的资源请求合并到一份 framework.Resource，写入 CycleState：

// fit.go:334 (k8s v1.36.1) — PreFilter 入口
func (f *Fit) PreFilter(ctx context.Context, cycleState fwk.CycleState, pod *v1.Pod, nodes []fwk.NodeInfo) (*fwk.PreFilterResult, *fwk.Status) {
    // 计算 Pod 总资源请求（合并 containers + initContainers 取 max + overhead + sidecar）
    s := f.calculateResource(pod)
    cycleState.Write(preFilterStateKey, &preFilterState{Resource: s})  // fit.go:58
    return nil, nil
}

设计精髓

为什么要在 PreFilter 而不是 Filter 里合并资源请求？因为这个 Pod 在调度中只会算一次，但 Filter 会跑 N 个节点。PreFilter 把 "每个节点都要用的计算" 提到调度循环入口算一次，结果写到 CycleState，Filter 阶段直接读——这就是"写一次读多次" 的标准实践。initContainers 的合并规则（取 regularContainers.max，而不是 sum）也是在这里做。

2.3 Filter：逐节点资源比对

fit.go:615 的 Filter 是真正的"装得下吗"判断：

// fit.go:615 (k8s v1.36.1) — Filter 入口
func (f *Fit) Filter(ctx context.Context, cycleState fwk.CycleState, pod *v1.Pod, nodeInfo fwk.NodeInfo) *fwk.Status {
    s, err := getPreFilterState(cycleState)
    if err != nil {
        return fwk.AsStatus(err)
    }

    // Fits() 函数：返回所有装不下的资源维度（含原因）
    insufficient := Fits(pod, nodeInfo, f.draManager, ResourceRequestsOptions{
        EnablePodLevelResources:   f.enablePodLevelResources,
        EnableDRAExtendedResource: f.enableDRAExtendedResource,
    })
    if len(insufficient) != 0 {
        return fwk.NewStatus(fwk.Unschedulable, fitError+insufficient[0].Reason)
    }
    return nil
}

注意

Filter 只返回第一条 装不下的资源维度作为 reason。这是 k8s 的设计选择：调试友好（你能立刻看到最关键的瓶颈），但也会隐藏其他资源不足。如果你想看完整的不匹配原因，看 fit.go:674 的 Fits() 函数的返回值（[]InsufficientResource）。

2.4 Score：四种打分策略

fit.go:65-90 的 nodeResourceStrategyTypeMap 注册了3 种打分散列：

策略	公式（简化）	特点
LeastAllocated（默认）	(capacity - requested) / capacity	分高 = 剩余多 = 打散
MostAllocated	requested / capacity	分高 = 用得多 = 堆积
RequestedToCapacityRatio	基于 Shape 函数（线性/log）	可配权重 + 阈值

三、NodeAffinity —— 节点选择器与节点亲和性

3.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go。struct 在 node_affinity.go:39-44：

// node_affinity.go:39 (k8s v1.36.1)
type NodeAffinity struct {
    handle                    fwk.Handle
    addedNodeSelector         *nodeaffinity.NodeSelector          // 调度器全局强制节点选择器
    addedPrefSchedTerms       *nodeaffinity.PreferredSchedulingTerms // 调度器全局 preferred 项
    enableSchedulingQueueHint bool
}

// node_affinity.go:46-51 — 实现的扩展点
var _ fwk.PreFilterPlugin = &NodeAffinity{}
var _ fwk.FilterPlugin    = &NodeAffinity{}
var _ fwk.PreScorePlugin  = &NodeAffinity{}
var _ fwk.ScorePlugin     = &NodeAffinity{}
var _ fwk.EnqueueExtensions = &NodeAffinity{}
var _ fwk.SignPlugin      = &NodeAffinity{}

3.2 PreFilter：解析 Pod 的 RequiredDuringSchedulingIgnoredDuringExecution

node_affinity.go:159 的 PreFilter 把 Pod 的 节点选择器 + 节点亲和 合并成一份 RequiredNodeAffinity 写到 CycleState，并返回一个节点集合优化：

// node_affinity.go:159 (k8s v1.36.1) — PreFilter
func (pl *NodeAffinity) PreFilter(...) (*fwk.PreFilterResult, *fwk.Status) {
    affinity := pod.Spec.Affinity
    noNodeAffinity := (affinity == nil || affinity.NodeAffinity == nil ||
        affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution == nil)
    if noNodeAffinity && pl.addedNodeSelector == nil && pod.Spec.NodeSelector == nil {
        return nil, fwk.NewStatus(fwk.Skip)   // 没活干
    }

    state := &preFilterState{
        requiredNodeSelectorAndAffinity: nodeaffinity.GetRequiredNodeAffinity(pod),
    }
    cycleState.Write(preFilterStateKey, state)

    // 关键：尝试把 affinity 转化为"只可能落在这些节点上"
    terms := affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms
    var nodeNames sets.Set[string]
    for _, t := range terms {
        // 如果 term 包含 key==metadata.name 且 operator==In
        // → 这些节点可能匹配
        for _, r := range t.MatchFields {
            if r.Key == metav1.ObjectNameField && r.Operator == v1.NodeSelectorOpIn {
                s := sets.New(r.Values...)
                // 取交集（terms 间是 OR，term 内是 AND）
                ...
            }
        }
    }
    if nodeNames != nil && len(nodeNames) > 0 {
        return &fwk.PreFilterResult{NodeNames: nodeNames}, nil   // node_affinity.go:205
    }
    return nil, nil
}

设计精髓

PreFilter 的关键返回值是 fwk.PreFilterResult{NodeNames: ...}——framework 拿到这个集合后会跳过不在集合里的节点（直接当成 Filter 失败）。这是 k8s 的"早期剪枝"机制：如果你的 Pod 显式亲和 nodeName in {node-a, node-b}，框架根本不会让 NodeResourcesFit 在 node-c 上跑——节省了大量 Filter 调用。这就是 PreFilter 优于 Filter 的核心收益。

3.3 Filter：节点亲和匹配

node_affinity.go:218 的 Filter 几乎不做计算，只读 CycleState + 调 RequiredNodeAffinity.Match(node)：

// node_affinity.go:218 (k8s v1.36.1) — Filter
func (pl *NodeAffinity) Filter(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeInfo fwk.NodeInfo) *fwk.Status {
    node := nodeInfo.Node()

    // 先匹配 scheduler-enforced（admin 通过配置加的全局强制节点选择器）
    if pl.addedNodeSelector != nil && !pl.addedNodeSelector.Match(node) {
        return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, errReasonEnforced)
    }

    // 再匹配 Pod 自带的
    s, err := getPreFilterState(state)
    if err != nil {
        // PreFilter 跳过时的兜底：直接现场算
        s = &preFilterState{requiredNodeSelectorAndAffinity: nodeaffinity.GetRequiredNodeAffinity(pod)}
    }
    match, _ := s.requiredNodeSelectorAndAffinity.Match(node)
    if !match {
        return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, ErrReasonPod)
    }
    return nil
}

小贴士 — 关于 Fallback 算亲和

注意 node_affinity.go:227-230 的兜底逻辑：如果 PreFilter 被跳过（返回 Skip），CycleState 里就没有这个 key，Filter 会现场重新算 RequiredNodeAffinity。这是 CycleState 的容错模式：写入是 best-effort，读取是必须兼容"没写"的情况——但是绝大多数情况下不应该走兜底，否则就丢掉了 PreFilter 的优化收益。

四、TaintToleration —— 污点与容忍

4.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/tainttoleration/taint_toleration.go。struct 在 taint_toleration.go:35-39：

// taint_toleration.go:35 (k8s v1.36.1)
type TaintToleration struct {
    handle                                   fwk.Handle
    enableSchedulingQueueHint                bool
    enableTaintTolerationComparisonOperators bool
}

// taint_toleration.go:41-45 — 唯一没有 PreFilter 的内置插件！
var _ fwk.FilterPlugin     = &TaintToleration{}
var _ fwk.PreScorePlugin   = &TaintToleration{}
var _ fwk.ScorePlugin      = &TaintToleration{}
var _ fwk.EnqueueExtensions = &TaintToleration{}
var _ fwk.SignPlugin       = &TaintToleration{}

小贴士 — 为什么 TaintToleration 没有 PreFilter？

TaintToleration 的判断逻辑是per-node 的（"这个 Pod 的 tolerations 能不能容忍这个节点的 taints"），没有跨节点可剪枝的可能——每个节点都需要独立检查。所以 PreFilter 没有优化空间，直接 Filter 即可。

4.2 Filter：硬过滤 NoSchedule taints

taint_toleration.go:119 的 Filter 极简——只过滤 NoSchedule / NoExecute 效应的 taints：

// taint_toleration.go:119 (k8s v1.36.1) — Filter 入口
func (pl *TaintToleration) Filter(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeInfo fwk.NodeInfo) *fwk.Status {
    node := nodeInfo.Node()

    taint, isUntolerated := v1helper.FindMatchingUntoleratedTaint(
        logger, node.Spec.Taints, pod.Spec.Tolerations,
        helper.DoNotScheduleTaintsFilterFunc(),   // 只看 NoSchedule + NoExecute
        pl.enableTaintTolerationComparisonOperators,
    )
    if !isUntolerated {
        return nil
    }
    return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "node(s) had untolerated taint(s)")
}

4.3 PreScore + Score：PreferNoSchedule 软分数

taint_toleration.go:157 的 PreScore 把 Pod 的 PreferNoSchedule 容忍 过滤出来写到 CycleState，:195 的 Score 给"不耐受 PreferNoSchedule 越多，分越低"：

// taint_toleration.go:146 (k8s v1.36.1) — 提取 PreferNoSchedule 容忍
func getAllTolerationPreferNoSchedule(tolerations []v1.Toleration) (tolerationList []v1.Toleration) {
    for _, toleration := range tolerations {
        if len(toleration.Effect) == 0 || toleration.Effect == v1.TaintEffectPreferNoSchedule {
            tolerationList = append(tolerationList, toleration)
        }
    }
    return
}

// taint_toleration.go:195 — Score 入口
func (pl *TaintToleration) Score(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeInfo fwk.NodeInfo) (int64, *fwk.Status) {
    s, _ := getPreScoreState(state)
    score := int64(pl.countIntolerableTaintsPreferNoSchedule(logger, node.Spec.Taints, s.tolerationsPreferNoSchedule))
    return score, nil   // 分越高越不可取（intolerable 越少分越低）
}

五、PodTopologySpread —— 拓扑打散

5.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/podtopologyspread/plugin.go + filtering.go + scoring.go。struct 在 plugin.go:60-73：

// plugin.go:60 (k8s v1.36.1)
type PodTopologySpread struct {
    systemDefaulted                              bool
    parallelizer                                 fwk.Parallelizer
    defaultConstraints                           []v1.TopologySpreadConstraint
    sharedLister                                 fwk.SharedLister
    services, replicationCtrls, replicaSets, statefulSets ...
    enableNodeInclusionPolicyInPodTopologySpread bool
    enableMatchLabelKeysInPodTopologySpread      bool
    enableSchedulingQueueHint                    bool
}

// plugin.go:46-57 — 系统默认约束（如果 Pod 没写，自动加这两条）
var systemDefaultConstraints = []v1.TopologySpreadConstraint{
    {TopologyKey: v1.LabelHostname,     WhenUnsatisfiable: v1.ScheduleAnyway, MaxSkew: 3},
    {TopologyKey: v1.LabelTopologyZone, WhenUnsatisfiable: v1.ScheduleAnyway, MaxSkew: 5},
}

5.2 preFilterState 的关键设计：[2]criticalPaths

filtering.go:41-52 的 preFilterState 是 6 个插件里最复杂的：

// filtering.go:41 (k8s v1.36.1)
type preFilterState struct {
    Constraints []topologySpreadConstraint
    // 关键设计：每个 constraint 只存 2 条 critical path
    // CriticalPaths[i][0].MatchNum 永远是最小匹配数
    // CriticalPaths[i][1].MatchNum 永远 >= [0]，但不保证是第二小
    CriticalPaths []*criticalPaths
    // 每个 constraint 每个 topology value 的匹配 Pod 数
    TpValueToMatchNum []map[string]int
}

// filtering.go:97 — criticalPaths 是 [2]struct 的定长数组，不是 slice
type criticalPaths [2]struct {
    TopologyValue string
    MatchNum      int
}

设计精髓

为什么是 [2] 而不是 []？看 filtering.go:90-95 的注释：当前 k8s 的抢占算法保证：(1) 抢占只发生在同一节点上的 Pod；(2) 每个节点的抢占周期用独立拷贝的 preFilterState。两者叠加——每次抢占只"挪走"一个 Pod 上的若干 Pod，最多影响 2 条 critical path。所以存 [2] 就够用，避免 slice 的动态分配。这是一个用领域知识换性能的经典例子。

5.3 Clone 模式：浅拷贝 + 显式 deep copy map

filtering.go:71-88 的 Clone 是 6 个插件里最复杂的，因为 preFilterState 含 slice + map 嵌套：

// filtering.go:71 (k8s v1.36.1) — Clone
func (s *preFilterState) Clone() fwk.StateData {
    if s == nil {
        return nil                          // ① nil-safe
    }
    copy := preFilterState{
        Constraints:       s.Constraints,                          // ② slice 共享（不可变）
        CriticalPaths:     make([]*criticalPaths, len(s.CriticalPaths)),  // ③ 重新分配 slice
        TpValueToMatchNum: make([]map[string]int, len(s.TpValueToMatchNum)),
    }
    for i, paths := range s.CriticalPaths {
        copy.CriticalPaths[i] = &criticalPaths{paths[0], paths[1]}  // ④ 拷贝 [2] 结构体
    }
    for i, tpMap := range s.TpValueToMatchNum {
        copy.TpValueToMatchNum[i] = maps.Clone(tpMap)  // ⑤ 用 Go 1.21+ maps.Clone 浅拷贝 map
    }
    return &copy
}

六、VolumeBinding —— 存储绑定（横跨三阶段）

6.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/volumebinding/volume_binding.go。struct 在 volume_binding.go:73-79：

// volume_binding.go:73 (k8s v1.36.1)
type VolumeBinding struct {
    Binder      SchedulerVolumeBinder         // Reserve/PreBind 阶段真实绑定 PV
    PVCLister   corelisters.PersistentVolumeClaimLister
    classLister storagelisters.StorageClassLister
    scorer      volumeCapacityScorer
    fts         feature.Features
}

// volume_binding.go:81-88 — 唯一横跨 PreFilter/Filter/PreScore/Score/Reserve/PreBind 的插件
var _ fwk.PreFilterPlugin  = &VolumeBinding{}
var _ fwk.FilterPlugin     = &VolumeBinding{}
var _ fwk.ReservePlugin    = &VolumeBinding{}
var _ fwk.PreBindPlugin    = &VolumeBinding{}
var _ fwk.PreScorePlugin   = &VolumeBinding{}
var _ fwk.ScorePlugin      = &VolumeBinding{}
var _ fwk.EnqueueExtensions = &VolumeBinding{}
var _ fwk.SignPlugin       = &VolumeBinding{}

6.2 stateData：嵌入 sync.Mutex 的跨阶段状态

volume_binding.go:50-68 的 stateData 是 6 个插件里唯一嵌入 sync.Mutex 的：

// volume_binding.go:53 (k8s v1.36.1)
type stateData struct {
    allBound bool
    // podVolumesByNode: Filter 阶段为每个候选节点缓存的"卷假设绑定结果"
    podVolumesByNode map[string]*PodVolumes
    podVolumeClaims  *PodVolumeClaims
    hasStaticBindings bool   // 是否有 StaticBinding PV（影响是否需要 Skip Score）
    sync.Mutex       // ← 关键：保护并发修改 podVolumesByNode
}

func (d *stateData) Clone() fwk.StateData {
    return d   // 浅拷贝：state 内部有 sync.Mutex，不能随便 deep copy
}

注意

VolumeBinding 的 Clone() 直接返回 d（volume_binding.go:66-68），不做任何拷贝。这是有意为之——stateData 含 sync.Mutex（不能拷贝，拷贝后锁状态会乱），而且 stateData 的生命周期严格在 schedulingCycle 内，不需要 deep copy。这是"Clone 是 best-effort"的另一面——有时不动就是最正确的。

6.3 Reserve + PreBind：实际绑定 PV

volume_binding.go:531 的 Reserve 把"假设绑定"转为"实际预留"，:577 的 PreBind 触发 apiserver 真正创建 PV/PVC binding：

// volume_binding.go:531 (k8s v1.36.1) — Reserve
func (pl *VolumeBinding) Reserve(ctx context.Context, cs fwk.CycleState, pod *v1.Pod, nodeName string) *fwk.Status {
    state, _ := getStateData(cs)
    // 1. 获取在 Filter 阶段为该节点假设绑定的 PodVolumes
    pvMap := state.podVolumesByNode[nodeName]
    // 2. 告诉 assumeCache：这个 Pod 已经"占座"，未来其他 Pod 不能假设绑定这些 PV
    return pl.Binder.AssumePodVolumes(pod, nodeName, pvMap)
}

// volume_binding.go:577 — PreBind
func (pl *VolumeBinding) PreBind(ctx context.Context, cs fwk.CycleState, pod *v1.Pod, nodeName string) *fwk.Status {
    return pl.Binder.BindPodVolumes(ctx, pod, nodeName)   // 真实调 apiserver
}

设计精髓

AssumeCache（assume_cache.go）是 VolumeBinding 的灵魂——它维护一个"已假设绑定的 PV 集合"。当一个 Pod 假设绑定了 PV-A，其他 Pod 在 Filter 阶段通过 assumeCache 看到"PV-A 已被占用"，就不会再尝试假设绑定同一个 PV。这避免了两个 Pod 同时被调度到同一 PV 上的冲突。Reserve 把"假设" 固化到 assumeCache，PreBind 把"假设" 落地到 apiserver。

七、InterPodAffinity —— Pod 之间的亲和 / 反亲和

7.1 源码定位 & 接口实现

pkg/scheduler/framework/plugins/interpodaffinity/plugin.go + filtering.go + scoring.go。struct 在 plugin.go:47-53：

// plugin.go:47 (k8s v1.36.1)
type InterPodAffinity struct {
    parallelizer              fwk.Parallelizer
    args                      config.InterPodAffinityArgs
    sharedLister              fwk.SharedLister       // 关键：能跨 Pod 看所有 Pod
    nsLister                  listersv1.NamespaceLister
    enableSchedulingQueueHint bool
}

// plugin.go:39-44 — 接口实现列表
var _ fwk.PreFilterPlugin = &InterPodAffinity{}
var _ fwk.FilterPlugin    = &InterPodAffinity{}
var _ fwk.PreScorePlugin  = &InterPodAffinity{}
var _ fwk.ScorePlugin     = &InterPodAffinity{}
var _ fwk.EnqueueExtensions = &InterPodAffinity{}
var _ fwk.SignPlugin      = &InterPodAffinity{}

7.2 preFilterState：3 个 topology-to-count map

filtering.go:44-56 的 preFilterState 是 6 个插件里语义最丰富的：

// filtering.go:44 (k8s v1.36.1)
type preFilterState struct {
    // 已有 Pod 的 anti-affinity → 每个 (topologyKey, topologyValue) 对下有多少"现有 Pod"和本 Pod 反亲和
    existingAntiAffinityCounts topologyToMatchedTermCount
    // 本 Pod 的 affinity → 每个 (topologyKey, topologyValue) 对下有多少"现有 Pod"和本 Pod 亲和
    affinityCounts topologyToMatchedTermCount
    // 本 Pod 的 anti-affinity → 每个 (topologyKey, topologyValue) 对下有多少"现有 Pod"和本 Pod 反亲和
    antiAffinityCounts topologyToMatchedTermCount
    podInfo        fwk.PodInfo
    namespaceLabels labels.Set
}

7.3 Clone 模式：显式 clone 3 个嵌套 map

filtering.go:59-72 的 Clone 显式调每个 map 自己的 clone() 方法：

// filtering.go:59 (k8s v1.36.1) — Clone
func (s *preFilterState) Clone() fwk.StateData {
    if s == nil {
        return nil
    }
    copy := preFilterState{}
    copy.affinityCounts         = s.affinityCounts.clone()
    copy.antiAffinityCounts     = s.antiAffinityCounts.clone()
    copy.existingAntiAffinityCounts = s.existingAntiAffinityCounts.clone()
    copy.podInfo        = s.podInfo          // podInfo 是不可变快照，共享即可
    copy.namespaceLabels = s.namespaceLabels
    return &copy
}

小贴士 — 为什么 InterPodAffinity 是最慢的插件？

看 plugin.go:63-79 的 SignPod 实现：当 Pod 有 PodAffinity 或 PodAntiAffinity 时，直接放弃 signing（返回 Unschedulable 状态）。原因是 InterPodAffinity 的判断依赖"其他所有 Pod 的位置"，而签名机制是为了让Pod 自身的字段能决定 Cache 命中——两者语义不兼容。这是 InterPodAffinity 不能用 Cache 优化、Filter 阶段最慢的根本原因。

八、横向对比表：6 插件能力速查

维度	NodeResourcesFit	NodeAffinity	TaintToleration	PodTopologySpread	VolumeBinding	InterPodAffinity
能否 Sign	✓	✓	✓	✗	✓	✗
PreFilter 节点剪枝	✗	✓	N/A	✗	✗	✗
preFilterState 含 sync.Mutex	✗	✗	N/A	✗	✓	✗
Clone 模式	return s	return s	return s	浅拷贝 + maps.Clone	return d（含锁）	显式 clone 3 map
Filter 复杂度	O(资源种类)	O(terms)	O(taints)	O(constraints)	O(PVC × 节点)	O(所有 Pod)
跨调度周期复用	无（per-pod）	无	无	无	AssumeCache	无

九、踩坑实录

9.1 在 Filter 阶段才合并 initContainers 资源

问题现象：自定义 Filter 阶段才做容器资源合并，每次 Filter 都重算一遍 initContainers，调度慢 30%+。

根本原因：错过了 PreFilter 的优化机会，违反"写一次读多次"原则。

正确做法：任何不依赖具体节点的 Pod 字段计算，都应该放 PreFilter 阶段。NodeResourcesFit 的 fit.go:334 是个标准范例。

9.2 自己加 PreFilterResult.NodeNames 剪枝，但忘考虑 OR 关系

问题现象：NodeAffinity 自己实现时只算了第一个 term 的节点名集合就返回，导致其他 term 能匹配的节点被剪掉。

根本原因：NodeSelectorTerms 之间是 OR（任一匹配即可），term 内 MatchExpressions 是 AND。看 node_affinity.go:179-198，正确做法是取所有 term 节点名的并集，term 内 MatchFields 是交集。

9.3 PodTopologySpread 的 criticalPaths [2] 改成 slice

问题现象：想"通用化"成 []criticalPath，结果抢占逻辑错乱。

根本原因：filtering.go:90-95 注释明确说"基于当前抢占算法的两个事实"。改成 slice 后表面上"灵活"，但语义已经变了，抢占逻辑就不再能保证正确。

9.4 VolumeBinding 的 Clone 改成 deep copy

问题现象：认为 return d 是 bug，改成 deep copy 后发生 deadlock。

根本原因：sync.Mutex 不能拷贝（拷贝的锁状态会乱），而且 preemption 时 framework 需要拿原始 stateData 的锁，deep copy 后两份 stateData 各自锁，Filter 阶段可能死锁。

9.5 InterPodAffinity 自己加签名优化

问题现象：以为只要 Pod 没写 affinity 就能签名，结果有时命中 Cache 后调度错了。

根本原因：InterPodAffinity 的判断依赖其他 Pod 的位置，这信息不写在 Pod 自身里。Cache 只能基于 Pod 自身字段命中——本质冲突，签不了。k8s 在 plugin.go:63-79 明确禁止了。

十、FAQ & Roadmap

Q1：6 个插件的执行顺序是什么？谁先谁后？

A：由 framework 的 pluginOrder 配置决定，不是源码写死的。默认顺序在 pkg/scheduler/apis/config/v1/default_plugins.go，大致是：NodeResourcesFit → NodeAffinity → NodeName → TaintToleration → NodeUnschedulable → NodeVolumeLimits → ... → InterPodAffinity → PodTopologySpread。建议不要随便改顺序，除非你能证明调换后调度结果仍然正确。

Q2：为什么 PodTopologySpread 默认不开 PreFilter 节点剪枝？

A：因为它的判断依赖"这个 Pod 在这个 topology 下有多少匹配 Pod"，每个 topology 的 Pod 数不同，不能提前剪掉节点（除非整个 topology 域都不可能满足 maxSkew，但这种判断本身和 Filter 等价，不如直接 Filter）。NodeAffinity 不一样——它的"必须匹配这条 label"是0/1 判断，可以提前剪。

Q3：怎么决定把计算放 PreFilter 还是 Filter？

A：判断标准：(1) 如果计算只依赖 Pod 不依赖 Node，放 PreFilter；(2) 如果结果能提前剪掉节点集合，放 PreFilter；(3) 否则放 Filter。NodeResourcesFit 的资源请求合并符合 (1)，但无法提前剪节点（每个节点的可分配资源不同），所以 PreFilter 只写 CycleState，不返回 NodeNames 剪枝。

Q4：VolumeBinding 的 AssumeCache 是什么？为什么只有它有？

A：AssumeCache 是一个跨 Pod 共享的 PV 占用快照——记录"哪些 PV 已经被假设绑定"。其他插件的计算只和 Pod+Node 有关，不涉及集群全局资源，所以不需要这种机制。PV 是独占资源，必须避免两个 Pod 同时假设占用——AssumeCache 解决了这个问题。

Q5：TaintToleration 为什么是唯一一个没有 PreFilter 的"硬过滤"插件？

A：因为 TaintToleration 的判断是纯 per-node 的（节点 taints vs Pod tolerations），没有Pod 级别的预计算可做。NodeResourcesFit 的 PreFilter 虽然不剪节点，但合并 initContainers 资源请求有Pod 级别的 CPU，可以 PreFilter 算一次。NodeAffinity 的 PreFilter 直接解析 Pod 的 nodeAffinity 表达式，可以提前算好 requiredNodeSelectorAndAffinity + 提前剪节点。

Q6：怎么判断一个插件能不能签（sign）？

A：看 SignPod(ctx, pod) 函数。如果返回 nil, nil 表示可以签；返回 nil, fwk.NewStatus(Unschedulable, ...) 表示不能签。判断依据：插件的判断是否完全基于 Pod 自身的字段。NodeResourcesFit 看 pod.Spec.Containers，NodeAffinity 看 pod.Spec.Affinity.NodeAffinity——都是 Pod 字段，可以签。InterPodAffinity 依赖其他 Pod 的位置，签不了。

Q7：可以禁用某个内置插件吗？

A：可以。在 KubeSchedulerConfiguration 的 profiles 里把插件的 enabled: false。但要谨慎——比如禁用 TaintToleration 等于允许 Pod 跳过所有 taint 检查，破坏节点隔离模型。除非你有特殊场景（如自定义调度器接管了污点逻辑），不建议禁用。

Q8：所有内置插件都是用同一个调度器配置吗？

A：可以多个 profile。在 KubeSchedulerConfiguration.profiles 里能定义多个 Profile，每个 Profile 有自己的插件启用列表 + 插件参数。可以让某些 workload 用"打散"profile，某些用"堆积"profile——但需要配 schedulingGates 或扩展点。

本文参考与源码链接：
  • noderesources/fit.go · NodeResourcesFit 实现
  • nodeaffinity/node_affinity.go · NodeAffinity 实现
  • tainttoleration/taint_toleration.go · TaintToleration 实现
  • podtopologyspread/filtering.go · preFilterState + criticalPaths [2]
  • volumebinding/volume_binding.go · stateData + sync.Mutex
  • interpodaffinity/filtering.go · 3 个 topology-to-count map
  • volumebinding/assume_cache.go · 假设绑定 PV 缓存
  • kube-scheduler/framework · 插件接口定义
  • Scheduling, Preemption and Eviction · 官方文档

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

博客园 - 左扬