< font-size: 16px;">接下来我们看一下kube-scheduler调度算法(预选&优选)是如何与上述这些操作结合起来的:

// Fit is determined by resource availability.

// This predicate is actually a default predicate, because it is invoked from

// predicates.GeneralPredicates()

scheduler.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)


// RegisterFitPredicate registers a fit predicate with the algorithm

// registry. Returns the name with which the predicate was registered.

func RegisterFitPredicate(name string, predicate predicates.FitPredicate) string {

  return RegisterFitPredicateFactory(name, func(AlgorithmFactoryArgs) predicates.FitPredicate { return predicate })



// RegisterFitPredicateFactory registers a fit predicate factory with the

// algorithm registry. Returns the name with which the predicate was registered.

func RegisterFitPredicateFactory(name string, predicateFactory FitPredicateFactory) string {


  defer schedulerFactoryMutex.Unlock()


  fitPredicateMap[name] = predicateFactory

  return name



// Prioritizes nodes that have labels matching NodeAffinity

scheduler.RegisterPriorityMapReduceFunction(priorities.NodeAffinityPriority, priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1)


// RegisterPriorityMapReduceFunction registers a priority function with the algorithm registry. Returns the name,

// with which the function was registered.

func RegisterPriorityMapReduceFunction(

  name string,

  mapFunction priorities.PriorityMapFunction,

  reduceFunction priorities.PriorityReduceFunction,

  weight int) string {

  return RegisterPriorityConfigFactory(name, PriorityConfigFactory{

    MapReduceFunction: func(AlgorithmFactoryArgs) (priorities.PriorityMapFunction, priorities.PriorityReduceFunction) {

      return mapFunction, reduceFunction


    Weight: int64(weight),




// RegisterPriorityConfigFactory registers a priority config factory with its name.

func RegisterPriorityConfigFactory(name string, pcf PriorityConfigFactory) string {


  defer schedulerFactoryMutex.Unlock()


  priorityFunctionMap[name] = pcf

  return name



// (g.predicates)

// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.

// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached

// predicate results as possible.

// This function is called from two different places: Schedule and Preempt.

// When it is called from Schedule, we want to test whether the pod is schedulable

// on the node with all the existing pods on the node plus higher and equal priority

// pods nominated to run on the node.

// When it is called from Preempt, we should remove the victims of preemption and

// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().

// It removes victims from meta and NodeInfo before calling this function.

func (g *genericScheduler) podFitsOnNode(

  ctx context.Context,

  state *framework.CycleState,

  pod *v1.Pod,

  meta predicates.Metadata,

  info *schedulernodeinfo.NodeInfo,

  alwaysCheckAllPredicates bool,

) (bool, []predicates.PredicateFailureReason, *framework.Status, error) {

  var failedPredicates []predicates.PredicateFailureReason

  var status *framework.Status

  podsAdded := false

  // We run predicates twice in some cases. If the node has greater or equal priority

  // nominated pods, we run them when those pods are added to meta and nodeInfo.

  // If all predicates succeed in this pass, we run them again when these

  // nominated pods are not added. This second pass is necessary because some

  // predicates such as inter-pod affinity may not pass without the nominated pods.

  // If there are no nominated pods for the node or if the first run of the

  // predicates fail, we don't run the second pass.

  // We consider only equal or higher priority pods in the first pass, because

  // those are the current "pod" must yield to them and not take a space opened

  // for running them. It is ok if the current "pod" take resources freed for

  // lower priority pods.

  // Requiring that the new pod is schedulable in both circumstances ensures that

  // anti-affinity are more likely to fail when the nominated pods are treated

  // anti-affinity are more likely to fail when the nominated pods are treated

  // as running, while predicates like pod affinity are more likely to fail when

  // the nominated pods are treated as not running. We can't just assume the

  // nominated pods are running because they are not running right now and in fact,

  // they may end up getting scheduled to a different node.

  for i := 0; i < 2; i++ {


    for _, predicateKey := range predicates.Ordering() {


      if predicate, exist := g.predicates[predicateKey]; exist {

        fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)

        if err != nil {

          return false, []predicates.PredicateFailureReason{}, nil, err






  return len(failedPredicates) == 0 && status.IsSuccess(), failedPredicates, status, nil



// (g.prioritizers)

// prioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.

// Each priority function is expected to set a score of 0-10

// 0 is the lowest priority score (least preferred node) and 10 is the highest

// Each priority function can also have its own weight

// The node scores returned by the priority function are multiplied by the weights to get weighted scores

// All scores are finally combined (added) to get the total weighted scores of all nodes

func (g *genericScheduler) prioritizeNodes(

  ctx context.Context,

  state *framework.CycleState,

  pod *v1.Pod,

  meta interface{},

  nodes []*v1.Node,

) (framework.NodeScoreList, error) {

  workqueue.ParallelizeUntil(context.TODO(), 16, len(nodes), func(index int) {

    nodeInfo := g.nodeInfoSnapshot.NodeInfoMap[nodes[index].Name]

    for i := range g.prioritizers {

      var err error

      results[i][index], err = g.prioritizers[i].Map(pod, meta, nodeInfo)

      if err != nil {


        results[i][index].Name = nodes[index].Name

    &nb市场营销策划团队sp; }



  for i := range g.prioritizers {

    if g.prioritizers[i].Reduce == nil {




    go func(index int) {


      defer func() {




      if err := g.prioritizers[index].Reduce(pod, meta, g.nodeInfoSnapshot, results[index]); err != nil {



      if klog.V(10) {

        for _, hostPriority := range results[index] {

          klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), hostPriority.Name, g.prioritizers[index].Name, hostPriority.Score)





  // Wait for all computations to be finished.




< font-size: 16px;">综上,如果要在kube-scheduler基础上添加策略,则按照如下步骤进行添加:

< font-size: 16px;">设置默认预选&优选策略:defaultPredicates以及defaultPriorities(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/defaults.go)

< font-size: 16px;">注册预选和优选相关处理函数:注册预选函数(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/register_predicates.go);注册优选函数(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/register_priorities.go)

< font-size: 16px;">编写预选和优选处理函数:编写预选函数(k8s.io/kubernetes/pkg/scheduler/algorithm/predicates/predicates.go);编写优选函数Map+Reduce(k8s.io/kubernetes/pkg/scheduler/algorithm/priorities/xxx.go)

< font-size: 16px;">除了默认设置预选&优选外,还可以手动通过命令行--policy-config-file指定调度策略(会覆盖默认策略),例如examples/scheduler-policy-config.json


< font-size: 16px;">相比recoding只修改简单代码,standalone在kube-scheduler基础上进行重度二次定制,这种方式优缺点如下:

< font-size: 16px;">Pros

< font-size: 16px;">  满足对scheduler最大程度的重构&定制

< font-size: 16px;">Cons

< font-size: 16px;">  实际工程中如果只是想添加预选或者优选策略,则会切换到第一种方案,不会单独开发和部署一个scheduler

< font-size: 16px;">  二次定制scheduler开发难度较大(至少对scheduler代码非常熟悉),且对Kubernetes集群影响较大(无论是单独部署,还是并列部署),后续升级和维护成本较高

< font-size: 16px;">  可能会产生调度冲突问题,在同时部署两个scheduler时,可能会出现一个scheduler bind的时候实际资源已经被另一个scheduler分配了

< font-size: 16px;">因此建议在其它方案满足不了扩展需求时,才采用standalone方案,且生产环境仅部署一个scheduler。

scheduler extender

< font-size: 16px;">对于Kubernetes项目来说,它很乐意开发者使用并向它提bug或者PR(受欢迎),但是不建议开发者为了实现业务需求直接修改Kubernetes核心代码,因为这样做会影响Kubernetes本身的代码质量以及稳定性。因此Kubernetes希望尽可能通过外围的方式来解决客户自定义的需求。

< font-size: 16px;">其实任何好的项目都应该这样思考:尽可能抽取核心代码,这部分代码不应该经常变动或者说只能由maintainer改动(提高代码质量,减小项目本身开发&运维成本);将第三方客户需求尽可能提取到外围解决(满足客户自由),例如:插件的形式(eg:CNI,CRI,CSI and scheduler framework etc)。

< font-size: 16px;">上面介绍的default-scheduler recoding以及standalone方案都属于侵入式的方案,不太优雅;而scheduler extender以及scheduler framework属于非侵入式的方案,这里重点介绍scheduler extender。

< font-size: 16px;">scheduler extender类似于webhook,kube-scheduler会在默认调度算法执行完成后以http/https的方式调用extender,extender server完成自定义的预选&优选逻辑,并返回规定字段给scheduler,scheduler结合这些信息进行最终的调度裁决,从而完成基于extender实现扩展的逻辑。

< font-size: 16px;">scheduler extender适用于调度策略与非标准kube-scheduler管理资源相关的场景,当然你也可以使用extender完成与上述两种方式同样的功能。

< font-size: 16px;">下面我们结合代码说明extender的使用原理:

