腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一

时间:2021-07-15 | 标签: | 作者:Q8 | 来源:小小小小杜网络

小提示:您能找到这篇{腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一}绝对不是偶然,我们能帮您找到潜在客户,解决您的困扰。如果您对本页介绍的腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一内容感兴趣,有相关需求意向欢迎拨打我们的服务热线,或留言咨询,我们将第一时间联系您!

< font-size: 16px;">< ">前言

< font-size: 16px;">Scheduler是Kubernetes组件中功能&逻辑相对单一&简单的模块,它主要的作用是:watch kube-apiserver,监听PodSpec.NodeName为空的pod,并利用预选和优选算法为该pod选择一个最佳的调度节点,最终将pod与该节点进行绑定,使pod调度在该节点上运行。

< font-size: 16px;">展开上述调用流程中的scheduler部分,内部细节调用(参考Kubernetes Scheduler)如图所示:

< font-size: 16px;">scheduler内部预置了很多预选和优选算法(参考scheduler_algorithm),比如预选:NoDiskConflict,PodFitsResources,MatchNodeSelector,CheckNodeMemoryPressure等;优选:LeastRequestedPriority,BalancedResourceAllocation,CalculateAntiAffinityPriority,NodeAffinityPriority等。但是在实际生产环境中我们常常会需要一些特殊的调度策略,比如批量调度(aka coscheduling or gang scheduling),这是kubernetes默认调度策略所无法满足的,这个时候就需要我们对scheduler进行扩展来实现这个功能了。

scheduler扩展方案

< font-size: 16px;">目前Kubernetes支持四种方式实现客户自定义的调度算法(预选&优选),如下:

< font-size: 16px;">default-scheduler recoding:直接在Kubernetes默认scheduler基础上进行添加,然后重新编译kube-scheduler

< font-size: 16px;">standalone:实现一个与kube-scheduler平行的custom scheduler,单独或者和默认kube-scheduler一起运行在集群中

< font-size: 16px;">scheduler extender:实现一个"scheduler extender",kube-scheduler会调用它(http/https)作为默认调度算法(预选&优选&bind)的补充

< font-size: 16px;">scheduler framework:实现scheduler framework plugins,重新编译kube-scheduler,类似于第一种方案,但是更加标准化,插件化

< font-size: 16px;">下面分别展开介绍这几种方式的原理和开发指引。

default-scheduler recoding

< font-size: 16px;">这里我们先分析一下kube-scheduler调度相关入口:

< font-size: 16px;">设置默认预选&优选策略

< font-size: 16px;">见defaultPredicates以及defaultPriorities(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/defaults.go):

func init() {

  registerAlgorithmProvider(defaultPredicates(), defaultPriorities())

}


func defaultPredicates() sets.String {

  return sets.NewString(

    predicates.NoVolumeZoneConflictPred,

    predicates.MaxEBSVolumeCountPred,

    predicates.MaxGCEPDVolumeCountPred,

    predicates.MaxAzureDiskVolumeCountPred,

    predicates.MaxCSIVolumeCountPred,

    predicates.MatchInterPodAffinityPred,

    predicates.NoDiskConflictPred,

    predicates.GeneralPred,

    predicates.PodToleratesNodeTaintsPred,

    predicates.CheckVolumeBindingPred,

    predicates.CheckNodeUnschedulablePred,

  )

}


func defaultPriorities() sets.String {

  return sets.NewString(

    priorities.SelectorSpreadPriority,

    priorities.InterPodAffinityPriority,

    priorities.LeastRequestedPriority,

    priorities.BalancedResourceAllocation,

    priorities.NodePreferAvoidPodsPriority,

    priorities.NodeAffinityPriority,

    priorities.TaintTolerationPriority,

    priorities.ImageLocalityPriority,

  )

}


func registerAlgorithmProvider(predSet, priSet sets.String) {

  // Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used

  // by specifying flag.

  scheduler.RegisterAlgorithmProvider(scheduler.DefaultProvider, predSet, priSet)

  // Cluster autoscaler friendly scheduling algorithm.

  scheduler.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,

    copyAndReplace(priSet, priorities.LeastRequestedPriority, priorities.MostRequestedPriority))

}


const (

  // DefaultProvider defines the default algorithm provider name.

  DefaultProvider = "DefaultProvider"

)

< font-size: 16px;">注册预选和优选相关处理函数

< font-size: 16px;">注册预选函数(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/register_predicates.go):

func init() {

    ...

  // Fit is determined by resource availability.

  // This predicate is actually a default predicate, because it is invoked from

  // predicates.GeneralPredicates()

  scheduler.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)

}

< font-size: 16px;">注册优选函数(k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults/register_priorities公关有哪些危机.go):

func init() {

    ...

  // Prioritizes nodes that have labels matching NodeAffinity

  scheduler.RegisterPriorityMapReduceFunction(priorities.NodeAffinityPriority, priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1)

}

< font-size: 16px;">编写预选和优选处理函数

< font-size: 16px;">PodFitsResourcesPred对应的预选函数如下(k8s.io/kubernetes/pkg/scheduler/algorithm/predicates/predicates.go):

// PodFitsResources checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.

// First return value indicates whether a node has sufficient resources to run a pod while the second return value indicates the

// predicate failure reasons if the node has insufficient resources to run the pod.

func PodFitsResources(pod *v1.Pod, meta Metadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error) {

  node := nodeInfo.Node()

  if node == nil {

    return false, nil, 央视广告投放流程fmt.Errorf("node not found")

  }


  var predicateFails []PredicateFailureReason

  allowedPodNumber := nodeInfo.AllowedPodNumber()

  if len(nodeInfo.Pods())+1 > allowedPodNumber {

    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))

  }


  // No extended resources should be ignored by default.

  ignoredExtendedResources := sets.NewString()


  var podRequest *schedulernodeinfo.Resource

  if predicateMeta, ok := meta.(*predicateMetadata); ok && predicateMeta.podFitsResourcesMetadata != nil {

    podRequest = predicateMeta.podFitsResourcesMetadata.podRequest

    if predicateMeta.podFitsResourcesMetadata.ignoredExtendedResources != nil {

      ignoredExtendedResources = predicateMeta.podFitsResourcesMetadata.ignoredExtendedResources

    }

  } else {

    // We couldn't parse metadata - fallback to computing it.

    podRequest = GetResourceRequest(pod)

  }

  if podRequest.MilliCPU == 0 &&

    podRequest.Memory == 0 &&



    podRequest.EphemeralStorage == 0 &&

    len(podRequest.ScalarResources) == 0 {

    return len(predicateFails) == 0, predicateFails, nil

  }


  allocatable := nodeInfo.AllocatableResource()

  if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {

    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))

  }

  if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {

    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))

  }

  if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {

    predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))

  }


  for rName, rQuant := range podRequest.ScalarResources {

    if v1helper.IsExtendedResourceName(rName) {

      // If this resource is one of the extended resources that should be

      // ignored, we will skip checking it.

      if ignoredExtendedResources.Has(string(rName)) {

        continue

      }

    }

    if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {

      predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))

    }

  }


  if klog.V(10) {

    if len(predicateFails) == 0 {

      // We explicitly don't do klog.V(10).Infof() to avoid computing all the parameters if this is

      // not logged. There is visible performance gain from it.

      klog.Infof("Schedule Pod %+v on Node %+v is allowed, Node is running only %v out of %v Pods.",

        podName(pod), node.Name, len(nodeInfo.Pods()), allowedPodNumber)

    }

  }

  return len(predicateFails) == 0, predicateFails, nil

}

优选NodeAffinityPriority对应的Map与Reduce函数(k8s.io/kubernetes/pkg/scheduler/algorithm/priorities/node_affinity.go)如下:

// CalculateNodeAffinityPriorityMap prioritizes nodes according to node affinity scheduling preferences

// indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node matches a preferredSchedulingTerm,

// it will get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms

// the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher

// score the node gets.

func CalculateNodeAffinityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (framework.NodeScore, error) {

  node := nodeInfo.Node()

  if node == nil {

    return framework.NodeScore{}, fmt.Errorf("node not found")

  }


  // default is the podspec.

  affinity := pod.Spec.Affinity

  if priorityMeta, ok := meta.(*priorityMetadata); ok {

    // We were able to parse metadata, use affinity from there.

    affinity = priorityMeta.affinity

  }


  var count int32

  // A nil element of PreferredDuringSchedulingIgnoredDuringExecution matches no objects.

  // An element of PreferredDuringSchedulingIgnoredDuringExecution that refers to an

  // empty PreferredSchedulingTerm matches all objects.

  if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {

    // Match PreferredDuringSchedulingIgnoredDuringExecution term by term.

    for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {

      preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]

      if preferredSchedulingTerm.Weight == 0 {

        continue

      }


      // TODO: Avoid computing it for all nodes if this becomes a performance problem.

      nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)

      if err != nil {

        return framework.NodeScore{}, err

      }

      if nodeSelector.Matches(labels.Set(node.Labels)) {

        count += preferredSchedulingTerm.Weight

      }

    }

  }




  return framework.NodeScore{

    Name:  node.Name,

    Score: int64(count),

  }, nil

}


// CalculateNodeAffinityPriorityReduce is a reduce function for node affinity priority calculation.

var CalculateNodeAffinityPri舆情回复orityReduce = NormalizeReduce(framework.MaxNodeScore, false)

腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一

上一篇:美妆品牌如何在TikTok营销:定制专属歌曲,视频
下一篇:Facebook跨境营销四大方案来了!


版权声明:以上主题为“腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一"的内容可能是本站网友自行发布,或者来至于网络。如有侵权欢迎联系我们客服QQ处理,谢谢。
相关内容
推荐内容
扫码咨询
    腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一
    打开微信扫码或长按识别二维码

小提示:您应该对本页介绍的“腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一”相关内容感兴趣,若您有相关需求欢迎拨打我们的服务热线或留言咨询,我们尽快与您联系沟通腾讯云带你一篇读懂Kubernetes Scheduler扩展功能(一的相关事宜。

关键词:腾讯云带你一篇读懂Kube

关于 | 业务 | 案例 | 免责 | 隐私
客服邮箱:sales@1330.com.cn
电话:400-021-1330 | 客服QQ:865612759
沪ICP备12034177号 | 沪公网安备31010702002418号