什么是 kube-scheduler ?
Kubernetes 集群的核心组件之一,它负责为新创建的 Pods 分配节点。它根据多种因素进行决策,包括:
在部署对象中的 spec 中常常会见到关于 limits 和 requests 的声明 ,例如:
limits
requests
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx resources: limits: memory: 1Gi cpu: 1 requests: memory: 256Mi cpu: 100m
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
labels:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 256Mi
cpu: 100m
这里的 limits 和 requests 是与 Pod 容器资源管理相关的两个关键概念:
limits 和 requests 跟 scheduler 有什么关系 ?
在集群中 kube-scheduler 一直是默默无闻的幕后工作者,它主要工作内容如下:
nginx
总之,kube-scheduler 会保证 Pod 会调度到满足其运行资源需求的 Node 节点上。
描述
LimitRange 是资源描述对象,主要用于限制命名空间内资源的使用。它可以设置默认的资源请求和限制,以及资源使用的最大和最小值。它可以确保每个 Pod 或容器在资源使用上遵循特定的策略,从而避免单个 Pod 或容器占用过多资源。使用示例如下:
创建一个 YAML 文件保存 LimitRange 内容,例如:mem-limit-range.yaml:
mem-limit-range.yaml
apiVersion: v1kind: LimitRangemetadata: name: mem-limit-rangespec: limits: - default: memory: 512Mi defaultRequest: memory: 256Mi type: Container
apiVersion: v1
kind: LimitRange
name: mem-limit-range
- default:
memory: 512Mi
defaultRequest:
type: Container
应用到集群:
$ kubectl apply -f mem-limit-range.yaml
查看创建的 LimitRange 对象:
$ kubectl describe limitrange mem-limit-range
输出:
Name: mem-limit-rangeNamespace: defaultType Resource Min Max Default Request Default Limit Max Limit/Request Ratio---- -------- --- --- --------------- ------------- -----------------------Container memory - - 256Mi 512Mi -
Name: mem-limit-range
Namespace: default
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container memory - - 256Mi 512Mi -
说明:
Container
验证
定义一个没有声明资源请求的部署对象,文件命名为: nginx-without-resource.yaml ,如下:
nginx-without-resource.yaml
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx
应用部署到集群:
$ kubectl apply -f nginx-without-resource.yaml
等 Pod 创建后,可以通过检查它们的配置来确认 LimitRange 是否生效。
LimitRange
$ kubectl describe pod [POD_NAME]
Containers: #.. ignore Limits: memory: 512Mi Requests: memory: 256Mi
Containers:
#.. ignore
Limits:
Requests:
initContainers 用于在主应用容器启动之前执行一些预备任务。常见于以下场景:
initContainers 在执行完其任务后会停止,且必须成功完成才能启动主容器。非常适合用于启动前的初始化任务。
示例:
在部署对象中声明 initContainers 属性:
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: initContainers: - name: init-myservice image: busybox:1.28 command: ['sh', '-c', 'echo The app is running! && sleep 10'] containers: - name: nginx image: nginx
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 10']
将部署对象应用到集群:
$ kubectl apply -f init-container.yaml
当 Pod 启动后,可以通过查看事件日志验证容器的加载顺序:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m20s default-scheduler Successfully assigned default/nginx-deployment-6445f86ddc-fmmzw to docker-desktop Normal Pulling 2m20s kubelet Pulling image "busybox:1.28" Normal Pulled 116s kubelet Successfully pulled image "busybox:1.28" in 23.099396719s (23.099404677s including waiting) Normal Created 116s kubelet Created container init-myservice Normal Started 116s kubelet Started container init-myservice Normal Pulling 106s kubelet Pulling image "nginx" Normal Pulled 88s kubelet Successfully pulled image "nginx" in 18.382000675s (18.382006008s including waiting) Normal Created 88s kubelet Created container nginx Normal Started 88s kubelet Started container nginx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m20s default-scheduler Successfully assigned default/nginx-deployment-6445f86ddc-fmmzw to docker-desktop
Normal Pulling 2m20s kubelet Pulling image "busybox:1.28"
Normal Pulled 116s kubelet Successfully pulled image "busybox:1.28" in 23.099396719s (23.099404677s including waiting)
Normal Created 116s kubelet Created container init-myservice
Normal Started 116s kubelet Started container init-myservice
Normal Pulling 106s kubelet Pulling image "nginx"
Normal Pulled 88s kubelet Successfully pulled image "nginx" in 18.382000675s (18.382006008s including waiting)
Normal Created 88s kubelet Created container nginx
Normal Started 88s kubelet Started container nginx
可以看到 initContainers 声明的容器已经加载,然后查看特定的日志,来检查 Pod 日志输出:
$ kubectl logs [POD_NAME] -c init-myservice
The app is running!
验证完成。
initContainers 和 kube-scheduler 的关系 ?
如果 initContainers 没有声明资源需求,默认也会使用 LimitRange 声明的默认资源,这也意味着,initContainers 也是由 kube-scheduler 来调度创建的。所以在 initContainers 中加上资源需求也会影响着 kube-scheduler 的调度决策。
在部署对象中,nodeSelector 属性的作用是用于把指定 Pod 调度到具有特定标签的节点上。如果没有满足要求的 Node 节点,则 Pod 会持续等待,示例:
nodeSelector
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx nodeSelector: disktype: ssd
nodeSelector:
disktype: ssd
在这个例子中 nodeSelector 属性值为:disktype: ssd 。这表明这个 Pod 应该被调度到标签为 disktype=ssd 的 Node 节点上。kube-scheduler 在调度时,会选择合适的节点以运行这个 Pod 时。
disktype=ssd
先将部署对象应用到集群中:
$ kubectl apply -f node-selector.yaml
然后查看 Pod 状态:
$ kubectl get pod
NAME READY STATUS RESTARTS AGEnginx-deployment-f5bc98d57-pmq9v 0/1 Pending 0 2m17s
NAME READY STATUS RESTARTS AGE
nginx-deployment-f5bc98d57-pmq9v 0/1 Pending 0 2m17s
可以看到创建的 Pod 一直保持在 "Pending" 状态。通过事件日志查看具体原因:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 4m38s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 4m38s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
从事件日志可以看出,这个 Pod 不能被调度,因为没有节点满足其设定的节点选择条件。因为我的集群中确实没有任何标记为 disktype: ssd 的节点在运行。
NodeSelector 的演进版本,提供了更复杂的选择规则。除了简单的匹配,它们还支持更丰富的条件表达式,如 "存在"、"不等于"、"在集合中" 等,并且支持对 Pod 之间(Pod Affinity/Anti-Affinity)以及 Pod 与节点之间(Node Affinity)的亲和性/反亲和性设置。在 Kubernetes 后续版本中 Affinity 也逐渐替代了 NodeSelector。
podAffinity 用于定义 Pods 之间的亲和性。使得某个 Pod 被调度到与其他特定标签的 Pod 相同的节点上。
使用场景:当希望一组服务紧密地协同工作时,比如一个应用的不同组件需要低延迟通讯。
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-antispec: replicas: 2 selector: matchLabels: app: anti-nginx template: metadata: labels: app: anti-nginx spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: a operator: In values: - b topologyKey: kubernetes.io/hostname podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - anti-nginx topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: nginx
name: nginx-anti
replicas: 2
app: anti-nginx
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: a
operator: In
values:
- b
topologyKey: kubernetes.io/hostname
podAntiAffinity:
- key: app
- anti-nginx
- name: with-pod-affinity
部署文件展示亲和性(Affinity)设置:
a
b
app
anti-nginx
将上面部署文件应用到集群后,查看 Pods 的分布情况:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESnginx-anti-5656fcbb98-62mds 0/1 Pending 0 5s <none> <none> <none> <none>nginx-anti-5656fcbb98-wxphs 0/1 Pending 0 5s <none> <none> <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-anti-5656fcbb98-62mds 0/1 Pending 0 5s <none> <none> <none> <none>
nginx-anti-5656fcbb98-wxphs 0/1 Pending 0 5s <none> <none> <none> <none>
可以 Pod 因为亲和性规则无法调度一直处于等待状态,查看特定 Pod 的事件日志可以验证:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 27s default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity rules. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 27s default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod affinity rules. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
利用 Pod 亲和性和反亲和性规则来控制 Pod 的调度位置,以实现特定的调度需求和负载分布。
用于定义 Pod 与节点之间的亲和性。控制 Pod 被调度到具有特定标签或属性的节点上。
适用场景:当您需要根据硬件特性(如 GPU、高性能存储)或其他自定义标签(如环境标签)调度 Pod 时。
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx
nodeAffinity:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
- ssd
部署文件的亲和性(Affinity)设置:
nodeAffinity
将上面部署文件应用到集群后,查看 Pod 的运行情况:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESnginx-deployment-565d7797dc-jf5nk 0/1 Pending 0 14s <none> <none> <none> <none>
nginx-deployment-565d7797dc-jf5nk 0/1 Pending 0 14s <none> <none> <none> <none>
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 89s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 89s default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
preferredDuringSchedulingIgnoredDuringExecution
和之前的 requiredDuringScheduling 调度类型不同,preferredDuringScheduling 表明其是一个偏好性的调度,调度器会根据偏好优先选择满足对应规则的节点来调度Pod。但如果找不到满足规则的节点,调度器则会选择其他节点来调度Pod。
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: disktype operator: In values: - ssd containers: - name: nginx image: nginx
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
配置说明:这里使用的是 preferredDuringSchedulingIgnoredDuringExecution 类型,这意味着调度器会尽量但不强制将 Pod 调度到具有 disktype: ssd 标签的节点上。
NAME READY STATUS RESTARTS AGEnginx-deployment-69c654d896-7qh8t 1/1 Running 0 28s
nginx-deployment-69c654d896-7qh8t 1/1 Running 0 28s
可以看到虽然我本地没有满足亲和性规则的 Node 节点,但是 Pod 依然可以调度起来了。
总结:
podAffinity
Taints 和 Tolerations 是 Kubernetes 中用于控制 Pod 调度到特定节点的一种机制,相比 Affinity 亲和性 **相似性 **的机制,Taints 的规则是属于 排斥性 的机制,用来“排斥”不满足特定条件的 Pod。
Taints 有三种效果:
NoSchedule
PreferNoSchedule
NoExecute
Taints 常见的应用场景:
使用示例:
给节点添加 Taint,防止所有 Pod 自动调度到该节点,除非它们具有匹配的 Tolerations:
$ kubectl taint nodes docker-desktop for-special-user=cadmin:NoSchedule
先定义一个没有任何 Tolerations 的 Pod 来验证:
将它应用到集群,查看 Pod 状态会一直处于 Pending:
NAME READY STATUS RESTARTS AGEnginx-deployment-77b4fdf86c-wm5f9 0/1 Pending 0 23s
nginx-deployment-77b4fdf86c-wm5f9 0/1 Pending 0 23s
从事件日志可以看到是 Taints 在发挥作用:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 56s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {for-special-user: cadmin}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Warning FailedScheduling 56s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {for-special-user: cadmin}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
然后再 Pod 定义中添加 Tolerations,允许它被调度到带有特定 Taint 的节点上:
apiVersion: apps/v1kind: Deploymentmetadata: name: nginx-deploymentspec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx tolerations: - key: "for-special-user" operator: "Equal" value: "docker-desktop" effect: "NoSchedule"
tolerations:
- key: "for-special-user"
operator: "Equal"
value: "docker-desktop"
effect: "NoSchedule"
这个部署文件设置了一个 容忍度 (Tolerations) 规则:允许 Pod 被调度到标记为 for-special-user=docker-desktop 并且具有 NoSchedule 效果的节点上。
for-special-user=docker-desktop
将它应用到集群,查看 Pod 状态:
NAME READY STATUS RESTARTS AGEnginx-deployment-dd7d69c9c-77qlf 1/1 Running 0 31s
nginx-deployment-dd7d69c9c-77qlf 1/1 Running 0 31s
Pod 已经正常调度,这也是 Taints 发挥作用。
如果节点不在需要 Tanints 作为排除,可以移除 :
$ kubectl taint nodes docker-desktop for-special-user=cadmin:NoSchedule-
node/docker-desktop untainted
PriorityClass 用于定义 Pod 的调度优先级。常见的场景包括:
PriorityClass
使用 PriorityClass 的步骤:
创建 PriorityClass:
apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: high-priorityvalue: 1000000globalDefault: falsedescription: "This priority class should be used for XYZ service pods only."
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
apiVersion: v1kind: Podmetadata: name: mypodspec: priorityClassName: high-priority containers: - name: mycontainer image: myimage
kind: Pod
name: mypod
priorityClassName: high-priority
- name: mycontainer
image: myimage
通过 priorityClassName 应用刚才创建的 PriorityClass,从而确保该 Pod 具有更高的调度优先级。
priorityClassName
默认的调度器是面向通用的使用场景设计的,如果默认的 Kubernetes 调度器无法满足需求,也可以通过自定义的调度器来满足更加个性化的需求,示例:
apiVersion: v1kind: Podmetadata: name: mypodspec: schedulerName: my-custom-scheduler containers: - name: mycontainer image: myimage
schedulerName: my-custom-scheduler
社区也有很多成熟开源的自定义调度器,例如:
另外也可以参考 kube-scheduler 源码实现一个自己的调度器。
kube-scheduler
原文链接:https://www.cnblogs.com/xiao2shiqi/p/17863674.html
本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728