Kubernetes Inter-pod anti-affinity and de-schedule

Introduction

Before talking about affinity and anti-affinity inside a Kubernetes cluster, let's first understand what Kubernetes is. Kubernetes is a platform to manage and orchestrate workloads and services based on containers that offer a lot of features such as auto-scaling (vertical and horizontal), container replicas, secrets management, etc.

Kubernetes also offers 2 important scheduling features that can be configured, to place pods inside the nodes. Those features are:

Node affinity: This is similar to nodeSelector with the difference that the language is more expressive and you can create rules that are not hard requirements but rather a soft/preferred rule, meaning that the scheduler will still be able to schedule your pod, even if the rules can not be met.
Inter-pod affinity and anti-affinity: Allow you to define rules that constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.

The focus of this post will be on Inter-pod anti-affinity:

How to deploy an application with a rule that specifies to prefer scheduling a pod into another node if that node already contains a pod with the same labels as the pod to be scheduled (like in the case of multiple replicas of the same app)
Fix a real-world edge case that can make your pods get stuck on the same node, even if you had specified a pod anti-affinity rule.

So let's get started!.

Creating a local multi-node cluster

To create a local multi-node cluster in our machine, we will be using Kind, so let's go ahead and follow the installation guide.

Once installed, let's create a configuration file (kind-config.yaml), specifying a cluster with 1 control-plane and 3 worker nodes:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
# One control plane node and three "workers".
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

Now let's create a cluster by running the following command:

kind create cluster --name k8s-playground --config kind-config.yaml

Note: It may take a few minutes, depending on your computer's resources

Let's check if the cluster was created and has the right nodes, to do this, run the following command:

kubectl get nodes

Should see an output similar to this one

NAME                           STATUS   ROLES                  AGE   VERSION
k8s-playground-control-plane   Ready    control-plane,master   51s   v1.21.1
k8s-playground-worker          Ready    <none>                 25s   v1.21.1
k8s-playground-worker2         Ready    <none>                 25s   v1.21.1
k8s-playground-worker3         Ready    <none>                 25s   v1.21.1

Now, create a new deployment file (deployment.yaml) with the name demo-app.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: demo-app
  name: demo-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demo-app
  template:
    metadata:
      labels: # These are the the Pod labels
        app: demo-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions: # The key and value of the label that you will match against
                - key: app
                  operator: In
                  values:
                  - demo-app # In this example we are matching against the same labels as the pod label
              topologyKey: kubernetes.io/hostname
      containers:
      - image: nginxdemos/hello
        imagePullPolicy: Always
        name: hello
        resources: {}

Let's apply the deployment

kubectl apply -f deployment.yaml

Run kubectl get pods -o wide to see the running pods

NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE                     NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-w6f6p   1/1     Running   0          4s    10.244.1.7   k8s-playground-worker3   <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          4s    10.244.3.6   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-xwsfk   1/1     Running   0          4s    10.244.2.8   k8s-playground-worker2   <none>           <none>

As you can see, Kubernetes will prefer to place the pods on nodes that do not have an instance of the app running.

What happens if we have more replicas than the number of nodes? Well, let's see. Run the following command to scale the deployment to 5 replicas:

kubectl scale deployment demo-app --replicas 5

Now run kubectl get pods -o wide, the output should be similiar to this one

NAME                       READY   STATUS    RESTARTS   AGE    IP           NODE                     NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-blmdm   1/1     Running   0          14s    10.244.1.8   k8s-playground-worker3   <none>           <none>
demo-app-99d479bc9-td9rk   1/1     Running   0          14s    10.244.3.7   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-w6f6p   1/1     Running   0          5m5s   10.244.1.7   k8s-playground-worker3   <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          5m5s   10.244.3.6   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-xwsfk   1/1     Running   0          5m5s   10.244.2.8   k8s-playground-worker2   <none>           <none>

As you can see, since we are using preferredDuringSchedulingIgnoredDuringExecution, Kubernetes "preferred" to place the other 2 replicas on nodes that already had pods of the same app running since there wasn't another node to schedule to.

Taking down nodes

What happens if a node goes down? Well, let's find out.

Let's drain the node k8s-playground-worker3 to simulate that node went down.

kubectl drain k8s-playground-worker3 --ignore-daemonsets

If we run kubectl get pods -o wide, we can see that all pods got rescheduled on nodes k8s-playground-worker and k8s-playground-worker2, since k8s-playground-worker3 went down.

NAME                       READY   STATUS    RESTARTS   AGE     IP           NODE                     NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-2ztpg   1/1     Running   0          2m11s   10.244.2.9   k8s-playground-worker2   <none>           <none>
demo-app-99d479bc9-c6pfn   1/1     Running   0          2m11s   10.244.3.9   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-td9rk   1/1     Running   0          20m     10.244.3.7   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          25m     10.244.3.6   k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-xwsfk   1/1     Running   0          25m     10.244.2.8   k8s-playground-worker2   <none>           <none>

Now, let's drain the node k8s-playground-worker2.

kubectl drain k8s-playground-worker2 --ignore-daemonsets

If we run kubectl get pods -o wide, we can see (as expected), that all the pods are running only in the node k8s-playground-worker, since there is no other node in the cluster.

NAME                       READY   STATUS    RESTARTS   AGE    IP            NODE                    NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-87pwh   1/1     Running   0          47s    10.244.3.11   k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-c6pfn   1/1     Running   0          7m5s   10.244.3.9    k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-kvwq7   1/1     Running   0          47s    10.244.3.10   k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-td9rk   1/1     Running   0          25m    10.244.3.7    k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          30m    10.244.3.6    k8s-playground-worker   <none>           <none>

Restoring down nodes

So, what happens if a node goes back online?

Let's see, run the following commands to restore the nodes

kubectl uncordon k8s-playground-worker2
kubectl uncordon k8s-playground-worker3

Wait a moment then run the following command to see if all nodes are back online.

kubectl get nodes -o wide

Should print an output similar to this one

NAME                           STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION                      CONTAINER-RUNTIME
k8s-playground-control-plane   Ready    control-plane,master   64m   v1.21.1   172.29.0.5    <none>        Ubuntu 21.04   5.10.60.1-microsoft-standard-WSL2   containerd://1.5.2
k8s-playground-worker          Ready    <none>                 64m   v1.21.1   172.29.0.2    <none>        Ubuntu 21.04   5.10.60.1-microsoft-standard-WSL2   containerd://1.5.2
k8s-playground-worker2         Ready    <none>                 64m   v1.21.1   172.29.0.4    <none>        Ubuntu 21.04   5.10.60.1-microsoft-standard-WSL2   containerd://1.5.2
k8s-playground-worker3         Ready    <none>                 64m   v1.21.1   172.29.0.3    <none>        Ubuntu 21.04   5.10.60.1-microsoft-standard-WSL2   containerd://1.5.2

We can see that all the nodes are back online!

What happened to the pods?

Run kubectl get pods -o wide, to list the pods.

NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE                    NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-87pwh   1/1     Running   0          16m   10.244.3.11   k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-c6pfn   1/1     Running   0          22m   10.244.3.9    k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-kvwq7   1/1     Running   0          16m   10.244.3.10   k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-td9rk   1/1     Running   0          41m   10.244.3.7    k8s-playground-worker   <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          46m   10.244.3.6    k8s-playground-worker   <none>           <none>

What?!, all pods are still running on node k8s-playground-worker, even if all the other nodes are back online!.

What does this mean?

If node k8s-playground-worker goes down, we will have downtime in our application during the re-scheduling to the other nodes. Since all the pods are on the same node
We have lost high availability (HA) in our cluster for that app, even when having multiple nodes up and running.

The issue

What happened was that the inter-pod anti-affinity mechanism is only relevant during scheduling. Once a pod is running, the rules cannot be re-applied. To apply the rules again, you will need to recreate the pod.

The solution

To fix this, we need to somehow watch for node changes and reapply the rules and de-schedule the pods to distribute the workload accordingly to the rules specification.

Luckily there is already a tool that does that. It is called Descheduler.

Let's install the Helm Chart, by running the following command.

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm install descheduler --namespace kube-system descheduler/descheduler

And that's all, only need to wait a few minutes to take effect. Run kubectl get pods -o wide to watch the changes in the pods.

NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE                     NOMINATED NODE   READINESS GATES
demo-app-99d479bc9-9vxbr   1/1     Running   0          45s   10.244.2.12   k8s-playground-worker2   <none>           <none>
demo-app-99d479bc9-s95f8   1/1     Running   0          45s   10.244.1.11   k8s-playground-worker3   <none>           <none>
demo-app-99d479bc9-td9rk   1/1     Running   0          91m   10.244.3.7    k8s-playground-worker    <none>           <none>
demo-app-99d479bc9-xgjgz   1/1     Running   0          45s   10.244.2.11   k8s-playground-worker2   <none>           <none>
demo-app-99d479bc9-xhfj8   1/1     Running   0          96m   10.244.3.6    k8s-playground-worker    <none>           <none>

As you can see, the anti-affinity rules got reapplied, and the pods are re-scheduled on different nodes again. High availability for your application has been restored!.

Note: Descheduler has a lot more options, but that's a story for another post.

Summary

We have learned how to create an application and applied anti-affinity rules to spread the pods across the nodes and avoid all replicas of the app to schedule on the same node, achieving more high availability. We also learned that some edge cases make our apps lose high availability and introduce downtime. And to fix those kinds of situations, there are tools like Descheduler, that can help us overcome those issues.

If you want to learn more about how to assign pods to nodes, you can check the official documentation here.

Thanks for your time reading this article. See you on the next one!