Share the love

It is possible to create an AKS whose nodepools have more than one affinity zone. This ensures that different nodes in the cluster are physically separated in different zones within the same region, which adds redundancy: if one zone goes down, the cluster continues to function. Some limitations are mentioned in the documentation (not all regions support it and the pool VM size must be available in all affinity zones).

 Of course, you can only set the affinity zones of the nodepool when you create it, once created they cannot be modified

On paper that’s great – anything to add fault tolerance is good, but beware! that comes with a price to pay, and that is that you can only link disks from the same affinity zone in which a node is. That means if you have a pod running on a node that is in Affinity Zone 1, all of its Pods will only be able to link disks that are in that same Affinity Zone. The nodes are marked with the labels topology.kubernetes.io/region and failure-domain.beta.kubernetes.io/region indicating the region and the labels topology.kuberbetes.io/zone and failure-domain.beta.kubernetes.io/zone indicating the affinity zone (there are two labels for the same, because the failure-domain is old and obsolete, but it is kept for compatibility).

Affinity zones and self-provisioned PVs

What happens when you create a self-provisioned PV, from a PVC? Remember that, in this case, the PV is created automatically by the PVC. To check what happens I have created 4 PVCs and waited for the 4 underlying PVs to have been created:

> kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS      REASON   AGE
pvc-a9d08554-7557-46dd-a2b6-5c269ba7a688   5Gi        RWO            Delete           Bound    default/pvc3          managed-premium            84s
pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea   5Gi        RWO            Delete           Bound    default/pvc1          default                    84s
pvc-afd7dc31-f633-4c99-8b7e-1f59f58f4416   3Gi        RWO            Delete           Bound    default/pvc4          managed-premium            84s
pvc-5e147a8a-5e1f-49c5-9a02-51ebe9ee2807   8Gi        RWO            Delete           Bound    default/pvc2          default                    84s

In my resource group MC_* I have the disks already created. But in which affinity zone?

$aks = "<nombre-del-aks>"
$rg = "<rg-del-aks>"
$rgmc=$(az aks show -n $aks -g $rg -ojson | ConvertFrom-Json).nodeResourceGroup
az disk list -g $rgmc --query '[][name,zones[0]]'

The output in my case is:

[
  [
    "kubernetes-dynamic-pvc-a9d08554-7557-46dd-a2b6-5c269ba7a688",
    "2"
  ],
  [
    "kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea",
    "1"
  ],
  [
    "kubernetes-dynamic-pvc-afd7dc31-f633-4c99-8b7e-1f59f58f4416",
    "2"
  ],
  [
    "kubernetes-dynamic-pvc-5e147a8a-5e1f-49c5-9a02-51ebe9ee2807",
    "1"
  ]
]

In this case, two of the disks have been created in affinity zone 1 and two more in affinity zone 2. The information of the affinity zone of the disk is also found in the associated PV, since this declares one nodeAffinity to ensure that pods using that PV run on a node that is in this same affinity zone:

> kubectl describe pv pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
Name:              pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
Labels:            failure-domain.beta.kubernetes.io/region=westeurope
                   failure-domain.beta.kubernetes.io/zone=westeurope-1
Annotations:       pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
                   volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      default
Status:            Bound
Claim:             default/pvc1
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          5Gi
Node Affinity:
  Required Terms:
    Term 0:        failure-domain.beta.kubernetes.io/region in [westeurope]
                   failure-domain.beta.kubernetes.io/zone in [westeurope-1]
Message:
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
    DiskURI:      /subscriptions/ac529d82-e944-485d-9a9b-4d46d44214da/resourceGroups/mc_cloudiseasy_test_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-62d88cfc-d7ca-451e-abf4-99fb42651eea
    Kind:         Managed
    FSType:
    CachingMode:  ReadOnly
    ReadOnly:     false
Events:           <none>

Notice how the Node Affinity declares that this PV must be in a node that has the failure-domain.beta.kubernetes.io/zone value label westeurope-1(the affinity zone 1, which is the one that corresponds to the disk associated with this PV).

If I now create a pod that uses said PV (through the associated PVC (pvc1 in the example above)) the Kubernetes Scheduler will make sure that said pod runs on a node that is in the same affinity zone. The result is that the pod is assigned to the node aks-agentpool-65616547-vmss000000that is the one in affinity zone 1:

> kubectl get po -o wide
NAME                         READY   STATUS    RESTARTS   AGE    IP            NODE                                NOMINATED NODE   READINESS GATES
nginxpvc1-5845934b55-megtq   1/1     Running   0          112s   10.242.0.12   aks-agentpool-65616547-vmss000000   <none>           <none>

What if, for whatever reason, this node could not run the pod? Well, very simple, it would stay in pending:

> kubectl delete deploy nginxpvc1
deployment.apps "nginxpvc1" deleted
> kubectl taint node aks-agentpool-65616547-vmss000000 donothig:NoSchedule
node/aks-agentpool-65616547-vmss000000 tainted
> kubectl apply -f .\deploy-pvc1.yaml
deployment.apps/nginxpvc1 created
> kubectl get po
NAME                         READY   STATUS    RESTARTS   AGE
nginxpvc1-57458397b55-dhrkd   0/1     Pending   0          58s

And if you launch a kubectl describe in the section EVENTS it would be clear that the pod cannot be executed because the scheduler cannot find any node.

Affinity zones and static PVs

If you create your PV linked to a pre-existing disk, you must put a nodeAffinity in said PV, since if you won’t, the scheduler would deploy the pod in any node. If the node is in another affinity zone, it cannot be mounted. Let’s look at an example. I have a disk called testvhd created in the affinity zone 3. I define a PV associated with said disk.

This PV is linked to the disk indicated in diskURI and to the PVC indicated in claimRef.name. It is necessary to create the PVC of course. Once the PVC is applied, the PV will be tied.

> kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS      REASON   AGE
testvhd                                    8Gi        RWO            Retain           Bound    default/testvhd       managed-premium            108s

Now we can deploy a pod that uses the PVC testvhd and see what happens. Well, two things can happen:

  1. That the pod is deployed, by pure chance, in a node that is in the affinity zone of the disk (3 in my case). In this case the pod will run without problems. By default, the scheduler tries to distribute the pods among the different affinity zones.
  2. Pod is deployed, by pure chance, to a node that is in any other affinity zone than the disk. In this case the pod will stay in ContainerCreating and in the section EVENTS you will see something like:
Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Normal   Scheduled           72s                default-scheduler        Successfully assigned default/nginxpvctestvhd-5575bcddcb-bbjrh to aks-agentpool-75116936-vmss000001
  Warning  FailedAttachVolume  72s                attachdetach-controller  Multi-Attach error for volume "testvhd" Volume is already used by pod(s) nginxpvctestvhd-5575bcddcb-tgccb
  Warning  FailedAttachVolume  15s (x7 over 49s)  attachdetach-controller  AttachVolume.Attach failed for volume "testvhd" : Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
  "error": {
    "code": "BadRequest",
    "message": "Disk /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/MC_velerotest_velerotest_westeurope/providers/Microsoft.Compute/disks/testvhd cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '3'."
  }
}

That’s because unlike PVs that are created automatically, our PV had no defined node affinity, so the Scheduler doesn’t know which node to place the pod in. Hence, you must use nodeAffinity in the definition of the PV. This way, when scheduler needs to place a pod that uses this PV, it will place it in a node that is in affinity zone 3 and everything will work correctly.

Conclusion

Using affinity zones increases the fault tolerance in your AKS but adds an additional dimension to your cluster’s topology: once a VP is placed in an affinity zone, any pod that uses it must be placed in that affinity zone. Therefore, you must also balance the affinity zones well. Example – it may not be a good idea to have only one node per zone, because in this case, if this node goes down, the pods cannot be relocated to any other node.