Run a replicated stateful application using local storage in Kubernetes

This post shows how to run a replicated stateful application on local storage using a StatefulSet controller. This application is a replicated MySQL database. The example topology has a single primary server and multiple replicas, using asynchronous row-based replication. The MySQL data is using a storage class backed by local SSD storage provided by the vSphere CSI driver performing the dynamic PersistentVolume provisioner.

This post continues from the previous post where I described how to setup multi-AZ topology aware volume provisioning with local storage.

I used this example here to setup a StatefulSet with MySQL to get an example application up and running.

However, I did not use the default storage class, but added one line to the mysql-statefulset.yaml file to use the storage class that is backed by local SSDs instead.

from:

    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

to:

    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: k8s-local-nvme
      resources:
        requests:
          storage: 10Gi

I also appended the StatefulSet to include the spec.template.spec.affinity and spec.template.spec.podAntiAffinity settings to make use of the three AZs for pod scheduling.

spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.csi.vmware.com/k8s-zone
                operator: In
                values:
                - az-1
                - az-2
                - az-3
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - mysql
            topologyKey: topology.csi.vmware.com/k8s-zone

Everything else stayed the same. Please spend some time reading the example from kubernetes.io as I will be performing the same steps but using local storage instead to test the behavior of MySQL replication.

Architecture

I am using the same setup, with three replicas in the StatefulSet to match with the three AZs that I have setup in my lab.

My AZ layout is the following.

AZESX hostTKG worker
az-1esx1.vcd.labtkg-hugo-md-0-7d455b7488-g28bl
az-2esx2.vcd.labtkg-hugo-md-1-7bbd55cdb8-996×2
az-3esx3.vcd.labtkg-hugo-md-2-6c6c49dc67-xbpg7

We can see which pod runs on which worker using the following command:

k get po -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP               NODE                             NOMINATED NODE   READINESS GATES
mysql-0             2/2     Running   0          3h24m   100.120.135.67   tkg-hugo-md-1-7bbd55cdb8-996x2   <none>           <none>
mysql-1             2/2     Running   0          3h22m   100.127.29.3     tkg-hugo-md-0-7d455b7488-g28bl   <none>           <none>
mysql-2             2/2     Running   0          113m    100.109.206.65   tkg-hugo-md-2-6c6c49dc67-xbpg7   <none>           <none>

To see which PVCs are using which AZs using the CSI driver’s node affinity we can use this command.

kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}'
pvc-06f9a40c-9fdf-48e3-9f49-b31ca2faf5a5        data-mysql-1    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-1"]}]}]}}
pvc-1f586db7-12bb-474c-adb6-2f92d44789bb        data-mysql-2    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-3"]}]}]}}
pvc-7108f915-e0e4-4028-8d45-6770b4d5be20        data-mysql-0    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-2"]}]}]}}

We can see that each PV has been allocated to each AZ.

PVClaim NameAZ
pvc-06f9a40c-9fdf-48e3-9f49-b31ca2faf5a5data-mysql-1az-1
pvc-1f586db7-12bb-474c-adb6-2f92d44789bbdata-mysql-2az-3
pvc-7108f915-e0e4-4028-8d45-6770b4d5be20data-mysql-0az-2

So we know which pod and which PV are on which worker node and on which ESX host.

PodPVCWorkerESX HostAZ
mysql-0data-mysql-0tkg-hugo-md-1-7bbd55cdb8-996×2esx2.vcd.labaz-2
mysql-1data-mysql-1tkg-hugo-md-0-7d455b7488-g28blesx1.vcd.labaz-1
mysql-2data-mysql-2tkg-hugo-md-2-6c6c49dc67-xbpg7esx3.vcd.labaz-3

For the rest of this exercise, I will perform the tests on mysql-2, tkg-hugo-md-2 and esx3.vcd.lab, which are all members of az-3.

Show data using mysql-client-loop pod

When all the pods are running we can run the following pod to constantly query the MySQL clusters.

kubectl run mysql-client-loop --image=mysql:5.7 -i -t --rm --restart=Never --\
  bash -ic "while sleep 1; do mysql -h mysql-read -e 'SELECT @@server_id,NOW()'; done"

Which would give us the following result:

+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         100 | 2022-02-06 14:12:57 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:12:58 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:12:59 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         100 | 2022-02-06 14:13:00 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         102 | 2022-02-06 14:13:01 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:13:02 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         102 | 2022-02-06 14:13:03 |
+-------------+---------------------+

The server_id’s are either 100, 101, or 102, referencing either mysql-0, mysql-1 or mysql-2 respectively. We can see that we can read data from all three of the pods which means our MySQL service is running well across all three AZs.

Simulating Pod and Node downtime

To demonstrate the increased availability of reading from the pool of replicas instead of a single server, keep the SELECT @@server_id loop from above running while you force a Pod out of the Ready state.

Delete Pods

The StatefulSet also recreates Pods if they’re deleted, similar to what a ReplicaSet does for stateless Pods.

kubectl delete pod mysql-2

The StatefulSet controller notices that no mysql-2 Pod exists anymore, and creates a new one with the same name and linked to the same PersistentVolumeClaim. You should see server ID 102 disappear from the loop output for a while and then return on its own.

Drain a Node

If your Kubernetes cluster has multiple Nodes, you can simulate Node downtime (such as when Nodes are upgraded) by issuing a drain.

We already know that mysql-2 is running on worker tkg-hugo-md-2. Then drain the Node by running the following command, which cordons it so no new Pods may schedule there, and then evicts any existing Pods.

kubectl drain tkg-hugo-md-2-6c6c49dc67-xbpg7 --force --delete-emptydir-data --ignore-daemonsets

What happens now is the pod mysql-2 will be evicted, it will also have its PVC unattached. Because we only have one worker per AZ, mysql-2 won’t be able to be scheduled on another node in another AZ.

The mysql-client-loop pod would show that 102 (mysql-2) is no longer serving MySQL requests. The pod mysql-2 will stay with a status as pending until a worker is available in AZ2 again.

Perform maintenance on ESX

After draining the worker node, we can now go ahead and perform maintenance operations on the ESX host by placing it into maintenance mode. Doing so will VMotion any VMs that are not using shared storage. You will find that because the worker node is still powered on and has locally attached VMDKs, this will prevent the ESX host from going into maintenance mode.

We know that the worker node is already drained and the MySQL application has two other replicas that are running in two other AZs, so we can safely power off this worker and enable the ESX host to complete going into maintenance mode. Yes, power off instead of gracefully shutting down. Kubernetes worker nodes are cattle and not pets and Kubernetes will destroy it anyway.

Operations with local storage

Consider the following when using local storage with Tanzu Kubernetes Grid.

  1. TKG worker nodes that have been tagged with a k8s-zone and have attached PVs will not be able to VMotion.
  2. TKG worker nodes that have been tagged with a k8s-zone and do not have attached PVs will also not be able to VMotion as they have the affinity rule set to “Must run on this host”.
  3. Placing a ESX host into maintenance mode will not complete until the TKG worker node running on that host has been powered off.

However, do not be alarmed by any of this, as this is normal behavior. Kubernetes workers can be replaced very often and since we have a stateful application with more than one replica, we can do this with no consequences.

The following section shows why this is the case.

How do TKG clusters with local storage handle ESX maintenance?

To perform maintenance on an ESX host that requires a host reboot perform the following.

  • Drain the TKG worker node of the host that you want to place into maintenance mode
kubectl drain <node-name> --force --delete-emptydir-data --ignore-daemonsets

What this does is it evicts all pods but daemonsets, it will also evict the MySQL pod running on this node, including removing the volume mount. In our example here, we still have the other two MySQL pods running on two other worker nodes.

  • Now place the ESX host into maintenance mode.
  • Power off the TKG worker node on this ESX host to allow the host to go into maintenance mode.
  • You might notice that TKG will try to delete that worker node and clone a new worker node on this host, but it cannot due to the host being in maintenance mode. This is normal behavior as any Kubernetes clusters will try to replace a worker that is no longer accessible. This of course is the case as we have powered ours off.
  • You will notice that Kubernetes does not try to create a worker node on any other ESX host. This is because the powered-off worker is labelled with one of the AZs therefore Kubernetes tries to place a new worker in the same AZ.
  • Perform ESX maintenance as normal and when complete exit the host from maintenance mode.
  • When the host exits maintenance mode, you’ll notice that Kubernetes can now delete the powered-off worker and replace it with a new one.
  • When the new worker node powers on and becomes ready, you will notice that the previous PV that was attached to the now deleted worker node is now attached to the new worker node.
  • The MySQL pod will then claim the PV and the pod will start and come out of pending status into ready status.
  • All three MySQL pods are now up and running and we have a healthy MySQL cluster again. Any MySQL data that was changed during this maintenance window will be replicated to the MySQL pod.

Summary

Using local storage backed storage classes with TKG is a viable alternative to using shared storage when your applications can perform data protection and replication at a higher level. Applications such as databases like the MySQL example that I used can benefit from using cheaper locally attached fast solid state media such as SSD or NVMe without the need to create hyperconverged storage environments. Applications that can replicated data at the application level, can avoid using SAN and NAS completely and benefit from simpler infrastructures and lower costs as well as benefiting from faster storage and lower latencies.

Using local storage with Tanzu Kubernetes Grid Topology Aware Volume Provisioning

With the vSphere CSI driver, it is now possible to use local storage with TKG clusters. This is enabled by TKG’s Topology Aware Volume Provisioning capability.

With this model, it is possible to present individual SSDs or NVMe drives attached to an ESXi host and configure a local datastore for use with topology aware volume provisioning. Kubernetes can then create persistent volumes and schedule pods that are deployed onto the worker nodes that are on the same ESXi host as the volume. This enables Kubernetes pods to have direct local access to the underlying storage.

With the vSphere CSI driver version 2.4.1, it is now possible to use local storage with TKG clusters. This is enabled by TKG’s Topology Aware Volume Provisioning capability.

Using local storage has distinct advantages over shared storage, especially when it comes to supporting faster and cheaper storage media for applications that do not benefit from or require the added complexity of having their data replicated by the storage layer. Examples of applications that do not require storage protection (RAID or failures to tolerate) are applications that can achieve data protection at the application level.

With this model, it is possible to present individual SSDs or NVMe drives attached to an ESXi host and configure a local datastore for use with topology aware volume provisioning. Kubernetes can then create persistent volumes and schedule pods that are deployed onto the worker nodes that are on the same ESXi host as the volume. This enables Kubernetes pods to have direct local access to the underlying storage.

Figure 1.

To setup such an environment, it is necessary to go over some of the requirements first.

  1. Deploy Tanzu Kubernetes Clusters to Multiple Availability Zones on vSphere – link
  2. Spread Nodes Across Multiple Hosts in a Single Compute Cluster
  3. Configure Tanzu Kubernetes Plans and Clusters with an overlay that is topology-aware – link
  4. Deploy TKG clusters into a multi-AZ topology
  5. Deploy the k8s-local-ssd storage class
  6. Deploy Workloads with WaitForFirstConsumer Mode in Topology-Aware Environment – link

Before you start

Note that only the CSI driver for vSphere version 2.4.1 supports local storage topology in a multi-AZ topology. To check if you have the correct version in your TKG cluster, run the following.

tanzu package installed get vsphere-csi -n tkg-system
- Retrieving installation details for vsphere-csi... I0224 19:20:29.397702  317993 request.go:665] Waited for 1.03368201s due to client-side throttling, not priority and fairness, request: GET:https://172.16.3.94:6443/apis/secretgen.k14s.io/v1alpha1?timeout=32s
\ Retrieving installation details for vsphere-csi...
NAME:                    vsphere-csi
PACKAGE-NAME:            vsphere-csi.tanzu.vmware.com
PACKAGE-VERSION:         2.4.1+vmware.1-tkg.1
STATUS:                  Reconcile succeeded
CONDITIONS:              [{ReconcileSucceeded True  }]

Deploy Tanzu Kubernetes Clusters to Multiple Availibility Zones on vSphere

In my example, I am using the Spread Nodes Across Multiple Hosts in a Single Compute Cluster example, each ESXi host is an availability zone (AZ) and the vSphere cluster is the Region.

Figure 1. shows a TKG cluster with three worker nodes, each node is running on a separate ESXi host. Each ESXi host has a local SSD drive formatted with VMFS 6. The topology aware volume provisioner would always place pods and their replicas on separate worker nodes and also any persistent volume claims (PVC) on separate ESXi hosts.

ParameterSpecificationvSphere objectDatastore
RegiontagCategory: k8s-regioncluster*
Zone
az-1
az-2
az-3
tagCategory: k8s-zone
host-group-1
host-group-2
host-group-3

esx1.vcd.lab
esx2.vcd.lab
esx3.vcd.lab

esx1-ssd-1
esx2-ssd-1
esx3-ssd-1
Storage Policyk8s-local-ssdesx1-ssd-1
esx2-ssd-1
esx3-ssd-1
TagstagCategory: k8s-storage
tag: k8s-local-ssd
esx1-ssd-1
esx2-ssd-1
esx3-ssd-1

*Note that “cluster” is the name of my vSphere cluster.

Ensure that you’ve set up the correct rules that enforce worker nodes to their respective ESXi hosts. Always use “Must run on hosts in group“, this is very important for local storage topology to work. This is because the worker nodes will be labelled for topology awareness, and if a worker node is vMotion’d accidentally then the CSI driver will not be able to bind the PVC to the worker node.

Below is my vsphere-zones.yaml file.

Note that autoConfigure is set to true. Which means that you do not have to tag the cluster or the ESX hosts yourself, you would only need to setup up the affinity rules under Cluster, Configure, VM/Host Groups and VM/Host Rules. The setting autoConfigure: true, would then make CAPV automatically configure the tags and tag categories for you.

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereFailureDomain
metadata:
 name: az-1
spec:
 region:
   name: cluster
   type: ComputeCluster
   tagCategory: k8s-region
   autoConfigure: true
 zone:
   name: az-1
   type: HostGroup
   tagCategory: k8s-zone
   autoConfigure: true
 topology:
   datacenter: home.local
   computeCluster: cluster
   hosts:
     vmGroupName: workers-group-1
     hostGroupName: host-group-1
   datastore: lun01
   networks:
   - tkg-workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereFailureDomain
metadata:
 name: az-2
spec:
 region:
   name: cluster
   type: ComputeCluster
   tagCategory: k8s-region
   autoConfigure: true
 zone:
   name: az-2
   type: HostGroup
   tagCategory: k8s-zone
   autoConfigure: true
 topology:
   datacenter: home.local
   computeCluster: cluster
   hosts:
     vmGroupName: workers-group-2
     hostGroupName: host-group-2
   datastore: lun01
   networks:
   - tkg-workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereFailureDomain
metadata:
 name: az-3
spec:
 region:
   name: cluster
   type: ComputeCluster
   tagCategory: k8s-region
   autoConfigure: true
 zone:
   name: az-3
   type: HostGroup
   tagCategory: k8s-zone
   autoConfigure: true
 topology:
   datacenter: home.local
   computeCluster: cluster
   hosts:
     vmGroupName: workers-group-3
     hostGroupName: host-group-3
   datastore: lun01
   networks:
   - tkg-workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereDeploymentZone
metadata:
 name: az-1
spec:
 server: vcenter.vmwire.com
 failureDomain: az-1
 placementConstraint:
   resourcePool: tkg-vsphere-workload
   folder: tkg-vsphere-workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereDeploymentZone
metadata:
 name: az-2
spec:
 server: vcenter.vmwire.com
 failureDomain: az-2
 placementConstraint:
   resourcePool: tkg-vsphere-workload
   folder: tkg-vsphere-workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereDeploymentZone
metadata:
 name: az-3
spec:
 server: vcenter.vmwire.com
 failureDomain: az-3
 placementConstraint:
   resourcePool: tkg-vsphere-workload
   folder: tkg-vsphere-workload

Note that Kubernetes does not like using parameter names that are not standard, I suggest for your vmGroupName and hostGroupName parameters, use lowercase and dashes instead of periods. For example host-group-3, instead of Host.Group.3. The latter will be rejected.

Configure Tanzu Kubernetes Plans and Clusters with an overlay that is topology-aware

To ensure that this topology can be built by TKG, we first need to create a TKG cluster plan overlay that tells Tanzu how what to do when creating worker nodes in a multi-availability zone topology.

Lets take a look at my az-overlay.yaml file.

Since I have three AZs, I need to create an overlay file that includes the cluster plan for all three AZs.

ParameterSpecification
Zone
az-1
az-2
az-3
VSphereMachineTemplate
-worker-0
-worker-1
-worker-2
KubeadmConfigTemplate
-md-0
-md-1
-md-2
#! Please add any overlays specific to vSphere provider under this file.

#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")

#@ load("lib/helpers.star", "get_bom_data_for_tkr_name", "get_default_tkg_bom_data", "kubeadm_image_repo", "get_image_repo_for_component", "get_vsphere_thumbprint")

#@ load("lib/validate.star", "validate_configuration")
#@ load("@ytt:yaml", "yaml")
#@ validate_configuration("vsphere")

#@ bomDataForK8sVersion = get_bom_data_for_tkr_name()

#@ if data.values.CLUSTER_PLAN == "dev" and not data.values.IS_WINDOWS_WORKLOAD_CLUSTER:
#@overlay/match by=overlay.subset({"kind":"VSphereCluster"})
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereCluster
metadata:
  name: #@ data.values.CLUSTER_NAME
spec:
  thumbprint: #@ get_vsphere_thumbprint()
  server: #@ data.values.VSPHERE_SERVER
  identityRef:
    kind: Secret
    name: #@ data.values.CLUSTER_NAME

#@overlay/match by=overlay.subset({"kind":"MachineDeployment", "metadata":{"name": "{}-md-0".format(data.values.CLUSTER_NAME)}})
---
spec:
  template:
    spec:
      #@overlay/match missing_ok=True
      #@ if data.values.VSPHERE_AZ_0:
      failureDomain: #@ data.values.VSPHERE_AZ_0
      #@ end
      infrastructureRef:
        name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)

#@overlay/match by=overlay.subset({"kind":"VSphereMachineTemplate", "metadata":{"name": "{}-worker".format(data.values.CLUSTER_NAME)}})
---
metadata:
  name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)
spec:
  template:
    spec:
      #@overlay/match missing_ok=True
      #@ if data.values.VSPHERE_AZ_0:
      failureDomain: #@ data.values.VSPHERE_AZ_0
      #@ end
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
  #@overlay/match missing_ok=True
  annotations:
    vmTemplateMoid: #@ data.values.VSPHERE_TEMPLATE_MOID
spec:
  template:
    spec:
      cloneMode:  #@ data.values.VSPHERE_CLONE_MODE
      datacenter: #@ data.values.VSPHERE_DATACENTER
      datastore: #@ data.values.VSPHERE_DATASTORE
      storagePolicyName: #@ data.values.VSPHERE_STORAGE_POLICY_ID
      diskGiB: #@ data.values.VSPHERE_WORKER_DISK_GIB
      folder: #@ data.values.VSPHERE_FOLDER
      memoryMiB: #@ data.values.VSPHERE_WORKER_MEM_MIB
      network:
        devices:
          #@overlay/match by=overlay.index(0)
          #@overlay/replace
          - networkName: #@ data.values.VSPHERE_NETWORK
            #@ if data.values.WORKER_NODE_NAMESERVERS:
            nameservers: #@ data.values.WORKER_NODE_NAMESERVERS.replace(" ", "").split(",")
            #@ end
            #@ if data.values.TKG_IP_FAMILY == "ipv6":
            dhcp6: true
            #@ elif data.values.TKG_IP_FAMILY in ["ipv4,ipv6", "ipv6,ipv4"]:
            dhcp4: true
            dhcp6: true
            #@ else:
            dhcp4: true
            #@ end
      numCPUs: #@ data.values.VSPHERE_WORKER_NUM_CPUS
      resourcePool: #@ data.values.VSPHERE_RESOURCE_POOL
      server: #@ data.values.VSPHERE_SERVER
      template: #@ data.values.VSPHERE_TEMPLATE
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  name: #@ "{}-md-2".format(data.values.CLUSTER_NAME)
  #@overlay/match missing_ok=True
  annotations:
    vmTemplateMoid: #@ data.values.VSPHERE_TEMPLATE_MOID
spec:
  template:
    spec:
      cloneMode:  #@ data.values.VSPHERE_CLONE_MODE
      datacenter: #@ data.values.VSPHERE_DATACENTER
      datastore: #@ data.values.VSPHERE_DATASTORE
      storagePolicyName: #@ data.values.VSPHERE_STORAGE_POLICY_ID
      diskGiB: #@ data.values.VSPHERE_WORKER_DISK_GIB
      folder: #@ data.values.VSPHERE_FOLDER
      memoryMiB: #@ data.values.VSPHERE_WORKER_MEM_MIB
      network:
        devices:
          #@overlay/match by=overlay.index(0)
          #@overlay/replace
          - networkName: #@ data.values.VSPHERE_NETWORK
            #@ if data.values.WORKER_NODE_NAMESERVERS:
            nameservers: #@ data.values.WORKER_NODE_NAMESERVERS.replace(" ", "").split(",")
            #@ end
            #@ if data.values.TKG_IP_FAMILY == "ipv6":
            dhcp6: true
            #@ elif data.values.TKG_IP_FAMILY in ["ipv4,ipv6", "ipv6,ipv4"]:
            dhcp4: true
            dhcp6: true
            #@ else:
            dhcp4: true
            #@ end
      numCPUs: #@ data.values.VSPHERE_WORKER_NUM_CPUS
      resourcePool: #@ data.values.VSPHERE_RESOURCE_POOL
      server: #@ data.values.VSPHERE_SERVER
      template: #@ data.values.VSPHERE_TEMPLATE
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
  name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
spec:
  clusterName: #@ data.values.CLUSTER_NAME
  replicas: #@ data.values.WORKER_MACHINE_COUNT_1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
        node-pool: #@ "{}-worker-pool".format(data.values.CLUSTER_NAME)
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
      clusterName: #@ data.values.CLUSTER_NAME
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
      version: #@ data.values.KUBERNETES_VERSION
      #@ if data.values.VSPHERE_AZ_1:
      failureDomain: #@ data.values.VSPHERE_AZ_1
      #@ end
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
  name: #@ "{}-md-2".format(data.values.CLUSTER_NAME)
spec:
  clusterName: #@ data.values.CLUSTER_NAME
  replicas: #@ data.values.WORKER_MACHINE_COUNT_2
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
        node-pool: #@ "{}-worker-pool".format(data.values.CLUSTER_NAME)
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: #@ "{}-md-2".format(data.values.CLUSTER_NAME)
      clusterName: #@ data.values.CLUSTER_NAME
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: #@ "{}-md-2".format(data.values.CLUSTER_NAME)
      version: #@ data.values.KUBERNETES_VERSION
      #@ if data.values.VSPHERE_AZ_2:
      failureDomain: #@ data.values.VSPHERE_AZ_2
      #@ end
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
  namespace: '${ NAMESPACE }'
spec:
  template:
    spec:
      useExperimentalRetryJoin: true
      joinConfiguration:
        nodeRegistration:
          criSocket: /var/run/containerd/containerd.sock
          kubeletExtraArgs:
            cloud-provider: external
            tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
          name: '{{ ds.meta_data.hostname }}'
      preKubeadmCommands:
        - hostname "{{ ds.meta_data.hostname }}"
        - echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
        - echo "127.0.0.1   localhost" >>/etc/hosts
        - echo "127.0.0.1   {{ ds.meta_data.hostname }}" >>/etc/hosts
        - echo "{{ ds.meta_data.hostname }}" >/etc/hostname
      files: []
      users:
        - name: capv
          sshAuthorizedKeys:
            - #@ data.values.VSPHERE_SSH_AUTHORIZED_KEY
          sudo: ALL=(ALL) NOPASSWD:ALL
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: #@ "{}-md-2".format(data.values.CLUSTER_NAME)
  namespace: '${ NAMESPACE }'
spec:
  template:
    spec:
      useExperimentalRetryJoin: true
      joinConfiguration:
        nodeRegistration:
          criSocket: /var/run/containerd/containerd.sock
          kubeletExtraArgs:
            cloud-provider: external
            tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
          name: '{{ ds.meta_data.hostname }}'
      preKubeadmCommands:
        - hostname "{{ ds.meta_data.hostname }}"
        - echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
        - echo "127.0.0.1   localhost" >>/etc/hosts
        - echo "127.0.0.1   {{ ds.meta_data.hostname }}" >>/etc/hosts
        - echo "{{ ds.meta_data.hostname }}" >/etc/hostname
      files: []
      users:
        - name: capv
          sshAuthorizedKeys:
            - #@ data.values.VSPHERE_SSH_AUTHORIZED_KEY
          sudo: ALL=(ALL) NOPASSWD:ALL
#@ end

Deploy a TKG cluster into a multi-AZ topology

To deploy a TKG cluster that spreads its worker nodes over multiple AZs, we need to add some key value pairs into the cluster config file.

Below is an example for my cluster config file – tkg-hugo.yaml.

The new key value pairs are described in the table below.

ParameterSpecificationDetails
VSPHERE_REGIONk8s-regionMust be the same as the configuration in the vsphere-zones.yaml file
VSPHERE_ZONEk8s-zoneMust be the same as the configuration in the vsphere-zones.yaml file
VSPHERE_AZ_0
VSPHERE_AZ_1
VSPHERE_AZ_2
az-1
az-2
az-3
Must be the same as the configuration in the vsphere-zones.yaml file
WORKER_MACHINE_COUNT3This is the number of worker nodes for the cluster.

The total number of workers are distributed in a round-robin fashion across the number of AZs specified.
A note on WORKER_MACHINE_COUNT when using CLUSTER_PLAN: dev instead of prod.

If you change the az-overlay.yaml @ if data.values.CLUSTER_PLAN == “prod” to @ if data.values.CLUSTER_PLAN == “dev”
Then the WORKER_MACHINE_COUNT reverts to the number of workers for each AZ. So if you set this number to 3, in a three AZ topology, you would end up with a TKG cluster with nine workers!
CLUSTER_CIDR: 100.96.0.0/11
CLUSTER_NAME: tkg-hugo
CLUSTER_PLAN: prod
ENABLE_CEIP_PARTICIPATION: 'false'
ENABLE_MHC: 'true'
IDENTITY_MANAGEMENT_TYPE: none
INFRASTRUCTURE_PROVIDER: vsphere
SERVICE_CIDR: 100.64.0.0/13
TKG_HTTP_PROXY_ENABLED: false
DEPLOY_TKG_ON_VSPHERE7: 'true'
VSPHERE_DATACENTER: /home.local
VSPHERE_DATASTORE: lun02
VSPHERE_FOLDER: /home.local/vm/tkg-vsphere-workload
VSPHERE_NETWORK: /home.local/network/tkg-workload
VSPHERE_PASSWORD: <encoded:snipped>
VSPHERE_RESOURCE_POOL: /home.local/host/cluster/Resources/tkg-vsphere-workload
VSPHERE_SERVER: vcenter.vmwire.com
VSPHERE_SSH_AUTHORIZED_KEY: ssh-rsa <snipped> administrator@vsphere.local
VSPHERE_USERNAME: administrator@vsphere.local
CONTROLPLANE_SIZE: small
WORKER_MACHINE_COUNT: 3
WORKER_SIZE: small
VSPHERE_INSECURE: 'true'
ENABLE_AUDIT_LOGGING: 'true'
ENABLE_DEFAULT_STORAGE_CLASS: 'false'
ENABLE_AUTOSCALER: 'false'
AVI_CONTROL_PLANE_HA_PROVIDER: 'true'
VSPHERE_REGION: k8s-region
VSPHERE_ZONE: k8s-zone
VSPHERE_AZ_0: az-1
VSPHERE_AZ_1: az-2
VSPHERE_AZ_2: az-3

Deploy the k8s-local-ssd Storage Class

Below is my storageclass-k8s-local-ssd.yaml.

Note that parameters.storagePolicyName: k8s-local-ssd, which is the same as the name of the storage policy for the local storage. All three of the local VMFS datastores that are backed by the local SSD drives are members of this storage policy.

Note that the volumeBindingMode is set to WaitForFirstConsumer.

Instead of creating a volume immediately, the WaitForFirstConsumer setting instructs the volume provisioner to wait until a pod using the associated PVC runs through scheduling. In contrast with the Immediate volume binding mode, when the WaitForFirstConsumer setting is used, the Kubernetes scheduler drives the decision of which failure domain to use for volume provisioning using the pod policies.

This guarantees the pod at its volume is always on the same AZ (ESXi host).

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: k8s-local-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  storagePolicyName: k8s-local-ssd

Deploy a workload that uses Topology Aware Volume Provisioning

Below is a statefulset that deploys three pods running nginx. It configures two persistent volumes, one for www and another for log. Both of these volumes are going to be provisioned onto the same ESXi host where the pod is running. The statefulset also runs an initContainer that will download a simple html file from my repo and copy it to the www mount point (/user/share/nginx/html).

You can see under spec.affinity.nodeAffinity how the statefulset uses the topology.

The statefulset then exposes the nginx app using the nginx-service which uses the Gateway API, that I wrote about in a previous blog post.

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: default
  labels:
    ako.vmware.com/gateway-name: gateway-tkg-workload-vip
    ako.vmware.com/gateway-namespace: default
spec:
  selector:
    app: nginx
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx-service
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.csi.vmware.com/k8s-zone
                operator: In
                values:
                - az-1
                - az-2
                - az-3
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx
            topologyKey: topology.csi.vmware.com/k8s-zone
      terminationGracePeriodSeconds: 10
      initContainers:
      - name: install
        image: busybox
        command:
        - wget
        - "-O"
        - "/www/index.html"
        - https://raw.githubusercontent.com/hugopow/cse/main/index.html
        volumeMounts:
        - name: www
          mountPath: "/www"
      containers:
        - name: nginx
          image: k8s.gcr.io/nginx-slim:0.8
          ports:
            - containerPort: 80
              name: web
          volumeMounts:
            - name: www
              mountPath: /usr/share/nginx/html
            - name: logs
              mountPath: /logs
  volumeClaimTemplates:
    - metadata:
        name: www
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: k8s-local-ssd
        resources:
          requests:
            storage: 2Gi
    - metadata:
        name: logs
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: k8s-local-ssd
        resources:
          requests:
            storage: 1Gi

What if you wanted to use more than three availability zones?

Some notes here on what I experienced during my testing.

The TKG cluster config has the following three lines to specify the names of the AZs that you want to use which will be passed onto the Tanzu CLI to use to deploy your TKG cluster using the ytt overlay file. However, the Tanzu CLI only supports a total of three AZs.

VSPHERE_AZ_0: az-1
VSPHERE_AZ_1: az-2
VSPHERE_AZ_2: az-3

If you wanted to use more than three AZs, then you would have to remove these three lines from the TKG cluster config and change the ytt overlay to not use the VSPHERE_AZ_# variables but to hard code the AZs into the ytt overlay file instead.

To do this replace the following:

      #@ if data.values.VSPHERE_AZ_2:
      failureDomain: #@ data.values.VSPHERE_AZ_0
      #@ end

with the following:

      failureDomain: az-2

and create an additional block of MachineDeployment and KubeadmConfigTemplate for each additional AZ that you need.

Summary

Below are screenshots and the resulting deployed objects after running kubectl apply -f to the above.

kubectl get nodes
NAME                             STATUS   ROLES                  AGE     VERSION
tkg-hugo-md-0-7d455b7488-d6jrl   Ready    <none>                 3h23m   v1.22.5+vmware.1
tkg-hugo-md-1-bc76659f7-cntn4    Ready    <none>                 3h23m   v1.22.5+vmware.1
tkg-hugo-md-2-6bb75968c4-mnrk5   Ready    <none>                 3h23m   v1.22.5+vmware.1

You can see that the worker nodes are distributed across the ESXi hosts as per our vsphere-zones.yaml and also our az-overlay.yaml files.

kubectl get po -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP                NODE                             NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          3h14m   100.124.232.195   tkg-hugo-md-2-6bb75968c4-mnrk5   <none>           <none>
web-1   1/1     Running   0          3h13m   100.122.148.67    tkg-hugo-md-1-bc76659f7-cntn4    <none>           <none>
web-2   1/1     Running   0          3h12m   100.108.145.68    tkg-hugo-md-0-7d455b7488-d6jrl   <none>           <none>

You can see that each pod is placed on a separate worker node.

kubectl get csinodes -o jsonpath='{range .items[*]}{.metadata.name} {.spec}{"\n"}{end}'
tkg-hugo-md-0-7d455b7488-d6jrl {"drivers":[{"allocatable":{"count":59},"name":"csi.vsphere.vmware.com","nodeID":"tkg-hugo-md-0-7d455b7488-d6jrl","topologyKeys":["topology.csi.vmware.com/k8s-region","topology.csi.vmware.com/k8s-zone"]}]}
tkg-hugo-md-1-bc76659f7-cntn4 {"drivers":[{"allocatable":{"count":59},"name":"csi.vsphere.vmware.com","nodeID":"tkg-hugo-md-1-bc76659f7-cntn4","topologyKeys":["topology.csi.vmware.com/k8s-region","topology.csi.vmware.com/k8s-zone"]}]}
tkg-hugo-md-2-6bb75968c4-mnrk5 {"drivers":[{"allocatable":{"count":59},"name":"csi.vsphere.vmware.com","nodeID":"tkg-hugo-md-2-6bb75968c4-mnrk5","topologyKeys":["topology.csi.vmware.com/k8s-region","topology.csi.vmware.com/k8s-zone"]}]}

We can see that the CSI driver has correctly configured the worker nodes with the topologyKeys that enables the topology aware volume provisioning.

kubectl get pvc -o wide
NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE     VOLUMEMODE
logs-web-0   Bound    pvc-13cf4150-db60-4c13-9ee2-cbc092dba782   1Gi        RWO            k8s-local-ssd   3h18m   Filesystem
logs-web-1   Bound    pvc-e99cfe33-9fa4-46d8-95f8-8a71f4535b15   1Gi        RWO            k8s-local-ssd   3h17m   Filesystem
logs-web-2   Bound    pvc-6bd51eed-e0aa-4489-ac0a-f546dadcee16   1Gi        RWO            k8s-local-ssd   3h17m   Filesystem
www-web-0    Bound    pvc-8f46420a-41c4-4ad3-97d4-5becb9c45c94   2Gi        RWO            k8s-local-ssd   3h18m   Filesystem
www-web-1    Bound    pvc-c3c9f551-1837-41aa-b24f-f9dc6fdb9063   2Gi        RWO            k8s-local-ssd   3h17m   Filesystem
www-web-2    Bound    pvc-632a9f81-3e9d-492b-847a-9316043a2d47   2Gi        RWO            k8s-local-ssd   3h17m   Filesystem
kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}'
pvc-13cf4150-db60-4c13-9ee2-cbc092dba782        logs-web-0      {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-3"]}]}]}}
pvc-632a9f81-3e9d-492b-847a-9316043a2d47        www-web-2       {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-1"]}]}]}}
pvc-6bd51eed-e0aa-4489-ac0a-f546dadcee16        logs-web-2      {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-1"]}]}]}}
pvc-8f46420a-41c4-4ad3-97d4-5becb9c45c94        www-web-0       {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-3"]}]}]}}
pvc-c3c9f551-1837-41aa-b24f-f9dc6fdb9063        www-web-1       {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-2"]}]}]}}
pvc-e99cfe33-9fa4-46d8-95f8-8a71f4535b15        logs-web-1      {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-2"]},{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]}]}]}}

Here we see the placement for the persistent volumes within the AZs and they also align to the right worker node.

k get no tkg-hugo-md-0-7d455b7488-d6jrl -o yaml | grep topology.kubernetes.io/zone:
topology.kubernetes.io/zone: az-1
k get no tkg-hugo-md-1-bc76659f7-cntn4 -o yaml | grep topology.kubernetes.io/zone:
topology.kubernetes.io/zone: az-2
k get no tkg-hugo-md-2-6bb75968c4-mnrk5 -o yaml | grep topology.kubernetes.io/zone:
topology.kubernetes.io/zone: az-3
k get volumeattachments.storage.k8s.io
NAME                                                                   ATTACHER                 PV                                         NODE                             ATTACHED   AGE
csi-476b244713205d0d4d4e13da1a6bd2beec49ac90fbd4b64c090ffba8468f6479   csi.vsphere.vmware.com   pvc-c3c9f551-1837-41aa-b24f-f9dc6fdb9063   tkg-hugo-md-1-bc76659f7-cntn4    true       9h
csi-5a759811557125917e3b627993061912386f4d2e8fb709e85fc407117138b178   csi.vsphere.vmware.com   pvc-8f46420a-41c4-4ad3-97d4-5becb9c45c94   tkg-hugo-md-2-6bb75968c4-mnrk5   true       9h
csi-6016904b0ac4ac936184e95c8ff0b3b8bebabb861a99b822e6473c5ee1caf388   csi.vsphere.vmware.com   pvc-6bd51eed-e0aa-4489-ac0a-f546dadcee16   tkg-hugo-md-0-7d455b7488-d6jrl   true       9h
csi-c5b9abcc05d7db5348493952107405b557d7eaa0341aa4e952457cf36f90a26d   csi.vsphere.vmware.com   pvc-13cf4150-db60-4c13-9ee2-cbc092dba782   tkg-hugo-md-2-6bb75968c4-mnrk5   true       9h
csi-df68754411ab34a5af1c4014db9e9ba41ee216d0f4ec191a0d191f07f99e3039   csi.vsphere.vmware.com   pvc-e99cfe33-9fa4-46d8-95f8-8a71f4535b15   tkg-hugo-md-1-bc76659f7-cntn4    true       9h
csi-f48a7db32aafb2c76cc22b1b533d15d331cd14c2896b20cfb4d659621fd60fbc   csi.vsphere.vmware.com   pvc-632a9f81-3e9d-492b-847a-9316043a2d47   tkg-hugo-md-0-7d455b7488-d6jrl   true       9h

And finally, some other screenshots to show the PVCs in vSphere.

ESX1

ESX2

ESX3