Run a replicated stateful application using local storage in Kubernetes

This post shows how to run a replicated stateful application on local storage using a StatefulSet controller. This application is a replicated MySQL database. The example topology has a single primary server and multiple replicas, using asynchronous row-based replication. The MySQL data is using a storage class backed by local SSD storage provided by the vSphere CSI driver performing the dynamic PersistentVolume provisioner.

This post continues from the previous post where I described how to setup multi-AZ topology aware volume provisioning with local storage.

I used this example here to setup a StatefulSet with MySQL to get an example application up and running.

However, I did not use the default storage class, but added one line to the mysql-statefulset.yaml file to use the storage class that is backed by local SSDs instead.

from:

    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

to:

    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: k8s-local-nvme
      resources:
        requests:
          storage: 10Gi

I also appended the StatefulSet to include the spec.template.spec.affinity and spec.template.spec.podAntiAffinity settings to make use of the three AZs for pod scheduling.

spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.csi.vmware.com/k8s-zone
                operator: In
                values:
                - az-1
                - az-2
                - az-3
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - mysql
            topologyKey: topology.csi.vmware.com/k8s-zone

Everything else stayed the same. Please spend some time reading the example from kubernetes.io as I will be performing the same steps but using local storage instead to test the behavior of MySQL replication.

Architecture

I am using the same setup, with three replicas in the StatefulSet to match with the three AZs that I have setup in my lab.

My AZ layout is the following.

AZ	ESX host	TKG worker
az-1	esx1.vcd.lab	tkg-hugo-md-0-7d455b7488-g28bl
az-2	esx2.vcd.lab	tkg-hugo-md-1-7bbd55cdb8-996×2
az-3	esx3.vcd.lab	tkg-hugo-md-2-6c6c49dc67-xbpg7

We can see which pod runs on which worker using the following command:

k get po -o wide

NAME                READY   STATUS    RESTARTS   AGE     IP               NODE                             NOMINATED NODE   READINESS GATES
mysql-0             2/2     Running   0          3h24m   100.120.135.67   tkg-hugo-md-1-7bbd55cdb8-996x2   <none>           <none>
mysql-1             2/2     Running   0          3h22m   100.127.29.3     tkg-hugo-md-0-7d455b7488-g28bl   <none>           <none>
mysql-2             2/2     Running   0          113m    100.109.206.65   tkg-hugo-md-2-6c6c49dc67-xbpg7   <none>           <none>

To see which PVCs are using which AZs using the CSI driver’s node affinity we can use this command.

kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}'

pvc-06f9a40c-9fdf-48e3-9f49-b31ca2faf5a5        data-mysql-1    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-1"]}]}]}}
pvc-1f586db7-12bb-474c-adb6-2f92d44789bb        data-mysql-2    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-3"]}]}]}}
pvc-7108f915-e0e4-4028-8d45-6770b4d5be20        data-mysql-0    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"topology.csi.vmware.com/k8s-region","operator":"In","values":["cluster"]},{"key":"topology.csi.vmware.com/k8s-zone","operator":"In","values":["az-2"]}]}]}}

We can see that each PV has been allocated to each AZ.

PV	Claim Name	AZ
pvc-06f9a40c-9fdf-48e3-9f49-b31ca2faf5a5	data-mysql-1	az-1
pvc-1f586db7-12bb-474c-adb6-2f92d44789bb	data-mysql-2	az-3
pvc-7108f915-e0e4-4028-8d45-6770b4d5be20	data-mysql-0	az-2

So we know which pod and which PV are on which worker node and on which ESX host.

Pod	PVC	Worker	ESX Host	AZ
mysql-0	data-mysql-0	tkg-hugo-md-1-7bbd55cdb8-996×2	esx2.vcd.lab	az-2
mysql-1	data-mysql-1	tkg-hugo-md-0-7d455b7488-g28bl	esx1.vcd.lab	az-1
mysql-2	data-mysql-2	tkg-hugo-md-2-6c6c49dc67-xbpg7	esx3.vcd.lab	az-3

For the rest of this exercise, I will perform the tests on mysql-2, tkg-hugo-md-2 and esx3.vcd.lab, which are all members of az-3.

Show data using mysql-client-loop pod

When all the pods are running we can run the following pod to constantly query the MySQL clusters.

kubectl run mysql-client-loop --image=mysql:5.7 -i -t --rm --restart=Never --\
  bash -ic "while sleep 1; do mysql -h mysql-read -e 'SELECT @@server_id,NOW()'; done"

Which would give us the following result:

+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         100 | 2022-02-06 14:12:57 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:12:58 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:12:59 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         100 | 2022-02-06 14:13:00 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         102 | 2022-02-06 14:13:01 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         101 | 2022-02-06 14:13:02 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW()               |
+-------------+---------------------+
|         102 | 2022-02-06 14:13:03 |
+-------------+---------------------+

The server_id’s are either 100, 101, or 102, referencing either mysql-0, mysql-1 or mysql-2 respectively. We can see that we can read data from all three of the pods which means our MySQL service is running well across all three AZs.

Simulating Pod and Node downtime

To demonstrate the increased availability of reading from the pool of replicas instead of a single server, keep the SELECT @@server_id loop from above running while you force a Pod out of the Ready state.

Delete Pods

The StatefulSet also recreates Pods if they’re deleted, similar to what a ReplicaSet does for stateless Pods.

kubectl delete pod mysql-2

The StatefulSet controller notices that no mysql-2 Pod exists anymore, and creates a new one with the same name and linked to the same PersistentVolumeClaim. You should see server ID 102 disappear from the loop output for a while and then return on its own.

Drain a Node

If your Kubernetes cluster has multiple Nodes, you can simulate Node downtime (such as when Nodes are upgraded) by issuing a drain.

We already know that mysql-2 is running on worker tkg-hugo-md-2. Then drain the Node by running the following command, which cordons it so no new Pods may schedule there, and then evicts any existing Pods.

kubectl drain tkg-hugo-md-2-6c6c49dc67-xbpg7 --force --delete-emptydir-data --ignore-daemonsets

What happens now is the pod mysql-2 will be evicted, it will also have its PVC unattached. Because we only have one worker per AZ, mysql-2 won’t be able to be scheduled on another node in another AZ.

The mysql-client-loop pod would show that 102 (mysql-2) is no longer serving MySQL requests. The pod mysql-2 will stay with a status as pending until a worker is available in AZ2 again.

Perform maintenance on ESX

After draining the worker node, we can now go ahead and perform maintenance operations on the ESX host by placing it into maintenance mode. Doing so will VMotion any VMs that are not using shared storage. You will find that because the worker node is still powered on and has locally attached VMDKs, this will prevent the ESX host from going into maintenance mode.

We know that the worker node is already drained and the MySQL application has two other replicas that are running in two other AZs, so we can safely power off this worker and enable the ESX host to complete going into maintenance mode. Yes, power off instead of gracefully shutting down. Kubernetes worker nodes are cattle and not pets and Kubernetes will destroy it anyway.

Operations with local storage

Consider the following when using local storage with Tanzu Kubernetes Grid.

TKG worker nodes that have been tagged with a k8s-zone and have attached PVs will not be able to VMotion.
TKG worker nodes that have been tagged with a k8s-zone and do not have attached PVs will also not be able to VMotion as they have the affinity rule set to “Must run on this host”.
Placing a ESX host into maintenance mode will not complete until the TKG worker node running on that host has been powered off.

However, do not be alarmed by any of this, as this is normal behavior. Kubernetes workers can be replaced very often and since we have a stateful application with more than one replica, we can do this with no consequences.

The following section shows why this is the case.

How do TKG clusters with local storage handle ESX maintenance?

To perform maintenance on an ESX host that requires a host reboot perform the following.

Drain the TKG worker node of the host that you want to place into maintenance mode

kubectl drain <node-name> --force --delete-emptydir-data --ignore-daemonsets

What this does is it evicts all pods but daemonsets, it will also evict the MySQL pod running on this node, including removing the volume mount. In our example here, we still have the other two MySQL pods running on two other worker nodes.

Now place the ESX host into maintenance mode.
Power off the TKG worker node on this ESX host to allow the host to go into maintenance mode.
You might notice that TKG will try to delete that worker node and clone a new worker node on this host, but it cannot due to the host being in maintenance mode. This is normal behavior as any Kubernetes clusters will try to replace a worker that is no longer accessible. This of course is the case as we have powered ours off.
You will notice that Kubernetes does not try to create a worker node on any other ESX host. This is because the powered-off worker is labelled with one of the AZs therefore Kubernetes tries to place a new worker in the same AZ.
Perform ESX maintenance as normal and when complete exit the host from maintenance mode.
When the host exits maintenance mode, you’ll notice that Kubernetes can now delete the powered-off worker and replace it with a new one.
When the new worker node powers on and becomes ready, you will notice that the previous PV that was attached to the now deleted worker node is now attached to the new worker node.
The MySQL pod will then claim the PV and the pod will start and come out of pending status into ready status.
All three MySQL pods are now up and running and we have a healthy MySQL cluster again. Any MySQL data that was changed during this maintenance window will be replicated to the MySQL pod.

Summary

Using local storage backed storage classes with TKG is a viable alternative to using shared storage when your applications can perform data protection and replication at a higher level. Applications such as databases like the MySQL example that I used can benefit from using cheaper locally attached fast solid state media such as SSD or NVMe without the need to create hyperconverged storage environments. Applications that can replicated data at the application level, can avoid using SAN and NAS completely and benefit from simpler infrastructures and lower costs as well as benefiting from faster storage and lower latencies.

Author: Hugo Phan

@hugophan View all posts by Hugo Phan