restic – VMwire

Using Velero with Restic for Kubernetes Data Protection

Velero (formerly Heptio Ark) gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. You can run Velero with a cloud provider or on-premises.

This works with any Kubernetes cluster, including Tanzu Kubernetes Grid and Kubernetes clusters deployed with Container Service Extension with VMware Cloud Director.

This solution can be used for air-gapped environments where the Kuberenetes clusters do not have Internet access and cannot use public services such as Amazon S3, or Tanzu Mission Control Data Protection. These services are SaaS services which are pretty much out of bounds in air-gapped environments.

Overview

Velero (formerly Heptio Ark) gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. You can run Velero with a cloud provider or on-premises. Velero lets you:

Take backups of your cluster and restore in case of loss.
Migrate cluster resources to other clusters.
Replicate your production cluster to development and testing clusters.

Velero consists of:

A server that runs on your Kubernetes cluster
A command-line client that runs locally

Velero works with any Kubernetes cluster, including Tanzu Kubernetes Grid and Kubernetes clusters deployed using Container Service Extension with VMware Cloud Director.

This solution can be used for air-gapped environments where the Kubernetes clusters do not have Internet access and cannot use public services such as Amazon S3, or Tanzu Mission Control Data Protection. These services are SaaS services which are pretty much out of bounds in air-gapped environments.

Install Velero onto your workstation

Download the latest Velero release for your preferred operating system, this is usually where you have your kubectl tools.

https://github.com/vmware-tanzu/velero/releases

Extract the contents.

tar zxvf velero-v1.8.1-linux-amd64.tar.gz

You’ll see a folder structure like the following.

ls -l
total 70252
-rw-r----- 1 phanh users    10255 Mar 10 09:45 LICENSE
drwxr-x--- 4 phanh users     4096 Apr 11 08:40 examples
-rw-r----- 1 phanh users    15557 Apr 11 08:52 values.yaml
-rwxr-x--- 1 phanh users 71899684 Mar 15 02:07 velero

Copy the velero binary to the /usr/local/bin location so it is usable from anywhere.

sudo cp velero /usr/local/bin/velero

sudo chmod +x /usr/local/bin/velero

sudo chmod 755 /usr/local/bin/velero

If you want to enable bash auto completion, please follow this guide.

Setup an S3 service and bucket

I’m using TrueNAS’ S3 compatible storage in my lab. TrueNAS is an S3 compliant object storage system and is incredibly easy to setup. You can use other S3 compatible object stores such as Amazon S3. A full list of supported providers can be found here.

Follow these instructions to setup S3 on TrueNAS.

Add certificate, go to System, Certificates
Add, Import Certificate, copy and paste cert.pem and cert.key
Storage, Pools, click on the three dots next to the Pools that will hold the S3 root bucket.
Add a Dataset, give it a name such as s3-storage
Services, S3, click on pencil icon.
Setup like the example below.

Setup the access key and secret key for this configuration.

access key: AKIAIOSFODNN7EXAMPLE
secret key: wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY

Update DNS to point to s3.vmwire.com to 10.92.124.20 (IP of TrueNAS). Note that this FQDN and IP address needs to be accessible from the Kubernetes worker nodes. For example, if you are installing Velero onto Kubernetes clusters in VCD, the worker nodes on the Organization network need to be able to route to your S3 service. If you are a service provider, you can place your S3 service on the services network that is accessible by all tenants in VCD.

Test access

Download and install the S3 browser tool https://s3-browser.en.uptodown.com/windows

Setup the connection to your S3 service using the access key and secret key.

Create a new bucket to store some backups. If you are using Container Service Extension with VCD, create a new bucket for each Tenant organization. This ensures multi-tenancy is maintained. I’ve create a new bucket named tenant1 which corresponds to one of my tenant organizations in my VCD environment.

Install Velero into the Kubernetes cluster

You can use the velero-plugin-for-aws and the AWS provider with any S3 API compatible system, this includes TrueNAS, Cloudian Hyperstore etc.

Setup a file with your access key and secret key details, the file is named credentials-velero.

vi credentials-velero
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY

Change your Kubernetes context to the cluster that you want to enable for Velero backups. The Velero CLI will connect to your Kubernetes cluster and deploy all the resources for Velero.

velero install \
    --use-restic \
    --default-volumes-to-restic \
    --use-volume-snapshots=false \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.4.0 \
    --bucket tenant1 \
    --backup-location-config region=default,s3ForcePathStyle="true",s3Url=https://s3.vmwire.com:9000 \
    --secret-file ./credentials-velero

To install Restic, use the --use-restic flag in the velero install command. See the install overview for more details on other flags for the install command.

velero install --use-restic

When using Restic on a storage provider that doesn’t have Velero support for snapshots, the --use-volume-snapshots=false flag prevents an unused VolumeSnapshotLocation from being created on installation. The VCD CSI provider does not provide native snapshot capability, that’s why using Restic is a good option here.

I’ve enabled the default behavior to include all persistent volumes to be included in pod backups enabled on all Velero backups running the velero install command with the --default-volumes-to-restic flag. Refer install overview for details.

Specify the bucket with the --bucket flag, I’m using tenant1 here to correspond to a VCD tenant that will have its own bucket for storing backups in the Kubernetes cluster.

For the --backup-location-config flag, configure you settings like mine, and use the s3Url flag to point to your S3 object store, if you don’t use this Velero will use AWS’ S3 public URIs.

A working deployment looks like this

time="2022-04-11T19:24:22Z" level=info msg="Starting Controller" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.downloadrequest reconciler group=velero.io reconciler kind=DownloadRequest
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=restore logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=backup logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=restic-repo logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=backup-sync logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.backupstoragelocation reconciler group=velero.io reconciler kind=BackupStorageLocation worker count=1
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.downloadrequest reconciler group=velero.io reconciler kind=DownloadRequest worker count=1
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.serverstatusrequest reconciler group=velero.io reconciler kind=ServerStatusRequest worker count=10
time="2022-04-11T19:24:22Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2022-04-11T19:24:22Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"
time="2022-04-11T19:25:22Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2022-04-11T19:25:22Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"

To see all resources deployed, use this command.

k get all -n velero

NAME                          READY   STATUS    RESTARTS   AGE
pod/restic-x6r69              1/1     Running   0          49m
pod/velero-7bc4b5cd46-k46hj   1/1     Running   0          49m

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/restic   1         1         1       1            1           <none>          49m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/velero   1/1     1            1           49m

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/velero-7bc4b5cd46   1         1         1       49m

Example to test Velero and Restic integration

Please use this link here: https://velero.io/docs/v1.5/examples/#snapshot-example-with-persistentvolumes

You may need to edit the with-pv.yaml manifest if you don’t have a default storage class.

Useful commands

velero get backup-locations

NAME      PROVIDER   BUCKET/PREFIX   PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
default   aws        tenant1          Available   2022-04-11 19:26:22 +0000 UTC   ReadWrite     true

Create a backup example

velero backup create nginx-backup --selector app=nginx

Show backup logs

velero backup logs nginx-backup

Delete a backup

velero delete backup nginx-backup

Show all backups

velero backup get

Backup the VCD PostgreSQL database, see this previous blog post.

velero backup create postgresql --ordered-resources 'statefulsets=vmware-cloud-director/postgresql-primary' --include-namespaces=vmware-cloud-director

Show logs for this backup

velero backup logs postgresql

Describe the postgresql backup

velero backup describe postgresql

Describe volume backups

kubectl -n velero get podvolumebackups -l velero.io/backup-name=nginx-backup -o yaml

apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: PodVolumeBackup
  metadata:
    annotations:
      velero.io/pvc-name: nginx-logs
    creationTimestamp: "2022-04-13T17:55:04Z"
    generateName: nginx-backup-
    generation: 4
    labels:
      velero.io/backup-name: nginx-backup
      velero.io/backup-uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
      velero.io/pvc-uid: cf3bdb2f-714b-47ee-876c-5ed1bbea8263
    name: nginx-backup-vgqjf
    namespace: velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Backup
      name: nginx-backup
      uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
    resourceVersion: "8425774"
    uid: 1fcdfec5-9854-4e43-8bc2-97a8733ee38f
  spec:
    backupStorageLocation: default
    node: node-7n43
    pod:
      kind: Pod
      name: nginx-deployment-66689547d-kwbzn
      namespace: nginx-example
      uid: 05afa981-a6ac-4caf-963b-95750c7a31af
    repoIdentifier: s3:https://s3.vmwire.com:9000/tenant1/restic/nginx-example
    tags:
      backup: nginx-backup
      backup-uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
      ns: nginx-example
      pod: nginx-deployment-66689547d-kwbzn
      pod-uid: 05afa981-a6ac-4caf-963b-95750c7a31af
      pvc-uid: cf3bdb2f-714b-47ee-876c-5ed1bbea8263
      volume: nginx-logs
    volume: nginx-logs
  status:
    completionTimestamp: "2022-04-13T17:55:06Z"
    path: /host_pods/05afa981-a6ac-4caf-963b-95750c7a31af/volumes/kubernetes.io~csi/pvc-cf3bdb2f-714b-47ee-876c-5ed1bbea8263/mount
    phase: Completed
    progress:
      bytesDone: 618
      totalBytes: 618
    snapshotID: 8aa5e473
    startTimestamp: "2022-04-13T17:55:04Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Migrating VMware Cloud Director to Kubernetes

This post summarizes how you can migrate the VMware Cloud Director database from PostgreSQL running in the VCD appliance into a PostgreSQL pod running in Kuberenetes and then creating new VCD cells running as pods in Kubernetes to run VCD services. In summary, modernizing VCD as a modern application.

I wanted to experiment with VMware Cloud Director to see if it would run in Kubernetes. One of the reasons for this is to reduce resource consumption in my home lab. The VCD appliance can be quite a high resource consuming VM needing a minimum of 2 vCPUs and 6GB of RAM. Running VCD in Kubernetes would definitely reduce this down and free up much needed RAM for other applications. Other benefits by running this workload in Kubernetes would benefit from faster deployment, higher availability, easier lifecycle management and operations and additional benefits from the ecosystem such as observability tools.

Here’s a view of the current VCD appliance in the portal. 172.16.1.34 is the IP of the appliance, 172.16.1.0/27 is the network for the NSX-T segment that I’ve created for the VCD DMZ network. At the end of this post, you’ll see VCD running in Kubernetes pods with IP addresses assigned by the CNI instead.

Tanzu Kubernetes Grid Shared Services Cluster

I am using a Tanzu Kubernetes Grid cluster set up for shared services. Its the ideal place to run applications that in the virtual machine world would have been running in a traditional vSphere Management Cluster. I also run Container Service Extension and App Launchpad Kubernetes pods in this cluster too.

Step 1. Deploy PostgreSQL with Kubeapps into a Kubernetes cluster

If you have Kubeapps, this is the easiest way to deploy PostgreSQL.

Copy my settings below to create a PostgreSQL database server and the vcloud user and database that are required for the database restore.

Step 1. Alternatively, use Helm directly.

# Create database server using KubeApps or Helm, vcloud user with password

helm repo add bitnami https://charts.bitnami.com/bitnami

# Pull the chart, unzip then edit values.yaml
helm pull bitnami/postgresql
tar zxvf postgresql-11.1.11.tgz

helm install postgresql bitnami/postgresql -f /home/postgresql/values.yaml -n vmware-cloud-director

# Expose postgres service using load balancer
k expose pod -n vmware-cloud-director postgresql-primary-0 --type=LoadBalancer --name postgresql-public

# Get the IP address of the load balancer service
k get svc -n vmware-cloud-director postgresql-public

# Connect to database as postgres user from VCD appliance to test connection
psql --host 172.16.4.70 -U postgres -p 5432

# Type password you used when you deployed postgresql

# Quit
\q

Step 2. Backup database from VCD appliance and restore to PostgreSQL Kubernetes pod

Log into the VCD appliance using SSH.

# Stop vcd services on all VCD appliances
service vmware-vcd stop

# Backup database and important files on VCD appliance
./opt/vmware/appliance/bin/create_backup.sh

# Unzip the zip file into /opt/vmware/vcloud-director/data/transfer/backups

# Restore database using pg_dump backup file. Do this from the VCD appliance as it already has the postgres tools installed.

pg_restore --host 172.16.4.70 -U postgres -p 5432 -C -d postgres /opt/vmware/vcloud-director/data/transfer/backups/vcloud-database.sql

# Edit responses.properties and change IP address of database server from  load balancer IP to the assigned FQDN for the postgresql pod, e.g. postgresql-primary.vmware-cloud-director.svc.cluster.local

# Shutdown the VCD appliance, its no longer needed

Step 3. Deploy Helm Chart for VCD

# Pull the Helm Chart
helm pull oci://harbor.vmwire.com/library/vmware-cloud-director

# Uncompress the Helm Chart
tar zxvf vmware-cloud-director-0.5.0.tgz

# Edit the values.yaml to suit your needs

# Deploy the Helm Chart
helm install vmware-cloud-director vmware-cloud-director --version 0.5.0 -n vmware-cloud-director -f /home/vmware-cloud-director/values.yaml

# Wait for about five minutes for the installation to complete

# Monitor logs
k logs -f  -n vmware-cloud-director vmware-cloud-director-0

Known Issues

If you see an error such as:

Error starting application: Unable to create marker file in the transfer spooling area: VfsFile[fileObject=file:///opt/vmware/vcloud-director/data/transfer/cells/4c959d7c-2e3a-4674-b02b-c9bbc33c5828]

This is due to the transfer share being created by a different vcloud user on the original VCD appliance. This user has a different Linux user ID, normally 1000 or 1001, we need to change this to work with the new vcloud user.

Run the following commands to resolve this issue:

# Launch a bash session into the VCD pod
k exec -it -n vmware-cloud-director vmware-cloud-director-0 -- /bin/bash

# change ownership to the /transfer share to the vcloud user
chmod -R vcloud:vcloud /opt/vmware/vcloud-director/data/transfer

# type exit to quit
exit

Once that’s done, the cell can start and you’ll see the following:

Successfully verified transfer spooling area: VfsFile[fileObject=file:///opt/vmware/vcloud-director/data/transfer]
Cell startup completed in 2m 26s

Accessing VCD

The VCD pod is exposed using a load balancer in Kubernetes. Ports 443 and 8443 are exposed on a single IP, just like how it is configured on the VCD appliance.

Run the following to obtain the new load balancer IP address of VCD.

k get svc -n vmware-cloud-director  vmware-cloud-director

vmware-cloud-director   LoadBalancer   100.64.230.197   172.16.4.71   443:31999/TCP,8443:30016/TCP   16m

Redirect your DNS server record to point to this new IP address for both the HTTP and VMRC services, e.g., 172.16.4.71.

If everything ran successfully, you should now be able to log into VCD. Here’s my VCD instance that I use for my lab environment which was previously running in a VCD appliance, now migrated over to Kubernetes.

Notice, the old cell is now inactive because it is powered-off. It can now be removed from VCD and deleted from vCenter.

The pod vmware-cloud-director-0 is now running the VCD application. Notice its assigned IP address of 100.107.74.159. This is the pod’s IP address.

Everything else will work as normal, any UI customizations, TLS certificates are kept just as before the migration, this is because we restored the database and used the responses.properties to add new cells.

Even opening a remote console to a VM will continue to work.

Load Balancer is NSX Advanced LB (Avi)

Avi provides the load balancing services automatically through the Avi Kubernetes Operator (AKO).

AKO automatically configures the services in Avi for you when services are exposed.

Deploy another VCD cell, I mean pod

It is very easy now to scale the VCD by deploying additional replicas.

Edit the values.yaml file and change the replicas number from 1 to 2.

# Upgrade the Helm Chart
helm upgrade vmware-cloud-director vmware-cloud-director --version 0.4.0 -n vmware-cloud-director -f /home/vmware-cloud-director/values.yaml

# Wait for about five minutes for the installation to complete

# Monitor logs
k logs -f  -n vmware-cloud-director vmware-cloud-director-1

When the VCD services start up successfully, you’ll notice that the cell will appear in the VCD UI and Avi is also updated automatically with another pool.

We can also see that Avi is load balancing traffic across the two pods.

Deploy as many replicas as you like.

Resource usage

Here’s a very brief overview of what we have deployed so far.

Notice that the two PostgreSQL pods together are only using 700 Mb of RAM. The VCD pods are consuming much more. But a vast improvement over the 6GB that one appliance needed previously.

High Availability

You can ensure that the VCD pods are scheduled on different Kubernetes worker nodes by using multi availability zone topology. To do this just change the values.yaml.

# Availability zones in deployment.yaml are setup for TKG and must match VsphereFailureDomain and VsphereDeploymentZones
availabilityZones:
  enabled: true

This makes sure that if you scale up the vmware-cloud-director statefulset, Kubernetes will ensure that each of the pods will not be placed on the same worker node.

As you can see from the Kubernetes Dashboard output under Resource usage above, vmware-cloud-director-0 and vmware-cloud-director-1 pods are scheduled on different worker nodes.

More importantly, you can see that I have also used the same for the postgresql-primary-0 and postgresql-read-0 pods. These are really important to keep separate in case of failure of a worker node or of an ESX server that the worker node runs on.

Finally

Here are a few screenshots of VCD, CSE and ALP all running in my Shared Services Kubernetes cluster.

Backing up the PostgreSQL database

For Day 2 operations, such as backing up the PostgreSQL database you can use Velero or just take a backup of the database using the pg_dump tool.

Backing up the database with pg_dump using a Docker container

Its super easy to take a database backup using a Docker container, just make sure you have Docker running on your workstation and that it can reach the load balancer IP address for the PostgreSQL service.

docker run -it  -e PGPASSWORD=Vmware1! postgres:14.2  pg_dump  -h 172.16.4.70 -U postgres vcloud > backup.sql

The command will create a file in the current working directory named backup.sql.

Backing up the database with Velero

Please see this other post on how to setup Velero and Restic to backup Kubernetes pods and persistent volumes.

To create a backup of the PostgreSQL database using Velero run the following command.

velero backup create postgresql --ordered-resources 'statefulsets=vmware-cloud-director/postgresql-primary' --include-namespaces=vmware-cloud-director

Describe the backup

velero backup describe postgresql

Show backup logs

velero backup logs postgresql

To delete the backup

velero backup delete postgresql