Best practices for installing CSE 4.0

Container Service Extension 4 was released recently. This post aims to help ease the setup of CSE 4.0 as it has a different deployment model using the Solutions framework instead of deploying the CSE appliance into the traditional Management cluster concept used by service providers to run VMware management components such as vCenter, NSX-T Managers, Avi Controllers and other management systems.

Step 1 – Create a CSE Service Account

Perform these steps using the administrator@system account or an equivalent system administrator role.

Setup a Service Account in the Provider (system) organization with the role CSE Admin Role.

In my environment I created a user to use as a service account named svc-cse. You’ll notice that this user has been assigned the CSE Admin Role.

The CSE Admin Role is created automatically by CSE when you use the CSE Management UI as a Provider administrator, just do these steps using the administrator@system account.

Step 2 – Create a token for the Service Account

Log out of VCD and log back into the Provider organization as the service account you created in Step 1 above. Once logged in, it should look like the following screenshot, notice that the svc-cse user is logged into the Provider organization.

Click on the downward arrow at the top right of the screen, next to the user svc-cse and select User Preferences.

Under Access Tokens, create a new token and copy the token to a safe place. This is what you use to deploy the CSE appliance later.

Log out of VCD and log back in as adminstrator@system to the Provider organization.

Step 3 – Deploy CSE appliance

Create a new tenant Organization where you will run CSE. This new organization is dedicated to VCD extensions such as CSE and is managed by the service provider.

For example you can name this new organization something like “solutions-org“. Create an Org VDC within this organization and also the necessary network infrastructure such as a T1 router and an organization network with internet access.

Still logged into the Provider organization, open another tab by clicking on the Open in Tenant Portal link to your “solutions-org” organization. You must deploy the CSE vApp as a Provider.

Now you can deploy the CSE vApp.

Use the Add vApp From Catalog workflow.

Accept the EULA and continue with the workflow.

When you get the Step 8 of the Create vApp from Template, ensure that you setup the OVF properties like my screenshot below:

The important thing to note is to ensure that you are using the correct service account username and use the token from Step 2 above.

Also since you must have the service account in the Provider organization, leave the default system organization for CSE service account’s org.

The last value is very important, it must by set to the tenant organization that will run the CSE appliance, in our case it is the “solutions-org” org.

Once the OVA is deployed you can boot it up or if you want to customize the root password then do so before you start the vApp. If not, the default credentials are root and vmware.

Rights required for deploying TKG clusters

Ensure that the user that is logged into a tenant organization has the correct rights to deploy a TKG cluster. This user must have at a minimum the rights in the Kubernetes Cluster Author Global Role.

App LaunchPad

You’ll also need to upgrade App Launchpad to the latest version alp-2.1.2-20764259 to support CSE 4.0 deployed clusters.

Also ensure that the App-Launchpad-Service role has the rights to manage CAPVCD clusters.

Otherwise you may encounter the following issue:

VCD API Protected by Web Application Firewalls

If you are using a web application firewall (WAF) in front of your VCD cells and you are blocking access to the provider side APIs. You will need to add the SNAT IP address of the T1 from the solutions-org into the WAF whitelist.

The CSE appliance will need access to the VCD provider side APIs.

I wrote about using a WAF in front of VCD in the past to protect provider side APIs. You can read those posts here and here.

Advertisement

VMware Cloud Director, Container Service Extension and App Launchpad Running in Kubernetes

I’ve been experimenting with the VMware Cloud Director, Container Service Extension and App Launchpad applications and wanted to test if these applications would run in Kubernetes.

The short answer is yes!

I’ve been experimenting with the VMware Cloud Director, Container Service Extension and App Launchpad applications and wanted to test if these applications would run in Kubernetes.

The short answer is yes!

I initially deployed these apps as a standalone Docker container to see if they would run as a container. I wanted to eventually get them to run in a Kubernetes cluster to benefit from all the goodies that Kubernetes provides.

Packaging the apps wasn’t too difficult, just needed patience and a lot of Googling. The process was as follows:

  • run a Docker image of a Linux image, CentOS for VCD and Photon for ALP and CSE.
  • prepare all the pre-requisites, such as yum update and tdnf update.
  • commit the image to a Harbor registry
  • build a Helm chart to deploy the applications using the images and then create a shell script that is run when the image starts to install and run the applications.

Well, its not that simple but you can take a look at the code for all three Helm Charts on my Github or pull them from my public Harbor repository.

VMware Cloud Director

Github: https://github.com/hugopow/vmware-cloud-director

Helm Chart: helm pull oci://harbor.vmwire.com/library/vmware-cloud-director

How to install: Update values.yaml and then run

helm install vmware-cloud-director oci://harbor.vmwire.com/library/vmware-cloud-director --version 0.5.0 -n vmware-cloud-director

Notice how easy that was to install?

The values.yaml file is the only file you’ll need to edit, just update to suit your environment.

# Default values for vmware-cloud-director.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

replicaCount: 1

installFirstCell:
  enabled: true

installAdditionalCell:
  enabled: false

storageClass: iscsi
pvcCapacity: 2Gi

vcdNfs:
  server: 10.92.124.20
  mountPath: /mnt/nvme/vcd-k8s

vcdSystem:
  user: administrator
  password: Vmware1!
  email: admin@domain.local
  systemName: VCD
  installationId: 1

postgresql:
  dbHost: postgresql.vmware-cloud-director.svc.cluster.local
  dbName: vcloud
  dbUser: vcloud
  dbPassword: Vmware1!

# Availability zones in deployment.yaml are setup for TKG and must match VsphereFailureDomain and VsphereDeploymentZones
availabilityZones:
  enabled: false

httpsService:
  type: LoadBalancer
  port: 443

consoleProxyService:
  port: 8443

publicAddress:
  uiBaseUri: https://vcd-k8s.vmwire.com
  uiBaseHttpUri: http://vcd-k8s.vmwire.com
  restapiBaseUri: https://vcd-k8s.vmwire.com
  restapiBaseHttpUri: http://vcd-k8s.vmwire.com
  consoleProxy: vcd-vmrc.vmwire.com

tls:
  certFullChain: |-
    -----BEGIN CERTIFICATE-----
          wildcard certificate
    -----END CERTIFICATE-----
    -----BEGIN CERTIFICATE-----
          intermediate certificate
    -----END CERTIFICATE-----
    -----BEGIN CERTIFICATE-----
          root certificate
    -----END CERTIFICATE-----
  certKey: |-
    -----BEGIN PRIVATE KEY-----
          wildcard certificate private key
    -----END PRIVATE KEY-----

The installation process is quite fast, less than three minutes to get the first pod up and running and two minutes for each subsequent pod. That means a VCD multi-cell system up and running in less than ten minutes.

I’ve deployed VCD as a StatefulSet, and have three replicas. Since the replica is set to three, three VCD “Pods” are deployed, in the old world these would be the cells. Here you can see three pods running which would provide both load balancing and high-availability. The other pod is the PostgreSQL database that these cells use. You should also be able to see that Kubernetes has scheduled each pod on a different worker node. I have three worker nodes in this Kubernetes cluster.

Below is the view in VCD of the three cells.

The StatefulSet also has a LoadBalancer service configured for performing the load balancing of the HTTP and Console Proxy traffic on TCP 443 and TCP 8443 respectively.

You can see the LoadBalancer service has configured the services for HTTP and Console Proxy. Note, that this is done automatically by Kubernetes using a manifest in the Helm Chart.

Migrating an existing VCD instance to Kubernetes

If you want to migrate an existing instance to Kubernetes, then use this post here.

Container Service Extension

Github: https://github.com/hugopow/container-service-extension

Helm Chart: helm pull oci://harbor.vmwire.com/library/container-service-extension

How to install: Update values.yaml and then run helm install container-service-extension oci://harbor.vmwire.com/library/container-service-extension --version 0.2.0 -n container-service-extension

Here’s CSE running as a pod in Kubernetes. Since CSE is a stateless application, I’ve configured it to run as a Deployment.

CSE also does not need a database as it purely communicates with VCD through a message bus such as MQTT or RabbitMQ. Additionally no external access to CSE is required as this is done via VCD, so no load balancer is needed either.

You can see that when CSE is idle it only needs 1 milicore of CPU and 102Mib of RAM. This is so much better in terms of resource requirements than running CSE in a VM. This is one of the advantages of running pods vs VMs. Pods will use considerably fewer resources than VMs.

App Launchpad

Github: https://github.com/hugopow/app-launchpad

Helm Chart: helm pull oci://harbor.vmwire.com/library/app-launchpad

How to install: Update values.yaml and then run helm install app-launchpad oci://harbor.vmwire.com/library/app-launchpad --version 0.4.0 -n app-launchpad

The values.yaml file is the only file you’ll need to edit, just update to suit your environment.

# Default values for app-launchpad.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

alpConnect:
  saUser: "svc-alp"
  saPass: Vmware1!
  url: https://vcd-k8s.vmwire.com
  adminUser: administrator@system
  adminPass: Vmware1!
  mqtt: true
  eula: accept
# If you accept the EULA then type "accept" in the EULA key value to install ALP. You can fine the EULA in the README.md file.

I’ve already written an article about ALP here. That article contains a lot more details so I’ll share a few screenshots below for ALP.

Just like CSE, ALP is a stateless application and is deployed as a Deployment. ALP also does not require external access through a load balancer as it too communicates with VCD using the MQTT or RabbitMQ message bus.

You can see that ALP when idle requires just 3 milicores of CPU and 400 Mib of RAM.

ALP can be deployed with multiple instances to provide load balancer and high availability. This is done by deploying RabbitMQ and connecting ALP and VCD to the same exchange. VCD does not support multiple instances of ALP if MQTT is used.

When RabbitMQ is configured, then ALP can be scaled by changing the Deployment number of replicas to two or more. Kubernetes would then deploy additional pods with ALP.

Using Velero with Restic for Kubernetes Data Protection

Velero (formerly Heptio Ark) gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. You can run Velero with a cloud provider or on-premises.

This works with any Kubernetes cluster, including Tanzu Kubernetes Grid and Kubernetes clusters deployed with Container Service Extension with VMware Cloud Director.

This solution can be used for air-gapped environments where the Kuberenetes clusters do not have Internet access and cannot use public services such as Amazon S3, or Tanzu Mission Control Data Protection. These services are SaaS services which are pretty much out of bounds in air-gapped environments.

Overview

Velero (formerly Heptio Ark) gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. You can run Velero with a cloud provider or on-premises. Velero lets you:

  • Take backups of your cluster and restore in case of loss.
  • Migrate cluster resources to other clusters.
  • Replicate your production cluster to development and testing clusters.

Velero consists of:

  • A server that runs on your Kubernetes cluster
  • A command-line client that runs locally

Velero works with any Kubernetes cluster, including Tanzu Kubernetes Grid and Kubernetes clusters deployed using Container Service Extension with VMware Cloud Director.

This solution can be used for air-gapped environments where the Kubernetes clusters do not have Internet access and cannot use public services such as Amazon S3, or Tanzu Mission Control Data Protection. These services are SaaS services which are pretty much out of bounds in air-gapped environments.

Install Velero onto your workstation

Download the latest Velero release for your preferred operating system, this is usually where you have your kubectl tools.

https://github.com/vmware-tanzu/velero/releases

Extract the contents.

tar zxvf velero-v1.8.1-linux-amd64.tar.gz

You’ll see a folder structure like the following.

ls -l
total 70252
-rw-r----- 1 phanh users    10255 Mar 10 09:45 LICENSE
drwxr-x--- 4 phanh users     4096 Apr 11 08:40 examples
-rw-r----- 1 phanh users    15557 Apr 11 08:52 values.yaml
-rwxr-x--- 1 phanh users 71899684 Mar 15 02:07 velero

Copy the velero binary to the /usr/local/bin location so it is usable from anywhere.

sudo cp velero /usr/local/bin/velero

sudo chmod +x /usr/local/bin/velero

sudo chmod 755 /usr/local/bin/velero

If you want to enable bash auto completion, please follow this guide.

Setup an S3 service and bucket

I’m using TrueNAS’ S3 compatible storage in my lab. TrueNAS is an S3 compliant object storage system and is incredibly easy to setup. You can use other S3 compatible object stores such as Amazon S3. A full list of supported providers can be found here.

Follow these instructions to setup S3 on TrueNAS.

  1. Add certificate, go to System, Certificates
  2. Add, Import Certificate, copy and paste cert.pem and cert.key
  3. Storage, Pools, click on the three dots next to the Pools that will hold the S3 root bucket.
  4. Add a Dataset, give it a name such as s3-storage
  5. Services, S3, click on pencil icon.
  6. Setup like the example below.

Setup the access key and secret key for this configuration.

access key: AKIAIOSFODNN7EXAMPLE
secret key: wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY

Update DNS to point to s3.vmwire.com to 10.92.124.20 (IP of TrueNAS). Note that this FQDN and IP address needs to be accessible from the Kubernetes worker nodes. For example, if you are installing Velero onto Kubernetes clusters in VCD, the worker nodes on the Organization network need to be able to route to your S3 service. If you are a service provider, you can place your S3 service on the services network that is accessible by all tenants in VCD.

Test access

Download and install the S3 browser tool https://s3-browser.en.uptodown.com/windows

Setup the connection to your S3 service using the access key and secret key.

Create a new bucket to store some backups. If you are using Container Service Extension with VCD, create a new bucket for each Tenant organization. This ensures multi-tenancy is maintained. I’ve create a new bucket named tenant1 which corresponds to one of my tenant organizations in my VCD environment.

Install Velero into the Kubernetes cluster

You can use the velero-plugin-for-aws and the AWS provider with any S3 API compatible system, this includes TrueNAS, Cloudian Hyperstore etc.

Setup a file with your access key and secret key details, the file is named credentials-velero.

vi credentials-velero
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY

Change your Kubernetes context to the cluster that you want to enable for Velero backups. The Velero CLI will connect to your Kubernetes cluster and deploy all the resources for Velero.

velero install \
    --use-restic \
    --default-volumes-to-restic \
    --use-volume-snapshots=false \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.4.0 \
    --bucket tenant1 \
    --backup-location-config region=default,s3ForcePathStyle="true",s3Url=https://s3.vmwire.com:9000 \
    --secret-file ./credentials-velero

To install Restic, use the --use-restic flag in the velero install command. See the install overview for more details on other flags for the install command.

velero install --use-restic

When using Restic on a storage provider that doesn’t have Velero support for snapshots, the --use-volume-snapshots=false flag prevents an unused VolumeSnapshotLocation from being created on installation. The VCD CSI provider does not provide native snapshot capability, that’s why using Restic is a good option here.

I’ve enabled the default behavior to include all persistent volumes to be included in pod backups enabled on all Velero backups running the velero install command with the --default-volumes-to-restic flag. Refer install overview for details.

Specify the bucket with the --bucket flag, I’m using tenant1 here to correspond to a VCD tenant that will have its own bucket for storing backups in the Kubernetes cluster.

For the --backup-location-config flag, configure you settings like mine, and use the s3Url flag to point to your S3 object store, if you don’t use this Velero will use AWS’ S3 public URIs.

A working deployment looks like this

time="2022-04-11T19:24:22Z" level=info msg="Starting Controller" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.downloadrequest reconciler group=velero.io reconciler kind=DownloadRequest
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=restore logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=backup logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=restic-repo logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting controller" controller=backup-sync logSource="pkg/controller/generic_controller.go:76"
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.backupstoragelocation reconciler group=velero.io reconciler kind=BackupStorageLocation worker count=1
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.downloadrequest reconciler group=velero.io reconciler kind=DownloadRequest worker count=1
time="2022-04-11T19:24:22Z" level=info msg="Starting workers" logSource="/go/pkg/mod/github.com/bombsimon/logrusr@v1.1.0/logrusr.go:111" logger=controller.serverstatusrequest reconciler group=velero.io reconciler kind=ServerStatusRequest worker count=10
time="2022-04-11T19:24:22Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2022-04-11T19:24:22Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"
time="2022-04-11T19:25:22Z" level=info msg="Validating backup storage location" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:114"
time="2022-04-11T19:25:22Z" level=info msg="Backup storage location valid, marking as available" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:121"

To see all resources deployed, use this command.

k get all -n velero
NAME                          READY   STATUS    RESTARTS   AGE
pod/restic-x6r69              1/1     Running   0          49m
pod/velero-7bc4b5cd46-k46hj   1/1     Running   0          49m

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/restic   1         1         1       1            1           <none>          49m

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/velero   1/1     1            1           49m

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/velero-7bc4b5cd46   1         1         1       49m

Example to test Velero and Restic integration

Please use this link here: https://velero.io/docs/v1.5/examples/#snapshot-example-with-persistentvolumes

You may need to edit the with-pv.yaml manifest if you don’t have a default storage class.

Useful commands

velero get backup-locations
NAME      PROVIDER   BUCKET/PREFIX   PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
default   aws        tenant1          Available   2022-04-11 19:26:22 +0000 UTC   ReadWrite     true

Create a backup example

velero backup create nginx-backup --selector app=nginx

Show backup logs

velero backup logs nginx-backup

Delete a backup

velero delete backup nginx-backup

Show all backups

velero backup get

Backup the VCD PostgreSQL database, see this previous blog post.

velero backup create postgresql --ordered-resources 'statefulsets=vmware-cloud-director/postgresql-primary' --include-namespaces=vmware-cloud-director

Show logs for this backup

velero backup logs postgresql

Describe the postgresql backup

velero backup describe postgresql

Describe volume backups

kubectl -n velero get podvolumebackups -l velero.io/backup-name=nginx-backup -o yaml

apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: PodVolumeBackup
  metadata:
    annotations:
      velero.io/pvc-name: nginx-logs
    creationTimestamp: "2022-04-13T17:55:04Z"
    generateName: nginx-backup-
    generation: 4
    labels:
      velero.io/backup-name: nginx-backup
      velero.io/backup-uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
      velero.io/pvc-uid: cf3bdb2f-714b-47ee-876c-5ed1bbea8263
    name: nginx-backup-vgqjf
    namespace: velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Backup
      name: nginx-backup
      uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
    resourceVersion: "8425774"
    uid: 1fcdfec5-9854-4e43-8bc2-97a8733ee38f
  spec:
    backupStorageLocation: default
    node: node-7n43
    pod:
      kind: Pod
      name: nginx-deployment-66689547d-kwbzn
      namespace: nginx-example
      uid: 05afa981-a6ac-4caf-963b-95750c7a31af
    repoIdentifier: s3:https://s3.vmwire.com:9000/tenant1/restic/nginx-example
    tags:
      backup: nginx-backup
      backup-uid: c92d306a-bc76-47ba-ac81-5b4dae92c677
      ns: nginx-example
      pod: nginx-deployment-66689547d-kwbzn
      pod-uid: 05afa981-a6ac-4caf-963b-95750c7a31af
      pvc-uid: cf3bdb2f-714b-47ee-876c-5ed1bbea8263
      volume: nginx-logs
    volume: nginx-logs
  status:
    completionTimestamp: "2022-04-13T17:55:06Z"
    path: /host_pods/05afa981-a6ac-4caf-963b-95750c7a31af/volumes/kubernetes.io~csi/pvc-cf3bdb2f-714b-47ee-876c-5ed1bbea8263/mount
    phase: Completed
    progress:
      bytesDone: 618
      totalBytes: 618
    snapshotID: 8aa5e473
    startTimestamp: "2022-04-13T17:55:04Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Deploying Kubeapps on TKG in vCloud Director Clouds

Kubeapps is a web-based UI for deploying and managing applications in Kubernetes clusters. This guide shows how you can deploy Kubeapps into your TKG clusters deployed in VMware Cloud Director.

Kubeapps is a web-based UI for deploying and managing applications in Kubernetes clusters. This guide shows how you can deploy Kubeapps into your TKG clusters deployed in VMware Cloud Director.

With Kubeapps you can:

Pre-requisites:

  • a Kubernetes cluster deployed in VCD
  • Avi is setup for VCD to provide L4 load balancer to Kubernetes services
  • NSX-T is is setup for VCD
  • A default storageclass is defined for your Kubernetes cluster
  • Helm installed to your workstation, if using Photon OS, its already installed

Step 1: Install KubeApps

helm repo add bitnami https://charts.bitnami.com/bitnami
kubectl create namespace kubeapps
helm install kubeapps --namespace kubeapps bitnami/kubeapps

Step 2: Create demo credentials

kubectl create --namespace default serviceaccount kubeapps-operator
kubectl create clusterrolebinding kubeapps-operator --clusterrole=cluster-admin --serviceaccount=default:kubeapps-operator

Step 3: Obtain token to login to KubeApps

kubectl get --namespace default secret $(kubectl get --namespace default serviceaccount kubeapps-operator -o jsonpath='{range .secrets[*]}{.name}{"\n"}{end}' | grep kubeapps-operator-token) -o jsonpath='{.data.token}' -o go-template='{{.data.token | base64decode}}' && echo

Step 4: Expose KubeApps using Avi load balancer

k edit svc kubeapps -n kubeapps

change the line from

"type: ClusterIP"

to

"type: LoadBalancer"

Or: Expose using Gateway API, add ako.vmware.com labels into the kubeapps service like this (Not supported in VCD clouds):

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: kubeapps
    meta.helm.sh/release-namespace: kubeapps
  creationTimestamp: "2022-03-26T13:47:45Z"
  labels:
    ako.vmware.com/gateway-name: gateway-tkg-workload-vip
    ako.vmware.com/gateway-namespace: default
    app.kubernetes.io/component: frontend
    app.kubernetes.io/instance: kubeapps
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kubeapps
    helm.sh/chart: kubeapps-7.8.13
  name: kubeapps
  namespace: kubeapps

Step 5: Log into KubeApps with the token

Resize a TKGm cluster in CSE

When trying to resize a TKGm cluster with CSE, you might encounter this error below:

Cluster resize request failed. Please contact your provider if this problem persists. (Error: Unknown error)

This post shows how you can use the vcd cse cli to workaround this problem.

When trying to resize a TKGm cluster with CSE in the VCD UI, you might encounter this error below:

Cluster resize request failed. Please contact your provider if this problem persists. (Error: Unknown error)

Checking the logs in ~/.cse-logs there are no logs that show what the error is. It appears to be an issue with the Container UI Plugin for CSE 3.1.0.

If you review the console messages in Chrome’s developer tools you might see something like the following:

TypeError: Cannot read properties of null (reading 'length')
    at getFullSpec (https://vcd.vmwire.com/tenant/tenant1/uiPlugins/80134fc9-86e1-41db-9d02-b02d5e9e1e3c/ca5642fa-7186-4da2-b273-2dbd3451fd50/bundle.js:1:170675)
    at resizeCseCluster

This post shows how you can use the vcd cse cli to workaround this problem.

Using the vcd cse cli to resize a TKGm cluster

  1. First log into the CSE appliance or somewhere with vcd cse cli installed
  2. Then log into the VCD Org that has the cluster that you want to resize with a user with the role with the cse:nativecluster rights bundle.
    • vcd login vcd.vmwire.com tenant1 tenant1-admin -p Vmware1!
  3. Lets list the clusters using this command
    • vcd cse cluster list
  4. CSE should show you the clusters belonging to this organization
  5. Now lets obtain the details of the cluster that we want to resize
    • vcd cse cluster info hugo-tkg
    • copy the entire output of that command and paste it into Notepad++
  6. Delete everything from the status: line below so you only end up with the apiVersion, kind, metadata and spec sections. Like this:
apiVersion: cse.vmware.com/v2.0
kind: TKGm
metadata:
  name: hugo-tkg
  orgName: tenant1
  site: https://vcd.vmwire.com
  virtualDataCenterName: tenant1-vdc
spec:
  distribution:
    templateName: ubuntu-2004-kube-v1.20.5-vmware.2-tkg.1-6700972457122900687
    templateRevision: 1
  settings:
    network:
      cni: null
      expose: true
      pods:
        cidrBlocks:
        - 100.96.0.0/11
      services:
        cidrBlocks:
        - 100.64.0.0/13
    ovdcNetwork: default-organization-network
    rollbackOnFailure: true
    sshKey: ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAhcw67bz3xRjyhPLysMhUHJPhmatJkmPUdMUEZre+MeiDhC602jkRUNVu43Nk8iD/I07kLxdAdVPZNoZuWE7WBjmn13xf0Ki2hSH/47z3ObXrd8Vleq0CXa+qRnCeYM3FiKb4D5IfL4XkHW83qwp8PuX8FHJrXY8RacVaOWXrESCnl3cSC0tA3eVxWoJ1kwHxhSTfJ9xBtKyCqkoulqyqFYU2A1oMazaK9TYWKmtcYRn27CC1Jrwawt2zfbNsQbHx1jlDoIO6FLz8Dfkm0DToanw0GoHs2Q+uXJ8ve/oBs0VJZFYPquBmcyfny4WIh4L0lwzsiAVWJ6PvzF5HMuNcwQ==
      rsa-key-20210508
  topology:
    controlPlane:
      count: 1
      cpu: null
      memory: null
      sizingClass: small
      storageProfile: iscsi
    nfs:
      count: 0
      sizingClass: null
      storageProfile: null
    workers:
      count: 3
      cpu: null
      memory: null
      sizingClass: medium
      storageProfile: iscsi

Prepare a cluster config file

  1. Change the workers: count to your new desired number of workers.
  2. Save this file as update_my_cluster.yaml
  3. Update the cluster with this command
    • vcd cse cluster apply update_my_cluster.yaml
  4. You’ll notice that CSE will deploy another worker node into the same vApp and after a few minutes your TKGm cluster will have another node added to it.
root@photon-manager [ ~/.kube ]# kubectl get nodes
NAME        STATUS   ROLES                  AGE   VERSION
mstr-zcn7   Ready    control-plane,master   14m   v1.20.5+vmware.2
node-7swy   Ready    <none>                 10m   v1.20.5+vmware.2
node-90sb   Ready    <none>                 12m   v1.20.5+vmware.2
root@photon-manager [ ~/.kube ]# kubectl get nodes
NAME        STATUS   ROLES                  AGE   VERSION
mstr-zcn7   Ready    control-plane,master   22m   v1.20.5+vmware.2
node-7swy   Ready    <none>                 17m   v1.20.5+vmware.2
node-90sb   Ready    <none>                 19m   v1.20.5+vmware.2
node-rbmz   Ready    <none>                 43s   v1.20.5+vmware.2

Viewing client logs

The vcd cse cli commands are client side, to enable logging for this do the following

  1. Run this command in the CSE appliance or on your workstation that has the vcd cse cli installed.
    • CSE_CLIENT_WIRE_LOGGING=True
  2. View the logs by using this command
    • tail -f cse-client-debug.log

A couple of notes

The vcd cse cluster resize command is not enabled if your CSE server is using legacy_mode: false. You can read up on this in this link.

Therefore, the only way to resize a cluster is to update it using the vcd cse cluster apply command. The apply command supports the following:

apply a configuration to a cluster resource by filename. The
resource will be created if it does not exist. (The command
can be used to create the cluster, scale-up/down worker count,
scale-up NFS nodes, upgrade the cluster to a new K8s version.

CSE 3.1.1 can only scale-up a TKGm cluster, it does not support scale-down yet.