Skip to content

Kubernetes Cluster Maintenance and Upgrades

This section covers expanding the cluster with additional worker nodes or control plane nodes.

Use Ansible to prepare new nodes so the baseline is consistent across the cluster.

Terminal window
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-cpu.yaml

Use the GPU-specific playbooks to enable the runtime and device plugins:

Terminal window
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-nvidia-gpu.yaml
Terminal window
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-intel-gpu.yaml

Generate Join Token (On Existing Control Plane)

Section titled “Generate Join Token (On Existing Control Plane)”

Tokens expire after 24 hours. Run this on an existing control plane node to generate a new join command:

Terminal window
kubeadm token create --print-join-command

Run the join command from above on the new worker node:

Terminal window
sudo kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

For HA control planes, you need a load balancer in front of all control planes and must initialize the first control plane with --control-plane-endpoint=<load-balancer-ip>:6443.

Generate a certificate key on an existing control plane:

Terminal window
sudo kubeadm init phase upload-certs --upload-certs

Then join with the --control-plane flag:

Terminal window
sudo kubeadm join <load-balancer-ip>:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <certificate-key>

After joining, configure kubectl on the new control plane:

Terminal window
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Use these checks if the cluster runs unattended for long periods.

Terminal window
kubectl get nodes
kubectl get pods -A
kubectl get apps -n argocd

Use this flow to move existing apps out of the default namespace.

Step 1: Add a namespace and update manifests

Section titled “Step 1: Add a namespace and update manifests”

Create apps/<app-name>/namespace.yaml and apps/<app-name>/app.yaml, then update every manifest in the app folder to use namespace: <app-name>. If two apps must share storage, point both app.yaml files at the same namespace.

Once the changes are pushed, ArgoCD will reconcile the app into the new namespace.

After the new namespace is healthy, delete the old resources in default to avoid conflicts.

Terminal window
kubectl delete deployment,service,httproute -n default -l app=<app-name>

If the app owns PVCs, plan a data migration before deleting the old claims.

Use this flow to patch or reboot a node safely.

Terminal window
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Perform the host maintenance, then confirm the node is back online.

Terminal window
kubectl uncordon <node-name>

Upgrade control plane nodes first, then upgrade workers.

Update k8s_version in ansible/group_vars/all.yaml, then run the provisioning playbook against all nodes.

Run these on each control plane node, one at a time:

Terminal window
sudo kubeadm upgrade plan
TARGET_VERSION="v1.35.x"
sudo kubeadm upgrade apply ${TARGET_VERSION}
sudo apt-get install -y kubelet kubeadm kubectl
sudo systemctl restart kubelet

Replace v1.35.x with the target patch version after checking the latest stable release.

Drain, upgrade, then uncordon each worker:

Terminal window
kubectl drain <worker-name> --ignore-daemonsets --delete-emptydir-data
sudo kubeadm upgrade node
sudo apt-get install -y kubelet kubeadm kubectl
sudo systemctl restart kubelet
kubectl uncordon <worker-name>
Terminal window
kubectl get nodes
kubectl get pods -A

Use this flow to keep every node consistent while maintaining HA.

Host-level versions are pinned in ansible/group_vars/all.yaml:

  • k8s_version for kubeadm/kubelet/kubectl
  • containerd_version for the container runtime
  • cilium_version for the CNI

Cluster-level components (Longhorn, Tailscale, Envoy Gateway, cert-manager, ExternalDNS, ArgoCD) are pinned in their ArgoCD templates or manifests:

  • bootstrap/templates/longhorn.yaml
  • infrastructure/tailscale/tailscale-operator.yaml
  • infrastructure/envoy-gateway/envoy-gateway.yaml
  • infrastructure/envoy-gateway-crds/
  • infrastructure/gateway-api-crds/
  • infrastructure/cert-manager/cert-manager.yaml
  • infrastructure/external-dns/external-dns.yaml
  • infrastructure/external-secrets-crds/
  • bootstrap/templates/*-appset.yaml for repo references
  • Host packages (Kubernetes, containerd): rerun the provisioning playbook on all nodes so every host converges to the same version.
  • Cluster add-ons (Cilium, Longhorn, ArgoCD): update the version in Git and let ArgoCD sync. For Cilium, use cilium upgrade after updating cilium_version.
  • Upgrade control planes one at a time, then workers.
  • Drain nodes before upgrades and uncordon after, as described in the Kubernetes upgrade steps above.
  • Verify health between nodes: kubectl get nodes and kubectl get pods -A.

Update cilium_version in ansible/group_vars/all.yaml and targetRevision in infrastructure/cilium/cilium.yaml, then run:

Terminal window
CILIUM_VERSION=$(grep -E "cilium_version:" ansible/group_vars/all.yaml | head -n 1 | awk -F'\"' '{print $2}')
cilium upgrade --version ${CILIUM_VERSION}
cilium status --wait

Update bootstrap/argocd/kustomization.yaml to the desired ArgoCD release tag.

Terminal window
kubectl apply -k bootstrap/argocd
kubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocd

Update targetRevision in bootstrap/templates/longhorn.yaml, then let ArgoCD sync the application.

Use this section when rebuilding a node from bare metal or VM images:

Install Ubuntu 24.04 LTS and log in as a user with sudo. Clone this repo and review the pinned versions and paths in ansible/group_vars/all.yaml.

Run the Ansible provisioning playbook, then continue at Kubernetes.

Follow the bare metal tutorial path in order, starting with Prerequisites.

If you are rebuilding with existing data disks for Longhorn, ensure the storage path in bootstrap/templates/longhorn.yaml points to the correct mount before applying bootstrap/root.yaml.