Kubernetes Cluster Maintenance and Upgrades
Maintenance
Section titled “Maintenance”Adding Nodes to the Cluster
Section titled “Adding Nodes to the Cluster”This section covers expanding the cluster with additional worker nodes or control plane nodes.
Prerequisites for New Nodes
Section titled “Prerequisites for New Nodes”Use Ansible to prepare new nodes so the baseline is consistent across the cluster.
ansible-playbook -i ansible/inventory/hosts.yaml \ ansible/playbooks/provision-cpu.yamlFor Nodes with GPUs
Section titled “For Nodes with GPUs”Use the GPU-specific playbooks to enable the runtime and device plugins:
ansible-playbook -i ansible/inventory/hosts.yaml \ ansible/playbooks/provision-nvidia-gpu.yamlansible-playbook -i ansible/inventory/hosts.yaml \ ansible/playbooks/provision-intel-gpu.yamlGenerate Join Token (On Existing Control Plane)
Section titled “Generate Join Token (On Existing Control Plane)”Tokens expire after 24 hours. Run this on an existing control plane node to generate a new join command:
kubeadm token create --print-join-commandAdd a Worker Node
Section titled “Add a Worker Node”Run the join command from above on the new worker node:
sudo kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>Add a Control Plane Node (HA Setup)
Section titled “Add a Control Plane Node (HA Setup)”For HA control planes, you need a load balancer in front of all control planes and must initialize the first control plane with --control-plane-endpoint=<load-balancer-ip>:6443.
Generate a certificate key on an existing control plane:
sudo kubeadm init phase upload-certs --upload-certsThen join with the --control-plane flag:
sudo kubeadm join <load-balancer-ip>:6443 \ --token <token> \ --discovery-token-ca-cert-hash sha256:<hash> \ --control-plane \ --certificate-key <certificate-key>After joining, configure kubectl on the new control plane:
mkdir -p $HOME/.kubesudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/configsudo chown $(id -u):$(id -g) $HOME/.kube/configRoutine checks
Section titled “Routine checks”Use these checks if the cluster runs unattended for long periods.
kubectl get nodeskubectl get pods -Akubectl get apps -n argocdNamespace migration
Section titled “Namespace migration”Use this flow to move existing apps out of the default namespace.
Step 1: Add a namespace and update manifests
Section titled “Step 1: Add a namespace and update manifests”Create apps/<app-name>/namespace.yaml and apps/<app-name>/app.yaml, then update every manifest in the app folder to use namespace: <app-name>. If two apps must share storage, point both app.yaml files at the same namespace.
Step 2: Let ArgoCD sync
Section titled “Step 2: Let ArgoCD sync”Once the changes are pushed, ArgoCD will reconcile the app into the new namespace.
Step 3: Remove old resources
Section titled “Step 3: Remove old resources”After the new namespace is healthy, delete the old resources in default to avoid conflicts.
kubectl delete deployment,service,httproute -n default -l app=<app-name>If the app owns PVCs, plan a data migration before deleting the old claims.
Node maintenance window
Section titled “Node maintenance window”Use this flow to patch or reboot a node safely.
Step 1: Drain the node
Section titled “Step 1: Drain the node”kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-dataStep 2: Apply OS updates or reboot
Section titled “Step 2: Apply OS updates or reboot”Perform the host maintenance, then confirm the node is back online.
Step 3: Uncordon the node
Section titled “Step 3: Uncordon the node”kubectl uncordon <node-name>Upgrade Kubernetes (Bare Metal)
Section titled “Upgrade Kubernetes (Bare Metal)”Upgrade control plane nodes first, then upgrade workers.
Step 1: Update versions and packages
Section titled “Step 1: Update versions and packages”Update k8s_version in ansible/group_vars/all.yaml, then run the provisioning playbook against all nodes.
Step 2: Upgrade the control plane
Section titled “Step 2: Upgrade the control plane”Run these on each control plane node, one at a time:
sudo kubeadm upgrade planTARGET_VERSION="v1.35.x"sudo kubeadm upgrade apply ${TARGET_VERSION}sudo apt-get install -y kubelet kubeadm kubectlsudo systemctl restart kubeletReplace v1.35.x with the target patch version after checking the latest stable release.
Step 3: Upgrade each worker node
Section titled “Step 3: Upgrade each worker node”Drain, upgrade, then uncordon each worker:
kubectl drain <worker-name> --ignore-daemonsets --delete-emptydir-datasudo kubeadm upgrade nodesudo apt-get install -y kubelet kubeadm kubectlsudo systemctl restart kubeletkubectl uncordon <worker-name>Step 4: Verify the upgrade
Section titled “Step 4: Verify the upgrade”kubectl get nodeskubectl get pods -AVersion Management (Ansible + ArgoCD)
Section titled “Version Management (Ansible + ArgoCD)”Use this flow to keep every node consistent while maintaining HA.
Step 1: Update the pinned versions
Section titled “Step 1: Update the pinned versions”Host-level versions are pinned in ansible/group_vars/all.yaml:
k8s_versionfor kubeadm/kubelet/kubectlcontainerd_versionfor the container runtimecilium_versionfor the CNI
Cluster-level components (Longhorn, Tailscale, Envoy Gateway, cert-manager, ExternalDNS, ArgoCD) are pinned in their ArgoCD templates or manifests:
bootstrap/templates/longhorn.yamlinfrastructure/tailscale/tailscale-operator.yamlinfrastructure/envoy-gateway/envoy-gateway.yamlinfrastructure/envoy-gateway-crds/infrastructure/gateway-api-crds/infrastructure/cert-manager/cert-manager.yamlinfrastructure/external-dns/external-dns.yamlinfrastructure/external-secrets-crds/bootstrap/templates/*-appset.yamlfor repo references
Step 2: Apply the change consistently
Section titled “Step 2: Apply the change consistently”- Host packages (Kubernetes, containerd): rerun the provisioning playbook on all nodes so every host converges to the same version.
- Cluster add-ons (Cilium, Longhorn, ArgoCD): update the version in Git and let ArgoCD sync. For Cilium, use
cilium upgradeafter updatingcilium_version.
Step 3: Keep HA during updates
Section titled “Step 3: Keep HA during updates”- Upgrade control planes one at a time, then workers.
- Drain nodes before upgrades and uncordon after, as described in the Kubernetes upgrade steps above.
- Verify health between nodes:
kubectl get nodesandkubectl get pods -A.
Upgrade Cilium
Section titled “Upgrade Cilium”Update cilium_version in ansible/group_vars/all.yaml and targetRevision in infrastructure/cilium/cilium.yaml, then run:
CILIUM_VERSION=$(grep -E "cilium_version:" ansible/group_vars/all.yaml | head -n 1 | awk -F'\"' '{print $2}')cilium upgrade --version ${CILIUM_VERSION}cilium status --waitUpgrade ArgoCD
Section titled “Upgrade ArgoCD”Update bootstrap/argocd/kustomization.yaml to the desired ArgoCD release tag.
kubectl apply -k bootstrap/argocdkubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocdUpgrade Longhorn
Section titled “Upgrade Longhorn”Update targetRevision in bootstrap/templates/longhorn.yaml, then let ArgoCD sync the application.
Recreate From Scratch (Ubuntu 24.04)
Section titled “Recreate From Scratch (Ubuntu 24.04)”Use this section when rebuilding a node from bare metal or VM images:
Base setup
Section titled “Base setup”Install Ubuntu 24.04 LTS and log in as a user with sudo. Clone this repo and review the pinned versions and paths in ansible/group_vars/all.yaml.
Automated setup
Section titled “Automated setup”Run the Ansible provisioning playbook, then continue at Kubernetes.
Manual setup
Section titled “Manual setup”Follow the bare metal tutorial path in order, starting with Prerequisites.
If you are rebuilding with existing data disks for Longhorn, ensure the storage path in bootstrap/templates/longhorn.yaml points to the correct mount before applying bootstrap/root.yaml.