How I automated Kubernetes cluster upgrades using Ansible, turning a tedious multi-hour process into a single command.
The Problem
My homelab Kubernetes cluster was running v1.28, several versions behind the current stable release. Kubernetes only supports upgrading one minor version at a time, meaning I’d need to go through:
1.28 → 1.29 → 1.30 → 1.31 → 1.32
Each upgrade involves:
- Upgrading the control plane (kubeadm, then kubelet/kubectl)
- Draining each worker node
- Upgrading packages on each worker
- Uncordoning and waiting for Ready state
- Verifying cluster health
Doing this manually across 4 nodes, 4 times, is tedious and error-prone. Enter Ansible.
Cluster Setup
| Node | Role | OS |
|---|---|---|
| k8s-master01 | Control Plane | CentOS Stream 9 |
| k8s-worker01 | Worker | CentOS Stream 9 |
| k8s-worker02 | Worker | CentOS Stream 9 |
| polycephala | Worker | Rocky Linux 10 |
All nodes use kubeadm for cluster management and containerd as the container runtime.
The Playbooks
I created two playbooks:
- k8s-upgrade-check.yml - Pre-flight checks
- k8s-upgrade.yml - Performs the actual upgrade
Directory Structure
ansible/
├── inventory/
│ └── hosts.yml
├── playbooks/
│ ├── k8s-upgrade-check.yml
│ └── k8s-upgrade.yml
└── README.md
Inventory
Simple YAML inventory using hostnames (assuming DNS/hosts resolution works):
all:
children:
control_plane:
hosts:
k8s-master01:
workers:
hosts:
k8s-worker01:
k8s-worker02:
polycephala:
k8s_cluster:
children:
control_plane:
workers:
vars:
ansible_user: will
ansible_become: true
Pre-flight Check Playbook
Before upgrading, I want to verify:
- All nodes are Ready
- The target version packages are available
- etcd is healthy
- No pending pods that would block draining
- name: Check available kubeadm versions (RHEL-based)
shell: |
dnf list available kubeadm --showduplicates 2>/dev/null | grep -E "{{ next_minor_version }}" | tail -5
register: available_kubeadm
- name: Check all nodes are Ready
command: kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
register: node_status
- name: Check etcd cluster health
shell: |
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
Usage:
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade-check.yml
Output shows version availability, node status, and the kubeadm upgrade plan.
Upgrade Playbook
The main upgrade playbook handles the entire process:
1. Backup etcd
Never upgrade without a backup:
- name: Backup etcd
shell: |
ETCDCTL_API=3 etcdctl snapshot save {{ backup_file }} \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
2. Upgrade Control Plane
- name: Upgrade kubeadm
dnf:
name: "kubeadm-{{ full_target_version }}"
state: present
disable_excludes: kubernetes
- name: Apply kubeadm upgrade
command: "kubeadm upgrade apply v{{ full_target_version }} --yes"
- name: Upgrade kubelet and kubectl
dnf:
name:
- "kubelet-{{ full_target_version }}"
- "kubectl-{{ full_target_version }}"
state: present
disable_excludes: kubernetes
- name: Restart kubelet
systemd:
name: kubelet
state: restarted
daemon_reload: true
3. Upgrade Workers (Serial)
Workers are upgraded one at a time (serial: 1) to maintain availability:
- name: Drain node
command: "kubectl drain {{ inventory_hostname }} --ignore-daemonsets --delete-emptydir-data --force"
delegate_to: "{{ groups['control_plane'][0] }}"
- name: Upgrade kubeadm
dnf:
name: "kubeadm-{{ full_target_version }}"
state: present
- name: Upgrade node config
command: kubeadm upgrade node
- name: Upgrade kubelet and kubectl
dnf:
name:
- "kubelet-{{ full_target_version }}"
- "kubectl-{{ full_target_version }}"
state: present
- name: Restart kubelet
systemd:
name: kubelet
state: restarted
daemon_reload: true
- name: Uncordon node
command: "kubectl uncordon {{ inventory_hostname }}"
delegate_to: "{{ groups['control_plane'][0] }}"
4. Verification
After all nodes are upgraded:
- name: Get cluster version info
command: kubectl get nodes -o wide
- name: Check all system pods
shell: |
kubectl get pods -n kube-system -o wide | grep -v Running | grep -v Completed
Safety Features
The playbook includes several safety mechanisms:
Version Jump Validation
Prevents accidentally skipping minor versions:
- name: Validate version jump
fail:
msg: "Cannot skip minor versions. Current: {{ current_minor }}, Target: {{ upgrade_minor }}"
when: (upgrade_minor.split('.')[1] | int) - (current_minor.split('.')[1] | int) > 1
Confirmation Prompt
Requires explicit confirmation before proceeding:
- name: Confirm upgrade
pause:
prompt: "Press ENTER to continue with upgrade to {{ full_target_version }}, or Ctrl+C to abort"
Wait for Ready State
Doesn’t proceed until each node is Ready:
- name: Wait for node to be Ready
command: kubectl get node {{ inventory_hostname }} -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
register: node_ready
until: node_ready.stdout == "True"
retries: 30
delay: 10
Usage
Check Readiness
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade-check.yml
Perform Upgrade
# Auto-detect next version
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade.yml
# Specific version
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade.yml -e target_version=1.29.15
Full Upgrade Path
To go from 1.28 to 1.32:
for version in 1.29 1.30 1.31 1.32; do
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade-check.yml -e target_version=$version
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade.yml -e target_version=$version
done
Lessons Learned
1. Serial Execution is Essential
Initially I tried upgrading workers in parallel. Bad idea - if something goes wrong, you’ve potentially broken multiple nodes. serial: 1 ensures we upgrade one node at a time.
2. Always Backup etcd
Even with Ansible automation, things can go wrong. The etcd backup saved me once when a network issue mid-upgrade left a node in a weird state.
3. Check Package Availability First
The pre-flight check playbook isn’t optional. I’ve been burned by assuming packages were available when the Kubernetes repo wasn’t properly configured on a node.
4. DNS Resolution Matters
Using hostnames instead of IP addresses in the inventory makes the playbooks cleaner and works better with kubectl drain/uncordon commands.
5. disable_excludes is Required
On RHEL-based systems, the kubernetes repo uses exclude=kubelet kubeadm kubectl to prevent accidental upgrades. The disable_excludes: kubernetes option is required to actually install the new versions.
What’s Not Covered
This playbook assumes:
- Single control plane node (no HA control plane)
- RHEL-based systems (dnf package manager)
- kubeadm-based cluster
- containerd runtime (no Docker)
For HA control planes, you’d need to upgrade control plane nodes serially as well.
Conclusion
What used to be a multi-hour, error-prone manual process is now:
ansible-playbook -i inventory/hosts.yml playbooks/k8s-upgrade.yml
The playbook handles the entire upgrade safely:
- Backs up etcd first
- Validates the version jump
- Upgrades control plane
- Drains, upgrades, and uncordons each worker
- Verifies cluster health
For a homelab with 4 nodes, this saves about 2-3 hours per version upgrade - and more importantly, eliminates the risk of forgetting a step or making a typo.