December 2025 – Off The Record

Updating Kubernetes nodes’ OS

With nine nodes in my cluster right now, each running Ubuntu 24.04, I want to ensure that the latest updates are present on the nodes.

I know I can remove the node from the cluster, update the OS, and then re-add the node, but I’m hoping there is an easier way.

I asked ChatGPT, and the two best methods suggested were to create a custom Ansible playbook to do the updates, or to use the Kubernetes Cluster API. The Cluster API would take a lot of effort to setup, so I’m opting for the playbook approach.

The steps suggested are:

cordon the node
drain the node
apply apt updates
reboot
wait for node to be ready
uncordon

ChatGPT provided an example playbook with these steps. For my cluster, however, which uses Longhorn storage, I want to change the node drain policy before the updates are done, so that the drain command doesn’t timeout, waiting for any sole replica. After the upgrade, the drain mode can be restore.

The revised playbook (rolling_apt_upgrade.yaml) looks like this:

---
- hosts: kube_node
serial: 1
become: yes

pre_tasks:
- name: "Set Longhorn node-drain-policy BEFORE rolling updates"
command: >
kubectl -n longhorn-system patch setting node-drain-policy
--type=merge -p '{"value":"block-for-eviction-if-contains-last-replica"}'
delegate_to: "{{ groups['kube_control_plane'][0] }}"
run_once: true

tasks:
- name: Cordon the node
command: kubectl cordon {{ inventory_hostname }}
delegate_to: "{{ groups['kube_control_plane'][0] }}"

- name: Drain the node
command: >
kubectl drain {{ inventory_hostname }}
--ignore-daemonsets
--delete-emptydir-data
--grace-period=30
delegate_to: "{{ groups['kube_control_plane'][0] }}"

- name: Apply apt upgrades
apt:
upgrade: dist
update_cache: yes

- name: Reboot the node
reboot:

- name: Wait for node to return to Ready
command: kubectl get node {{ inventory_hostname }} -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
register: node_ready
retries: 40
delay: 10
until: node_ready.stdout == "True"
delegate_to: "{{ groups['kube_control_plane'][0] }}"

- name: Uncordon the node
command: kubectl uncordon {{ inventory_hostname }}
delegate_to: "{{ groups['kube_control_plane'][0] }}"

post_tasks:
- name: "Restore Longhorn node-drain-policy AFTER rolling updates"
command: >
kubectl -n longhorn-system patch setting node-drain-policy
--type=merge -p '{"value":"block-if-contains-last-replica"}'
delegate_to: "{{ groups['kube_control_plane'][0] }}"
run_once: true

From my ~/workspace/picluster area, with the playbook in the sub-dir playbooks, I invoked with:

ansible-playbook -i inventory/mycluster/hosts.yaml playbooks/rolling_apt_upgrade.yaml

I was having issues on one node, where it was not becoming ready. What I saw was that the node did not know the IP of the API (lb-apiserver.kubernetes.local), and to resolve I had to add an entry to /etc/hosts mapping the IP to that name. I guess the problem was that, on reboot, kubelet is not up, so it cannot get the DNS info for the API. I don’t have a separate DNS server.

I added an ansible playbook to do this in playbooks/update_host_tmpl.yaml and it can be run with –limit to specify the node, if desired. Adding this to the node prep steps in Part IV of my series on Raspberry PI clusters.

Category: bare-metal, Kubernetes, Linux, Raspberry PI | Comments Off

December 11

Updating Kubernetes and Kubespray

With my current cluster, I was running Kubernetes 1.31.1 with Kubespray that is slightly newer than 2.27.0. I wanted to bring the cluster up to 1.34.2. To do that though, I have to move to 1.32.10 (or any 1.32 version), then 1.33.5, and finally 1.34.2.

Now, to do that, I have to use Kubespray versions that have checksums for the desired Kubernetes images (see roles/kubespray_defaults/vars/main/checksums.yml), and that version has to support the min Kubernetes version. For example, if I were to use Kubespray 2.29.0, it would not allow upgrading from older than 1.32.0, so I cannot jump to the latest Kubespray version. In addition, because of the checksums for any Kubespray version, it may not have the desired Kubernetes release.

In the past, I used to always specify the version of kube_version in k8s_cluster.yml in my inventory, to specify a version. However, if one omits this, Kubespray will use the most recent version of Kubernetes that it knows about.

I checked out the 2.28.0 tag of Kubespray and did an upgrade. This brought Kubernetes to 1.32.5. Next, I upgraded to the 2.29.0 tag of Kubespray and did another upgrade, and that brought Kubernetes to 1.33.5. Finally, I checked out the latest of Kubespray commit and saw that it had 1.34.2, so I did another upgrade to that version.

Upgrade Process

The first thing I did (once) was to do a “brew update” and “brew upgrade” so that I had the latest tools (and kubectl client).

Next, because I’m using Longhorn, there can be situations where a node has only one replica (even though each feature I have will use three replicas). When an upgrade occurs, the drain can be delayed long enough to cause a timeout. To prevent this, before doing any upgrade process, we can change the node drain policy to allow a quick switch to another node, and the drain to proceed…

kubectl -n longhorn-system patch setting node-drain-policy \
  --type=merge -p '{"value":"block-for-eviction-if-contains-last-replica"}'

For a single upgrade, I do the following:

Checkout the kubespray commit desired (e.g. git checkout v2.29.0)
Check the Kubespray requirements.txt file and make sure you have exactly same versions installed in your environment (e.g. poetry add ansible@10.7.0).
Rename your inventory directory for Kubernetes (~/workspace/kubernetes/picluster/inventory) to a different name (e.g. mv mycluster{,.previous}).
Copy the Kubespray inventory/sample directory to the Kubernetes inventory area (e.g. cp -r inventory/sample ../picluster/inventory/mycluster
Do a recursive diff, and apply the customizations from your current inventory to the new one. This will involve copying some files (like hosts.yml), and changing config settings in others. I usually add a comment with my initials so that I know what lines are modified.
Keep in mind that sometimes the default settings change between Kubespray versions, so you will need to adjust accordingly. For example, ingress_nginx_default in 2.27.0 was true, so no change was made, but in 2.28.0 and 2.29.0 the default is false, so it had to be set to true.

Now, the upgrade can be done. I use the command:

ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} \
    -b -v --private-key=~/.ssh/id_ed25519 upgrade-cluster.yml \
    -e "serial=1" -e upgrade_cluster_setup=true

Once completed, assuming success, you can check that all the resources are running. If there is an issue with a node, you can run the update again, with the argument:


--limit "node1,node4"

Where node4 is the node that had failed, and node1 is one of the control plane nodes (I guess you need to include a control plane node).

I’ve also had occasions where I just did an install again, and it corrected the nodes that had issues (I had one worker node that was “Not Ready” after the upgrade. Lastly, you can always delete failed pods, in hopes of the pod restarting and working.

When you are all done with upgrades, you can restore the Longhorn node drain policy:

kubectl -n longhorn-system patch setting node-drain-policy \
  --type=merge -p '{"value":"block-if-contains-last-replica"}'

Category: bare-metal, Kubernetes, Raspberry PI | Comments Off