Rebuilding a Kubernetes Node
When I created my Kubernetes cluster, initially, I had (naively) partitioned the 1TB disk on each node into several areas for root, home, and logs. What I found out later, was that the log area wasn’t necessarily large enough, so I was running into some issues with disk space.
I decided that I would, at a later time, re-image these nodes. Well, that time is now. I have eight nodes, of which five have these multiple partition drives. Three are worker nodes and two are control plane nodes.
Prep Work
Before doing anything, I made backups of the Postgres databases I had in the cluster for my apps by exporting the databases. I have a script for each app that does this. For example:
cat > exec-init-db-pod <<'EOT'
kubectl exec -it -n viewmaster `kubectl get pod -n viewmaster -l tier=postgres | cut -f1 -d" " | tail -1` -- /bin/bash
EOT
chmod +x exec-init-db-pod
This gets me into the database pod for an app I have in the namespace ‘viewmaster’. I then create a backup:
pg_dump -U <DB_USERNAME> -W -F t <DB_NAME> > viewmasterdb.tar
Entering the database password, when prompted. I exit and then run this command from my Mac to pull down the backup:
cat > move-backup <<'EOT':
kubectl cp -n viewmaster `kubectl get pod -n viewmaster -l tier=postgres | cut -f1 -d" " | tail -1`:viewmasterdb.tar "viewmasterdb-${TIMESTAMP}.tar"
EOT
chmod +x move-backup
With Longhorn running in my cluster, I had also set it up to do periodic snapshots of the volumes, so hopefully, I’ve got everything I need, incase things go South (one never really knows, until something bad happens and the cluster has to be rebuilt).
Worker Nodes
Figuring I would tackle these first as they would be easier, I started with the node ‘cypher’, and then did ‘niobe’ and ‘mouse’ together, as most operations are using kubespray commands, and I can specify more than one node.
Node Removal
I read that one would typically cordon the node (to prevent pods from being scheduled on it), and then drain the node (to remove pods that are running on the node and have them run elsewhere). Kubespray has a playbook to remove a node, and it appears to do the draining, so with ‘cypher’, I did a ‘kubectl cordon cypher’, before running the playbook. With the other two nodes, I just ran the playbook and it was fine. Here are the steps I did…
Before removal, I checked into what pods (especially ones I created) were running on the nodes with:
kubectl get pods -A -o wide --field-selector spec.nodeName=cypher
To remove the node, I did:
export TARGET_NODE=cypher
cd ~/workspace/kubernetes/picluster
poetry shell
cd ../kubespray
ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 remove-node.yml -e node=${TARGET_NODE}
For the other two nodes, I set TARGET_NODE=”niobi,mouse”, before running the remove-node.yml playbook.
I ran “kubectl get nodes -o wide” to make sure that the nodes were removed from the cluster. I kept them in the inventory, as I was going to re-add the nodes, after re-imaging the drive.
Re-imaging The SSD Drive
Since I do not have a monitor near the cluster, I pulled out the nodes and brought them to the study (one at a time), connected a keyboard, mouse, monitor, ethernet cable, and power module. Here are the steps done for each node…
Holding the SHIFT key down, I powered on the RPI, and watched it enter the net boot mode. It downloads the image and then reboots with the RPI imager. I followed Part II in my cluster bring-up procedure to specify RPI4, select the 64-bit Ubuntu 24.04.1 server image, and choose the SSH drive for storage. I then chose to edit the custom settings to set the node name the same as it was before, and set the user and password to my username (and a simple password). I selected the America/New_York time zone, saved the changes, and then confirmed to image the whole drive.
Note that my router has a reserved DHCP entry with the desired IP for each node, based on the MAC of the ehternet interface, so the node will retain the same IP address.
When done, and the node has booted, I logged in and set my SSH key using ‘ssh-keygen -t ed25519’. I then did a ‘ssh-copy-id <IP-OF-MY-MAC-HOST> and entered in my password to copy the key. On my Mac, I had to remove the known_host entry for the IP of this node. I did a “ssh-copy-id <IP-OF-THE-NODE>” and used the password to copy the public key. I verified that I can SSH to the node and the node can SSH to my Mac.
Continuing with Part II of the process, I SSH’ed to the node, and replaced /etc/netboot/50-cloud-init.yaml with the static IP, DNS IPs, and search domain (use the template in Part II and replace the IP for the node). From the console of the node, I did a “sudo netplan apply”.
This is enough of a basic configuration, so that I can now use ansible playbooks to do the rest of the work. I did a “sudo shutdown -h 0” for the node, unplugged everything, and then reinstalled it back into the rack of my cluster.
Preparing The Nodes
Following Part IV of the Kubernetes cluster setup, I set the TARGET_NODE to “cypher” (and later to “niobi,mouse”), and while still under the same Poetry shell, did a ping check of the cluster to make sure I can communicate with the node being updated:
cd ~/workspace/kubernetes/picluster
poetry shell
ansible-playbook -i inventory/mycluster/hosts.yaml playbooks/ping.yaml --private-key=~/.ssh/id_ed25519
With that working, I ran through each of the commands to set things up (entering the node password for the first command, when prompted):
ansible-playbook -i "${TARGET_HOST}," playbooks/passwordless_sudo.yaml -v --private-key=~/.ssh/id_ed25519 --ask-become-pass
ansible-playbook -i "${TARGET_HOST}," playbooks/ssh.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/hostname.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/os_update.yaml --extra-vars "inventory=all reboot_default=false proxy_env=[]" --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/tools.yaml -v --private-key=~/.ssh/id_ed25519
This sets up for password-less sudo, requires passphrase only SSH, sets the FQDN, updates the OS, and installs tools desired. Since the cluster uses kube-vip, we need to define the hostname for the kube-apiserver, so that when the node boots, it can contact the API server before coreDNS is running. It involves adding the following line to /etc/cloud/templates/hosts.debian.tmpl:
<LOAD-BALANCER-IP-FOR-API-SERVER> lb-apiserver.kubernetes.local
There are some more RPI setups to do (this is assuming you have setup ~/workspace/SKU_RM0004/ as specified in Part IV)…
ansible-playbook -i "${TARGET_HOST}," playbooks/uctronics.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/cgroups.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/iptables.yaml -v --private-key=~/.ssh/id_ed25519
This sets up the LCD display used on the UCTRONICS front panel, configures cgroups, load the overlay modules, setup iptables for bridged traffic, and will allow IPv4 forwarding.
You can check the UCTRONIC display, check the IP address for the node (ip addr), check the FQDN (hostname –fqdn), and check the kernel is what you expected (uname -a). We are ready to add the node back to the cluster…
Re-Adding The Node
Moving over to the kubespray area, we will run the cluster.yml playbook, which, since the new is still in the inventory, it will be added to the cluster:
cd ~/workspace/kubernetes/kubespray
ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 cluster.yml
It takes a long time, but after completion, you can check that the node is present (kubectl get nodes -o wide), and check that all resources are ready/running (kubectl get all -A). The cluster should be good to go now.
Control Plane/ETCD Node
This is a bit trickier. First, if one of the control plane nodes is the FIRST entry in the inventory, you need to move it in the ordering, so that it is not the first entry. The file is ~/workspace/kubernetes/piclsuter/inventory/my-cluster/hosts.yaml in my case. From what I see them explaining in the Kubespray documentation it looks like the reordering is applied to the control plane, etcd, and nodes sections. Not sure if I have to do it in all three places, but I will, so as not to angry the Kubespray gods :).
This is the case for me today, as I want to re-image ‘apoc’ and lock’ and the former is first in the inventory list.
Second, there is supposed to be an odd number of etcd nodes. I have three, but two of them are nodes I need to re-image. It “looks” like I can temporarily run with an even number of nodes during the re-imaging process.
Third, the Kubespray example node configuration shows etcd running on the control plane nodes. I really don’t know if this is a requirement or just a simple convention used by Kubespray. I’ll ask on Slack, as it may be easier to change the inventory to move etcd from the ‘apoc’ and ‘lock’ control plane nodes, to two worker nodes that are already re-imaged.
This may also help with balancing the load on the nodes. Currently, the three control plane nodes have the normal services (kube-apiserver, kube-controller-manager, kube-schedule), along with Calico, longhorn, metal-lb, cert-manager, etcd, prometheus and loki. With all that the load is very high on these nodes, and much lower on the worker nodes. I’m wondering if etcd can be moved to worker nodes, and if that would help relieve a significant load off the control plane nodes.