February 2

Rebuilding a Kubernetes Node

When I created my Kubernetes cluster, initially, I had (naively) partitioned the 1TB disk on each node into several areas for root, home, and logs. What I found out later, was that the log area wasn’t necessarily large enough, so I was running into some issues with disk space.

I decided that I would, at a later time, re-image these nodes. Well, that time is now. I have eight nodes, of which five have these multiple partition drives. Three are worker nodes and two are control plane nodes.

Prep Work

Before doing anything, I made backups of the Postgres databases I had in the cluster for my apps by exporting the databases. I have a script for each app that does this. For example:

cat > exec-init-db-pod <<'EOT'
kubectl exec -it -n viewmaster `kubectl get pod -n viewmaster -l tier=postgres | cut -f1 -d" " | tail -1` -- /bin/bash
EOT
chmod +x exec-init-db-pod

This gets me into the database pod for an app I have in the namespace ‘viewmaster’. I then create a backup:

pg_dump -U <DB_USERNAME> -W -F t <DB_NAME> > viewmasterdb.tar

Entering the database password, when prompted. I exit and then run this command from my Mac to pull down the backup:

cat > move-backup <<'EOT':
kubectl cp -n viewmaster `kubectl get pod -n viewmaster -l tier=postgres | cut -f1 -d" " | tail -1`:viewmasterdb.tar "viewmasterdb-${TIMESTAMP}.tar"
EOT
chmod +x move-backup

With Longhorn running in my cluster, I had also set it up to do periodic snapshots of the volumes, so hopefully, I’ve got everything I need, incase things go South (one never really knows, until something bad happens and the cluster has to be rebuilt).

Worker Nodes

Figuring I would tackle these first as they would be easier, I started with the node ‘cypher’, and then did ‘niobe’ and ‘mouse’ together, as most operations are using kubespray commands, and I can specify more than one node.

Node Removal

I read that one would typically cordon the node (to prevent pods from being scheduled on it), and then drain the node (to remove pods that are running on the node and have them run elsewhere). Kubespray has a playbook to remove a node, and it appears to do the draining, so with ‘cypher’, I did a ‘kubectl cordon cypher’, before running the playbook. With the other two nodes, I just ran the playbook and it was fine. Here are the steps I did…

Before removal, I checked into what pods (especially ones I created) were running on the nodes with:

kubectl get pods -A -o wide --field-selector spec.nodeName=cypher

To remove the node, I did:

export TARGET_NODE=cypher

cd ~/workspace/kubernetes/picluster
poetry shell
cd ../kubespray

ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 remove-node.yml -e node=${TARGET_NODE}

For the other two nodes, I set TARGET_NODE=”niobi,mouse”, before running the remove-node.yml playbook.

I ran “kubectl get nodes -o wide” to make sure that the nodes were removed from the cluster. I kept them in the inventory, as I was going to re-add the nodes, after re-imaging the drive.

Re-imaging The SSD Drive

Since I do not have a monitor near the cluster, I pulled out the nodes and brought them to the study (one at a time), connected a keyboard, mouse, monitor, ethernet cable, and power module. Here are the steps done for each node…

Holding the SHIFT key down, I powered on the RPI, and watched it enter the net boot mode. It downloads the image and then reboots with the RPI imager. I followed Part II in my cluster bring-up procedure to specify RPI4, select the 64-bit Ubuntu 24.04.1 server image, and choose the SSH drive for storage. I then chose to edit the custom settings to set the node name the same as it was before, and set the user and password to my username (and a simple password). I selected the America/New_York time zone, saved the changes, and then confirmed to image the whole drive.

Note that my router has a reserved DHCP entry with the desired IP for each node, based on the MAC of the ehternet interface, so the node will retain the same IP address.

When done, and the node has booted, I logged in from the console and set my SSH key using ‘ssh-keygen -t ed25519’. I then did a ‘ssh-copy-id <IP-OF-MY-MAC-HOST> and entered in my password to copy the key. On my Mac, I had to remove the known_host entry for the IP of this node. I did a “ssh-copy-id <IP-OF-THE-REIMAGED-NODE>” and used the password to copy the public key for the Mac to the node. I verified that I can SSH to the node and the node can SSH to my Mac.

Continuing with Part II of the process, I SSH’ed to the node, and replaced /etc/netboot/50-cloud-init.yaml with the static IP, DNS IPs, and search domain (use the template in Part II and replace the IP for the node). From the console of the node, I did a “sudo netplan apply”.

This is enough of a basic configuration, so that I can now use ansible playbooks to do the rest of the work. I did a “sudo shutdown -h 0” for the node, unplugged everything, and then reinstalled it back into the rack of my cluster.

Preparing The Nodes

Following Part IV of the Kubernetes cluster setup, I set the TARGET_NODE to “cypher” (and later to “niobi,mouse”), and while still under the same Poetry shell, did a ping check of the cluster to make sure I can communicate with the node being updated:

cd ~/workspace/kubernetes/picluster
poetry shell

ansible-playbook -i inventory/mycluster/hosts.yaml playbooks/ping.yaml --private-key=~/.ssh/id_ed25519

With that working, I ran through each of the commands to set things up (entering the node password for the first command, when prompted):

ansible-playbook -i "${TARGET_HOST}," playbooks/passwordless_sudo.yaml -v --private-key=~/.ssh/id_ed25519 --ask-become-pass

ansible-playbook -i "${TARGET_HOST}," playbooks/ssh.yaml -v --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/hostname.yaml -v --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/os_update.yaml --extra-vars "inventory=all reboot_default=false proxy_env=[]" --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/tools.yaml -v --private-key=~/.ssh/id_ed25519

This sets up for password-less sudo, requires passphrase only SSH, sets the FQDN, updates the OS, and installs tools desired. Since the cluster uses kube-vip, we need to define the hostname for the kube-apiserver, so that when the node boots, it can contact the API server before coreDNS is running. It involves adding the following line to /etc/cloud/templates/hosts.debian.tmpl:

<LOAD-BALANCER-IP-FOR-API-SERVER> lb-apiserver.kubernetes.local

There are some more RPI setups to do (this is assuming you have setup ~/workspace/SKU_RM0004/ as specified in Part IV)…

ansible-playbook -i "${TARGET_HOST}," playbooks/uctronics.yaml -v --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/cgroups.yaml -v --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/iptables.yaml -v --private-key=~/.ssh/id_ed25519

This sets up the LCD display used on the UCTRONICS front panel, configures cgroups, load the overlay modules, setup iptables for bridged traffic, and will allow IPv4 forwarding.

You can check the UCTRONIC display, check the IP address for the node (ip addr), check the FQDN (hostname –fqdn), and check the kernel is what you expected (uname -a). We are ready to add the node back to the cluster…

Re-Adding The Node

Moving over to the kubespray area, we will run the cluster.yml playbook, which, since the new is still in the inventory, it will be added to the cluster:

cd ~/workspace/kubernetes/kubespray
ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 cluster.yml

It takes a long time, but after completion, you can check that the node is present (kubectl get nodes -o wide), and check that all resources are ready/running (kubectl get all -A). The cluster should be good to go now.

Control Plane/ETCD Node

This is a bit trickier. First, if one of the control plane nodes is the FIRST entry in the inventory, you need to move it in the ordering, so that it is not the first entry. The file is ~/workspace/kubernetes/piclsuter/inventory/my-cluster/hosts.yaml in my case. From what I see them explaining in the Kubespray documentation it looks like the reordering is applied to the control plane, etcd, and nodes sections. Not sure if I have to do it in all three places, but I will, so as not to angry the Kubespray gods :).

This is the case for me today, as I want to re-image my nodes ‘apoc’ and lock’ and the former is first in the inventory list.

Second, there is supposed to be an odd number of etcd nodes. I have three etcd nodes, but two of them are nodes I need to re-image. It “looks” like I can temporarily run with an even number of nodes during the re-imaging process.

Third, the Kubespray example node configuration shows etcd running on the control plane nodes. I really don’t know if this is a requirement or just a simple convention used by Kubespray.I asked on Slack, and it sounds like etcd can run on any node, but it is not recommended to run on etcd on a worker node (if you do, the node should be tainted to not allow workloads scheduled on it).

The assumption here is that the control plane nodes are considered part of the cluster admin, and do not run workloads. If etcd is run on a worker node, and some workload is compromised, it can compromise the cluster.

So, it sounds like the practical options are A) run for a while with an even number of etcd nodes, B) place etcd on a worker node, but taint it to prevent scheduling of normal workloads, or C) run etcd cluster external to the kubernetes cluster.

I’m going to try option (A), and I think that Kubespray will do the right thing w.r.t. updating the etcd nodes, as needed.

The first step is to remove one of the control plane nodes, apoc:

cd ~/workspace/kubernetes/picluster
poetry shell
cd ../kubespray
ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 remove-node.yml -e node="apoc"

I did “sudo shutdown -h 0” on the node, and moved it to the study, where I could follow the same steps as above and mentioned in Part II of my series, to re-image the node. After that, I again, followed the steps in Part IV to check that I can ping the node, and then provision the node to prepare for cluster addition. A quick rundown of the commands done:

cd ../picluster
export TARGET_HOST=apoc

ansible-playbook -i inventory/mycluster/hosts.yaml playbooks/ping.yaml --private-key=~/.ssh/id_ed25519

ansible-playbook -i "${TARGET_HOST}," playbooks/passwordless_sudo.yaml -v --private-key=~/.ssh/id_ed25519 --ask-become-pass
ansible-playbook -i "${TARGET_HOST}," playbooks/ssh.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/hostname.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/os_update.yaml --extra-vars "inventory=all reboot_default=false proxy_env=[]" --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/tools.yaml -v --private-key=~/.ssh/id_ed25519

On apoc, added this line to /etc/cloud/templates/hosts.debian.tmpl so that the API server’s IP can be resolved, before DNS is up and running on this node:

<KUBE_VIP_IP_FOR_API_SERVER> lb-apiserver.kubernetes.local

A few more RPI specific configurations…

ansible-playbook -i "${TARGET_HOST}," playbooks/uctronics.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/cgroups.yaml -v --private-key=~/.ssh/id_ed25519
ansible-playbook -i "${TARGET_HOST}," playbooks/iptables.yaml -v --private-key=~/.ssh/id_ed25519

With the node provisioned, it is time to try to add it back into the cluster. The Kubespray docs say that for a etcd node to add some additional arguments, so trying that:

cd ../kubespray
ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 --limit=etcd,kube_control_plane -e ignore_assert_errors=yes cluster.yml

After a very long run time (just over 50 minutes, on my cluster), it completed successfully. Next, the docs suggest to run the upgrade playbook, again with extra args, to update the etcd cluster:

ansible-playbook -i ../picluster/inventory/mycluster/hosts.yaml -u ${USER} -b -v --private-key=~/.ssh/id_ed25519 --limit=etcd,kube_control_plane -e ignore_assert_errors=yes upgrade-cluster.yml

After this completes (almost an hour, on my cluster), they suggest to edit /etc/kubernetes/manifests/kube-apiserver.yaml on each control plane node to make sure that the –etc-servers API parameter has the correct IPs for all of the etcd nodes. I just checked the contents of mine, and all three etcd servers were listed.

I did a quick check to make sure all the pods and resources were ready/running in the cluster. There were two older coredns replica sets that were not running (but a current one running). I deleted them. There was one longhorn engine pod showing 0/1 READY with status Running, but a few minutes later, it was running. Looks like everything is fine.

I just need to do this for the other control-plane/etcd node, and I’m done. Note: It completed w/o any errors, but there were some issues with longhorn resources. A pod in crash loop, and several daemonsets and deployments not fully running. I deleted the pod, so it restarted. I still had some difficulty getting all of the daemonsets for the longhorn manager running. Had to delete several longhorn manager pods that were in the 1/2 READY state, and finally everything was running.

Posted February 2, 2025 by pcm in category "bare-metal", "Kubernetes", "Raspberry PI