How to recover (I think) from a botched Kubernetes update
I was using KubeAdm v1.10 and wanted to give the latest Kubernetes from master a try. I (unfortunately) just updated the binaries for kubeadm, kubectl, and kubelet. I restarted the kubelet daemon (“sudo sytemctl restart kubelet”), and then ran “kubeadm init” hoping to sit back and watch the new cluster come up.
First, I found that the config file needed a newer API version, so I changed that to use “kubeadm.k8s.io/v1alpha2”, instead of “kubeadm.k8s.io/v1alpha1”, and tried “kubeadm init” again.
Well, it failed to come up, and it looking at the issue, I found that the kubelet configuration file, /etc/systemd/system/kubelet.service.d/10-kubeadm.conf was referring to a kubelet config file that did not exist:
I had no clue how to create this file, nor why it wasn’t there.
What Should Have Been Done?
It looks like, going from v1.10 to a newer version, the upgrade procedures should be used. One needs to go from one minor release to minor release at at time (1.10 -> 1.11, 1.11 -> 1.12,…). In this process, one can use “kubeadm config migrate –old-config kubeadm.conf –new-config new-kubeadm.conf, to update the config file. You can then change the API version, and set the Kubernetes version, before using the config file in the update.
I’m not sure what you would do, if you didn’t have a running v1.10 cluster, as this method seems to imply that is needed. Maybe you’d end up in the same state as I was in.
What to do, if you didn’t do the upgrade?
From what I can tell, it appears that v1.11+ needs /var/lib/kubelet/config.yaml, which doesn’t exist in v1.10. It will get generated when kubeadm init is invoked, and removed when kubeadm reset is done. But, when I had a previous v1.10 install, it was not getting created during init, and with this file missing, init fails. That file was the key to try to recover from my mess.
On a fresh system install, I brought up Kubernetes v1.11, using KubeAdm and the same config file that I was using on the corrupt system. I took the config.yaml (this one is what I had, YMMV) that was created, and placed it on the system that was corrupted, and then brought up the cluster with “kubeadm init”.
The cluster came up OK with that change. Oddly enough, doing a “kubeadm reset” remokved the file, and it was recreated the next time I did “kubeadm init”. I’m not sure why, when I had v1.10 setup, and then switched to v1.11, the file was not created. In any case, I’m happy it is working now.
I did see another problem though, and I’m not sure if it is related. Once I brought up the master node, set the bridge CNI plugin config file, and untainted the node, I created some alpine pods. From the pod, I could not ping other pods on the node. Looking at the iptables rules, I was seeing this rule…
-P FORWARD DROP
-P FORWARD ACCEPT
I flushed all the iptables rules, and then brought up the cluster with KubeAdm again, and after untainting and creating pods, everything seemed to work just peachy.