When managed software goes bad… A cloud tale
So the other day I wrote of my experience with the first ‘critical’ kubernetes bug, and how the mitigation took down my Google Kubernetes (GKE). In that case, Google pushed an upgrade, and missed something with the Calico migration (Calico had been installed by them as well, nothing had been changed by me). Ooops.
Today, Azure AKS. Errors like:
"heapster" is forbidden: User "system:serviceaccount:kube-system:heapster" cannot update deployments.extensions in the namespace "kube-system
start appearing. Along with a mysterious ‘server is misbehaving’ message associated with ‘exec’ to a single namespace (other namespaces are ok, and non-exec calls within this namespace are ok). Hmm.
Some online ‘research’ and we are lead to Issue#664.
Looking deeper at the ‘server misbehaving’ leads to some discussion about kube-dns being broken. Kube-system shows errors like:
Node aks-nodepool1-19254313-0 has no valid hostname and/or IP address: aks-nodepool1-19254313-0
Hmm. That is my node name, how could it loose track of its own hostname? I don’t even have (easy) access to this, its all managed.
OK, unpack the ‘access one azure node’ here. And we’re in to the assumed ‘sick’ node. Snoop around, nothing seems too wrong.
So… peanut gallery, what does one do? Delete the cluster and move on with life? Open a support ticket?