My build is slow: understanding resource limits and steal time in the cloudy world of cloud
I earlier wrote about steal time, the concept of “my image wants to use time but its not available due to some unknown noisy neighbour stealing it”. In a nutshell, you have a server with X resources. You then ‘sell’ 10X to your users, making it 10:1 oversubscribed. The Internet industry of late has been making noises about regulating subscription through e.g. sunshine/transparency, or truth in advertising. There has also been a ton of research done around queuing etc in an oversubscribed network.
The way I normally explain this to the lay-person is: my house has 5 toilets. Each has a 3″ outlet valve. But my house has a 4″ sewer line to the city. Clearly this is not designed to flush all 5 at the same instant.
The reason resources are oversubscribed is cost. You actually want to have a high-peak and low-average for most things (consider if the road had guaranteed lanes for each person how wide that lane might be).
So today I’m watching my CI pipeline. I tested out the change locally and did a push. And its been more than 1hour in ‘the cloud’ for something that took only about 5 minutes locally. How can this be, I thought the cloud was fast?
First, lets compare. In the cloud I have 3 x 4VCPU/7.5GB of ram. On the desktop I have 1 x 8C/16T/32GB of ram. So I guess the desktop is bigger than the cloud.
Second, lets compare how much we get. That single machine the job runs on (4 VCPU/7.5GB ram) is running Kubernetes, and some other pods. But, its also carved out of a bigger machine that other people share (the ‘noisy neighbours’).
I first started to dig into my old favourite the ‘steal time’. But, it was showing 0.0. Hmm, not what I expected. This means it is a cgroup limit or an IO limit.
I then tried ‘pv /dev/zero > foo’ on each of the Kubernetes nodes. This gives me a rough idea of their disk performance. 1 of the nodes is ~100MiB/s, another is ~130MiB/s, and one is ~190MiB/s. So yes, we have some noisy neighbours.
In comparison, the humble desktop is showing 1.9GiB/s. So somewhat more than 10x faster. I guess the cloud SSD is not as good as my single-stick NVME. Hmm. But, I don’t think this 10x disk speed accounts for my issue.
I snoop around @ TRIM. I see that the nodes do not have ‘discard’ on on the mount, I run ‘fstrim -av’. I see that one of the nodes has 74GiB of untrimmed data. Hmm, maybe that helps? OK, it did a bit, the ‘slow’ node caught up to the others (and was the one w/ the most untrimmed data).
Want to understand why you need to run TRIM on an SSD? Well, here’s a starter. But, in a nutshell, flash can only be block-erased, and TRIM takes ‘garbage’ pages that are no longer in use and erases them before you need them again.
OK, maybe this is a resource limit? Kubernetes has some resource management that applies to the cgroup of each container. First lets check the namespace level:
don@cube:~/src-ag/corp-tools/k8s-gitlab$ kubectl describe namespace gitlab-runner Name: gitlab-runner Labels: <none> Annotations: <none> Status: Active No resource quota. No resource limits.
OK, that was not it, no limit.
Now, lets do a kubectl describe on the pod and look at the limits:
... Requests: cpu: 100m memory: 128Mi ... QoS Class: Burstable
We have a request amount, but no limit. We are also ‘burstable’. If I read this correctly it means we get at least 100 mili-cores, 128MiB of memory, and, we get whatever is left over from the other pods.
So, no insight there.
So, I’m a bit stumped. I looked @ top, all have reasonable idle and memory. I looked @ IO, and all have reasonable IO. I looked @ resource limits, and its not that. I looked @ steal time, I’m not being oversold.
So, any votes on why my CI is slow? Comments?