Stealing a CPU: how to tell how oversubscribed your cloud is
Imagine you have a virtual machine. Its running peacefully inside the cloud somewhere. You are a bit curious how close it is to max capacity. You log in and run top. And then you are baffled. Maybe it shows 99% but you think it can do more. Maybe it shows 1% but you are skeptical it has 100x to go.
But wait, what is that field @ the far right in top? %st?
That field is called ‘steal time’. And steal time works like this… Your ‘top’ uses standard unix accounting. And, it expects time to sum to 100% of (idle/system/user). So when you are running inside a shared infrastructure, there can be a fourth, which is, some other virtual machine ‘stole’ some time. This is an involuntary wait, e.g. it wanted to be in system/user, but the ticker wasn’t available.
Another way to look @ it… if steal time is quite large, then you are running on a cloud instance w/ a ‘noisy neighbour’ and maybe a bit more oversubscribed than you want.
So you have a couple of cases:
- Top is showing you are pretty busy, but steal-time is low… You need to buy more VCPU (you are busy but the host is not)
- Top is showing you are not that busy, and steal time is high (you need to run on a less oversubscribed cloud)
There’s a neat article here w/ some graphs for those interested in this being empirically plumbed out.