Stealing a CPU: how to tell how oversubscribed your cloud is

Imagine you have a virtual machine. Its running peacefully inside the cloud somewhere. You are a bit curious how close it is to max capacity. You log in and run top. And then you are baffled. Maybe it shows 99% but you think it can do more. Maybe it shows 1% but you are skeptical it has 100x to go.

But wait, what is that field @ the far right in top? %st?

That field is called ‘steal time’. And steal time works like this… Your ‘top’ uses standard unix accounting. And, it expects time to sum to 100% of (idle/system/user). So when you are running inside a shared infrastructure, there can be a fourth, which is, some other virtual machine ‘stole’ some time. This is an involuntary wait, e.g. it wanted to be in system/user, but the ticker wasn’t available.

Another way to look @ it… if steal time is quite large, then you are running on a cloud instance w/ a ‘noisy neighbour’ and maybe a bit more oversubscribed than you want.

So you have a couple of cases:

  1. Top is showing you are pretty busy, but steal-time is low… You need to buy more VCPU (you are busy but the host is not)
  2. Top is showing you are not that busy, and steal time is high (you need to run on a less oversubscribed cloud)

There’s a neat article here w/ some graphs for those interested in this being empirically plumbed out.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *