Security and the Cloud: The need for high bandwidth entropy

Security and the Cloud: The need for high bandwidth entropy

Entropy. Its the clutter on your desk, the noise from your fan, the randomness in your life. We spent most of our lives trying to reduce entropy (filing things, sorting, making order from chaos).

But what if I told you there is an entropy shortage somewhere near you, and, that you should care? Would you come and lock me up in pyjamas with sleeves in the back? Well.. you see, good entropy (randomness) is important for good encryption. And you use a lot of it when you create connections to things (as well as on an ongoing basis). If someone can predict your randomness, even just a little, your protections are reduced.

Enter the cloud. A big server, shared with a lot of people. I've even heard terms like 'serverless' bandied about for something that looks suspiciously like a server to me, just one that shares with a big pool of the great unwashed.

Lets examine the entropy of one of my home Kubernetes system (which has been pressed into service to build envoy which uses bazel which has previously caused a lot of trouble). See the graph? See how it falls off a cliff when the job starts, and then slowly rebuilds? And this is with a hardware-assisted random-number generator (/rngd is reading from /dev/hwrng). Imagine my poor machine trying to collect randomness, where will it get it from? There's no mouse or keyboard to get randomness from me. It just sits quietly in the basement, same humdrum existence day in and out.

Now, there are usb random number generators (like this one). $100. it generates about 400kbits/s of random. Is it random? Well, that's a hard test to perform. And it matters. What if its random number generator is like the one in my old TI 99 4/A machine? You called 'seed(n)' and it followed a chain.

We could splurge, $300 for this one. Its got 3.2Mb/s of randomness. Maybe I should get a few of these and come up with a cloud service, randomness as a service? O wait, you say that exists?

6 comments on “Security and the Cloud: The need for high bandwidth entropy
    • db db says:

      Now that i want!
      I’m not sure the lava lamp is the most energy efficient, but the style points are high.

  1. Looking at the screenshot you posted, the y axis is compressed, so you actually have lots of entropy even after the drop at 08:52.

    The other item I wanted to touch on, is while building entropy after a cold boot is extremely important, the crypto community doesn’t seem to support the idea that after that initial seed CSPRNG’s, such as the one in the linux kernel, will produce predictable numbers when entropy is low.

    Instead of trying to go into detail here, this article has a better explain than I can provide:
    https://pthree.org/2014/07/21/the-linux-random-number-generator/

  2. db db says:

    the issue is, on these hosts (running kubernetes or openstack or serverless), there are a lot of ‘cold boots’ per second occurring.

    each time a pod is created for kubernetes its like booting a new machine and creating a new pairwise mtls trust, new pki certs etc.

    yes the graph is not to ‘0’, but that is 1 pod starting, it took about 10% of my pool. If it had multiple containers in it, and I started multiple pods, and I was using istio for mtls mesh, then I would have (blocked, it won’t run out since it reads /dev/random, not /dev/urandom).

    Good news tho. I have purchased a pair of https://www.crowdsupply.com/13-37/infinite-noise-trng to try! It uses a hash function for whitening (similar to the sha1 that the kernel does) which fixes the adjacent bit correlation issue in most hardware implementations.

    so its not about having a pool of entropy laying around, its about replenishing it quickly w/o introducing correlations. It looks like that 1 operation used about 3.2kbits of entropy, and it looks like replenishes at a rate of about 300bits/s. The ‘infinite-noise-trng’ above should replenish @ 300kb/s w/o trouble.

    • I don’t think I would necessarily consider starting a container a cold boot, the kernel CSPRNG is not isolated by namespaces, and each container is using the same underlying state. If using VM’s or VM isolated containers (intel clear containers), it might be a different story, as I don’t know off-hand how the VM implementations seed the CSPRNG in the guest OSes.

      I do see where you’re going with the statement, that the concern is more around scheduling lots of containers to a node, which requires bootstrapping mTLS and a bunch of cryptographic secrets on launch, and that something in this bootstrap is using /dev/random and could potentially block. Whichever component is using /dev/random / draining the entropy estimation, I would be curious to understand why they made that choice.

      What I’m trying to point out, is that reading from the CSPRNG non-blocking isn’t generally considered insecure when entropy is low. It doesn’t produce predictable numbers. In theory they are correlated, because they are generated off advancing the same 4096 bits of kernel state, but in practice it’s not generally considered possible to observe this state without total ownership of the system, in which case I wouldn’t read the CSPRNG state, I would just read the secrets from the running processes.

      Here’s a better source going through this:
      https://www.2uo.de/myths-about-urandom/
      https://crypto.stackexchange.com/questions/41595/when-to-use-dev-random-over-dev-urandom-in-linux

  3. db db says:

    In Kubernetes with Istio, there is mutual TLS. Starting a new pod has multiple containers, each boots some random thing, but, each also now has a pair-wise PKI setup and service account created, for each service (TCP port).

    In some environments (e.g. gVisor, Kata) these are using a type-1 hypervisor underneath, e.g. not sharing the kernel. My $0.02 is this will increase in the future, more people will run gVisor style than ‘i really hope the kernel namespace works’ style.

    this can require ~10+ pair-wise PKI setups, per container, and you want to get 10 containers going in sub-second. That more than drains the entropy pool.

    I’m not sure why some of the components use blocking random TBH. I definitely observed this problem on my previous company OpenStack system. When a large number of instances where scheduled, the controller would block the progress as it regenerated its entropy.

    But i’ll post how the ‘infinite-noise-trng’ works. It should also be a bit more efficient since it does the whitening in the hw vs the kernels cspr thing running on top of intel’s rngd.

Leave a Reply

Your email address will not be published. Required fields are marked *

*