Navigating the restrictions of CI and Bazel

Working on a tool called ‘envoy‘. Its a proxy server that’s at the heart of Istio. And its build process is a tool called ‘Bazel‘.

I’ve been having some battles with Bazel. First it ate my laptop (well not literally, but it uses more resources than the laptop has. The laptop has 2C4T w/ 16GB, and Bazel wants most of that. Hmm. OK, worked around that w/ a nifty Ryzen 2700X, done!

Moving on, of course I want this built in my CI system. I suspect Bazel is meant to be used by a team, as a server, on central system w/ shared disks. But that’s not really the way that the CI works.

OK, attempt one. We make a simple .gitlab-ci.yml file. Add a section like:

  stage: build
      - bazel-bin/source/exe/envoy-static
  script: |
    bazel --batch build ${BAZEL_BUILD_OPTIONS} -c opt //source/exe:envoy-static

So, this kind of works. Its got some issues:

  1. Its very inefficient, bazel builds a huge cache that we discard
  2. The gitlab runners are my nodes in k8s, and then have 4C w/ 15GB of ram. They struggle with this since most of the CPU and RAM is consumed by the Java-Hog-That-Is-Bazel

OK, lets address the first. And this is where the restrictions of the CI and Bazel nearly left me a null set. Gitlab CI has a concept called ‘cache’. So, lets cache ~/.cache (where Bazel puts its stuff). Hmm, that doesn’t work, Gitlab CI only allows caching from within the WORKSPACE. OK, so lets change the Bazel output dir. Wait, that’s not configurable directly? We have two options:

  1. Change HOME
  2. Set magic env var TMP_TESTDIR (used in Bazel’s own unit tests)

So, if we follow #1 we end up with other troubles (e.g. git fails for the dependencies since there is no ~/.gitconfig preconfigured, etc). #2 fails since this isn’t really supported and some things seem to depend on it being in ~/.cache.


So my project is checked out in /GROUP/REPO. Lets try this:

  1. Set TEST_TMPDIR to ${CI_PROJECT_DIR}/../.cache
  2. in before_script, move the cache from ${CI_PROJECT_DIR}/.cache to ..
  3. in after_script, move the cache from ../.cache to ${CI_PROJECT_DIR}
  - echo Start
  - mkdir -p .cache
  - ls .cache
  - touch .cache/foobar
  - find .cache | wc -l
  - mv .cache ..

  - echo End
  - mv ../.cache .
  - find .cache | wc -l

OK, now we are cooking. But then we run into a couple more problems:

  1. Gitlab CI uses Kubernetes EmptyDir as the volume type for the code/build, and the Node doesn’t have enough disk for that
  2. The maximum object-size on the s3-like cache (minio) is 500M due to Ingress setup, it fails on the post.

OK, so lets address #1 along with our original #2 issue (the size of the Node is not enough). Now the way I want to do this is to use a auto-scaling pool in Azure. Ideally I would be able to launch containers without having virtual-machine nodes, and they would just charge me for what I use. That is research for another day. For now, we’ll use the ‘beast of the basement’, it has 36C72T w/ 256G of ram. It has a 4x1TB NVME Ceph cluster, I should be able to use that instead of EmptyDir, right?

Well, no, not per se. That is hard-coded into Gitlab runner. It will mount other volumes for you during a job, but, not where it checks out the repo. Grr. So I spent a bunch of time looking at its code to add Ceph RBD option there, and for now I will leave that.

I’ve been using kube-spawn┬áto make a local K8s cluster, so let me look at what it does for node-size. Hmmm, /var/lib/kubelet is where EmptyDir lives, and, that is inside the container of its node, so max ~1GB space free.

So I modified kube-spawn and sent a PR for it to bind-mount the host. Now it has access to that Ceph cluster indirectly, good enough for now.

Now address that 500M Ingress issue. We are using minio to make S3-like access on GCS┬ábecause Gitlab CI didn’t work properly w/ the GCS-exposed S3 API, nor its native API (different reasons for each). So I invoke Minio like:

helm install --name minio --namespace minio --set accessKey=blah,
 persistence.enabled=false stable/minio

I then created an Ingress and told it to find my LetEncrypt cert:

cat << EOF | kubectl apply -f -
apiVersion: extensions/v1beta1
kind: Ingress
  annotations: nginx "0" "0" "true" "false"
  name: minio-ingress
  - host:
      - backend:
          serviceName: minio
          servicePort: 9000
        path: /
  - hosts:
kind: Certificate
    - domains:
        ingress: minio-ingress
    kind: ClusterIssuer
    name: letsencrypt-prod

apiVersion: v1
kind: Secret

Now, the minio chart does have the ability to enable Ingress, and to have a certificate, but I couldn’t see how to hook to to the LetsEncrypt issuer, so I did this.

So now I have:

  1. A node large enough to build Envoy
  2. A cache large enough to hold/retrieve the build cache bits for the stages

And it works. It still haven’t address auto-scaling on-demand. I want that to be very low-latency to start (e.g. ~0-1s to get the job going). I want that to be a burst of a very large instance (ideally >= 16 cores and >= 64G ram), and I want that to be per-second billed.

An option for me is to look at Google Cloud Builder, I’m not sure how I would gasket it into the pipeline. One of the very nice things about the Gitlab CI is it handles the key creation for the duration of the job so I can use my private registry and my git repo (e.g. clone a 2nd dependency, or push a container at the end).





Leave a Reply

Your email address will not be published. Required fields are marked *