Starting to work w/ Azure. Go to create my first Kubernetes cluster. After 15 minutes of watching the slide dots in the web, I give up. I try the CLI, same deal. This must be just me right? Wrong.

This is a general observation about cloud tooling. Things are very slow because of all the API hand off and polling. Cloud scales horizontally (many small things working independently), not vertically. And those small things only scale if they don't talk to each other or the central API (much). This is called Amdahl's Law. Think more glacier flow than avalanche. The glacier carved out the great lakes, but it did it very slowly.

Debugging a CI pipeline is another thing which is very slow. Make change, commit, push, wait, repeat.

This in turn means that as a developer I need a lot of unrelated small work items to switch to, inefficiently, rather than just stay on task and run. It took 28 minutes for my Azure AKS cluster to come alive. That is 26 minutes longer than it took on Google GKE or on my desktop (kube-spawn). That's an inconvenient amount of time. If it were a day, I would switch tasks entirely. If it were 1 minute, I would wait. 28 minutes is on that threshold where switching makes no sense and waiting makes no sense. Tempes Fugit becomes Tardius Fluit becomes carpe diem.

Tagged with: , , ,

A museum is where you go to see old technology, now retired. The steam museum, etc.

Let's get together and create an IPv4 museum. It will have 2^32 exhibits. There will be Class-A halls, class-B halls, class-C halls. I'm not sure yet how to arrange the class-D hall, maybe its everywhere and nowhere at the same time. We'll take our children there and point and say, "back in my day we talked about quad dotted decimal" and they will look at us with the same crazy look when we wax lyrical about 8-track and LP.

Hipsters will spend big $ to have a special-purpose IP(we never mention version) to IPv4 'NAT' so they can use archaic tech and look cool while doing it.

Sadly, we seem further than ever from making this museum of retired IP addresses. Even new technologies like Kubernetes have very poor support. And I've spent the morning trying to figure out how to get ::1 bound to lo in a docker container. This issue is 2 years old and relates. My brand spanking new Google Kubernetes Engine (GKE) clusters have no v6 in sight.

Tagged with: , , ,

So Azure has a 'serverless' kubelet concept. In a nutshell we follow virtual-kubelet instructions  (except they were missing az provider register --namespace 'Microsoft.ContainerInstance', pull request sent).

What this does is schedule Pods (which have a special annotation) to a farm of servers which are willing to accept 'foreign' containers (Pods). This means your Kubernetes master delegates work to a shared Node. What this means for security, well, lets chat about that another day. But what it means for horizontal scale is Good Things(tm). We can now share the pool of many servers rather than the small number of virtual machines we have pressed into service as Nodes in our own K8s cluster.

So, when I run this:

kind: Pod
metadata:
  name: u
spec:
  containers:
  - image: ubuntu:18.04
    command: ['sh', '-c', 'sleep 3600']
    imagePullPolicy: Always
    name: u
    resources:
      requests:
        memory: 4G
        cpu: 2
  dnsPolicy: ClusterFirst
  tolerations:
  - key: virtual-kubelet.io/provider
    operator: Exists
  - key: azure.com/aci
    effect: NoSchedule

It runs 'somewhere', a container floating in the universe with no specific host to call home. Where it lands, nobody knows, who its beside on that server, nobody knows.

Now, is this good enough for me with my Gitlab Runner CI?

In short, not really.

  1. Max size is 4 CPU, 14GB ram (https://docs.microsoft.com/en-us/azure/container-instances/container-instances-quotas#region-availability)
  2. It is not available in Canada. Closest is US-EAST. So this would be an issue for people needing Data Sovereignty.
  3. I would have to change Gitlab Runner to add these tolerations (indulgences?)

But, I think its a step in the right direction. I wonder if we will see a Google 'raw container pool' engine, given that Azure has ACI and AWS has Fargate?

A little snooping on the remote container in dmesg. Its interesting, there is the full boot sequence, then a large gap, then a line about my interface coming up:

[ 72.832894] hv_balloon: Received INFO_TYPE_MAX_PAGE_CNT
[ 72.832960] hv_balloon: Data Size is 8
[104939.085094] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[104939.104792] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[104939.105115] device veth32524353 entered promiscuous mode
[104939.105197] cbr0: port 1(veth32524353) entered forwarding state

How is this achieved? We can see that the host is running Ubuntu 16.04, that it has an uptime of 1 day, 5:13 hours. So it is not a VM created just for me.

We can see that the host is itself running on Microsoft HyperV (dmesg rats it out).

We can see we are vulnerable to some spectre + meltdown madness:

Hardware check
* Hardware support (CPU microcode) for mitigation techniques
* Indirect Branch Restricted Speculation (IBRS)
* SPEC_CTRL MSR is available: UNKNOWN (is msr kernel module available?)
* CPU indicates IBRS capability: UNKNOWN (is cpuid kernel module available?)
* Indirect Branch Prediction Barrier (IBPB)
* PRED_CMD MSR is available: UNKNOWN (is msr kernel module available?)
* CPU indicates IBPB capability: UNKNOWN (is cpuid kernel module available?)
* Single Thread Indirect Branch Predictors (STIBP)
* SPEC_CTRL MSR is available: UNKNOWN (is msr kernel module available?)
* CPU indicates STIBP capability: UNKNOWN (is cpuid kernel module available?)
* Speculative Store Bypass Disable (SSBD)
* CPU indicates SSBD capability: NO
* L1 data cache invalidation
* FLUSH_CMD MSR is available: UNKNOWN (is msr kernel module available?)
* Enhanced IBRS (IBRS_ALL)
* CPU indicates ARCH_CAPABILITIES MSR availability: UNKNOWN (is cpuid kernel module available?)
* ARCH_CAPABILITIES MSR advertises IBRS_ALL capability: UNKNOWN
* CPU explicitly indicates not being vulnerable to Meltdown (RDCL_NO): UNKNOWN
* CPU explicitly indicates not being vulnerable to Variant 4 (SSB_NO): UNKNOWN
* Hypervisor indicates host CPU might be vulnerable to RSB underflow (RSBA): UNKNOWN
* CPU microcode is known to cause stability problems: NO (model 0x3f family 0x6 stepping 0x2 ucode 0xffffffff cpuid 0x0)
* CPU microcode is the latest known available version: UNKNOWN (latest microcode version for your CPU model is unknown)
* CPU vulnerability to the speculative execution attack variants
* Vulnerable to Variant 1: YES
* Vulnerable to Variant 2: YES
* Vulnerable to Variant 3: YES
* Vulnerable to Variant 3a: YES
* Vulnerable to Variant 4: YES
* Vulnerable to Variant l1tf: YES

CVE-2017-5753 [bounds check bypass] aka 'Spectre Variant 1'
* Mitigated according to the /sys interface: YES (Mitigation: __user pointer sanitization)
* Kernel has array_index_mask_nospec: UNKNOWN (couldn't check (couldn't find your kernel image in /boot, if you used netboot, this is normal))
* Kernel has the Red Hat/Ubuntu patch: UNKNOWN (missing 'strings' tool, please install it, usually it's in the binutils package)
* Kernel has mask_nospec64 (arm64): UNKNOWN (couldn't check (couldn't find your kernel image in /boot, if you used netboot, this is normal))
* Checking count of LFENCE instructions following a jump in kernel... UNKNOWN (couldn't check (couldn't find your kernel image in /boot, if you used netboot, this is normal))
> STATUS: NOT VULNERABLE (Mitigation: __user pointer sanitization)

CVE-2017-5715 [branch target injection] aka 'Spectre Variant 2'
* Mitigated according to the /sys interface: YES (Mitigation: Full generic retpoline)
* Mitigation 1
* Kernel is compiled with IBRS support: YES
* IBRS enabled and active: NO
* Kernel is compiled with IBPB support: YES
* IBPB enabled and active: NO
* Mitigation 2
* Kernel has branch predictor hardening (arm): NO
* Kernel compiled with retpoline option: UNKNOWN (couldn't read your kernel configuration)
> STATUS: VULNERABLE (IBRS+IBPB or retpoline+IBPB is needed to mitigate the vulnerability)

CVE-2017-5754 [rogue data cache load] aka 'Meltdown' aka 'Variant 3'
* Mitigated according to the /sys interface: YES (Mitigation: PTI)
* Kernel supports Page Table Isolation (PTI): NO
* PTI enabled and active: YES
* Reduced performance impact of PTI: YES (CPU supports INVPCID, performance impact of PTI will be greatly reduced)
* Running as a Xen PV DomU: NO
> STATUS: NOT VULNERABLE (Mitigation: PTI)

CVE-2018-3640 [rogue system register read] aka 'Variant 3a'
* CPU microcode mitigates the vulnerability: NO
> STATUS: VULNERABLE (an up-to-date CPU microcode is needed to mitigate this vulnerability)

CVE-2018-3639 [speculative store bypass] aka 'Variant 4'
* Mitigated according to the /sys interface: NO (Vulnerable)
* Kernel supports speculation store bypass: YES (found in /proc/self/status)
> STATUS: VULNERABLE (Your CPU doesn't support SSBD)

CVE-2018-3615/3620/3646 [L1 terminal fault] aka 'Foreshadow & Foreshadow-NG'
* Mitigated according to the /sys interface: YES (Mitigation: PTE Inversion)
> STATUS: NOT VULNERABLE (Mitigation: PTE Inversion)

Tagged with: , , ,

Working on a tool called 'envoy'. Its a proxy server that's at the heart of Istio. And its build process is a tool called 'Bazel'.

I've been having some battles with Bazel. First it ate my laptop (well not literally, but it uses more resources than the laptop has. The laptop has 2C4T w/ 16GB, and Bazel wants most of that. Hmm. OK, worked around that w/ a nifty Ryzen 2700X, done!

Moving on, of course I want this built in my CI system. I suspect Bazel is meant to be used by a team, as a server, on central system w/ shared disks. But that's not really the way that the CI works.

OK, attempt one. We make a simple .gitlab-ci.yml file. Add a section like:

build:
  stage: build
  artifacts:
    paths:
      - bazel-bin/source/exe/envoy-static
  script: |
    bazel --batch build ${BAZEL_BUILD_OPTIONS} -c opt //source/exe:envoy-static

So, this kind of works. Its got some issues:

  1. Its very inefficient, bazel builds a huge cache that we discard
  2. The gitlab runners are my nodes in k8s, and then have 4C w/ 15GB of ram. They struggle with this since most of the CPU and RAM is consumed by the Java-Hog-That-Is-Bazel

OK, lets address the first. And this is where the restrictions of the CI and Bazel nearly left me a null set. Gitlab CI has a concept called 'cache'. So, lets cache ~/.cache (where Bazel puts its stuff). Hmm, that doesn't work, Gitlab CI only allows caching from within the WORKSPACE. OK, so lets change the Bazel output dir. Wait, that's not configurable directly? We have two options:

  1. Change HOME
  2. Set magic env var TMP_TESTDIR (used in Bazel's own unit tests)

So, if we follow #1 we end up with other troubles (e.g. git fails for the dependencies since there is no ~/.gitconfig preconfigured, etc). #2 fails since this isn't really supported and some things seem to depend on it being in ~/.cache.

Hmmm.

So my project is checked out in /GROUP/REPO. Lets try this:

  1. Set TEST_TMPDIR to ${CI_PROJECT_DIR}/../.cache
  2. in before_script, move the cache from ${CI_PROJECT_DIR}/.cache to ..
  3. in after_script, move the cache from ../.cache to ${CI_PROJECT_DIR}
before_script:
  - echo Start
  - mkdir -p .cache
  - ls .cache
  - touch .cache/foobar
  - find .cache | wc -l
  - mv .cache ..

after_script:
  - echo End
  - mv ../.cache .
  - find .cache | wc -l

OK, now we are cooking. But then we run into a couple more problems:

  1. Gitlab CI uses Kubernetes EmptyDir as the volume type for the code/build, and the Node doesn't have enough disk for that
  2. The maximum object-size on the s3-like cache (minio) is 500M due to Ingress setup, it fails on the post.

OK, so lets address #1 along with our original #2 issue (the size of the Node is not enough). Now the way I want to do this is to use a auto-scaling pool in Azure. Ideally I would be able to launch containers without having virtual-machine nodes, and they would just charge me for what I use. That is research for another day. For now, we'll use the 'beast of the basement', it has 36C72T w/ 256G of ram. It has a 4x1TB NVME Ceph cluster, I should be able to use that instead of EmptyDir, right?

Well, no, not per se. That is hard-coded into Gitlab runner. It will mount other volumes for you during a job, but, not where it checks out the repo. Grr. So I spent a bunch of time looking at its code to add Ceph RBD option there, and for now I will leave that.

I've been using kube-spawn to make a local K8s cluster, so let me look at what it does for node-size. Hmmm, /var/lib/kubelet is where EmptyDir lives, and, that is inside the container of its node, so max ~1GB space free.

So I modified kube-spawn and sent a PR for it to bind-mount the host. Now it has access to that Ceph cluster indirectly, good enough for now.

Now address that 500M Ingress issue. We are using minio to make S3-like access on GCS because Gitlab CI didn't work properly w/ the GCS-exposed S3 API, nor its native API (different reasons for each). So I invoke Minio like:

helm install --name minio --namespace minio --set accessKey=blah,
 secretKey=blahblah,
 defaultBucket.enabled=true,
 defaultBucket.name=MY-gitlab-runner-cache,
 defaultBucket.purge=true,
 persistence.enabled=false stable/minio

I then created an Ingress and told it to find my LetEncrypt cert:

cat << EOF | kubectl apply -f -
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.org/server-tokens: "false"
  name: minio-ingress
spec:
  rules:
  - host: s3.MYDOMAIN.com
    http:
      paths:
      - backend:
          serviceName: minio
          servicePort: 9000
        path: /
  tls:
  - hosts:
    - s3.MYDOMAIN.com
    secretName: s3.MYDOMAIN.com-tls
---
apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: s3.MYDOMAIN.com-tls
spec:
  acme:
    config:
    - domains:
      - s3.MYDOMAIN.com
      http01:
        ingress: minio-ingress
  commonName: s3.MYDOMAIN.com
  dnsNames:
  - s3.MYDOMAIN.com
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt-prod
  secretName: s3.MYDOMAIN.com-tls

---
apiVersion: v1
kind: Secret
metadata:
  name: s3.MYDOMAIN.com-tls
EOF

Now, the minio chart does have the ability to enable Ingress, and to have a certificate, but I couldn't see how to hook to to the LetsEncrypt issuer, so I did this.

So now I have:

  1. A node large enough to build Envoy
  2. A cache large enough to hold/retrieve the build cache bits for the stages

And it works. It still haven't address auto-scaling on-demand. I want that to be very low-latency to start (e.g. ~0-1s to get the job going). I want that to be a burst of a very large instance (ideally >= 16 cores and >= 64G ram), and I want that to be per-second billed.

An option for me is to look at Google Cloud Builder, I'm not sure how I would gasket it into the pipeline. One of the very nice things about the Gitlab CI is it handles the key creation for the duration of the job so I can use my private registry and my git repo (e.g. clone a 2nd dependency, or push a container at the end).

Tagged with: , , ,

I'm using Gitlab, and one of the things they promote is Auto-Devops. In a nutshell, you use the Gitlab-CI as your means from start to finish, starting w/ an idea, through code, unit-test, address-space-tests, dynamic-tests, thread-tests, license-checks, lint, code-format, static scans, ... all the way until it lands on a running server somewhere for your customers to get their grubby virtual fingers on it.

And I gotta say, it works really well.

Enter weave. They have a pattern 'gitops'. It has 'git' in the name so it must be good, right? They also have some opinions on whether a CI tool is good for continuous deployment. In short: NO:

Your CI server is not an orchestration tool.  You need something that continually attempts to make progress (until there are no more diffs). CI fails when it encounters a difference.  In combination with a human operator, a CI server can be made to force convergence, but this creates other issues. For example your CI scripts might not be able to enforce an idempotent and/or atomic group of changes.  

They  have coined the term 'CIOps' for the alternative, and they diagram it thusly:

 

versus their product (gitops) which is thusly:

 

They don't talk about gitlab-CI (which I think is stronger than the travis and circle ones they reference), its much better integrated to Kubernetes. Also, gitlab does monitoring where the others don't. It also supports 'environments' (e.g. staging, dev, production).

So, gentle reader, any opinions on this?

Tagged with: , , ,