IPv4. Its rare when its public, and annoying when its private. So we try and conserve this precious resource. One of the things that makes it complex is Kubernetes namespaces. A Kubernetes Ingress controller is not namespace aware (you can't have a shared Ingress that has services in multiple namespaces). Or can you?

What if I told you you could install a single Ingress (and cert-manager etc) and then have a service in each namespace served by it? Would you rejoice over saving a few $100/mo on public IP rental in 'the cloud'?

Lets dig in. Imagine we have 3 namespaces with interesting services. 'foo', 'bar' and 'kube-system' (which has our dashboard).

Lets assume we have 'kibana' running in kube-system. We want to expose this to the 'public internet'. Likely we would also use oauth2 proxy here to sign in, but I'll ignore auth for now.  We are going to use a new service (synthetic) which lives in the default namespace alongside our Ingress controller as 'glue'. Its kind of like a DNS CNAME.

First we install a single global ingress. Lets use helm:

helm install stable/nginx-ingress --name ingress --set controller.service.externalTrafficPolicy=Local --set rbac.create=true

Wait for the LoadBalancer to get a public IP, register it in DNS. You can either use a wildcard (*.something.MYDOMAIN.CA) or register each service, your call. All will use the same IP.

(To avoid complicating this, I'll show the cert-manager etc at the end, but its optional, we just need the next step with the Ingress + Service)

Once we have installed the below yaml we can now browse https://kibana.MYDOMAIN.CA/ and we are there. Repeat for the other services. Done! We have a single public IP

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: kibana
  annotations:
    kubernetes.io/ingress.class: nginx
    certmanager.k8s.io/cluster-issuer: letsencrypt
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    kubernetes.io/tls-acme: 'true'
    nginx.ingress.kubernetes.io/tls-acme: 'true'
spec:
  tls:
  - hosts:
    - kibana.MYDOMAIN.CA
    secretName: tls-secret-kibana
  rules:
  - host: kibana.MYDOMAIN.CA
    http:
      paths:
      - path: /
        backend:
          serviceName: kibana
          servicePort: 5601
---
kind: Service
apiVersion: v1
metadata:
  name: kibana
  namespace: default
spec:
  type: ExternalName
  externalName: kibana.kube-system.svc.cluster.local
  ports:
  - port: 5601

Now, to complete this and show with SSL certificates. You don't need this, above is all you need to expose the service, but why not do it on TLS at the same time? Its free!.

helm install stable/cert-manager --namespace kube-system --set ingressShim.defaultIssuerName=letsencrypt --set ingressShim.defaultIssuerKind=ClusterIssuer --name cert

cat << EOF | kubectl -n kube-system apply -f -
apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: letsencrypt
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: don@agilicus.com
    privateKeySecretRef:
      name: letsencrypt
    http01: {}
EOF
Tagged with: , , , , ,

The difference between Azure AKS and Google GKE is stark. GKE just worked. Single sign on, login, create cluster. It walked me through a couple of questions (how many nodes, what size of node). A minute or so later it was done.

Azure. Still working on it.

Attempt 1. Use the web interface. Now, it was a bit tricky to find. its in the 'misc' case. Highlighted down the left are things like Cosmos DB, Virtual machines. But (see screenshot) no container services in sight. To find, use 'all services' and then find it in the sea of unsorted links. OK, I go through the wizard. Setup the 'basics'. The default is 1 'Standard DS2 v2' node. Lets leave that default alone. Accept defaults, get to last screen. After a second or so it pops up a pink error "Validation failed: Required information is missing or not valid". No amount of trying would yield more info. No amount of changing cluster name or RBAC etc would affect. Huh, what field could be wrong?

So lets take a break and try the CLI. I use their example.

az group create --name az-canada-central --location canadacentral
az aks create --resource-group az-canada-central --name glr --node-count 1 --enable-addons monitoring --generate-ssh-keys

after chewing on this for about 5 minutes, I'm given:

Operation failed with status: 'Bad Request'. Details: The VM size of AgentPoolProfile:nodepool1 is not allowed in your subscription in location 'canadacentral'. The available VM sizes are Standard_A1,Standard_A1_v2,Standard_A2,Standard_A2_v2,Standard_A2m_v2,Standard_A3,Standard_A4,Standard_A4_v2,Standard_A4m_v2,Standard_A5,Standard_A6,Standard_A7,Standard_A8_v2,Standard_A8m_v2,Standard_F16s_v2,Standard_F2s_v2,Standard_F32s_v2,Standard_F4s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_F8s_v2,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5

After reading this for a bit, I realise the DS2 v2 default node is not present in the 'acceptable' nodes types. Oh. OK, well, I guess they must mean this for some reason.

So, lets try the web interface with a Standard_A1_v2. Well, something is happening. Dots are wandering across the top bar in vaguely progress-esque fashion. After some number of minutes I become suspicious of the progress and decide to hedge my bet with the CLI:

az aks create --node-vm-size Standard_A1_v2 --resource-group az-canada-central --name glr --node-count 1 --enable-addons monitoring --generate-ssh-keys
- Running ..

OK, its been sitting there for 15 minutes now. No word on the 'running'. Glance at web browser. Still doing the dot-dance, so, progress is slow.

The 'nodepool' did get created, I can see the virtual machine that is the node running.

So I sit, a bit stumped. If the easy things are this hard, how hard will the hard things be?

Tagged with: , , , , ,

Today I 'released' endoscope. This is a tool that solves a couple of 'simple' problems:

  1. I have a running container in Kubernetes. I wish I could have a shell inside it that is root, but also with a bunch of tools like gdb or ptrace. My container doesn't allow root or ptrace. I don't want to rebuild a debug version of it and create a new Pod
  2. I want to ping/create network traffic as if it originated from a specific pod
  3. I want to capture network traffic from/to a specific pod

If you have those problems, well, this is for you!

Lets look at an example:

scope -n NAMESPACE -p POD strace [-p #] [-e expr]

What sourcery is this? You mean from my current host I can run strace on a remote application in a container without knowing the node or ssh or anything? Yes! Simply run with the namespace/pod info (and -p #if there is more than one pid in the container, default is the first), and optionally e.g. -e file to filter. You can use 'scope pids' to show the pids if you want (the first one is not always the right one for more complex containers).

Current commands include gdb, ping, shell, strace, hping. If you use 'shell', you are in the network + pid namespace of the debugee (check ifconfig if you don't believe me!).

Work in progress is to allow you to, from the Wireshark GUI, simply select a pod and capture/filter its traffic in real time. Pull requests welcome 🙂

Tagged with: , , , , ,

In Kubernetes v1.11 you can resize persistent volume claims. Great!

Sadly, Google has not rolled this out to us great unwashed yet (its available to early-adopters or for everyone on alpha clusters), we are on v1.10.

Side note: Docker registry. One of the most commonly asked questions is: how do I delete or clean up? tl;dr: you can't. Its the hotel california. Get over it. All those bash/php/ruby/... scripts people have written to try and work around this? Don't spend your life trying to make them work.

Double sadly, today was the day, the container registry hit super-critical. So, once more into the breach, we can't wait for v1.11.

So, what do I need to do? I want to:

  1. tar the /registry somewhere
  2. stop / delete the pod
  3. delete the pvc
  4. create a new larger pvc
  5. restart / reschedule the pod

But, I ran into an issue on #2. Its part of a larger helm deployment. I suppose I could take down the whole deployment and let it recreate later in 5. But why should I?

Instead what I did is 'kubectl edit <deployment' and set the replicas to 0. This caused the pods to all exit, making the pvc unclaimed. Now I can delete the pvc, create a new one, and then 'kubectl edit ...' again and set the replicas back. Easy peasy.

# Please edit the object below.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 ...
  uid: 22125323-77c8-11e8-9758-42010aa200b4
spec:
  progressDeadlineSeconds: 600
  replicas: 0
  revisionHistoryLimit: 10
  selector:
Tagged with: , , , , ,

I earlier wrote about steal time, the concept of "my image wants to use time but its not available due to some unknown noisy neighbour stealing it".  In a nutshell, you have a server with X resources. You then 'sell' 10X to your users, making it 10:1 oversubscribed. The Internet industry of late has been making noises about regulating subscription through e.g. sunshine/transparency, or truth in advertising. There has also been a ton of research done around queuing etc in an oversubscribed network.

The way I normally explain this to the lay-person is: my house has 5 toilets. Each has a 3" outlet valve. But my house has a 4" sewer line to the city. Clearly this is not designed to flush all 5 at the same instant.

The reason resources are oversubscribed is cost. You actually want to have a high-peak and low-average for most things (consider if the road had guaranteed lanes for each person how wide that lane might be).

So today I'm watching my CI pipeline. I tested out the change locally and did a push. And its been more than 1hour in 'the cloud' for something that took only about 5 minutes locally. How can this be, I thought the cloud was fast?

First, lets compare. In the cloud I have 3 x 4VCPU/7.5GB of ram. On the desktop I have 1 x 8C/16T/32GB of ram. So I guess the desktop is bigger than the cloud.

Second, lets compare how much we get. That single machine the job runs on (4 VCPU/7.5GB ram) is running Kubernetes, and some other pods. But, its also carved out of a bigger machine that other people share (the 'noisy neighbours').

I first started to dig into my old favourite the 'steal time'. But, it was showing 0.0. Hmm, not what I expected. This means it is a cgroup limit or an IO limit.

I then tried 'pv /dev/zero > foo' on each of the Kubernetes nodes. This gives me a rough idea of their disk performance. 1 of the nodes is ~100MiB/s, another is ~130MiB/s, and one is ~190MiB/s. So yes, we have some noisy neighbours.

In comparison, the humble desktop is showing 1.9GiB/s. So somewhat more than 10x faster. I guess the cloud SSD is not as good as my single-stick NVME. Hmm. But, I don't think this 10x disk speed accounts for my issue.

I snoop around @ TRIM. I see that the nodes do not have 'discard' on on the mount, I run 'fstrim -av'. I see that one of the nodes has 74GiB of untrimmed data. Hmm, maybe that helps? OK, it did a bit, the 'slow' node caught up to the others (and was the one w/ the most untrimmed data).

Want to understand why you need to run TRIM on an SSD? Well, here's a starter. But, in a nutshell, flash can only be block-erased, and TRIM takes 'garbage' pages that are no longer in use and erases them before you need them again.

OK, maybe this is a resource limit? Kubernetes has some resource management that applies to the cgroup of each container. First lets check the namespace level:

don@cube:~/src-ag/corp-tools/k8s-gitlab$ kubectl describe namespace gitlab-runner
Name: gitlab-runner
Labels: <none>
Annotations: <none>
Status: Active

No resource quota.
No resource limits.

OK, that was not it, no limit.

Now, lets do a kubectl describe on the pod and look at the limits:

...
Requests:
  cpu: 100m
  memory: 128Mi
 ...
QoS Class:       Burstable

We have a request amount, but no limit. We are also 'burstable'. If I read this correctly it means we get at least 100 mili-cores, 128MiB of memory, and, we get whatever is left over from the other pods.

So, no insight there.

So, I'm a bit stumped. I looked @ top, all have reasonable idle and memory. I looked @ IO, and all have reasonable IO. I looked @ resource limits, and its not that. I looked @ steal time, I'm not being oversold.

So, any votes on why my CI is slow? Comments?

Tagged with: , , , ,