The difference between Azure AKS and Google GKE is stark. GKE just worked. Single sign on, login, create cluster. It walked me through a couple of questions (how many nodes, what size of node). A minute or so later it was done.

Azure. Still working on it.

Attempt 1. Use the web interface. Now, it was a bit tricky to find. its in the 'misc' case. Highlighted down the left are things like Cosmos DB, Virtual machines. But (see screenshot) no container services in sight. To find, use 'all services' and then find it in the sea of unsorted links. OK, I go through the wizard. Setup the 'basics'. The default is 1 'Standard DS2 v2' node. Lets leave that default alone. Accept defaults, get to last screen. After a second or so it pops up a pink error "Validation failed: Required information is missing or not valid". No amount of trying would yield more info. No amount of changing cluster name or RBAC etc would affect. Huh, what field could be wrong?

So lets take a break and try the CLI. I use their example.

az group create --name az-canada-central --location canadacentral
az aks create --resource-group az-canada-central --name glr --node-count 1 --enable-addons monitoring --generate-ssh-keys

after chewing on this for about 5 minutes, I'm given:

Operation failed with status: 'Bad Request'. Details: The VM size of AgentPoolProfile:nodepool1 is not allowed in your subscription in location 'canadacentral'. The available VM sizes are Standard_A1,Standard_A1_v2,Standard_A2,Standard_A2_v2,Standard_A2m_v2,Standard_A3,Standard_A4,Standard_A4_v2,Standard_A4m_v2,Standard_A5,Standard_A6,Standard_A7,Standard_A8_v2,Standard_A8m_v2,Standard_F16s_v2,Standard_F2s_v2,Standard_F32s_v2,Standard_F4s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_F8s_v2,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5

After reading this for a bit, I realise the DS2 v2 default node is not present in the 'acceptable' nodes types. Oh. OK, well, I guess they must mean this for some reason.

So, lets try the web interface with a Standard_A1_v2. Well, something is happening. Dots are wandering across the top bar in vaguely progress-esque fashion. After some number of minutes I become suspicious of the progress and decide to hedge my bet with the CLI:

az aks create --node-vm-size Standard_A1_v2 --resource-group az-canada-central --name glr --node-count 1 --enable-addons monitoring --generate-ssh-keys
- Running ..

OK, its been sitting there for 15 minutes now. No word on the 'running'. Glance at web browser. Still doing the dot-dance, so, progress is slow.

The 'nodepool' did get created, I can see the virtual machine that is the node running.

So I sit, a bit stumped. If the easy things are this hard, how hard will the hard things be?

Tagged with: , , , , ,

Today I 'released' endoscope. This is a tool that solves a couple of 'simple' problems:

  1. I have a running container in Kubernetes. I wish I could have a shell inside it that is root, but also with a bunch of tools like gdb or ptrace. My container doesn't allow root or ptrace. I don't want to rebuild a debug version of it and create a new Pod
  2. I want to ping/create network traffic as if it originated from a specific pod
  3. I want to capture network traffic from/to a specific pod

If you have those problems, well, this is for you!

Lets look at an example:

scope -n NAMESPACE -p POD strace [-p #] [-e expr]

What sourcery is this? You mean from my current host I can run strace on a remote application in a container without knowing the node or ssh or anything? Yes! Simply run with the namespace/pod info (and -p #if there is more than one pid in the container, default is the first), and optionally e.g. -e file to filter. You can use 'scope pids' to show the pids if you want (the first one is not always the right one for more complex containers).

Current commands include gdb, ping, shell, strace, hping. If you use 'shell', you are in the network + pid namespace of the debugee (check ifconfig if you don't believe me!).

Work in progress is to allow you to, from the Wireshark GUI, simply select a pod and capture/filter its traffic in real time. Pull requests welcome 🙂

Tagged with: , , , , ,

In Kubernetes v1.11 you can resize persistent volume claims. Great!

Sadly, Google has not rolled this out to us great unwashed yet (its available to early-adopters or for everyone on alpha clusters), we are on v1.10.

Side note: Docker registry. One of the most commonly asked questions is: how do I delete or clean up? tl;dr: you can't. Its the hotel california. Get over it. All those bash/php/ruby/... scripts people have written to try and work around this? Don't spend your life trying to make them work.

Double sadly, today was the day, the container registry hit super-critical. So, once more into the breach, we can't wait for v1.11.

So, what do I need to do? I want to:

  1. tar the /registry somewhere
  2. stop / delete the pod
  3. delete the pvc
  4. create a new larger pvc
  5. restart / reschedule the pod

But, I ran into an issue on #2. Its part of a larger helm deployment. I suppose I could take down the whole deployment and let it recreate later in 5. But why should I?

Instead what I did is 'kubectl edit <deployment' and set the replicas to 0. This caused the pods to all exit, making the pvc unclaimed. Now I can delete the pvc, create a new one, and then 'kubectl edit ...' again and set the replicas back. Easy peasy.

# Please edit the object below.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 ...
  uid: 22125323-77c8-11e8-9758-42010aa200b4
spec:
  progressDeadlineSeconds: 600
  replicas: 0
  revisionHistoryLimit: 10
  selector:
Tagged with: , , , , ,

I earlier wrote about steal time, the concept of "my image wants to use time but its not available due to some unknown noisy neighbour stealing it".  In a nutshell, you have a server with X resources. You then 'sell' 10X to your users, making it 10:1 oversubscribed. The Internet industry of late has been making noises about regulating subscription through e.g. sunshine/transparency, or truth in advertising. There has also been a ton of research done around queuing etc in an oversubscribed network.

The way I normally explain this to the lay-person is: my house has 5 toilets. Each has a 3" outlet valve. But my house has a 4" sewer line to the city. Clearly this is not designed to flush all 5 at the same instant.

The reason resources are oversubscribed is cost. You actually want to have a high-peak and low-average for most things (consider if the road had guaranteed lanes for each person how wide that lane might be).

So today I'm watching my CI pipeline. I tested out the change locally and did a push. And its been more than 1hour in 'the cloud' for something that took only about 5 minutes locally. How can this be, I thought the cloud was fast?

First, lets compare. In the cloud I have 3 x 4VCPU/7.5GB of ram. On the desktop I have 1 x 8C/16T/32GB of ram. So I guess the desktop is bigger than the cloud.

Second, lets compare how much we get. That single machine the job runs on (4 VCPU/7.5GB ram) is running Kubernetes, and some other pods. But, its also carved out of a bigger machine that other people share (the 'noisy neighbours').

I first started to dig into my old favourite the 'steal time'. But, it was showing 0.0. Hmm, not what I expected. This means it is a cgroup limit or an IO limit.

I then tried 'pv /dev/zero > foo' on each of the Kubernetes nodes. This gives me a rough idea of their disk performance. 1 of the nodes is ~100MiB/s, another is ~130MiB/s, and one is ~190MiB/s. So yes, we have some noisy neighbours.

In comparison, the humble desktop is showing 1.9GiB/s. So somewhat more than 10x faster. I guess the cloud SSD is not as good as my single-stick NVME. Hmm. But, I don't think this 10x disk speed accounts for my issue.

I snoop around @ TRIM. I see that the nodes do not have 'discard' on on the mount, I run 'fstrim -av'. I see that one of the nodes has 74GiB of untrimmed data. Hmm, maybe that helps? OK, it did a bit, the 'slow' node caught up to the others (and was the one w/ the most untrimmed data).

Want to understand why you need to run TRIM on an SSD? Well, here's a starter. But, in a nutshell, flash can only be block-erased, and TRIM takes 'garbage' pages that are no longer in use and erases them before you need them again.

OK, maybe this is a resource limit? Kubernetes has some resource management that applies to the cgroup of each container. First lets check the namespace level:

don@cube:~/src-ag/corp-tools/k8s-gitlab$ kubectl describe namespace gitlab-runner
Name: gitlab-runner
Labels: <none>
Annotations: <none>
Status: Active

No resource quota.
No resource limits.

OK, that was not it, no limit.

Now, lets do a kubectl describe on the pod and look at the limits:

...
Requests:
  cpu: 100m
  memory: 128Mi
 ...
QoS Class:       Burstable

We have a request amount, but no limit. We are also 'burstable'. If I read this correctly it means we get at least 100 mili-cores, 128MiB of memory, and, we get whatever is left over from the other pods.

So, no insight there.

So, I'm a bit stumped. I looked @ top, all have reasonable idle and memory. I looked @ IO, and all have reasonable IO. I looked @ resource limits, and its not that. I looked @ steal time, I'm not being oversold.

So, any votes on why my CI is slow? Comments?

Tagged with: , , , ,

Ceph has long been a favourite technology of mine. Its a storage mechanism that just scales out forever. Gone are the days of raids and complex sizing / setup. Chuck all your disks into whatever number of servers, and let ceph take care of it. Want more read speed? Let it have more read replicas. Want a filesystem that is consistent on many hosts? Use cephfs. Want your OpenStack Nova/Glance/Cinder to play nice, work well, and have tons of space? use ceph.

TL;DR: want to save a lot of money in an organisation, use Ceph.

Why do you want these things? Cost and scalability. Ceph can dramatically lower the cost in your organisation vs running a big NAS or SAN. And do it for higher performance and better onward scalability. Don't believe me? Check youtube

My ceph system at home is wicked fast, but not that big. Its 3 x 1TB NVME. We talked about this earlier, and you may recall the beast-of-the-basement and its long NVME challenges. Its been faithfully serving my OpenStack system for a while, why not the Kubernetes one?

NVME is not expensive anymore. I bought 3 of these. $200/each for 1TB. But, and this is really trick-mode, it has built-in capacitor 'hard power down'. So you don't have to have a batter-backed raid. If your server shuts down dirty the blocks still flush to ram, meaning you can run without hard-sync. Performance is much higher.

OK, first we digress. Kubernetes has this concept of a 'provisioner'. Sort of like cinder. Now, there are 3 main ways I could have gone:

  1. We use 'magnum' on OpenStack, it creates Kubernetes clusters, which in turn have access to Ceph automatically
  2. We use OpenStack Cinder as the PVC of Kubernetes.
  3. We use Ceph rbd-provisioner of Kubernetes

I tried #1, it worked OK. I have not tried #2. This post is about #3. Want to see? Lets dig in. Pull your parachute now if you don't want to be blinded by YAML.

cat <<EOF | kubectl create -n kube-system -f -
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: rbd-provisioner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["list", "watch", "create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services"]
    resourceNames: ["coredns", "kube-dns"]
    verbs: ["list", "get"]
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: rbd-provisioner
subjects:
  - kind: ServiceAccount
    name: rbd-provisioner
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: rbd-provisioner
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  name: rbd-provisioner
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rbd-provisioner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rbd-provisioner
subjects:
- kind: ServiceAccount
  name: rbd-provisioner
  namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rbd-provisioner
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: rbd-provisioner
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: rbd-provisioner
    spec:
      containers:
      - name: rbd-provisioner
        image: "quay.io/external_storage/rbd-provisioner:latest"
        env:
        - name: PROVISIONER_NAME
          value: ceph.com/rbd
      serviceAccount: rbd-provisioner
EOF

kubectl create secret generic ceph-secret --type="kubernetes.io/rbd" --from-literal=key=$(sudo ceph --cluster ceph auth get-key client.admin) --namespace=kube-system

sudo ceph --cluster ceph osd pool create kube 128
sudo ceph osd pool application enable kube rbd
sudo ceph --cluster ceph auth get-or-create client.kube mon 'allow r' osd 'allow rwx pool=kube'
sudo ceph --cluster ceph auth get-key client.kube

kubectl create secret generic ceph-secret-kube --type="kubernetes.io/rbd" --from-literal=key=$(sudo ceph --cluster ceph auth get-key client.kube) --namespace kube-system 

Now we need to create the StorageClass. We need the **NAME** of 1 or more of the mons (you don't need all of them), replace MONHOST1 w/ your **NAME**. Note, if you don't have a name for your monhost, and want to use an IP, you can create an external service w/ xip.io:

kind: Service
apiVersion: v1
metadata:
  name: monhost1
  namespace: default
spec:
  type: ExternalName
  externalName: 1.2.3.4.xip.io

and you would then use monhost1.default.svc.cluster.local as the name below.

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rbd
provisioner: ceph.com/rbd
parameters:
  monitors: MONHOST1:6789, MONHOST2:6789, ...
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-secret-kube
  userSecretNamespace: kube-system
  imageFormat: "2"
  imageFeatures: layering
EOF

Now we are done, lets test:

cat <<EOF | kubectl create -f -
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: rbdclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: rbd
EOF
kubectl get pvc -w rbdclaim
kubectl describe pvc rbdclaim
Tagged with: , , , , ,