This is not much of a book review. Head over to http://nap.edu/12050 if you want to read this yourself 🙂

If you don’t want to read it, tl;dr: IT insecurity exists in many devices, and some of them control the fate of countries. Earlier I wrote about some of the SCADA problems as found through everyone’s favourite tool Shodan.

There is a concept called ‘critical infrastructure’. Things we need to run out lives. Electricity is one of them. Without it people die, quicky. Hospitals, food safety, transportation, etc.

And the power grid is the original distributed network. Many companies all connect to it, and are collectively regulated (in North America under NERC). In 2003 we saw a large scale eastern-seaboard outage. The cause? A software bug in an alarm system in FirstEnergy, one of the many interconnected companies. As power is used, the overhead cables heat and droop, and rebalancing is done to prevent them from drooping too much into trees.

Now, imagine, is this the only software bug? What if others can be exploited, perhaps remotely. That is what this book is about. Physical security (you and a pair of bolt-cutters can be in many power grid areas). Smart-meters, they use wireless IP connectivity. Smart grids will drive when my car charges, and, soon, be able to have me drive power back into the grid. Solar and other local-generation balancing.

Its all a set of software, systems, people. And, any or all of those things can be imperfect.

The 2003 outage had an estimated ‘social cost’ of $4B – $10B. ($8.6B according to this IEEE presentation).

Now imagine the amount of time you spend security your home widgets. Upgrading them. You have done that this week right? Now imagine if you had to climb poles or drive a truck to a remote site to do it. Hmm.

Anyway, read the book, its interesting.

My first experiences with NFS (Network File System) started in 1989. My first term at university, a set of vax machines running BSD Unix, some vt220 terminals, and ‘rn’.

My first understanding of NFS came a few years later. ClearCase. I was working at HP, the year was 1992. Most of us on the team shared a 68040-based HP server (an HP/Apollo 9000/380), but were very excited because the new PA-RISC machines (an HP 9000/720) were about ready to use, promising much higher speeds. We were using a derivative of RCS for revision control (as was everyone at the time).

Our HP division was (convinced? decided?) to try some new software that had come as a genesis of HP buying Apollo in 1989. A new company (Atria) had formed, and, well, ClearCase was born out of something that had been internal (DSEE). And it was based on distributed computing principles. The most novel thing about it was called ‘MVFS‘ (multi-version file system). This was really unique, it was a database + a remote file access protocol. You created a ‘config spec’, and, when you did an ‘ls’ in a directory, it would compute what version you should see and send that to you. It was amazing.

This amazement lasted about an hour. And then we learned what a ‘hard mount’ in NFS was. You see, the original framers of Unix had decided that blocking IO would, well, block. If you wrote something to disk, you would waiting until the disk was finished writing it. When you made this a network filesystem it means that if the network were down, you would wait for it to come back up.

Enter the next ‘feature’ of our network architecture of the time: everything mounted everything. This was highly convenient but it meant that if any one machine in that building went down, or became unavailable, *all* of them would lock up waiting.

And this lead rise to a lot of pacing the halls and discussion over donuts of “when will ClearCase be back?”

Side note: HP was the original tech startup. And one of its original traditions was the donut break. Every day at 10, all work would stop, all staff would congregate in the caf, a donut, a coffee would be had, and you would chat with people from other groups. It was how information moved. Don’t believe me? Check it out.

OK, back to this. Over the years, Clearcase and computing reliability/speed largely stayed in check. Bigger software, bigger machines, it stayed somethiing that was slow and not perfectly reliable, but not a squeaky enough wheel to fix. As we started the next company, we kept ClearCase, and then on into the next. So many years of my life involved with that early decision.

But what does this have to do w/ today you ask? Well, earlier I posted about ReadWriteOnce problems and backup and my solution. But today I ran into another issue. You see, when I deployed gitlab, two of the containers (registry + gitlab) shared a volume for purposes other than backup. And, it bit me. Something squawked in the system, it got rescheduled, and then refused to run.

OK, no problem, this is Unix, I got this. So I decided that, well, the lesser of all evils would be NFS. You see, in Google Kubernetes Engine (GKE) there is no ReadWriteMany options. So I decided to make ReadWriteOnce volume, load it into a machine running NFS server, and then mount it multiple times in the two culprits (gitlab + docker registry). It would be grand.

And then time vanished into a blackhole vortex. You see, when you dig under the covers of Kubernetes, it is a very early and raw system. It has this concept of a PersistentVolumeClaim. On this you can set options such as nfs (hard vs soft). But, you cannot use it with anything other than the built-in provisioners. I looked at the external provisioners but, well, a) incubator, and b) for some custom hardware I don’t have.

Others were clearly worried about NFS mount options since issue 17226 existed. After 2.5 years of work on it etc, mission accomplished was declared. But only for PersistentVolumeClaim, not volume. And this matters.

    spec:
      containers:
        - name: nfs-client
...
          volumeMounts:
            - name: nfs
              mountPath: /registry
      volumes:
        - name: nfs
          nfs:
            server: nfs-server.default.svc.cluster.local
            path: /

you see, I need something like that (which doesn’t accept options). Because I cannot use:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-nfs-client
spec:
  capacity:
    storage: 10Mi
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server.default.svc.cluster.local
    path: "/"

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-nfs-client
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Mi
  selector:
    matchLabels:
      name: pv-nfs-client

Since this requires a built-in provisioner for some external NFS appliance.

OK, maybe I have panic’d to early, I mean, its 2018. Its 26 years since my first bad experiences with NFS, and 29 since my first ones at all.

Lets try. So I wrote k8s-nfs-test.yaml. Feel free to try it yourself. ‘kubectl create -f k8s-nfs-test.yaml’. Wait a min, then delete it. And you will find if you run ‘kubectl get pods’ that you have something stuck in Terminating:

nfs-client-797f96b748-j8ttv 0/1 Terminating 0 14m

Now, you can pretend to delete this:

kubectl delete pod --force --grace-period 0 nfs-client-797f96b748-xgxnb

But, and I stress but, you haven’t. You can log into the Nodes and see the mount:

nfs-server.default.svc.cluster.local:/exports on 
/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/fa6c9ce3-61d0-11e8-9758-42010aa200b4/volumes/kubernetes.io~nfs/nfs
type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.19.240.34,mountvers=3,mountport=20048,
mountproto=tcp,local_lock=none,addr=10.19.240.34)

in all its ‘hard’ glory. You can try and force unmount it:

umount -f -l /path

But, well, that just makes it vanish from the list of mounts, no real change.

So. Here I sit. Broken-hearted. I’ve looped full circle back to the start of my career.

So, peanut gallery, who wants to tell me what I’m doing wrong so that i can ‘claim’ an export I’ve made from a container I’ve launched, rather than one in the cloud infrastructure. Or alternatively, how I can set a mount-option on a volumeMount: line rather than a PersistentVolume: line.

Or chip in on 51835.

This is actually pretty cool. If you go to gitlab website, you are presented with a tool at the bottom that shows you what they use the cookies for, and allows you to control which you accept.

This is in a turn a requirement (partially) driven by the EU GDPR.

Now, there are browser and other tools that allow some, but to see it here presented with the what and the why, I like.

They are using a tool called cookiebot. Check it out!

So for example I see gitlab has used a Facebook cookie (at right). And its used to deliver real-time bidding and advertisement. Since gitlab doesn’t have any ads, this means there is a revenue stream somewhere else. Interesting.

In the top image we can see a ‘BigIP’ cookie. This is added by an (F5) Load-Balancer, presumably to lock a session to the same backend.

Anyway, I like the transparency here. Give it a try.

So earlier I wrote about my simple rsync approach. It worked well for a bit, and then mysteriously broke. Why you ask? And how did I fix it?

Well, down the rabbit-hole we go.

First I get an email from Google GCP/GKE people.

Hello Google Kubernetes Customer,

Several vulnerabilities were recently discovered in the Linux kernel which may allow escalation of privileges or denial of service (via Kernel Crash) from an unprivileged process. These CVEs are identified with tags CVE-2018-1000199CVE-2018-8897 and CVE-2018-1087All Google Kubernetes Engine (GKE) nodes are affected by these vulnerabilities, and we recommend that you upgrade to the latest patch version, as we detail below.

OK, well, you’ve got my attention. I read up on the CVE, the solution seems simple. We have ‘auto-upgrade’ set, just let it do its thing.

A while later I find everything is dead. Huh. So what has happened, I have 2 nodes online, each with 2VCPU and 7.5GB of ram. The workload I have running is about 60% on each. So it set the first node to not take new work, and then started moving stuff over. But, when the online node hit 100% usage, it couldn’t take more, and we were locked up. A bug I think. OK, start a 3rd node (its only money right? I mean, I’m spending $300/mo to lease about the speed of my laptop, what a deal!). OK, that node comes online, we rescue the deadlock, and the upgrade proceeds.

But, suddenly, I notice that the work moved from Node 1 to Node 3 (which has an upgraded Kubernetes version) goes into CrashLoopBackoff. Why?

Well, it turns out it relates to the persistent-volume-claim and use of ReadWriteOnce. For some reason, the previous version of k8s allowed my rsync container to do a ReadOnly mount of something that someone else had mounted ReadWrite. I did not think this was an error, but, reviewing the docs, it is. The reason is GCEPersistentDisk is exposed as block-io, and mounted on the Node. In my 2-node case, the rsync + thing it backed up were running on the same node, so only 1 block-io mount. In the new case, its not. But, its also checked, so you can’t circumvent. Hmm. Looking at the table, I now find I have a null-set solution. On my home OpenStack I could use cephfs to solve this (ReadWriteMany), but this is not exposed in GKE. Hmm. I could bring up my own glusterfs solution, but that is a fair bit of work I’m not excited to do. There must be a better way?

Volume PluginReadWriteOnceReadOnlyManyReadWriteMany
AWSElasticBlockStore
AzureFile
AzureDisk
CephFS
Cinder
FC
FlexVolume
Flocker
GCEPersistentDisk
Glusterfs
HostPath
iSCSI
Quobyte
NFS
RBD
VsphereVolume– (works when pods are collocated)
PortworxVolume
ScaleIO
StorageOS

 

OK here is what I did (my github repo). I found a great backup tool, restic. It allows me to use all kinds of backends (S3, GCS, …). I chose to use REST and run it off site. I then found this great tool ‘stash‘ which automatically injected itself into all your pods and the problem just goes away. Beauty right? Well, flaw. Stash doesn’t support REST. Sigh. (side note: please vote for that issue so we can get it fixed!). OK, I’m pretty pot committed at this stage, so I trundle on (by now you’ve read my git repo in the other window and seen my solution, but humour me).

Next problem, restic assumes the output is a tty, supporting SIGWINCHZ etc. Grr. So, issue opened, judicious use of sed in the Dockerfile. pps, why o why did the framers of go consider it a good thing to hard-code the git repo name in the source? Same issue java had (w/ com.company… thing). So hard to ‘fork’ and fix.

So what I have done is create a container that has restic in it, and cron. If you run it as ‘auto’, then every so many interval hours it wakes up and backs up your pod. I then add it to the pod as a 2nd container manually (so much for stash). And we are good to go.

OK, lets redo that earlier poll.

Is this a 'good' thing?

View Results

Loading ... Loading ...

 

So the new office doesn’t have free parking. Its not far from home, and I’ve been walking, but as its getting hotter that is getting stickier. Time for a solution involving electrical gadgets I hear you say. You are right!

So I acquired a VoltBike Urban (black). Spiffy right? It folds up to make differently-shaped (perhaps trunk of car?). And, its got secrets. Lurking within is a modest 350W assist motor and a 36V 11.6AH battery. And does it ever propel you. Even in top gear you feel a little bit like a hamster in a wheel on full assist. Its speed limited (the motor that is) to 32km/h, which is plenty I think.

So now the questions. Can it/should it be hacked somehow? The controller is one of these in the LCD family.

The bike is pretty decent quality. But the charger… Whoa. Its hot if you just leave it plugged in not charging 🙂 And its suspiciously light in that charging brick for something that is outputting 42V @ 2A. I might spring for an ‘upgrade’ on on aliexpress to 42V @ 5A w/ a fan in a metal case.

So what do you think, will I look like too much of a hipster on this? Pee-wee herman accessories? Tassles?