Cloud Lessons: Of gits and ci’s and pipelines and k8s and charts
OK, no cats, no IoT robots, no crappy consumer electronics disassembly, no spray-foam-snowblower tires.. This is a post about the journey to get a source control system + continuous integration system going in a far away cloud. WiFi fireplaces and feral cats will return, I promise 🙂 Also, if you want to make this like Jeopardy and start with the answer, well, here.
PS that's my cat (TC) at the right, before he got big and mean. That's all the cat you get in this post.
So. I am in the process of getting a new company off the ground, and currently there is somewhat of a labour shortage, its just me. And I want to understand all the dirty secrets involved, I'm not looking to just push a 'go easy' button.
I'm a fan of 'gitlab'. It works well, its got the workflow I want. And, i've been running it at home for some years now, so I am familiar with it. It should be a simple matter to 'lift and shift' that to a public cloud, right? Well, um, yes it is now, but it wasn't when I started. Which brings us to lesson #1:
"It is a lot of work to make things simple"
So all the current 'hotness' today is around 'declarative' setups. Kubernetes, YAML, that sort of thing. In this model, instead of me saying: "create 1 of these, wire it to that" (imperative), I write a model that says "make the universe this way", and apply it. If the universe wanders from my desire, the controller pushes it back (e.g. if something crashes or dies or whatever). OK, no problem, I know YAML and can read the models, I got this. But, well, its hard to make this flexible enough, and, I'm familiar w/ Ansible and its use of 'mustaches' (Jinja2). Snoop around a bit, and find there is a thing for Kubernetes called "Helm". It takes a 'Chart' and applies templating prior to offering to the Kubernetes API. OK, I'm in. Its very crude in some ways (no line numbers if you make an error), but, it works.
One of my goals for the new company is 'BeyondCorp' model. And part of this is a single-sign-on need, which I wish to do via OAuth2 against G-Suite. E.g. I want to have a single spot I login, once, and have all other trust federated off that. That makes 2FA bearable and still strong. I have that on the home gitlab, so I want to maintain that. This in turn has some downstream ramifications (e.g. the container registry needs to trust those tokens).
Another goal was to understand how *all* of it deploys, what can go wrong, what security choices/shortcuts exist. Standard operating procedure for me. What things 'just worked' without supplying a shared secret? And if so, how could they?
Another goal was to see what this whole 'public cloud' thing was about. I've run a fairly large OpenStack system for years, so I've got that part of it in hand.
So to start with I chose Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE). I chose a single 'instance' in the node pool (2VCPU, 7.5GB of ram), should be lots right? This lead to lesson #2:
"Public cloud is not cheap".
Ouch. I posted the 1-day bill here, and its gone up!
Once I started deploying, I was shocked to learn that this machine was not big enough. Why? Its definitely way more than it uses at home. Turns out its a 'micro-service' issue. Micro-services are inherently less efficient (how many copies of nginx does one need?) since you end up with more things running and their inherent fixed overheads. Plus there is a lot oof management services and telemetry. This lead to lesson #3:
"Flexibility and scalability are costly at the low end"
OK, so I add another node to the pool and continue on. Now, gitlab is in a bit of a transition of huge monolith (they term 'omnibus') to 'cloud-native'. Refactoring, splitting, tearing, etc. And its termed 'alpha'. But there are a lot of restrictions. No backups, no upgradability, ssh hostkeys wander around, ... In short, its not yet ready for production (they do not claim it is).
So, well, I planned to learn how to make the sausage in this sausage factory anyway, so I made my own helm chart. And, after a couple of days of tinkering and cursing, its as simple as:
helm install -f values.yml --name git .
that's it. You have a production ready git + CI + container registry.
And I learned a few other things. One is about the adage of 'pets' and 'cattle'. You see, I had this mental battle, a dichotomy, about how to do 'PersistentVolumeClaims'. In Kubernetes, by default, all storage is ephemeral and vanishes when it pleases. Not so great when you are talking about building the next $1B company and all its IP. I had some options:
- Manually create the storage, and let Helm take it over and use it
- Allow Helm to create the storage, and either never delete the chart, or, rely on backup/restore on upgrades
- Create a new storage class with persistence, on first use let Helm create it, and subsequent use take it over. But then we can only have a single instance running
OK, all are viable. I chose #1, but made the Chart support #2 as well.
I chose to use cert-manager to manage creating/updating Let's Encrypt certificates for the public side. This made it really easy. And it integrates with the Kubernetes Ingress (which I'm using Nginx for). But, it leaves me w/ the east-west traffic unencrypted. And not easily fixed, I mean, I could go with a self-signed certificate on those legs. Or I could do SSL passthrough and give up VHOSTS, and use more public IP.
Other things that proved tough. Like all 'OSS', there is an embarassment of choices. What networking do you want, Flannel, Calico, Canal, ...? Ksonnet versus Helm? etc. This lead rise to another lesson #4.
"Pick an approach and go until unviable, don't go into analysis-paralysis".
(For those of you who have not, read Malcom Gladwell's Blink)
Another thing I found, Google does not have all services in all places. As I posted in the cloud billing, there is no Google Container Registry in Montreal. Which affected my data sovereignty. Which lead to lesson #5:
"Beware of blanket statements".
Things like 'data sovereignty' or 'country of install' are not Booleans. Where is their disaster recovery? Their backup? What about that service you might use someday, or someone on your team will because its enabled in the API?
- A lot of upstream 'risk' comes in. I posted about this before. But Debian + Ubuntu + Alpine, Go, Python, Ruby... Lots of risk there, lots of 'who builds that container and how' trust.
- Networking topologies are not obvious. Pod networks vs overlay networks vs nodeports. A lot of IP space, not all reachable
- A lot of security seems to be just assuming "if you can talk to me you are trusted" (I'm looking at you redis).
- The Kubernetes API. Its powerful. And a lot of containers inside it have access to it, do you trust them? Did you setup RBAC perfectly in all cases?
- Data sovereignty. Complex.
So, is it better than the one on my home machine? Well, its certainly not faster. And its certainly not cheaper. But, its not about scaling down, its about scaling out. And now I know what's inside the sausage.
in harmonia progressio