‘first’ and ‘only’ are four-letter words in cloud. How to do something `once` and `first` in a Kubernetes Deployment

A funny problem exists that you may not be aware of. If you like being blissfully unaware, perhaps head over here to kittenwar for a bit. But it involves the words 'first' or 'only'.

You see, in a cloud-native world, there is a continuum. There is no 'first' or 'only', only the many. Its kind of like the 'borg'. You have a whole bunch of things running already, and there was no start time. There was no bootstrap, initial creation. No 'let there be light' moment. But, you may have some pre-requisite, some thing that must be done exactly once before the universe is ready to go online.

Perhaps its installing the schema into your database. Or upgrading it. if you have a Deployment with n replicas, if n>1, they will all come up and try and install this schema, non-transactionally, badly.

How can you solve this dilemma?  You could read this long issue #1171 here. It's all going in the right direction, replicaset lifecycle hooks, etc. And then it falls off a cliff. Perhaps all the people involved in it were beamed up by aliens? It seems the most likely answer.

But, while you are waiting, I have another answer for you.
Let's say you have a Django or Flask (or Quart you Asynchio lover!) application. It uses SQLAlchemy. The schema upgrades are bulletproof and beautiful. If only you had a time you could run them in Kubernetes.

You could make a Job.  It will run once. But only once, not on upgrade. You can make an initContainer, but it runs on each Pod in the replica (here a Deployment). So, lets use a database transaction to serialise safely.

Now, last chance to head to kittenwar before this gets a bit complex. OK, still here? Well, uh, Python time.

In a nutshell:

  • create table
  • start nested session
  • lock table
  • run external commands
  • commit
  • end session

Easy, right? I chose the external commands method rather than calling (here flask) migrate to allow the technique to work for other things.

Hack on.

"""
This exists to solve a simple problem. We have a Deployment with >1
Pods. Each Pod requires that the database be up-to-date with the
right schema for itself. The schema install is non-transactional.
If we start 2 Pods in parallel, and each tries to upgrade the schema,
they fail.
If we don't upgrade the schema, then we can't go online until some
manual step.

Instead we create a 'install_locks' table in the database. A wrapper
python script creates a transaction lock exclusive on this table,
and then goes on w/ the initial setup / upgrade of the schema.
This will serialise. Now 1 Pod will do the work while the other waits.
the 2nd will then have no work to do.

Whenever the imageTag is changed, this deployment will update
and the process will repeat.

The initContainer doing this must run the same software.
Note: we could have done this *not* as an initContainer, in the main
start script.

See kubernetes/community#1171 for a longer discussion

"""

import sqlalchemy
import environ
import os

"""
    Could have just run this:
    db = SQLAlchemy(app)
    ...
    migrate = Migrate(app, db)
    from flask_migrate import upgrade as _upgrade
    _upgrade()
    but want this to be generate for other db operations
    so call os.system
"""

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy import inspect

env = environ.Env(DEBUG=(bool, False), )
SQLALCHEMY_DATABASE_URI = env(
    'SQLALCHEMY_DATABASE_URI',
    default='sqlite:////var/lib/superset/superset.db')

print("USE DB %s" % SQLALCHEMY_DATABASE_URI)
db = create_engine(SQLALCHEMY_DATABASE_URI)

# Note: there is a race here, we check the table
# then create. If the create fails, it was likely
# created by another instance.
if not db.dialect.has_table(db, 'install_locks'):
    metadata = MetaData(db)
    Table('install_locks', metadata, Column('lock', Integer))
    metadata.create_all()

Session = sessionmaker(bind=db)
session = Session()
session.begin_nested()
session.execute('BEGIN; LOCK TABLE install_locks IN ACCESS EXCLUSIVE MODE;')
os.system("/usr/local/bin/superset db upgrade")
 ... other init commands ...
session.commit()
3 comments on “‘first’ and ‘only’ are four-letter words in cloud. How to do something `once` and `first` in a Kubernetes Deployment
  1. db Vincent says:

    I did something similar a couple of years back, in my case by writing a record in a database to get an atomic lock and make sure only one pod instance would perform a schema migration. If I would face this problem again, I would probably look in using etcd’s compare-and-swap to get an atomic distributed lock instead.

  2. db db says:

    so here since its actually the same database its modifying this is a convenient way… no additional rbac or perm needed.
    There’s no convenient way to get the the k8s etcd, and for sure i don’t want another one 🙂
    I had considered having a Deployment w/ replica=1 for just the upgrade, and then having it delete itself (e.g. create an rbac that it had access to that allowed it to delete itself).
    I also looked @ https://github.com/pulcy/kube-lock. But i don’t really want to change annotations, and it has the same rbac need as the previous.

    My experience w/ a Regional k8s (e.g. 3-node master in 3 zones) is that these types of locks are not the best either.

    this is a bit hackish IMHO, and the thread about replica-set level hooks was interesting until it fell off a cliff.

  3. db Chris says:

    You had me at kittenwar.

Leave a Reply

Your email address will not be published. Required fields are marked *

*