Tag: devops

Gitlab 11.6.0 just released, and one of the new features is deleting a pipeline. Why might you want to delete a little bit of history? Well, perhaps you have inadvertently caused some private information to be printed (a password). Or perhaps you left Auto DevOps enabled and ended up with a whole slew of useless failures. Either way you want to go and delete the pipeline in question and all of its output, artifacts, etc. Well here's a quick 'hack' to achieve the goal.

You run it as 'python delete-pipeline.py -t token -u url -g group -p project' and it will list all the pipeline id's. You can then run it with "-d '*'" to delete all of them, or "-d #,#,#..." to delete specific ones. You'll likely want 'pip install python-gitlab' first.

Its a quick and dirty hack, but it worked for me. YMMV.

#!/usr/bin/env python3
import gitlab
import argparse
import os
import requests
import sys

parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', help='API url', default=url)
parser.add_argument('-t', '--token', help='personal token', default=private_token)
parser.add_argument('-g', '--group', help='group', default=None, required=True)
parser.add_argument('-p', '--project', help='project', default=None, required=True)
parser.add_argument('-d', '--delete', help='delete id(s)', default=None)

args = parser.parse_args()

if args.delete:
    args.delete = args.delete.split(',')

gl = gitlab.Gitlab(args.url, args.token)
gl.auth()

def find_group(gl, group):
    groups = gl.groups.list()
    for g in groups:
        if g.name == group:
            return g
    return None

def find_subgroup(gl, group, subgroup):
    subgroups = group.subgroups.list()
    for sg in subgroups:
        if sg.name == subgroup:
            return sg;
    return None

# if project is of form x/y, then assume x is a subgroup
def find_project(gl, group, project_with_subgroup):
    pws = project_with_subgroup.split('/')
    if len(pws) == 2:
        subgroup = find_subgroup(gl, group, pws[0])
        project = pws[1]
        group = gl.groups.get(subgroup.id, lazy=True)
    else:
        project = pws[0]
    projects = group.projects.list()
    for p in projects:
        if p.name == project:
            return p
    return None


group = find_group(gl, args.group)
gproject = find_project(gl, group, args.project)
if not gproject:
    print("Error: %s / %s does not exist" % (args.group, args.project))
    sys.exit(1)

project = gl.projects.get(gproject.id)
pipelines = project.pipelines.list()
for pid in pipelines:
    pl = project.pipelines.get(pid.id)
    if args.delete and (str(pid.id) in args.delete or args.delete == ['*']):
        path = "%s/api/v4/projects/%s/pipelines/%s" % (args.url, project.id, pid.id)
        r = requests.delete(path, headers={'PRIVATE-TOKEN': args.token})
        print("%s %s %s -- DELETE" % (pid.id, project.name, pl.status))
    else:
        print("%s %s %s" % (pid.id, project.name, pl.status))

 

Tagged with: , ,

So the other day I wrote of my experience with the first 'critical' kubernetes bug, and how the mitigation took down my Google Kubernetes (GKE). In that case, Google pushed an upgrade, and missed something with the Calico migration (Calico had been installed by them as well, nothing had been changed by me). Ooops.

Today, Azure AKS. Errors like:

"heapster" is forbidden: User "system:serviceaccount:kube-system:heapster" cannot update deployments.extensions in the namespace "kube-system

start appearing. Along with a mysterious 'server is misbehaving' message associated with 'exec' to a single namespace (other namespaces are ok, and non-exec calls within this namespace are ok). Hmm.

Some online 'research' and we are lead to Issue#664.

Looking deeper at the 'server misbehaving' leads to some discussion about kube-dns being broken. Kube-system shows errors like:

Node aks-nodepool1-19254313-0 has no valid hostname and/or IP address: aks-nodepool1-19254313-0 

Hmm. That is my node name, how could it loose track of its own hostname? I don't even have (easy) access to this, its all managed.

OK, unpack the 'access one azure node' here. And we're in to the assumed 'sick' node. Snoop around, nothing seems too wrong.

So... peanut gallery, what does one do? Delete the cluster and move on with life? Open a support ticket?

Should I...

View Results

Loading ... Loading ...
Tagged with: , , ,

Another day another piece of infrastructure cowardly craps out. Today it was Google GKE. It updated itself to 1.11.3-gke.18, and then had this to say (while nothing was working, all pods were stuck in Creating, and the Nodes would not come online since the CNI failed).

2018-12-03 21:10:57.996 [ERROR][12] migrate.go 884: Unable to store the v3 resources
2018-12-03 21:10:57.996 [ERROR][12] daemon.go 288: Failed to migrate Kubernetes v1 configuration to v3 error=error storing converted data: resource does not exist: BGPConfiguration(default) with error: the server could not find the requested resource (post BGPConfigurations.crd.projectcalico.org)

Um, ok. Yes I am running 'Project Calico' in my GKE. The intent was to have network policy available. My cluster dates from 'ancient times' of april 2018.

I did some searching online, and found nothing. So in desperation I did what most of you would have done, just created the CRD, as below. And, miracle of miracles, I'm back and running. If this fixes you, you're welcome. If you came here to tell me what dire thing is ahead of me for manually messing with a managed product, I'd love to know that too.

kubectl -n kube-system create -f - << EOF
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: bgpconfigurations.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPConfiguration
    plural: bgpconfigurations
    singular: bgpconfiguration
EOF
Tagged with: , , , , ,

Image result for moebius strip

Tomorrow (Tues 27, 2018) we're going to have the next meetup to talk Continuous Integration. Got a burning desire to rant about the flakiness of an infinite number of shell scripts bundled into a container and shipped to a remote agent that is more or less busy at different hours?

Wondering if its better to use Travis, Circle, Gitlab, Jenkins, with a backend of OpenStack, Kubernetes, AWS, ...?

We've got 3 short prepared bits of material and a chance to snoop around the shiny new offices of SSIMWAVE.

So, in the Greater Waterloo area and got some time tomorrow night to talk tech? Check the link.

Tagged with: , , , , , ,

So I have this architecture where we have 2 separate Kubernetes clusters. The first cluster runs in GKE, the second on 'the beast of the basement' (and then there's a bunch in AKS but they are for different purposes). I run Gitlab-runner on these 2 Kubernetes clusters. But... trouble is brewing.

You see, runners are not sticky. And, you have multiple stages that might need to share info. There are 2 methods built-in to gitlab-runner to use, 1 is the 'cache' and 1 is 'artifacts'. The general workflow is 'build... test... save... deploy...'. But 'test' has multiple parallel phases (SAST, Unit, DAST, System, Sanitisers, ...).

So right now I'm using this pattern:

But a problem creeps in if 'build' runs on a different cluster than 'linux_asan'. Currently I use the 'cache' (courtesy of minio) to save the build data to bootstrap the other phases. For this particular pipeline, if the cache is empty, each of the test phases works, at reduced efficiency.

However, I have other pipelines where the 'build' stage creates a docker image. In this case, the cache holds 'docker save' and the pre- script runs 'docker load' (using DinD). When the subsequent stages run on a different cluster, they actually fail instead.

So... Solutions.

  1. Run a single cache (in GKE). Both clusters use it. All works, but the performance of cache-upload from the 2nd cluster is poor
  2. Delete the GKE cluster gitlab-runner entirely
  3. Figure out a hierarchical-cache using ATS or Squid
  4. Use the 'registry', on build, push w/ some random tag, after test fix the tag
  5. Use the 'artifacts' for this. Its suffers from the same cache-upload speed issue
  6. Go back to the original issue that cause the deployment of the self-run Kubernetes cluster and find a cost-economical way to have large build machines dynamically (I tried Azure virtual kubelet but they are nowhere close to big enough, I tried Google Builder but its very complex to insert into this pipeline).

Others?

The problem is the basement-machine has 1Gbps downstream, but only 50Mbps upstream. And some of these cache items are multi-GiB.

Tagged with: , , , , , ,