An open padlock with a key inserted into it, on a printed circuit board

Pulling container images from private registries (including Docker Hub) with a Kubernetes Kubelet Credential Provider

At work last week, I finally solved an issue by writing some code, and I wanted to explain why I wrote it.

At it’s core, Kubernetes is an orchestrator which runs “Container Images”, which are structured filesystem snapshots, taken after running individual commands against a base system. These container images are stored in a container registry, and the most well known of these is the Docker registry, known as Docker Hub.

A registry can be public, meaning you don’t need credentials to get any images from it, or private. Some also offer a mixed-mode where you can make a certain number of requests without requiring authentication, but if you need more than that amount of requests, you need to provide credentials.

During the build-out of a new cluster, I discovered that the ECR (Elastic Container Registry) from AWS requires a new type of authentication – the Kubelet Credential Provider, which required the following changes:

  1. In /etc/sysconfig/kubelet you provide these two switches;
    --image-credential-provider-bin-dir /usr/local/bin/image-credential-provider and --image-credential-provider-config /etc/kubernetes/image-credential-provider-config.json.
  2. In /etc/kubernetes/image-credential-provider-config.json you provide a list of registries and the credential provider to use, which looks like this:
{
  "apiVersion": "kubelet.config.k8s.io/v1",
  "kind": "CredentialProviderConfig",
  "providers": [
    {
      "name": "binary-credential-provider-name",
      "matchImages": [
        "example.org",
        "registry.*.example.org",
        "*.registry.*.example.org"
      ],
      "defaultCacheDuration": "12h",
      "apiVersion": "credentialprovider.kubelet.k8s.io/v1"
    }
  ]
}
  1. Downloading and placing the credential provider binary into the /usr/local/bin/image-credential-provider path.

The ECR Credential Provider has it’s own Github repostitory, and it made me think – we’ve been using the “old” method of storing credentials using the containerd configuration file, which is now marked as deprecated – but this means that any changes to these credentials would require a restart of the containerd service (which apparently used to have a big impact on the platform), but this new ECR provider doesn’t.

I decided to write my own Credential Provider, following the documentation for the Kubelet Credential Provider API and I wrote it in Python – a language I’m trying to get better in! (Pull requests, feature requests, etc. are all welcome!)

I will confess I made heavy use of ChatGPT to get a steer on certain aspects of how to write the code, but all the code is generic and there’s nothing proprietary in this code.

Using the Generic Credential Provider

  1. Follow the steps above – change your Kubernetes environment to ensure you have the kubelet configuration changes and the JSON credential provider configuration put in the relevant parts of your tree. Set the “matchImages” values to include the registry in question – for dockerhub, I’d probably use ["docker.io", "*.docker.io"]
  2. Download the generic-credential-provider script from Github, put it in the right path in your worker node’s filesystem (if you followed my notes above it’ll be in /usr/local/bin/image-credential-provider/generic-credential-provider but this is *your* system we’re talking about, not mine! You know your build better than I do!)
  3. Create the /etc/kubernetes/registries directory – this can be changed by editing the script to use a new path, and for testing purposes there is a flag --credroot /some/new/path but that doesn’t work for the kubelet configuration file.
  4. Create a credential file, for example, /etc/kubernetes/registries/example.org.json which contains this string: {"username":"token_username","password":"token_password"}. [Yes, it’s a plaintext credential. Make sure it’s scoped for only image downloads. No, this still isn’t very good. But how else would you do this?! (Pull requests are welcomed!)] You can add a duration value into that JSON dictionary, to change the default timeout from 5 minutes. Technically, the default is actually set in /etc/kubernetes/image-credential-provider-config.json but I wanted to have my own per-credential, and as these values are coming from the filesystem, and therefore has very little performance liability, I didn’t want to have a large delay in the cache.
  5. Test your credential! This code is what I used:
echo '{
  "apiVersion": "credentialprovider.kubelet.k8s.io/v1",
  "kind": "CredentialProviderRequest",
  "image": "your.registry.example.org/org/image:version"
}' | /usr/local/bin/image-credential-provider/generic-credential-provider

which should return:

'{"kind": "CredentialProviderResponse", "apiVersion": "credentialprovider.kubelet.k8s.io/v1", "cacheKeyType": "Registry", "cacheDuration": "0h5m0s", "auth": {"your.registry.example.com": {"username": "token_username", "password": "token_password"}}}'

You should also see an entry in your syslog service showing a line that says “Credential request fulfilled for your.registry.example.com” and if you pass it a check that it fails, it should say “Failed to fulfill credential request for failure.example.org“.

If this helped you, please consider buying me a drink to say thanks!

Featured image is “Padlock on computer parts” by “Marco Verch Professional Photographer” on Flickr and is released under a CC-BY license.

A medieval helm in gold on a bench at a museum

Fixing 403 errors from ghcr.io with #helm pull

At work, I’m using skaffold to deploy a helm chart which references a ghcr.io repository. Here’s the stanza I’m looking at:

apiVersion: skaffold/v3
kind: Config
deploy:
  helm:
    releases:
      - name: {package}
        remoteChart: oci://ghcr.io/{owner}/{package}

This is the first time I’ve tried to deploy this chart, and I kept getting this message:

No tags generated
Starting test...
Starting deploy...
Helm release {package} not installed. Installing...
Error: INSTALLATION FAILED: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden
deploying "{package}": install: exit status 1

I thought this might have been an issue with the skaffold file, so I tried running this directly with helm:

$ helm pull oci://ghcr.io/{owner}/{package}
Error: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden

Huh, that looks a bit familiar. I spent a little while checking to see whether this was something at the Kubernetes cluster, or if it was just me, and ended up finding this nugget (thanks to a steer from this post)

$ gh auth token | helm registry login ghcr.io -u {my_github_user} --password-stdin
Login Succeeded

And now it works!

elm pull oci://ghcr.io/{owner}/{package}
Pulled: ghcr.io/{owner}/{package}:1.2.3
Digest: sha256:decafbad1234567890aabbccddeeffdeadbeefbadbadbadbad12345678901234

Featured image is “helm” by “23 dingen voor musea” on Flickr and is released under a CC-BY-SA license.

A series of gold blocks, each crossed, and one of the lower blocks has engraved "Eduardo Nery 1995-1998 Aleluia - Secla"

Deploying the latest build of a template (machine) image with #Xen #Orchestrator

In my current role we are using Packer to build images on a Xen Orchestrator environment, use a CI/CD system to install that image into both a Xen Template and an AWS AMI, and then we use Terraform to use that image across our estate. The images we build with Packer have this stanza in it:

locals {
  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
}
variable "artifact_name" {
  default = "SomeLinux-version.iso"
}
source "xenserver-iso" "this" {
  vm_name = "${var.artifact_name}-${local.timestamp}"
  # more config below
}

As a result, the built images include a timestamp.

When we use the AMI in Terraform, we can locate it with this code:

variable "ami_name" {
  default = "SomeLinux-version.iso-"
}

data "aws_ami" "this" {
  most_recent = true

  filter {
    name   = "name"
    values = [var.ami_name]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = [var.owner]
}

But, because Xen doesn’t track when a template is created, instead I needed to do something different. Enter get_xoa_template.sh.

#!/bin/bash
trap cleanup SIGINT SIGTERM EXIT
exit_message=""
set_exit=0
fail() {
[ -n "$1" ] && echo "$1" >&2
[ "$2" -gt 0 ] && exit $2
}
cleanup() {
trap – SIGINT SIGTERM EXIT
[ "$UNREGISTER" -eq 1 ] && [ "$STATE" == "signed_in" ] && xo-cli –unregister 2>&1
[ -n "$exit_message" ] && fail "$exit_message" $set_exit
}
log_debug() {
[ -n "$DEBUG" ] && echo "$1" >> "$DEBUG"
}
parse_params() {
UNREGISTER=1
DEBUG=""
while :; do
case "${1-}" in
-h | –help)
echo "usage: get_xoa_template.sh –template SomeTemplatePrefix" >&2
echo "" >&2
echo "Options:" >&2
echo " -t | –template MyTemplatePrefix = The template to look for (required)" >&2
echo " -s | –server ws://192.0.2.1 = Sign into Xen Orchestrator on 192.0.2.1" >&2
echo " [Default to using XOA_URL environment variable]" >&2
echo " -u | –user username@example.org = Sign into Xen Orchestrator using this username" >&2
echo " [Default to using XOA_USER environment variable]" >&2
echo " -p | –password hunter2 = Sign into Xen Orchestrator using this password" >&2
echo " [Default to using XOA_PASSWORD environment variable]" >&2
echo " -l | –pool MyXenPool1 = Use this pool when looking for the template." >&2
echo " [Omit to ignore]" >&2
echo " -x | –no-unregister = Don't log out from the XOA server once connected." >&2
echo " -d | –debug = Log output to /tmp/xocli_output" >&2
echo " -d | –debug /path/to/debug = Log output to the specified path" >&2
echo " –debug=/path/to/debug = Log output to the specified path" >&2
exit 255
;;
-s | –server)
XOA_URL="${2-}"
shift
;;
-u | –user)
XOA_USER="${2-}"
shift
;;
-p | –password)
XOA_PASSWORD="${2-}"
shift
;;
-l | –pool)
XOA_POOL="${2-}"
shift
;;
-t | –template)
TEMPLATE="${2-}"
shift
;;
-x | –no-unregister)
UNREGISTER=0
;;
-d | –debug)
DEBUG=/tmp/xocli_output
[ -n "${2-}" ] && [ "$(echo "${2-}" | cut -c1)" != "" ] && DEBUG="${2-}" && shift
;;
–debug=*)
DEBUG="$(echo $1 | sed -E -e 's/^[^=]+=//')"
;;
*)
break
;;
esac
shift
done
}
sign_in() {
[ -z "$XOA_URL" ] || [ -z "$XOA_USER" ] || [ -z "$XOA_PASSWORD" ] && fail "Missing sign-in details" 1
log_debug "Logging in"
if [ -n "$DEBUG" ]
then
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | tee -a "$DEBUG" | grep -q 'Successfully' || fail "Login failed" 2
else
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | grep -q 'Successfully' || fail "Login failed" 2
fi
STATE="signed_in"
}
get_pool() {
[ -z "$XOA_POOL" ] && log_debug "No Pool" && return 0
log_debug "Getting Pool ID"
if [ -n "$DEBUG" ]
then
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1 | tee -a "$DEBUG")"
else
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1)"
fi
[ "$POOL_ID" == "\$pool=" ] && fail "Pool provided but no ID received" 3
}
get_template() {
log_debug "Getting template"
if [ -n "$DEBUG" ]
then
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1 | tee -a "$DEBUG")"
else
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1)"
fi
[ -z "$TEMPLATE_IS" ] && fail "Could not match this template" 4
if [ -n "$DEBUG" ]
then
echo "{\"is\": ${TEMPLATE_IS}}" | tee -a "$DEBUG"
else
echo "{\"is\": ${TEMPLATE_IS}}"
fi
}
[ -n "$(command -v xo-cli)" ] || fail "xo-cli is missing, and is a required dependency for this script. Please install it; \`sudo npm -g install xo-cli\`" 5
parse_params "$@"
if [ -n "$DEBUG" ]
then
rm -f "$DEBUG"
log_debug "Invoked: $(date)"
log_debug "Template: $TEMPLATE"
log_debug "Pool: $XOA_POOL"
fi
sign_in
get_pool
get_template

This script is invoked from your terraform like this:

variable "template_name" {
  default     = "SomeLinux-version.iso-"
  description = "A regex, partial or full string to match in the template name"
}

variable "poolname" {
  default = "MyPool"
}

data "external" "get_xoa_template" {
  program = [
    "/bin/bash", "${path.module}/get_xoa_template.sh",
    "--template", var.template_name,
    "--pool", var.poolname
  ]
}

data "xenorchestra_pool" "pool" {
  name_label = var.poolname
}

data "xenorchestra_template" "template" {
  name_label = data.external.get_xoa_template.result.is
  pool_id    = data.xenorchestra_pool.pool.id
}

And that’s how you do it. Oh, and if you need to pin to a specific version? Change the template_name value from the partial or regex version to the full version, like this:

variable "template_name" {
  # This assumes your image was minted at midnight on 1970-01-01
  default     = "SomeLinux-version.iso-19700101000000"
}

Featured image is “Barcelos and Braga-18” by “Graeme Churchard” on Flickr and is released under a CC-BY license.

"Tickets" by "Becky Snyder" on Flickr

IP Address Management using PHPIPAM integrated with Keycloak for SAML2 Authentication

I’ve recently been working with a network estate that was a bit hard to get a handle on. It had grown organically, and was a bit tricky to allocate new network segments in. To fix this, I deployed PHPIPAM, which was super easy to setup and configure (I used the docker-compose file on the project’s docker hub page, and put it behind an NGINX server which was pre-configured with a LetsEncrypt TLS/HTTPS certificate).

PHPIPAM is a IP Address Management tool which is self-hostable. I started by setting up the “Sections” (which was the hosting environments the estate is using), and then setup the supernets and subnets in the “Subnets” section.

Already, it was much easier to understand the network topology, but now I needed to get others in to take a look at the outcome. The team I’m working with uses a slightly dated version of Keycloak to provide Single Sign-On. PHPIPAM will use SAML for authentication, which is one of the protocols that Keycloak offers. The documentation failed me a bit at this point, but fortunately a well placed ticket helped me move it along.

Setting up Keycloak

Here’s my walk through

  1. Go to “Realm Settings” in the sidebar and find the “SAML Identity Provider Metadata” (on my system it’s on the “General” tab but it might have changed position on your system). This will be an XML file, and (probably) the largest block of continuous text will be in a section marked “ds:X509Certificate” – copy this text, and you’ll need to use this as the “IDP X.509 public cert” in PHPIPAM.
  2. Go to “Clients” in the sidebar and click “Create”. If you want Keycloak to offer access to PHPIPAM as a link, the client ID needs to start “urn:” If you just want to use the PHPIPAM login option, give the client ID whatever you want it to be (I’ve seen some people putting in the URL of the server at this point). Either way, this needs to be unique. The client protocol is “saml” and the client SAML endpoint is the URL that you will be signing into on PHPIPAM – in my case https://phpipam.example.org/saml2/. It should look like this:

    Click Save to take you to the settings for this client.
  3. If you want Keycloak to offer a sign-in button, set the name of the button and description.

    Further down the page is “Root URL” set that to the SAML Endpoint (the one ending /saml2/ from before). Also set the “Valid Redirect URIs” to that too.

    Where it says “IDP Initiated SSO URL Name” put a string that will identify the client – I put phpipam, but it can be anything you want. This will populate a URL like this: https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam, which you’ll need as the “IDP Issuer”, “IDP Login URL” and “IDP Logout URL”. Put everything after the /auth/ in the box marked “Base URL”. It should look like this:

    Hit Save.
  4. Go to the “SAML Keys” tab. Copy the private key and certificate, these are needed as the “Authn X.509 signing” cert and cert key in PHPIPAM.
  5. Go to the “Mappers” tab. Create each of the following mappers;
    • A Role List mapper, with the name of “role list”, Role Attribute Name of “Role”, no friendly name, the SAML Attribute NameFormat set to “Basic” and Single Role Attribute set to on.
    • A User Attribute mapper, with the name, User Attribute, Friendly Name and SAML Attribute Name set to “email”, the SAML Attribute NameFormat set to “Basic” and Aggregate Attribute Values set to “off”.
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “display_name” and the SAML Attribute NameFormat set to “Basic”. The Script should be set to this single line: user.getFirstName() + ' ' + user.getLastName().
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “is_admin” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
is_admin = false;
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "phpipamadmins":
            is_admin = true;
            break;
    }
}
is_admin
  • Create one more mapper item:
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “groups” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
everyone_who_can_access_gets_read_only_access = false;
send_groups = "";
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "LDAP_GROUP_1":
            use_group = "IPAM_GROUP_1";
            break;
        case "LDAP_GROUP_2":
            use_group = "IPAM_GROUP_2";
            break;
    }
    if (use_group !== "") {
        if (send_groups !== "") {
          send_groups = send_groups + ","
        }
        send_groups = send_groups + use_group;
    }    
}
if (send_groups === "" && everyone_who_can_access_gets_read_only_access) {
    "Guests"
} else {
    send_groups
}

For context, the groups listed there, LDAP_GROUP_1 might be “Customer 1 Support Staff” or “ITSupport” or “Networks”, and the IPAM_GROUP_1 might be “Customer 1” or “WAN Links” or “DC Patching” – depending on the roles and functions of the teams. In my case they relate to other roles assigned to the staff member and the name of the role those people will perform in PHP IPAM. Likewise in the is_admin mapper, I’ve mentioned a group called “phpipamadmins” but this could be any relevant role that might grant someone admin access to PHPIPAM.

Late Update (2023-06-07): I’ve figured out how to enable modules now too. Create a Javascript mapper as per above, but named “modules” and have this script in it:

// Current modules as at 2023-06-07
// Some default values are set here.
noaccess       =  0;
readonly       =  1;
readwrite      =  2;
readwriteadmin =  3;
unsetperm      = -1;

var modules = {
    "*":       readonly,  "vlan":      unsetperm, "l2dom":    unsetperm,
    "devices": unsetperm, "racks":     unsetperm, "circuits": unsetperm,
    "nat":     unsetperm, "locations":  noaccess, "routing":  unsetperm,
    "pdns":    unsetperm, "customers": unsetperm
}

function updateModules(modules, new_value, list_of_modules) {
    for (var module in list_of_modules) {
        modules[module] = new_value;
    }
    return modules;
}

var GroupSet = user.getGroups();
for (var group in GroupSet) {
    switch (group.getName()) {
        case "LDAP_ROLE_3":
            modules = updateModules(modules, readwriteadmin, [
                'racks', 'devices', 'nat', 'routing'
            ]);
            break;
    }
}

var moduleList = '';

for (var key in modules) {
    if (modules.hasOwnProperty(key) && modules[key] !==-1) {
        if (moduleList !== '') {
            moduleList += ',';
        }
        moduleList += key + ':' + modules[key];
    }
}

moduleList;

OK, that’s Keycloak sorted. Let’s move on to PHPIPAM.

Setting up PHPIPAM

In the administration menu, select “Authentication Methods” and then “Create New” and select “Create new SAML2 authentication”.

In the description field, give it a relevant name, I chose SSO, but you could call it any SSO system name. Set “Enable JIT” to “on”, leave “Use advanced settings” as “off”. In Client ID put the Client ID you defined in Keycloak, probably starting urn: or https://. Leave “Strict mode” off. Next is the IDP Issuer, IDP Login URL and IDP Logout URL, which should all be set to the same URL – the “IDP Initiated SSO URL Name” from step 4 of the Keycloak side (that was set to something like https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam).

After that is the certificate section – first the IDP X.509 public cert that we got in step 1, then the “Sign Authn requests” should be set to “On” and the Authn X.509 signing cert and cert key are the private key and certificate we retrieved in step 5 above. Leave “SAML username attribute” and “SAML mapped user” blank and “Debugging” set to “Off”. It should look like this:

Hit save.

Next, any groups you specified in the groups mapper need to be defined. This is in Administration -> Groups. Create the group name and set a description.

Lastly, you need to configure the sections to define whigh groups have access. Each defined group gets given four radio buttons; “na” (no access), “ro” (read only), “rw” (read write) and “rwa” (read, write and administrate).

Try logging in. It should just work!

Debugging

If it doesn’t, and checking all of the above doesn’t help, I’ve tried adding some code into the PHP file in app/saml2/index.php, currently on line 149, above where it says:

if (empty($auth->getAttribute("display_name")[0])) {                                                                                    
    $Result->show("danger", _("Mandatory SAML JIT attribute missing")." : display_name (string)", true);
}

This code is:

$outfile = fopen("/tmp/log.out", "w");                                                           
fwrite($outfile, var_export($auth, true));                                                                     
fclose($outfile);

**REMEMBER THIS IS JUST FOR TESTING PURPOSES AND SHOULD BE REMOVED ASAP**

In here is an array called _attributes which will show you what has been returned from the Keycloak server when someone tries to log in. In my case, I got this:

   '_attributes' =>
  array (
    'groups' => 
    array (
      0 => 'PHPIPAM_GROUP_1',
    ),
    'is_admin' => 
    array (
      0 => 'false',
    ),
    'display_name' => 
    array (
      0 => 'Jon Spriggs',
    ),
    'email' => 
    array (
      0 => 'spriggsj@example.org',
    ),
  )

If you get something back here that isn’t what you expected, now at least you have a fighting chance of finding where that issue was! Good luck!!

Featured image is “Tickets” by “Becky Snyder” on Flickr and is released under a CC-BY license.

Quick tip: How to stop package installations from auto-starting server services with Debian based distributions (like Ubuntu)

I’m working on another toy project to understand a piece of software a little better, and to make it work, I needed to install dnsmasq inside an Ubuntu-based virtual machine. The problem with this is that Ubuntu already runs systemd-resolved to perform DNS lookups, and Debian likes to start server services as soon as it’s installed them. So how do we work around this? Well, actually, it’s pretty simple.

Thanks to this blog post from 2013, I found out that if you create an executable script called /usr/sbin/policy-rc.d with the content:

exit 101

This will stop all services in the dpkg/apt process from running on install, so I was able to do this:

echo 'exit 101' >> /usr/sbin/policy-rc.d
chmod +x /usr/sbin/policy-rc.d
apt update
apt install dnsmasq -y
systemctl disable --now systemd-resolved
# Futz with dnsmasq config
systemctl enable --now dnsmasq
dig example.com

Brilliant

Picture of a comms rack with a patch panel, a unifi USW-Pro-24 switch, two Dell Optiplex 3040M computers, two external hard drives and a Raspberry Pi.

Building a Highly Available (HA) two-node Home Lab on Proxmox

Warning, this is a long and dense document!

That said, If you’re thinking of getting started with Proxmox though it’s well worth a read. If you’ve *used* Proxmox, and think I’m doing something wrong here, let me know in the comments!

Context

In the various podcasts I listen to, I’ve been hearing over and over again about Proxmox, and how it’s a great system for building and running virtual machines. In a former life, I’d use a combination of VMWare ESXi servers or desktop machines running Vagrant and Virtualbox to build out small labs and build environments, and at home I’d previously used a i3 ex-demo machine that was resold to staff at a reduced price. Unfortunately, the power supply went pop one evening on that, and all my home-lab experiments died.

When I changed to my most recent job, I had a small cash windfall at the same time, and decided to rebuild my home lab. I bought two Dell Optiplex 3040M i5 with 16GB RAM and two 3TB external USB3 hard drives to provide storage. These were selected because of the small size which meant they would fit in the small comms rack I had fitted when I got my house wired with CAT6 networking cables last year. These were patched into the UniFi USW-Pro-24 which was fitted as part of the networking build.

Picture of a comms rack with a patch panel, a unifi USW-Pro-24 switch, two Dell Optiplex 3040M computers, two external hard drives and a Raspberry Pi.

(Yes, it’s a bit of a mess, but it’s also not been in there very long, so needs a bit of a clean-up!)

The Install

I allocated two static IP addresses for these hosts, and performed a standard installation of Proxmox using a USB stick with the multi-image-installer Ventoy on it.

Some screenshots follow:

Proxmox installation screen showing the EULA
Proxmox installation screen showing the installation target
Proxmox installation screen showing the location and timezone settings
Proxmox installation screen showing the prompt for credentials and contact email address
Proxmox installation screen showing the IP address and hostname selection screen

Note that these screenshots were built on one pass, and have been rebuilt with new IPs that are used later.

Proxmox installation screen showing the summary of all the options selected
Proxmox installation screen showing the actual installation details and an advert for why you should use it.
Proxmox installation screen showing the success screen

As I don’t have an enterprise subscription, I ran these commands to use tteck’s Post PVE Install script to change the repositories.

wget https://raw.githubusercontent.com/tteck/Proxmox/main/misc/post-pve-install.sh
# Run the following to confirm the download looks OK and non-corrupted
less post-pve-install.sh
bash post-pve-install.sh

This results in the following (time-lapse) output, which is a series of options asking you to approve making changes to the system.

A time-lapse video of what happens during the post-pve-install script.

[Most of the following are derived from this YouTube video: “1/2 Create a 2-node Proxmox VE Cluster. Gluster as shared storage. With High Availability! First ep”]

Clustering

After signing into both Proxmox nodes, I went to my first node (proxmox01), selected “Datacenter” and then “Cluster”.

An image of the Proxmox server selecting the cluster screen

I clicked on “Create Cluster”, and created a cluster, called (unimaginatively) proxmox-cluster.

The create cluster dialogue box
The task completion details for the create cluster action

I clicked “Join Information”.

A screenshot showing the "Join information" button
The join information dialogue box

Next, on proxmox02 on the same screen, I clicked on “Join Cluster” and then pasted that information into the dialogue box. I entered the root password, and clicked “Join ‘proxmox-cluster'”.

A screenshot of the proxmox cluster, showing where the "join cluster" button is.
The Cluster Join screen, showing the pasted in text from the other cluster and that the password has been entered.

When this finished running, if either screen has hung, check whether one of the screens is showing an error like permission denied - invalid PVE ticket (401), like this (hidden just behind the “Task Viewer: Join Cluster” dialogue box):

A screen shot showing the error message "permission denied - invalid PVE ticket"

Or /etc/pve/nodes/NODENAME/pve-ssl.pem' does not exist! (500):

A screen shot of the error message "pve-ssl.pem does not exist"

Refresh your browsers, and you’ll probably find that the joining node will present a new TLS certificate:

A screen shot of Firefox's "unknown certificate" screen

Accept the certificate to resume the process.

To ensure I had HA quorum, which requires three nodes, I added an unused Raspberry Pi 3 running Raspberry Pi OS.

With that, I enabled root SSH access:

echo "PermitRootLogin yes" | tee /etc/ssh/sshd_config.d/root_login.conf >/dev/null && systemctl restart ssh.service

Next, I setup a password for the root account:

sudo passwd

And I installed the package “corosync-qnetd” on it:

sudo apt update && sudo apt install -y corosync-qnetd

Back on both of the Proxmox nodes, I installed the package “corosync-qdevice”:

apt update && apt install -y corosync-qdevice
A screen shot of the installation of the corosync-qdevice package having completed

On proxmox01 I then ran pvecm qdevice setup 192.168.1.179(where 192.168.1.179 is the IP address of the Raspberry Pi device).

A screen shot of the first half of of the setup of the command pvecm qdevice setup
A screen shot of the second half of of the setup of the command pvecm qdevice setup

This gave me my quorum of 3 nodes. To confirm this, I ran pvecm statuswhich resulted in this output:

root@proxmox01:~# pvecm status
Cluster information
-------------------
Name:             proxmox-cluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue May 16 20:38:15 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate Qdevice 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.1.200 (local)
0x00000002          1    A,V,NMW 192.168.1.201
0x00000000          1            Qdevice
root@proxmox01:~#
A screen shot of the output from the pvecm status command.

Storage

ZFS

Once the machines were built, I went into the Disks screen on each node, found the 3TB drive and selected “Wipe Disk”.

A screenshot of the disks page, showing the location of the "wipe disk" button.
A confirmation screen shot asking if I want to format the disk.
The completion screen shot for the wipe disk action

Next I clicked “Initialize Disk with GPT”.

The disk screen showing the location of the "Initialize Disk with GPT" button
The completion screenshot for initializing the disk

Next I went into the ZFS page in the node and created a ZFS Single Disk pool.

The ZFS screen shot, showing the location of the "Create: ZFS" button.

This pool was named “zfs-proxmox##” where “##” was replaced by the node number (so zfs-proxmox01 and zfs-proxmox02).

A screen shot of the options for creating the ZFS pool.

This mounts the pool as the pool name in the root (so /zfs-proxmox01 and /zfs-proxmox02).

A screen shot confirming that the disks have been mounted

GlusterFS

I added the Gluster debian repository by downloading the key from https://download.gluster.org/pub/gluster/glusterfs/10/rsa.pub and placing it in /etc/apt/keyrings/gluster.asc.

mkdir /etc/apt/keyrings
cd /etc/apt/keyrings
wget https://download.gluster.org/pub/gluster/glusterfs/10/rsa.pub
mv rsa.pub gluster.asc
A screen shot showing that the gluster key has been added to the system

Next I created a new repository entry in /etc/apt/sources.list.d/gluster.listwhich contained the line:

deb [arch=amd64 signed-by=/etc/apt/keyrings/gluster.asc] https://download.gluster.org/pub/gluster/glusterfs/10/LATEST/Debian/bullseye/amd64/apt bullseye main
A screenshot showing the apt repository being added to the system

I next ran apt update && apt install -y glusterfs-serverwhich installed the Gluster service.

A screen shot showing the installation of glusterfs-server in progress
A screenshot showing the completion of the glusterfs-server package and it's dependencies having been installed.

Following the YouTube link above, I created an entry for gluster01 and gluster02 in /etc/hosts which pointed to the IP address of proxmox01 and proxmox02 respectively.

A screen shot of editing the hosts file

Next, I edited /etc/glusterfs/glusterd.volso it contained this content:

volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option transport.socket.listen-port 24007
    option transport.rdma.bind-address gluster01
    option transport.socket.bind-address gluster01
    option transport.tcp.bind-address gluster01
    option ping-timeout 0
    option event-threads 1
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
    option max-port  60999
end-volume
A screen shot of editing the glusterd.vol file.

Note that this content above is for proxmox01. For proxmox02 I replaced “gluster01” with “gluster02”. I then ran systemctl enable --now glusterdwhich started the Gluster service.

Once this is done, you must run gluster probe gluster02from proxmox01 (or vice versa), otherwise, when you run the next command, you get this message:

volume create: gluster-volume: failed: Host gluster02 is not in 'Peer in Cluster' state
A screen shot of the error message issued when you've not run gluster probe before creating the volume

(This takes some backing out… ugh)

On proxmox01, I created the volume using this command:

gluster volume create gluster-volume replica 2 gluster01:/zfs-proxmox01/gluster-volume gluster02:/zfs-proxmox02/gluster-volume
A screen shot of creating the gluster volume.

As you can see in the above screenshot, this warned about split brain situations. However, as this is for my home lab, I accepted the risk here. Following the YouTube video again, I ran these commands to “avoid [a] split-brain situation”:

gluster volume start gluster-volume
gluster volume set gluster-volume cluster.heal-timeout 5
gluster volume heal gluster-volume enable
gluster volume set gluster-volume cluster.quorum-reads false
gluster volume set gluster-volume cluster.quorum-count 1
gluster volume set gluster-volume network.ping-timeout 2
gluster volume set gluster-volume cluster.favorite-child-policy mtime
gluster volume heal gluster-volume granular-entry-heal enable
gluster volume set gluster-volume cluster.data-self-heal-algorithm full
A screenshot of the output of all the commands issued to prevent a gluster split brain scenario

I created /gluster-volume on both proxmox01 and proxmox02, and then added this line to /etc/fstab(yes, I know it should really have been a systemd mount unit) on proxmox01:

gluster01:gluster-volume /gluster-volume glusterfs defaults,_netdev,x-systemd.automount,backupvolfile-server=gluster02 0 0
A screen shot of the command issued to add the gluster volume to fstab

And on proxmox02:

gluster02:gluster-volume /gluster-volume glusterfs defaults,_netdev,x-systemd.automount,backupvolfile-server=gluster01 0 0

On both systems, I ensured that /gluster-volume was created, and then ran mount -a.

The result of adding the line to staband then mounting the volume.

In the Proxmox UI, I went to the “Datacenter” and selected “Storage”, then “Add” and selected “Directory”.

A screen shot of adding a directory to the proxmox server

I set the ID to “gluster-volume”, the directory to “/gluster-volume”, ticked the “Shared” box and selected all the content types (it looks like a list box, but it’s actually a multi-select box).

The Add Directory dialogue screen shot

(I forgot to click “Shared” before I selected all the items under “Content” here.)

I clicked Add and it was available on both systems then.

A screen shot proving that the gluster volume has been added.

Backups

This one saved me from having to rebuild my Home Assistant system last week! Go into “Datacenter” and select the “Backup” option.

A screen shot of the backup screen in Proxmox, showing the location of the "add" button.

Click the “Add” button, select the storage you’ve just configured (gluster-volume) and a schedule (I picked daily at 04:00) and choose “Selection Mode” of “All”.

A screenshot of the dialogue box for creating the backup job

On the retention tab, I entered the number 3 for “Keep Daily”, “Keep Weekly”, “Keep Monthly” and “Keep Yearly”. Your retention needs are likely to be different to mine!

A screenshot of the dialogue box for creating the retention in the backup job
Proof that the backup job has been created.

If you end up needing to restore one of these backups, you need a different tool depending on whether it’s a LXC container or a QEMU virtual machine. For a container, you’d run:

vmid=199
pct restore $vmid /path/to/backup-file

For a virtual machine, you’d run:

vmid=199
qmrestore /path/to/backup-file $vmid

…and yes, you can replace the vmid=199 \n $vmidwith just the number for the VMID like this:

pct restore 123 /backup/vzdump-lxc-100-1970_01_01-04_00_00.tar.zst

If you need to point the storage at a different device (perhaps Gluster broke, or your external drive) you’d add --storage storage-label(e.g. --storage local-lvm)

Networking

The biggest benefit for me of a home lab is being able to build things on their own VLAN. A VLAN allows a single network interface to carry traffic for multiple logical networks, in such a way that other ports on the switch which aren’t configured to carry that logical network can’t access that traffic.

For example, I’ve configured my switch to have a new VLAN on it, VLAN 30. This VLAN is exposed to the two Proxmox servers (which can access all the VLANs) and also the port to my laptop. This means that I can run virtual machines on VLAN 30 which can’t be accessed by any other machine on my network.

There are two ways to do this, the “easy way” and the “explicit way”. Both ways produce the same end state, it’s just down to which makes more logical sense in your head.

In both routes, you must create the VLANs on your switch first – I’m just addressing the way of configuring Proxmox to pass this traffic to your network switch.

Note that these VLAN tagged interfaces also don’t have a DHCP server or Internet gateway (unless you create one), so any addresses will need to be manually configured in any installation screens.

The easy way

Go into the individual nodes and select the Network option in the sidebar (nested under “System”). You’ll need to perform these actions on both nodes.

Click on the “Linux Bridge” line which is aligned to your “trunked” network interface. For me, as I have a single network interface (enp2s0) I have a single Linux Bridge (vmbr0). Click “Edit” and tick the “VLAN aware” box and click “OK”.

A screen shot showing how to add VLAN awareness to the linux bridge configuration.
A screen shot showing the changes to /etc/network/interfaces

When you now create your virtual machines, on the hardware option in the sidebar, find the network interface and enter the VLAN tag you want to assign.

A screen shot showing how to configure the VLAN tag when creating a new virtual machine in Proxmox

(This screenshot shows no VLAN tag added, but it’s fairly clear where you’d put that tag in there)

The explicit way

Go into the individual nodes and select the Network option in the sidebar. You’ll need to perform all the steps in the section on both nodes!

Create a new “Linux VLAN” object.

A screen shot showing where to add the van on the proxmox node.

Call it by the name of the interface (e.g. enp2s0) followed by a dot and then the VLAN tag, like this enp2s0.30. Click Create.

A screenshot of the dialogue box for creating a VLAN tagged interface

Next create a new “Linux Bridge”.

A screen shot showing where to find the Bridge interface button

Call it vmbr and then the VLAN tag, like this vmbr30. Set the ports to the VLAN you just created (enp2s0.30)

A screen shot of the creation of the  bridge interface, with the addition of the bridge port previously created.
A screen shot of the changes to the /etc/network/interfaces screen.

(I should note that I added the comment between writing this guide and taking these screen shots)

When you create your virtual machines select this bridge for accessing that VLAN.

A screen shot of the selection of the VLAN tagged bridge.

Making machines run in “HA”

If you haven’t already done the part with the QDevice under clustering, go back there and run those steps! You need quorum to do this right!

YOU MUST HAVE THE SAME NETWORK AND STORAGE CONFIGURATION FOR HIGH AVAILABILITY AND MIGRATIONS. This means every VM which you want to migrate from proxmox01 to proxmox02 must use the same network interface and storage device, no matter which host it’s connected to.

  • If you’re connecting enp2s0 to VLAN 55 by using a VLAN Bridge called vmbr55, then both nodes need this VLAN Bridge available. Alternatively, if you’re using a VLAN tag on vmbr0, that’s fine, but both nodes need to have vmbr0 set to be “VLAN aware”.
  • If you’re using a disk on gluster-volume, this must be shared across the cluster

Go to “Datacenter” and select “Groups” which is nested under “HA” in the sidebar.

A screen shot of where to find the HA Group Creation button.

Create a new group (again, unimaginatively, I went with “proxmox”). Select both nodes and press Create.

A screen shot of the HA Group Creation dialogue box.

Now go to the “HA” option in the sidebar and verify you have quorum, although it doesn’t matter which is the master.

A screen shot showing how to verify the HA quorum status

Under resources on that page, click “Add”.

A screen shot showing where the add button is to enable HA of a virtual machine.

In the VM box, select the ID for the container or virtual machine you want to be highly available and click Add.

A screen shot of the dialogue box when setting up high availability of a virtual machine.

This will restart that machine or container in HA mode.

A screen shot showing the HA status of that virtual machine.

The wrap up!

So, after all of this, there’s still no virtual machines running (well, that Ubuntu Desktop is created but not running yet!) and I’ve not even started playing around with Terraform yet… but I’m feeling really positive about Proxmox. It’s close enough to the proprietary solutions I’ve used at work in the past that I’m reasonably comfortable with it, but it’s open enough to mess around under the surface. I’m looking forward to doing more experiments!

The featured image is of the comms rack in my garage showing how bad my wiring is when I can’t get to the back of a rack!! It’s released under a CC-0 license.

Using multiple GitHub accounts from the Command Line with Environment Variables (using `direnv`) and per-account SSH keys

I recently was in the situation where I had two github profiles (one work, one personal) that I needed to incorporate in projects.

My work account on this device is my “default”, I use it to push, pull and so on, but the occasional personal activities (like terminate-notice) all should be attributed to my personal account.

To make this happen, I used direnv which reads a .envrcfile in the parents of the directory you’re currently in. I created a directory for my personal projects – ~/Code/Personaland placed a .envrc file which contains:

export GIT_AUTHOR_EMAIL=jon@sprig.gs
export GIT_COMMITTER_EMAIL=jon@sprig.gs
export GIT_SSH_COMMAND="ssh -i ~/.ssh/personal.id_ed25519"
export SSH_AUTH_SOCK=

This means that I have a specific SSH key just for my personal activities (~/.ssh/personal.id_ed25519) and I’ve got my email address defined as two environment variables – AUTHOR (who wrote the code) and COMMITTER (who added it to the tree) – both are required when you’re changing them like this!

Because I don’t ever want it to try to use my SSH Agent, I’ve added the fact that SSH_AUTH_SOCK should be empty.

As an aside, work also require Commit Signing, but I don’t want to use that for my personal projects right now, so I also discovered a new feature as-of 2020 – the environment variables GIT_CONFIG_KEY_x, GIT_CONFIG_VALUE_x and GIT_CONFIG_COUNT=x

By using these, you can override any system, global and repo-level configuration values, like this:

export GIT_CONFIG_KEY_0=commit.gpgSign
export GIT_CONFIG_VALUE_0=false
export GIT_CONFIG_KEY_1=push.gpgSign
export GIT_CONFIG_VALUE_1=false
export GIT_CONFIG_KEY_2=tag.gpgSign
export GIT_CONFIG_VALUE_2=false
export GIT_CONFIG_COUNT=2

This ensures that I *will not* GPG Sign commits, tags or pushes.

If I accidentally cloned a repo into an unusual location, or on purpose need to make a directory or submodule a personal repo, I just copy the .envrc file into that part of the tree, run direnv allowand hey-presto! I’ve turned that area into a personal repo, without having to remember the .gitconfigstring to mark a new part of my tree as a personal one.

The direnv and SSH part was largely inspired by : Handle multiple github accounts while the GIT_CONFIG_* bit was found via this StackOverflow answer.

Featured image is “Mirrored Lotus” by “Faye Mozingo” on Flickr and is released under a CC-BY-SA license.

"Apoptosis Network (alternate)" by "Simon Cockell" on Flickr

Multipass on Ubuntu with Bridged Network Interfaces

I’m working on a new project, and I am using Multipass on an Ubuntu machine to provision some virtual machines on my local machine using cloudinit files. All good so far!

I wanted to expose one of the services I’ve created to the bridged network (so I can run avahi-daemon), and did this by running multipass launch -n vm01 --network enp3s0 when, what should I see but: launch failed: The bridging feature is not implemented on this backend. OH NO!

By chance, I found a random Stack Overflow answer, which said:

Currently only the LXD driver supports the networks command on Linux.

So, let’s make multipass on Ubuntu use LXD! (Be prepared for entering your password a few times!)

Firstly, we need to install LXD. Dead simple:

snap install lxd

Next, we need to tell snap that it’s allowed to connect LXD to multipass:

snap connect multipass:lxd lxd

And lastly, we tell multipass to use lxd:

multipass set local.driver=lxd

Result?

user@host:~$ multipass networks
Name             Type      Description
enp3s0           ethernet  Ethernet device
mpbr0            bridge    Network bridge for Multipass

And when I brought my machine up with avahi-daemon installed and configured to broadcast it’s hostname?

user@host:~$ ip -4 addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
37: br-enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.0.2.33/24 brd 192.0.2.255 scope global dynamic noprefixroute br-enp3s0
       valid_lft 6455sec preferred_lft 6455sec
user@host:~$ multipass list
Name         State       IPv4             Image
vm01         Running     203.0.113.15     Ubuntu 22.04 LTS
                         192.0.2.101
user@host:~$ ping vm01.local
PING vm01.local (192.0.2.101) 56(84) bytes of data.

Tada!

Featured image is “Apoptosis Network (alternate)” by “Simon Cockell” on Flickr and is released under a CC-BY license.

"Sensitive Species" by "Rennett Stowe" on Flickr

HOWTO: Do DynDNS-style (DDNS) updates with Terraform (without leaking your credentials in the console)

For some of my projects, I run a Dynamic DNS server service attached to one of the less-standard DNS Names I own, and use that to connect to the web pages I’m spinning up. In a recent demo, I noticed that the terraform “changes” log where it shows what things are being updated showed the credentials I was using, because I was using “simple” authentication, like this:

data "http" "ddns_web" {
  url = "https://my.ddns.example.org/update?secret=${var.ddns_secret}&domain=web&addr=192.0.2.1"
}

variable "ddns_secret" {
  default = "bob"
}

For context, that would ask the DDNS service running at ddns.example.org to create a DNS record for web.ddns.example.org with an A record of 192.0.2.1.

While this is fine for my personal projects, any time this goes past, anyone who spots that update line would see the credentials I use for this service. Not great.

I had a quick look at the other options I had for authentication, and noticed that the DDNS server I’m running also supports the DynDNS update mechanism. In that case, we need to construct things a little differently!

data "http" "ddns_web" {
  url             = "https://my.ddns.example.org/nic/update?hostname=web&myip=192.0.2.1"
  request_headers = {
    Authorization = "Basic ${base64encode("user:${var.ddns_secret}")}"
  }
}

variable "ddns_secret" {
  type      = string
  sensitive = true
  default   = "bob"
}

So now, we change the URL to include the /nic/ path fragment, we use different names for the variables and we’re using Basic Authentication which is a request header. It’s a little frustrating that the http data source doesn’t also have a query type or a path constructor we could have used, but…

In this context the request header of “Authorization” is a string starting “Basic” but then with a Base64 encoded value of the username (which for this DDNS service, can be anything, so I’ve set it as the word “user”), then a colon and then the password. By setting the ddns_secret variable as being “sensitive”, if I use terraform console, and ask it for the value of data.http.ddns_web I get

> data.http.ddns_web
{
  "body" = <<-EOT
  good 192.0.2.1
  
  EOT
  "id" = "https://my.ddns.example.org/nic/update?hostname=web&myip=192.0.2.1"
  "request_headers" = tomap({
    "Authorization" = (sensitive)
  })
  "response_body" = <<-EOT
  good 192.0.2.1
  
  EOT
  "response_headers" = tomap({
    "Content-Length" = "18"
    "Content-Type" = "text/plain; charset=utf-8"
    "Date" = "Thu, 01 Jan 1970 00:00:00 UTC"
    "Server" = "nginx"
    "Strict-Transport-Security" = "max-age=31536000; includeSubDomains"
    "X-Content-Type-Options" = "nosniff"
    "X-Xss-Protection" = "1; mode=block"
  })
  "url" = "https://my.ddns.example.org/nic/update?hostname=web&myip=192.0.2.1"
}
>

Note that if your DDNS service has a particular username requirement, this can also be entered, in the same way, by changing the string “user” to something like ${var.ddns_user}.

Featured image is “Sensitive Species” by “Rennett Stowe” on Flickr and is released under a CC-BY license.

"Catch and Release" by "Trish Hamme" on Flickr

Releasing files for multiple operating systems with Github Actions in 2021

Hi! Long time, no see!

I’ve been working on my Decision Records open source project for a few months now, and I’ve finally settled on the cross-platform language Rust to create my script. As a result, I’ve got a build process which lets me build for Windows, Mac OS and Linux. I’m currently building a single, unsigned binary for each platform, and I wanted to make it so that Github Actions would build and release these three files for me. Most of the guidance which is currently out there points to some unmaintained actions, originally released by GitHub… but now they point to a 3rd party “release” action as their recommended alternative, so I thought I’d explain how I’m using it to release on several platforms at once.

Although I can go into detail about the release file I’m using for Rust-Decision-Records, I’m instead going to provide a much more simplistic view, based on my (finally working) initial test run.

GitHub Actions

GitHub have a built-in Continuous Integration, Continuous Deployment/Delivery (CI/CD) system, called GitHub Actions. You can have several activities it performs, and these are executed by way of instructions in .github/workflows/<somefile>.yml. I’ll be using .github/workflows/build.yml in this example. If you have multiple GitHub Action files you wanted to invoke (perhaps around issue management, unit testing and so on), these can be stored in separate .yml files.

The build.yml actions file will perform several tasks, separated out into two separate activities, a “Create Release” stage, and a “Build Release” stage. The Build stage will use a “Matrix” to execute builds on the three platforms at the same time – Linux AMD64, Windows and Mac OS.

The actual build steps? In this case, it’ll just be writing a single-line text file, stating the release it’s using.

So, let’s get started.

Create Release

A GitHub Release is typically linked to a specific “tagged” commit. To trigger the release feature, every time a commit is tagged with a string starting “v” (like v1.0.0), this will trigger the release process. So, let’s add those lines to the top of the file:

name: Create Release

on:
  push:
    tags:
      - 'v*'

You could just as easily use the filter pattern ‘v[0-9]+.[0-9]+.[0-9]+’ if you wanted to use proper Semantic Versioning, but this is a simple demo, right? 😉

Next we need the actual action we want to start with. This is at the same level as the “on” and “name” tags in that YML file, like this:

jobs:
  create_release:
    name: Create Release
    runs-on: ubuntu-latest
    steps:
      - name: Create Release
        id: create_release
        uses: softprops/action-gh-release@v1
        with:
          name: ${{ github.ref_name }}
          draft: false
          prerelease: false
          generate_release_notes: false

So, this is the actual “create release” job. I don’t think it matters what OS it runs on, but ubuntu-latest is the one I’ve seen used most often.

In this, you instruct it to create a simple release, using the text in the annotated tag you pushed as the release notes.

This is using a third-party release action, softprops/action-gh-release, which has not been vetted by me, but is explicitly linked from GitHub’s own action.

If you check the release at this point, (that is, without any other code working) you’d get just the source code as a zip and a .tgz file. BUT WE WANT MORE! So let’s build this mutha!

Build Release

Like with the create_release job, we have a few fields of instructions before we get to the actual actions it’ll take. Let’s have a look at them first. These instructions are at the same level as the jobs:\n create_release: line in the previous block, and I’ll have the entire file listed below.

  build_release:
    name: Build Release
    needs: create_release
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        include:
          - os: ubuntu-latest
            release_suffix: ubuntu
          - os: macos-latest
            release_suffix: mac
          - os: windows-latest
            release_suffix: windows
    runs-on: ${{ matrix.os }}

So this section gives this job an ID (build_release) and a name (Build Release), so far, so exactly the same as the previous block. Next we say “You need to have finished the previous action (create_release) before proceeding” with the needs: create_release line.

But the real sting here is the strategy:\n matrix: block. This says “run these activities with several runners” (in this case, an unspecified Ubuntu, Mac OS and Windows release (each just “latest”). The include block asks the runners to add some template variables to the tasks we’re about to run – specifically release_suffix.

The last line in this snippet asks the runner to interpret the templated value matrix.os as the OS to use for this run.

Let’s move on to the build steps.

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run Linux Build
        if: matrix.os == 'ubuntu-latest'
        run: echo "Ubuntu Latest" > release_ubuntu
      
      - name: Run Mac Build
        if: matrix.os == 'macos-latest'
        run: echo "MacOS Latest" > release_mac

      - name: Run Windows Build
        if: matrix.os == 'windows-latest'
        run: echo "Windows Latest" > release_windows

This checks out the source code on each runner, and then has a conditional build statement, based on the OS you’re using for each runner.

It should be fairly simple to see how you could build this out to be much more complex.

The final step in the matrix activity is to add the “built” file to the release. For this we use the softprops release action again.

      - name: Release
        uses: softprops/action-gh-release@v1
        with:
          tag_name: ${{ needs.create_release.outputs.tag-name }}
          files: release_${{ matrix.release_suffix }}

The finished file

So how does this all look when it’s done, this most simple CI/CD build script?

name: Create Release

on:
  push:
    tags:
      - 'v*'

jobs:
  create_release:
    name: Create Release
    runs-on: ubuntu-latest
    steps:
      - name: Create Release
        id: create_release
        uses: softprops/action-gh-release@v1
        with:
          name: ${{ github.ref_name }}
          draft: false
          prerelease: false
          generate_release_notes: false

  build_release:
    name: Build Release
    needs: create_release
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        include:
          - os: ubuntu-latest
            release_suffix: ubuntu
          - os: macos-latest
            release_suffix: mac
          - os: windows-latest
            release_suffix: windows
    runs-on: ${{ matrix.os }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run Linux Build
        if: matrix.os == 'ubuntu-latest'
        run: echo "Ubuntu Latest" > release_ubuntu
      
      - name: Run Mac Build
        if: matrix.os == 'macos-latest'
        run: echo "MacOS Latest" > release_mac

      - name: Run Windows Build
        if: matrix.os == 'windows-latest'
        run: echo "Windows Latest" > release_windows

      - name: Release
        uses: softprops/action-gh-release@v1
        with:
          tag_name: ${{ needs.create_release.outputs.tag-name }}
          files: release_${{ matrix.release_suffix }}

I hope this helps you!

My Sources and Inspirations

Featured image is “Catch and Release” by “Trish Hamme” on Flickr and is released under a CC-BY license.