"Killer travel plug and socket board" by "Ashley Basil" on Flickr

Testing and Developing WordPress Plugins using Vagrant to provide the test environment

I keep trundling back to a collection of WordPress plugins that I really love. And sometimes I want to contribute patches to the plugin.

I don’t want to develop against this server (that would be crazy… huh… right… no one does that… *cough*) but instead, I want a nice, fresh and new WordPress instance to just check that it works the way I was expecting.

So, I created a little Vagrant environment, just for testing WordPress plugins. I clone the repository for the plugin, and create a “TestingEnvironment” directory in there.

I then create the following Vagrantfile.

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/jammy64"
  # This will create an IP address in the range 192.168.64.0/24 (usually)
  config.vm.network "private_network", type: "dhcp"
  # This loads the git repo for the plugin into /tmp/git_repo
  config.vm.synced_folder "../", "/tmp/git_repo"

  # If you've got vagrant-cachier, this will speed up apt update/install operations
  if Vagrant.has_plugin?("vagrant-cachier")
    config.cache.scope = :box
  end

  config.vm.provision "shell", inline: <<-SHELL

    # Install Dependencies
    apt-get update
    apt-get install -y apache2 libapache2-mod-fcgid php-fpm mysql-server php-mysql git

    # Set up Apache
    a2enmod proxy_fcgi setenvif
    a2enconf "$(basename "$(ls /etc/apache2/conf-available/php*)" .conf)"
    systemctl restart apache2
    rm -f /var/www/html/index.html

    # Set up WordPress
    bash /vagrant/root_install_wordpress.sh
  SHELL
end

Next, let’s create that root_install_wordpress.sh file.

#! /bin/bash

# Allow us to run commands as www-data
chsh -s /bin/bash www-data
# Let www-data access files in the web-root.
chown -R www-data:www-data /var/www

# Install wp-cli system-wide
curl -s -S -O https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar
mv wp-cli.phar /usr/local/bin/wp
chmod +x /usr/local/bin/wp

# Slightly based on 
# https://www.a2hosting.co.uk/kb/developer-corner/mysql/managing-mysql-databases-and-users-from-the-command-line
echo "CREATE DATABASE wp;" | mysql -u root
echo "CREATE USER 'wp'@'localhost' IDENTIFIED BY 'wp';" | mysql -u root
echo "GRANT ALL PRIVILEGES ON wp.* TO 'wp'@'localhost';" | mysql -u root
echo "FLUSH PRIVILEGES;" | mysql -u root

# Execute the generic install script
su - www-data -c bash -c /vagrant/user_install_wordpress.sh
# Install any plugins with this script
su - www-data -c bash -c /vagrant/customise_wordpress.sh
# Log the path to access
echo "URL: http://$(sh /vagrant/get_ip.sh) User: admin Password: password"

Now we have our dependencies installed and our database created, let’s get WordPress installed with user_install_wordpress.sh.

#! /bin/bash

# Largely based on https://d9.hosting/blog/wp-cli-install-wordpress-from-the-command-line/
cd /var/www/html
# Install the latest WP into this directory
wp core download --locale=en_GB
# Configure the database with the credentials set up in root_install_wordpress.sh
wp config create --dbname=wp --dbuser=wp --dbpass=wp --locale=en_GB
# Skip the first-run-wizard
wp core install --url="http://$(sh /vagrant/get_ip.sh)" --title=Test --admin_user=admin --admin_password=password --admin_email=example@example.com --skip-email
# Setup basic permalinks
wp option update permalink_structure ""
# Flush the rewrite schema based on the permalink structure
wp rewrite structure ""

Excellent. This gives us a working WordPress environment. Now we need to add our customisation – the plugin we’re deploying. In this case, I’ve been tweaking the “presenter” plugin so here’s the customise_wordpress.sh code:

#! /bin/bash

cd /var/www/html/wp-content/plugins
git clone /tmp/git_repo presenter --recurse-submodules
wp plugin activate presenter

Actually, that /tmp/git_repo path is a call-back to this line in the Vagrantfile: config.vm.synced_folder "../", "/tmp/git_repo".

And there you have it; a vanilla WordPress install, with the plugin installed and ready to test. It only took 4 years to write up a blog post for it!

As an alternative, you could instead put the plugin you’re working with in a subdirectory of the Vagrantfile and supporting files, then you’d just need to change that git clone /tmp/git_repo line to git clone /vagrant/MyPlugin – but then you can’t offer this to the plugin repo as a PR, can you? 😀

Featured image is “Killer travel plug and socket board” by “Ashley Basil” on Flickr and is released under a CC-BY license.

"Fishing fleet" by "Nomad Tales" on Flickr

Using Terraform to select multiple Instance Types for an Autoscaling Group in AWS

Tale as old as time, the compute instance type you want to use in AWS is highly contested (or worse yet, not as available in every availability zone in your region)! You plead with your TAM or AM “Please let us have more of that instance type” only to be told “well, we can put in a request, but… haven’t you thought about using a range of instance types”?

And yes, I’ve been on both sides of that conversation, sadly.

The commented terraform

# This is your legacy instance_type variable. Ideally we'd have
# a warning we could raise at this point, telling you not to use
# this variable, but... it's not ready yet.
variable "instance_type" {
  description = "The legacy single-instance size, e.g. t3.nano. Please migrate to instance_types ASAP. If you specify instance_types, this value will be ignored."
  type        = string
  default     = null
}

# This is your new instance_types value. If you don't already have
# some sort of legacy use of the instance_type variable, then don't
# bother with that variable or the locals block below!
variable "instance_types" {
  description = "A list of instance sizes, e.g. [t2.nano, t3.nano] and so on."
  type        = list(string)
  default     = null
}

# Use only this locals block (and the value further down) if you
# have some legacy autoscaling groups which might use individual
# instance_type sizes.
locals {
  # This means if var.instance_types is not defined, then use it,
  # otherwise create a new list with the single instance_type
  # value in it!
  instance_types = var.instance_types != null ? var.instance_types : [ var.instance_type ]
}

resource "aws_launch_template" "this" {
  # The prefix for the launch template name
  # default "my_autoscaling_group"
  name_prefix = var.name

  # The AMI to use. Calculated outside this process.
  image_id = data.aws_ami.this.id

  # This block ensures that any new instances are created
  # before deleting old ones.
  lifecycle {
    create_before_destroy = true
  }

  # This block defines the disk size of the root disk in GB
  block_device_mappings {
    device_name = data.aws_ami.centos.root_device_name
    ebs {
      volume_size = var.disksize # default "10"
      volume_type = var.disktype # default "gp2"
    }
  }

  # Security Groups to assign to the instance. Alternatively
  # create a network_interfaces{} block with your
  # security_groups = [ var.security_group ] in it.
  vpc_security_group_ids = [ var.security_group ]

  # Any on-boot customizations to make.
  user_data = var.userdata
}

resource "aws_autoscaling_group" "this" {
  # The name of the Autoscaling Group in the Web UI
  # default "my_autoscaling_group"
  name = var.name

  # The list of subnets into which the ASG should be deployed.
  vpc_zone_identifier = var.private_subnets
  # The smallest and largest number of instances the ASG should scale between
  min_size            = var.min_rep
  max_size            = var.max_rep

  mixed_instances_policy {
    launch_template {
      # Use this template to launch all the instances
      launch_template_specification {
        launch_template_id = aws_launch_template.this.id
        version            = "$Latest"
      }

      # This loop can either use the calculated value "local.instance_types"
      # or, if you have no legacy use of this module, remove the locals{}
      # and the variable "instance_type" {} block above, and replace the
      # for_each and instance_type values (defined as "local.instance_types")
      # with "var.instance_types".
      #
      # Loop through the whole list of instance types and create a
      # set of "override" values (the values are defined in the content{}
      # block).
      dynamic "override" {
        for_each = local.instance_types
        content {
          instance_type = local.instance_types[override.key]
        }
      }
    }

    instances_distribution {
      # If we "enable spot", then make it 100% spot.
      on_demand_percentage_above_base_capacity = var.enable_spot ? 0 : 100
      spot_allocation_strategy                 = var.spot_allocation_strategy
      spot_max_price                           = "" # Empty string is "on-demand price"
    }
  }
}

So what is all this then?

This is two Terraform resources; an aws_launch_template and an aws_autoscaling_group. These two resources define what should be launched by the autoscaling group, and then the settings for the autoscaling group.

You will need to work out what instance types you want to use (e.g. “must have 16 cores and 32 GB RAM, have an x86_64 architecture and allow up to 15 Gigabit/second throughput”)

When might you use this pattern?

If you have been seeing messages like “There is no Spot capacity available that matches your request.” or “We currently do not have sufficient <size> capacity in the Availability Zone you requested.” then you need to consider diversifying the fleet that you’re requesting for your autoscaling group. To do that, you need to specify more instance types. To achieve this, I’d use the above code to replace (something like) one of the code samples below.

If you previously have had something like this:

resource "aws_launch_configuration" "this" {
  iam_instance_profile        = var.instance_profile_name
  image_id                    = data.aws_ami.this.id
  instance_type               = var.instance_type
  name_prefix                 = var.name
  security_groups             = [ var.security_group ]
  user_data_base64            = var.userdata
  spot_price                  = var.spot_price

  root_block_device {
    volume_size = var.disksize
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "this" {
  capacity_rebalance   = false
  launch_configuration = aws_launch_configuration.this.id
  max_size             = var.max_rep
  min_size             = var.min_rep
  name                 = var.name
  vpc_zone_identifier  = var.private_subnets
}

Or this:

resource "aws_launch_template" "this" {
  lifecycle {
    create_before_destroy = true
  }

  block_device_mappings {
    device_name = data.aws_ami.this.root_device_name
    ebs {
      volume_size = var.disksize
    }
  }

  iam_instance_profile {
    name = var.instance_profile_name
  }

  network_interfaces {
    associate_public_ip_address = true
    security_groups             = local.node_security_groups
  }

  image_id      = data.aws_ami.this.id
  name_prefix   = var.name
  instance_type = var.instance_type
  user_data     = var.userdata

  instance_market_options {
    market_type = "spot"
    spot_options {
      spot_instance_type = "one-time"
    }
  }

  metadata_options {
    http_tokens                 = var.imds == 1 ? "optional" : "required"
    http_endpoint               = "enabled"
    http_put_response_hop_limit = 1
  }
}

resource "aws_autoscaling_group" "this" {
  name                = var.name
  vpc_zone_identifier = var.private_subnets
  min_size            = var.min_rep
  max_size            = var.max_rep

  launch_template {
    id      = aws_launch_template.this.id
    version = "$Latest"
  }
}

Then this new method is a much better idea :) Even more so if you had two launch templates to support spot and non-spot instance types!

Hat-tip to former colleague Paul Moran who opened my eyes to defining your fleet of variable instance types, as well as to my former customer (deliberately unnamed) and my current employer who both stumbled into the same documentation issue. Without Paul’s advice with my prior customer’s issue I’d never have known what I was looking for this time around!

Featured image is “Fishing fleet” by “Nomad Tales” on Flickr and is released under a CC-BY-SA license.

"Traffic" by "Make Lemons" on Flickr

A Quick Guide to setting up Traefik on a single Docker node inside your home network

I have a small server running Docker for services at home. There are several services which will want to use HTTP, but I can’t have them all sharing the same port without a reverse proxy to manage how to route the traffic to the containers!

This is my guide to how I got Traefik set up to serve HTTP and HTTPS traffic.

The existing setup for one service

Currently, I have phpIPAM which has the following docker-compose.yml file:

version: '3'

services:
  web:
    image: phpipam/phpipam-www:latest
    ports:
      - "80:80"
    cap_add:
      - NET_ADMIN
      - NET_RAW
    environment:
      - TZ=Europe/London
      - IPAM_DATABASE_HOST=db
      - IPAM_DATABASE_USER=someuser
      - IPAM_DATABASE_PASS=somepassword
      - IPAM_DATABASE_WEBHOST=%
    restart: unless-stopped
    volumes:
      - phpipam-logo:/phpipam/css/images/logo
      - phpipam-ca:/usr/local/share/ca-certificates:ro
    depends_on:
      - db

  cron:
    image: phpipam/phpipam-cron:latest
    cap_add:
      - NET_ADMIN
      - NET_RAW
    environment:
      - TZ=Europe/London
      - IPAM_DATABASE_HOST=db
      - IPAM_DATABASE_USER=someuser
      - IPAM_DATABASE_PASS=somepassword
      - SCAN_INTERVAL=1h
    restart: unless-stopped
    volumes:
      - phpipam-ca:/usr/local/share/ca-certificates:ro
    depends_on:
      - db

  db:
    image: mariadb:latest
    environment:
      - MYSQL_USER=someuser
      - MYSQL_PASSWORD=somepassword
      - MYSQL_RANDOM_ROOT_PASSWORD=yes
      - MYSQL_DATABASE=phpipam
    restart: unless-stopped
    volumes:
      - phpipam-db-data:/var/lib/mysql

volumes:
  phpipam-db-data:
  phpipam-logo:
  phpipam-ca:

The moment I want to bind another service to TCP/80, I get an error because we’ve already used TCP/80 for phpIPAM. Enter Traefik. Let’s stop the docker container with docker compose down and build our Traefik setup.

Traefik Setup

I always store my docker compose files in /opt/docker/<servicename>, so let’s create a directory for traefik; sudo mkdir -p /opt/docker/traefik

The (“dynamic”) configuration file

Next we need to create a configuration file called traefik.yaml

# Ensure all logs are sent to stdout for `docker compose logs`
accessLog: {}
log: {}

# Enable docker provider but don't switch it on by default
providers:
  docker:
    exposedByDefault: false
    # Select this as the docker network to connect from traefik to containers
    # This is defined in the docker-compose.yaml file
    network: web

# Enable the API and Dashboard on TCP/8080
api:
  dashboard: true
  insecure: true
  debug: true

# Listen on both HTTP and HTTPS
entryPoints:
  http:
    address: ":80"
    http: {}
  https:
    address: ":443"
    http:
      tls: {}

With the configuration file like this, we’ll serve HTTPS traffic with a self-signed TLS certificate on TCP/443 and plain HTTP on TCP/80. We have a dashboard on TCP/8080 served over HTTP, so make sure you don’t expose *that* to the public internet!

The Docker-Compose File

Next we need the docker-compose file for Traefik, so let’s create docker-compose.yaml

version: '3'

networks:
  web:
    name: web
    attachable: true

services:
  traefik:
    image: traefik:latest
    ports:
      - "8080:8080"
      - "443:443"
      - "80:80"
    networks:
      - web
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./traefik.yaml:/etc/traefik/traefik.yaml
    restart: always

There are a few parts here which aren’t spelled out on the Traefik quickstart! Firstly, if you don’t define a network, it’ll create one using the docker-compose file path, so probably traefik_traefik or traefik_default, which is not what we want! So, we’ll create one called “web” (but you can call it whatever you want. On other deployments, I’ve used the name “traefik” but I found it tedious to remember how to spell that each time). This network needs to be “attachable” so that other containers can use it later.

You then attach that network to the traefik service, and expose the ports we need (80, 443 and 8080).

And then start the container with docker compose up -d

alpine-docker:/opt/docker/traefik# docker compose up -d
[+] Running 2/2
 ✔ Network web                  Created   0.2s 
 ✔ Container traefik-traefik-1  Started   1.7s 
alpine-docker:/opt/docker/traefik#

Adding Traefik to phpIPAM

Going back to phpIPAM, So that Traefik can reach the containers, and so that the container can reach it’s database, we need two network statements now; the first is the “external” network for the traefik connection which we called “web“. The second is the inter-container network so that the “web” service can reach the “db” service, and so that the “cron” service can reach the “db” service. So we need to add that to the start of /opt/docker/phpipam/docker-compose.yaml, like this;

networks:
  web:
    name: web
    external: true
    attachable: true
  ipam:
    name: ipam

We then need to add both networks that to the “web” container, like this:

services:
  web:
    image: phpipam/phpipam-www:latest
    networks:
      - ipam
      - web
# ...... and the rest of the config

Remove the “ports” block and replace it with an expose block like this:

services:
  web:
# ...... The rest of the config for this service
    ## Don't bind to port 80 - we use traefik now
    # ports:
    #   - "80:80"
    ## Do expose port 80 for Traefik to use 
    expose:
      - 80
# ...... and the rest of the config

And just the inter-container network to the “cron” and “db” containers, like this:

  cron:
    image: phpipam/phpipam-cron:latest
    networks:
      - ipam
# ...... and the rest of the config

  db:
    image: mariadb:latest
    networks:
      - ipam
# ...... and the rest of the config

There’s one other set of changes we need to make in the “web” service, which are to enable Traefik to know that this is a container to look at, and to work out what traffic to send to it, and that’s to add labels, like this:

services:
  web:
# ...... The rest of the config for this service
    labels:
      - traefik.enable=true
      - traefik.http.routers.phpipam.rule=Host(`phpipam.homenet`)
# ...... and the rest of the config

Right, now we run docker compose up -d

alpine-docker:/opt/docker/phpipam# docker compose up -d
[+] Running 4/4
 ✔ Network ipam              Created   0.4s 
 ✔ Container phpipam-db-1    Started   1.4s 
 ✔ Container phpipam-cron-1  Started   2.1s 
 ✔ Container phpipam-web-1   Started   2.6s 
alpine-docker:/opt/docker/phpipam#

If you notice, this doesn’t show to the web network being created (because it was already created by Traefik) but does bring up the container.

Checking to make sure it’s working

A screenshot of the traefik dashboard showing the phpipam service added.

If we head to the Traefik dashboard (http://your-docker-server:8080) you’ll see the phpipam service identified there… yey!

Better TLS with Lets Encrypt

So, at home I actually have a DNS suffix that is a real DNS name. For the sake of the rest of this documentation, assume it’s homenet.sprig.gs (but it isn’t 😁).

This DNS space is hosted by Digital Ocean, so I can use a DNS Challenge with Lets Encrypt to provide hostnames which are not publically accessible. If you’re hosting with someone else, then that’s probably also available – check the Traefik documentation for your specific variables. The table on that page (as of 2023-12-30) shows the environment variables you need to pass to Traefik to get LetsEncrypt working.

A screen capture of the table on the Traefik website, showing the environment variables needed to use the Lets Encrypt DNS challenge with Digital Ocean

As you can see here, I just need to add the value DO_AUTH_TOKEN, which is an API key. I went to the Digital Ocean console, and navigated to the API panel, and added a new “Personal Access Token”, like this:

Screen capture of part of the Digital Ocean console showing the personal access token, showing I needed "read" and "write" capabilities.

Notice that the API key needed to provide both “Read” and “Write” capabilities, and has been given a name so I can clearly see it’s purpose.

Changing the traefik docker-compose.yaml file

In /opt/docker/traefik/docker-compose.yaml we need to add that new environment variable; DO_AUTH_TOKEN, like this:

services:
  traefik:
# ...... The rest of the config for this service
    environment:
      DO_AUTH_TOKEN: dop_v1_decafbad1234567890abcdef....1234567890
# ...... and the rest of the config

Changing the traefik.yaml file

In /opt/docker/traefik/traefik.yaml we need to tell it to use Let’s Encrypt. Add this block to the end of the file:

certificatesResolvers:
  letsencrypt:
    acme:
      email: yourname@example.org
      storage: acme.json
      dnsChallenge:
        provider: digitalocean
        delayBeforeCheck: 1 # Minutes
        resolvers:
          - "1.1.1.1:53"
          - "8.8.8.8:53"

Obviously change the email address to a valid one for you! I hit a few issues with the value specified in the documentation for delayBeforeCheck, as their value of “0” wasn’t long enough for the DNS value to be propogated around the network – 1 minute is enough though!

I also had to add the resolvers, as my local network has a caching DNS server, so I’d never have seen the updates! You may be able to remove both those values from your files.

Now you’ve made all the changes to the Traefik service, restart it with docker compose down ; docker compose up -d

Changing the services to use Lets Encrypt

We need to add one final label to the /opt/docker/phpipam/docker-compose.yaml file, which is this one:

services:
  web:
# ...... The rest of the config for this service
    labels:
      - traefik.http.routers.phpipam.tls.certresolver=letsencrypt
# ...... and the rest of the config

Also, update your .rule=Host(`hostname`) to use the actual DNS name you want to be able to use, then restart the docker container.

phpIPAM doesn’t like trusting proxies, unless explicitly told to, so I also had add an environment variable IPAM_TRUST_X_FORWARDED=true to the /opt/docker/phpipam/docker-compose.yaml file too, because phpIPAM tried to write the HTTP scheme for any links which came up, based on what protocol it thought it was running – not what the proxy was telling it it was being accessed as!

Debugging any issues

If you have it all setup as per the above, and it isn’t working, go into /opt/docker/traefik/traefik.yaml and change the stanza which says log: {} to:

log:
  level: DEBUG

Be aware though, this adds a LOT to your logs! (But you won’t see why your ACME requests have failed without it). Change it back to log: {} once you have it working again.

Adding your next service

I now want to add that second service to my home network – WordPress. Here’s /opt/docker/wordpress/docker-compose.yaml for that service;

version: '3.7'

networks:
  web:
    name: web 
    external: true
    attachable: true
  wordpress:
    name: wordpress

services:
  php:
    image: wordpress:latest
    expose:
      - 80
    environment:
      - WORDPRESS_DB_HOST=mariadb
      - WORDPRESS_DB_USER=db_user
      - WORDPRESS_DB_PASSWORD=db_pass
      - WORDPRESS_DB_NAME=wordpress
    volumes:
      - wordpress:/var/www/html
    labels:
      - traefik.enable=true
      - traefik.http.routers.wordpress.rule=Host(`wp.homenet.sprig.gs`)
      - traefik.http.routers.wordpress.tls.certresolver=letsencrypt
    depends_on:
      - mariadb
    networks:
      - wordpress
      - web 

  mariadb:
    image: mariadb:10.3
    environment:
      MYSQL_ROOT_PASSWORD: True
      MYSQL_USER: db_user
      MYSQL_PASSWORD: db_pass
      MYSQL_DATABASE: wordpress
    volumes:
      - db:/var/lib/mysql
    networks:
      - wordpress

volumes:
  wordpress:
  db:

And then we start it up;

alpine-docker:/opt/docker/wordpress# docker compose up -d
[+] Running 3/3
 ✔ Network wordpress              Created   0.2s 
 ✔ Container wordpress-mariadb-1  Started   3.0s 
 ✔ Container wordpress-php-1      Started   3.8s 
alpine-docker:/opt/docker/wordpress# 

Tada!

One final comment – I never did work out how to make connections forceably upgrade from HTTP to HTTPS, so instead, I shut down port 80 in Traefik, and instead run this container.

Featured image is “Traffic” by “Make Lemons” on Flickr and is released under a CC-BY-SA license.

A medieval helm in gold on a bench at a museum

Fixing 403 errors from ghcr.io with #helm pull

At work, I’m using skaffold to deploy a helm chart which references a ghcr.io repository. Here’s the stanza I’m looking at:

apiVersion: skaffold/v3
kind: Config
deploy:
  helm:
    releases:
      - name: {package}
        remoteChart: oci://ghcr.io/{owner}/{package}

This is the first time I’ve tried to deploy this chart, and I kept getting this message:

No tags generated
Starting test...
Starting deploy...
Helm release {package} not installed. Installing...
Error: INSTALLATION FAILED: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden
deploying "{package}": install: exit status 1

I thought this might have been an issue with the skaffold file, so I tried running this directly with helm:

$ helm pull oci://ghcr.io/{owner}/{package}
Error: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden

Huh, that looks a bit familiar. I spent a little while checking to see whether this was something at the Kubernetes cluster, or if it was just me, and ended up finding this nugget (thanks to a steer from this post)

$ gh auth token | helm registry login ghcr.io -u {my_github_user} --password-stdin
Login Succeeded

And now it works!

elm pull oci://ghcr.io/{owner}/{package}
Pulled: ghcr.io/{owner}/{package}:1.2.3
Digest: sha256:decafbad1234567890aabbccddeeffdeadbeefbadbadbadbad12345678901234

Featured image is “helm” by “23 dingen voor musea” on Flickr and is released under a CC-BY-SA license.

A text dialogue from a web page showing "Uh oh. Something really just went wrong. Good thing we know about it and have our crack team of squirrels getting their nuts out of the system!"

How to capture stdout and stderr from a command in a shellscript without preventing piped processes from seeing them

I love the tee command – it captures stdout [1] and puts it in a file, while then returning that output to stdout for the next process in a pipe to consume, for example:

$ ls -l | tee /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh
$ cat /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh

But wait, why is that useful? Well, in a script, you don’t always want to see the content scrolling past, but in the case of a problem, you might need to catch up with the logs afterwards. Alternatively, you might do something like this:

if some_process | tee /tmp/output | grep -q "some text"
then
  echo "Found 'some text' - full output:"
  cat /tmp/output
fi

This works great for stdout but what about stderr [2]? In this case you could just do:

some_process 2>&1 | tee /tmp/output

But that mashes all of stdout and stderr into the same blob.

In my case, I want to capture all the output (stdout and stderr) of a given process into a file. Only stdout is forwarded to the next process, but I still wanted to have the option to see stderr as well during processing. Enter process substitution.

TEMP_DATA_PATH="$(mktemp -d)"
capture_out() {
  base="${TEMP_DATA_PATH}/${1}"
  mkdir "${base}"
  shift
  "$@" 2> >(tee "${base}/stderr" >&2) 1> >(tee "${base}/stdout")
}

With this, I run capture_out step-1 do_a_thing and then in /tmp/tmp.sometext/step-1/stdout and /tmp/tmp.sometext/step-1/stderr are the full outputs I need… but wait, I can also do:

$ capture_out step-1 do_a_thing | \
  capture_out step-2 process --the --thing && \
  capture_out step-3 echo "..." | capture_out step-4 profit
$ find /tmp/tmp.sometext -type f
/tmp/tmp.sometext/step-1/stdout
/tmp/tmp.sometext/step-1/stderr
/tmp/tmp.sometext/step-2/stdout
/tmp/tmp.sometext/step-2/stderr
/tmp/tmp.sometext/step-4/stdout
/tmp/tmp.sometext/step-4/stderr
/tmp/tmp.sometext/step-3/stderr
/tmp/tmp.sometext/step-3/stdout

Or

if capture_out has_an_error something-wrong | capture_out handler check_output
then
  echo "It all went great"
else
  echo "Process failure"
  echo "--Initial process"
  # Use wc -c to check the number of characters in the file
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/has_an_error/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/has_an_error/stderr"
  fi
  echo "--Second stage"
  if [ -e "${TEMP_DATA_PATH}/handler/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/handler/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/handler/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/handler/stderr"
  fi
fi

This has become part of my normal toolkit now for logging processes. Thanks bash!

Also, thanks to ChatGPT for helping me find this structure that I’d seen before, but couldn’t remember how to do it! (it almost got it right too! Remember kids, don’t *trust* what ChatGPT gives you, use it as a research starting point, test *that* against your own knowledge, test *that* against your environment and test *that* against expected error cases too! Copy & Paste is not the best idea with AI generated code!)

Footnotes

[1] stdout is the name of the normal output text we see in a shell, it’s also sometimes referred to as “file descriptor 1” or “fd1”. You can also output to &1 with >&1 which means “send to fd1”

[2] stderr is the name of the output in a shell when an error occurs. It isn’t caught by things like some_process > /dev/null which makes it useful when you don’t want to see output, just errors. Like stdout, it’s also referred to as “file descriptor 2” or “fd2” and you can output to &2 with >&2 if you want to send stdout to stderr.

Featured image is “WordPress Error” by “tara hunt” on Flickr and is released under a CC-BY-SA license.

A series of gold blocks, each crossed, and one of the lower blocks has engraved "Eduardo Nery 1995-1998 Aleluia - Secla"

Deploying the latest build of a template (machine) image with #Xen #Orchestrator

In my current role we are using Packer to build images on a Xen Orchestrator environment, use a CI/CD system to install that image into both a Xen Template and an AWS AMI, and then we use Terraform to use that image across our estate. The images we build with Packer have this stanza in it:

locals {
  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
}
variable "artifact_name" {
  default = "SomeLinux-version.iso"
}
source "xenserver-iso" "this" {
  vm_name = "${var.artifact_name}-${local.timestamp}"
  # more config below
}

As a result, the built images include a timestamp.

When we use the AMI in Terraform, we can locate it with this code:

variable "ami_name" {
  default = "SomeLinux-version.iso-"
}

data "aws_ami" "this" {
  most_recent = true

  filter {
    name   = "name"
    values = [var.ami_name]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = [var.owner]
}

But, because Xen doesn’t track when a template is created, instead I needed to do something different. Enter get_xoa_template.sh.

#!/bin/bash
trap cleanup SIGINT SIGTERM EXIT
exit_message=""
set_exit=0
fail() {
[ -n "$1" ] && echo "$1" >&2
[ "$2" -gt 0 ] && exit $2
}
cleanup() {
trap – SIGINT SIGTERM EXIT
[ "$UNREGISTER" -eq 1 ] && [ "$STATE" == "signed_in" ] && xo-cli –unregister 2>&1
[ -n "$exit_message" ] && fail "$exit_message" $set_exit
}
log_debug() {
[ -n "$DEBUG" ] && echo "$1" >> "$DEBUG"
}
parse_params() {
UNREGISTER=1
DEBUG=""
while :; do
case "${1-}" in
-h | –help)
echo "usage: get_xoa_template.sh –template SomeTemplatePrefix" >&2
echo "" >&2
echo "Options:" >&2
echo " -t | –template MyTemplatePrefix = The template to look for (required)" >&2
echo " -s | –server ws://192.0.2.1 = Sign into Xen Orchestrator on 192.0.2.1" >&2
echo " [Default to using XOA_URL environment variable]" >&2
echo " -u | –user username@example.org = Sign into Xen Orchestrator using this username" >&2
echo " [Default to using XOA_USER environment variable]" >&2
echo " -p | –password hunter2 = Sign into Xen Orchestrator using this password" >&2
echo " [Default to using XOA_PASSWORD environment variable]" >&2
echo " -l | –pool MyXenPool1 = Use this pool when looking for the template." >&2
echo " [Omit to ignore]" >&2
echo " -x | –no-unregister = Don't log out from the XOA server once connected." >&2
echo " -d | –debug = Log output to /tmp/xocli_output" >&2
echo " -d | –debug /path/to/debug = Log output to the specified path" >&2
echo " –debug=/path/to/debug = Log output to the specified path" >&2
exit 255
;;
-s | –server)
XOA_URL="${2-}"
shift
;;
-u | –user)
XOA_USER="${2-}"
shift
;;
-p | –password)
XOA_PASSWORD="${2-}"
shift
;;
-l | –pool)
XOA_POOL="${2-}"
shift
;;
-t | –template)
TEMPLATE="${2-}"
shift
;;
-x | –no-unregister)
UNREGISTER=0
;;
-d | –debug)
DEBUG=/tmp/xocli_output
[ -n "${2-}" ] && [ "$(echo "${2-}" | cut -c1)" != "-" ] && DEBUG="${2-}" && shift
;;
–debug=*)
DEBUG="$(echo $1 | sed -E -e 's/^[^=]+=//')"
;;
*)
break
;;
esac
shift
done
}
sign_in() {
[ -z "$XOA_URL" ] || [ -z "$XOA_USER" ] || [ -z "$XOA_PASSWORD" ] && fail "Missing sign-in details" 1
log_debug "Logging in"
if [ -n "$DEBUG" ]
then
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | tee -a "$DEBUG" | grep -q 'Successfully' || fail "Login failed" 2
else
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | grep -q 'Successfully' || fail "Login failed" 2
fi
STATE="signed_in"
}
get_pool() {
[ -z "$XOA_POOL" ] && log_debug "No Pool" && return 0
log_debug "Getting Pool ID"
if [ -n "$DEBUG" ]
then
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1 | tee -a "$DEBUG")"
else
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1)"
fi
[ "$POOL_ID" == "\$pool=" ] && fail "Pool provided but no ID received" 3
}
get_template() {
log_debug "Getting template"
if [ -n "$DEBUG" ]
then
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1 | tee -a "$DEBUG")"
else
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1)"
fi
[ -z "$TEMPLATE_IS" ] && fail "Could not match this template" 4
if [ -n "$DEBUG" ]
then
echo "{\"is\": ${TEMPLATE_IS}}" | tee -a "$DEBUG"
else
echo "{\"is\": ${TEMPLATE_IS}}"
fi
}
[ -n "$(command -v xo-cli)" ] || fail "xo-cli is missing, and is a required dependency for this script. Please install it; \`sudo npm -g install xo-cli\`" 5
parse_params "$@"
if [ -n "$DEBUG" ]
then
rm -f "$DEBUG"
log_debug "Invoked: $(date)"
log_debug "Template: $TEMPLATE"
log_debug "Pool: $XOA_POOL"
fi
sign_in
get_pool
get_template

This script is invoked from your terraform like this:

variable "template_name" {
  default     = "SomeLinux-version.iso-"
  description = "A regex, partial or full string to match in the template name"
}

variable "poolname" {
  default = "MyPool"
}

data "external" "get_xoa_template" {
  program = [
    "/bin/bash", "${path.module}/get_xoa_template.sh",
    "--template", var.template_name,
    "--pool", var.poolname
  ]
}

data "xenorchestra_pool" "pool" {
  name_label = var.poolname
}

data "xenorchestra_template" "template" {
  name_label = data.external.get_xoa_template.result.is
  pool_id    = data.xenorchestra_pool.pool.id
}

And that’s how you do it. Oh, and if you need to pin to a specific version? Change the template_name value from the partial or regex version to the full version, like this:

variable "template_name" {
  # This assumes your image was minted at midnight on 1970-01-01
  default     = "SomeLinux-version.iso-19700101000000"
}

Featured image is “Barcelos and Braga-18” by “Graeme Churchard” on Flickr and is released under a CC-BY license.

A photo of a door with the focus on the handle which has a lock in the centre of the knob. The lock has a key in it, with a bunch of keys dangling from the central ring.

Using direnv with terraform, terragrunt, saml2aws, SOPS and AWS KMS

In my current project I am often working with Infrastructure as Code (IoC) in the form of Terraform and Terragrunt files. Before I joined the team a decision was made to use SOPS from Mozilla, and this is encrypted with an AWS KMS key. You can only access specific roles using the SAML2AWS credentials, and I won’t be explaining how to set that part up, as that is highly dependant on your SAML provider.

While much of our environment uses AWS, we do have a small presence hosted on-prem, using a hypervisor service. I’ll demonstrate this with Proxmox, as this is something that I also use personally :)

Firstly, make sure you have all of the above tools installed! For one stage, you’ll also require yq to be installed. Ensure you’ve got your shell hook setup for direnv as we’ll need this later too.

Late edit 2023-07-03: There was a bug in v0.22.0 of the terraform which didn’t recognise the environment variables prefixed PROXMOX_VE_ – a workaround by using TF_VAR_PROXMOX_VE and a variable "PROXMOX_VE_" {} block in the Terraform code was put in place for the inital publication of this post. The bug was fixed in 0.23.0 which this post now uses instead, and so as a result the use of TF_VAR_ prefixed variables was removed too.

Set up AWS Vault

AWS KMS

AWS Key Management Service (KMS) is a service which generates and makes available encryption keys, backed by the AWS service. There are *lots* of ways to cut that particular cake, but let’s do this a quick and easy way… terraform

variable "name" {
  default = "SOPS"
  type    = string
}
resource "aws_kms_key" "this" {
  tags                     = {
    Name : var.name,
    Owner : "Admins"
  }
  key_usage                = "ENCRYPT_DECRYPT"
  customer_master_key_spec = "SYMMETRIC_DEFAULT"
  deletion_window_in_days  = 30
  is_enabled               = true
  enable_key_rotation      = false
  policy                   = <<EOF
{
  "Version": "2012-10-17",
  "Id": "key-default-1",
  "Statement": [
    {
      "Sid": "Root Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Estate Admin Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:role/estateadmins"
      },
      "Action": [
        "kms:Describe*",
        "kms:List*",
        "kms:Get*",
        "kms:Encrypt*"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_kms_alias" "this" {
  target_key_id = aws_kms_key.this.key_id
  name          = "alias/${var.name}"
}

output "key" {
  value = aws_kms_alias.this.arn
}

After running this, let’s assume that we get an output for the “key” value of:

arn:aws:kms:us-east-1:123456789012:alias/main

Setup Sops

In your terragrunt tree, create a file called .sops.yaml, which contains:

---
creation_rules:
  - kms: arn:aws:kms:us-east-1:123456789012:alias/main

And a file called secrets.enc.yaml which contains:

---
PROXMOX_VE_USERNAME: root@pam
PROXMOX_VE_PASSWORD: deadb33f@2023

Test that your KMS works by assuming your IAM role via SAML2AWS like this:

$ saml2aws login --skip-prompt --quiet
$ saml2aws exec -- sops --verbose --encrypt --in-place secrets.enc.yaml
[AWSKMS]	 INFO[0000] Encryption succeeded                          arn="arn:aws:kms:us-east-1:123456789012:alias/main"
[CMD]		 INFO[0000] File written successfully

Setup direnv

Outside your tree, in ~/.config/direnv/lib create a file called use_sops.sh (does not need to be chmod +x or chmod 755!) containing this:

# Based on https://github.com/direnv/direnv/wiki/Sops
use_sops() {
    local path=${1:-$PWD/secrets.enc.yaml}
    if [ -e "$path" ]
    then
        if grep -q -E '^sops:' "$path"
        then
            eval "$(sops --decrypt --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"
        else
            if [ -n "$(command -v yq)" ]
            then
                eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"
                export SOPS_WARNING="unencrypted $path"
            fi
        fi
    fi
    watch_file "$path"
}

There are two key lines here, the first of which is:

eval "$(sops -d --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"

This line asks sops to decrypt the secrets file, using the “dotenv” output type, however, the dotenv format looks like this:

some_key = "some value"

So, as a result, we then pass that value to direnv and ask it to rewrite it in the format it expects, which looks like this:

export some_key="some value"

The second key line is this:

eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"

This asks yq to parse the secrets file, using the “props” formatter, which results in lines just like the dotenv output we saw above.

However, because we used yq to parse the file, it means that we know this file isn’t encrypted, so we also add an extra export value:

export SOPS_WARNING="unencrypted $path"

This can be picked up as part of your shell prompt to put a warning in! Anyway… let’s move on.

Now that you have your reusable library file, we now configure the direnv file, .envrc for the root of your proxmox cluster:

use sops

Oh, ok, that was simple. You can add several files here if you wish, like this:

use sops file1.enc.yaml
use sops file2.enc.yml
use sops ~/.core_sops

But, we don’t need that right now!

Open your shell in that window, and you’ll get this warning:

direnv: error /path/to/demo/.envrc is blocked. Run `direnv allow` to approve its content

So, let’s do that!

$ direnv allow
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

So far, so good… but wait, you’ve authenticated to your SAML access to AWS. Let’s close that shell, and go back in again

$ cd /path/to/demo
direnv: loading /path/to/demo/.envrc
direnv: using sops
$

Ah, now we don’t have our values exported. That’s what we wanted!

What now?!

Configuring the details of the proxmox cluster

We have our .envrc file which provides our credentials (let’s pretend we’re using a shared set of credentials across all the boxes), but now we need to setup access to each of the boxes.

Let’s make our two cluster directories;

mkdir cluster_01
mkdir cluster_02

And in each of these clusters, we need to put an .envrc file with the right IP address in. This needs to check up the tree for any credentials we may have already loaded:

source_env "$(find_up ../.envrc)"
export PROXMOX_VE_ENDPOINT="https://192.0.2.1:8006" # Documentation IP address for the first cluster - change for the second cluster.

The first line works up the tree, looking for a parent .envrc file to inject, and then, with the second line, adds the Proxmox API endpoint to the end of that chain. When we run direnv allow (having logged back into our saml2aws session), we get this:

$ direnv allow
direnv: loading /path/to/demo/cluster_01/.envrc
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

Great, now we can setup the connection to the cluster in the terragrunt file!

Set up Terragrunt

In /path/to/demo/terragrunt.hcl put this:

remote_state {
  backend = "s3"
  config  = {
    encrypt                = true
    bucket                 = "example-inc-terraform-state"
    key                    = "${path_relative_to_include()}/terraform.tfstate"
    region                 = "us-east-1"
    dynamodb_table         = "example-inc-terraform-state-lock"
    skip_bucket_versioning = false
  }
}
generate "providers" {
  path      = "providers.tf"
  if_exists = "overwrite"
  contents  = <<EOF
terraform {
  required_providers {
    proxmox = {
      source = "bpg/proxmox"
      version = "0.23.0"
    }
  }
}

provider "proxmox" {
  insecure = true
}
EOF
}

Then in the cluster_01 directory, create a directory for the code you want to run (e.g. create a VLAN might be called “VLANs/30/“) and put in it this terragrunt.hcl

terraform {
  source = "${get_terragrunt_dir()}/../../../terraform-module-network//vlan"
  # source = "git@github.com:YourProject/terraform-module-network//vlan?ref=production"
}

include {
  path = find_in_parent_folders()
}

inputs = {
  vlan_tag    = 30
  description = "VLAN30"
}

This assumes you have a terraform directory called terraform-module-network/vlan in a particular place in your tree or even better, a module in your git repo, which uses the input values you’ve provided.

That double slash in the source line isn’t a typo either – this is the point in that tree that Terragrunt will copy into the directory to run terraform from too.

A quick note about includes and provider blocks

The other key thing is that the “include” block loads the values from the first matching terragrunt.hcl file in the parent directories, which in this case is the one which defined the providers block. You can’t include multiple different parent files, and you can’t have multiple generate blocks either.

Running it all together!

Now we have all our depending files, let’s run it!

user@host:~$ cd test
direnv: loading ~/test/.envrc
direnv: using sops
user@host:~/test$ saml2aws login --skip-prompt --quiet ; saml2aws exec -- bash
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test$ cd cluster_01/VLANs/30
direnv: loading ~/test/cluster_01/.envrc
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test/cluster_01/VLANs/30$ terragrunt apply
data.proxmox_virtual_environment_nodes.available_nodes: Reading...
data.proxmox_virtual_environment_nodes.available_nodes: Read complete after 0s [id=nodes]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # proxmox_virtual_environment_network_linux_bridge.this[0] will be created
  + resource "proxmox_virtual_environment_network_linux_bridge" "this" {
      + autostart  = true
      + comment    = "VLAN30"
      + id         = (known after apply)
      + mtu        = (known after apply)
      + name       = "vmbr30"
      + node_name  = "proxmox01"
      + ports      = [
          + "enp3s0.30",
        ]
      + vlan_aware = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
proxmox_virtual_environment_network_linux_bridge.this[0]: Creating...
proxmox_virtual_environment_network_linux_bridge.this[0]: Creation complete after 2s [id=proxmox01:vmbr30]
user@host:~/test/cluster_01/VLANs/30$

Winning!!

Featured image is “2018/365/1 Home is Where The Key Fits” by “Alan Levine” on Flickr and is released under a CC-0 license.

"Tickets" by "Becky Snyder" on Flickr

IP Address Management using PHPIPAM integrated with Keycloak for SAML2 Authentication

I’ve recently been working with a network estate that was a bit hard to get a handle on. It had grown organically, and was a bit tricky to allocate new network segments in. To fix this, I deployed PHPIPAM, which was super easy to setup and configure (I used the docker-compose file on the project’s docker hub page, and put it behind an NGINX server which was pre-configured with a LetsEncrypt TLS/HTTPS certificate).

PHPIPAM is a IP Address Management tool which is self-hostable. I started by setting up the “Sections” (which was the hosting environments the estate is using), and then setup the supernets and subnets in the “Subnets” section.

Already, it was much easier to understand the network topology, but now I needed to get others in to take a look at the outcome. The team I’m working with uses a slightly dated version of Keycloak to provide Single Sign-On. PHPIPAM will use SAML for authentication, which is one of the protocols that Keycloak offers. The documentation failed me a bit at this point, but fortunately a well placed ticket helped me move it along.

Setting up Keycloak

Here’s my walk through

  1. Go to “Realm Settings” in the sidebar and find the “SAML Identity Provider Metadata” (on my system it’s on the “General” tab but it might have changed position on your system). This will be an XML file, and (probably) the largest block of continuous text will be in a section marked “ds:X509Certificate” – copy this text, and you’ll need to use this as the “IDP X.509 public cert” in PHPIPAM.
  2. Go to “Clients” in the sidebar and click “Create”. If you want Keycloak to offer access to PHPIPAM as a link, the client ID needs to start “urn:” If you just want to use the PHPIPAM login option, give the client ID whatever you want it to be (I’ve seen some people putting in the URL of the server at this point). Either way, this needs to be unique. The client protocol is “saml” and the client SAML endpoint is the URL that you will be signing into on PHPIPAM – in my case https://phpipam.example.org/saml2/. It should look like this:

    Click Save to take you to the settings for this client.
  3. If you want Keycloak to offer a sign-in button, set the name of the button and description.

    Further down the page is “Root URL” set that to the SAML Endpoint (the one ending /saml2/ from before). Also set the “Valid Redirect URIs” to that too.

    Where it says “IDP Initiated SSO URL Name” put a string that will identify the client – I put phpipam, but it can be anything you want. This will populate a URL like this: https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam, which you’ll need as the “IDP Issuer”, “IDP Login URL” and “IDP Logout URL”. Put everything after the /auth/ in the box marked “Base URL”. It should look like this:

    Hit Save.
  4. Go to the “SAML Keys” tab. Copy the private key and certificate, these are needed as the “Authn X.509 signing” cert and cert key in PHPIPAM.
  5. Go to the “Mappers” tab. Create each of the following mappers;
    • A Role List mapper, with the name of “role list”, Role Attribute Name of “Role”, no friendly name, the SAML Attribute NameFormat set to “Basic” and Single Role Attribute set to on.
    • A User Attribute mapper, with the name, User Attribute, Friendly Name and SAML Attribute Name set to “email”, the SAML Attribute NameFormat set to “Basic” and Aggregate Attribute Values set to “off”.
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “display_name” and the SAML Attribute NameFormat set to “Basic”. The Script should be set to this single line: user.getFirstName() + ' ' + user.getLastName().
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “is_admin” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
is_admin = false;
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "phpipamadmins":
            is_admin = true;
            break;
    }
}
is_admin
  • Create one more mapper item:
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “groups” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
everyone_who_can_access_gets_read_only_access = false;
send_groups = "";
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "LDAP_GROUP_1":
            use_group = "IPAM_GROUP_1";
            break;
        case "LDAP_GROUP_2":
            use_group = "IPAM_GROUP_2";
            break;
    }
    if (use_group !== "") {
        if (send_groups !== "") {
          send_groups = send_groups + ","
        }
        send_groups = send_groups + use_group;
    }    
}
if (send_groups === "" && everyone_who_can_access_gets_read_only_access) {
    "Guests"
} else {
    send_groups
}

For context, the groups listed there, LDAP_GROUP_1 might be “Customer 1 Support Staff” or “ITSupport” or “Networks”, and the IPAM_GROUP_1 might be “Customer 1” or “WAN Links” or “DC Patching” – depending on the roles and functions of the teams. In my case they relate to other roles assigned to the staff member and the name of the role those people will perform in PHP IPAM. Likewise in the is_admin mapper, I’ve mentioned a group called “phpipamadmins” but this could be any relevant role that might grant someone admin access to PHPIPAM.

Late Update (2023-06-07): I’ve figured out how to enable modules now too. Create a Javascript mapper as per above, but named “modules” and have this script in it:

// Current modules as at 2023-06-07
// Some default values are set here.
noaccess       =  0;
readonly       =  1;
readwrite      =  2;
readwriteadmin =  3;
unsetperm      = -1;

var modules = {
    "*":       readonly,  "vlan":      unsetperm, "l2dom":    unsetperm,
    "devices": unsetperm, "racks":     unsetperm, "circuits": unsetperm,
    "nat":     unsetperm, "locations":  noaccess, "routing":  unsetperm,
    "pdns":    unsetperm, "customers": unsetperm
}

function updateModules(modules, new_value, list_of_modules) {
    for (var module in list_of_modules) {
        modules[module] = new_value;
    }
    return modules;
}

var GroupSet = user.getGroups();
for (var group in GroupSet) {
    switch (group.getName()) {
        case "LDAP_ROLE_3":
            modules = updateModules(modules, readwriteadmin, [
                'racks', 'devices', 'nat', 'routing'
            ]);
            break;
    }
}

var moduleList = '';

for (var key in modules) {
    if (modules.hasOwnProperty(key) && modules[key] !==-1) {
        if (moduleList !== '') {
            moduleList += ',';
        }
        moduleList += key + ':' + modules[key];
    }
}

moduleList;

OK, that’s Keycloak sorted. Let’s move on to PHPIPAM.

Setting up PHPIPAM

In the administration menu, select “Authentication Methods” and then “Create New” and select “Create new SAML2 authentication”.

In the description field, give it a relevant name, I chose SSO, but you could call it any SSO system name. Set “Enable JIT” to “on”, leave “Use advanced settings” as “off”. In Client ID put the Client ID you defined in Keycloak, probably starting urn: or https://. Leave “Strict mode” off. Next is the IDP Issuer, IDP Login URL and IDP Logout URL, which should all be set to the same URL – the “IDP Initiated SSO URL Name” from step 4 of the Keycloak side (that was set to something like https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam).

After that is the certificate section – first the IDP X.509 public cert that we got in step 1, then the “Sign Authn requests” should be set to “On” and the Authn X.509 signing cert and cert key are the private key and certificate we retrieved in step 5 above. Leave “SAML username attribute” and “SAML mapped user” blank and “Debugging” set to “Off”. It should look like this:

Hit save.

Next, any groups you specified in the groups mapper need to be defined. This is in Administration -> Groups. Create the group name and set a description.

Lastly, you need to configure the sections to define whigh groups have access. Each defined group gets given four radio buttons; “na” (no access), “ro” (read only), “rw” (read write) and “rwa” (read, write and administrate).

Try logging in. It should just work!

Debugging

If it doesn’t, and checking all of the above doesn’t help, I’ve tried adding some code into the PHP file in app/saml2/index.php, currently on line 149, above where it says:

if (empty($auth->getAttribute("display_name")[0])) {                                                                                    
    $Result->show("danger", _("Mandatory SAML JIT attribute missing")." : display_name (string)", true);
}

This code is:

$outfile = fopen("/tmp/log.out", "w");                                                           
fwrite($outfile, var_export($auth, true));                                                                     
fclose($outfile);

**REMEMBER THIS IS JUST FOR TESTING PURPOSES AND SHOULD BE REMOVED ASAP**

In here is an array called _attributes which will show you what has been returned from the Keycloak server when someone tries to log in. In my case, I got this:

   '_attributes' =>
  array (
    'groups' => 
    array (
      0 => 'PHPIPAM_GROUP_1',
    ),
    'is_admin' => 
    array (
      0 => 'false',
    ),
    'display_name' => 
    array (
      0 => 'Jon Spriggs',
    ),
    'email' => 
    array (
      0 => 'spriggsj@example.org',
    ),
  )

If you get something back here that isn’t what you expected, now at least you have a fighting chance of finding where that issue was! Good luck!!

Featured image is “Tickets” by “Becky Snyder” on Flickr and is released under a CC-BY license.

Tools I use when writing documentation

I’ve just finished what felt like the worlds longest work instruction, and it made me think… I’ve got these great tools at my disposal for when I’m writing documentation – perhaps not everyone knows about them!?

So here are the main ones!

Open Broadcaster Software (OBS) Studio

OBS Studio is a screen recording application, most commonly used by eSports players and gamers to stream content to streaming platforms like Twitch. While it’s really easy to do stuff like “lower third” bars (like this)

It’s also really *really* easy to just record a screen, which makes it easy to just “set and forget” and then run through the tasks. In the past, I’ve been on highly-technical in-person training courses and run OBS while I’ve been in a room, just to make sure I didn’t miss any details when I went back over my notes later!

Clicking “Start Recording” in the bottom third of that screen starts writing the content of that screen into (by default) a MKV file in your home directory (on Linux), or into your Videos directory on Windows. I don’t recall where it went on OS X!

This means after the task is done, I can play it back with…

VideoLan Client (VLC)

VLC is (or, at least, was) the defacto video player for anyone who consumes media on their computer. In my case, I only really use it now for playing back files I’ve recorded with OBS, but it’s a great player.

I only really make one key change to VLC, which is; adding the playback speed buttons. Different platforms do it differently, but on Linux and Windows at least, go to Tools -> Customise Interface and you’re presented with a screen like this:

I drag the “Slower” and “Faster” buttons to either side of the stop button, and will typically play videos back at 3x or even 8x speed to get to the areas I want, and then slow it down, sometimes, to 0.5x speed to get specific frames I’m looking for.

Once I get that frame, I’ll use a…

Screen Snipping Tool

On Windows this is the built-in snipping tool, found by pressing Windows+Shift+S. On Ubuntu, I’ve installed the package gnome-screenshot and the Gnome Extension Screenshot Tool.

With Snipping Tool you can draw circles around things and make annotations with a pen, but in extreme cases, I rely on the …

GNU Image Manipulation Program (GIMP).

GIMP, aside from being a badly named project (notably because most people over a certain age know about that scene from Pulp Fiction) is a FANTASTIC image editor. I use it for all sorts of editing, cropping and resizing, but I particularly love the layers menu on the side.

A couple of years ago my son wanted to give all his friends “membership cards” for a club he’d made. These were created in GIMP and just required him to clone a layer from the block on the right, hide the previous layer, and change the text in the clone.

This also means that if you want to easily highlight specific areas of the screen it’s only a few clicks to do it. Now, that said, it’s not as straightforward as Paint, and for anything beyond easy tweaks it will take a bit of experience, but it works.

This is all well and good, but sometimes you just need details from a terminal, where I’d use…

Asciinema

I use Asciinema (just a pip, apt or rpm away) when I’m working on something on a command line that I want to play back as a video, and the main reason is that I can adjust my output really quickly and easily!

Consider this screen of data:

I “recorded” that using asciinema rec ./demo.rec and ended it with ctrl+d. That created a file called demo.rec, which looks like this:

Each line is prefixed with a timestamp, it then has a stream action (mostly o for “output”, but you can add m for “marker” or i for “input” in there if you wanted to be clever ;) ) and then the text which will be rendered when it’s played back (more about that in a tic). This means that if you make a drastic typo, you can fix it in this file, or if you want to redact a password that you never meant to type, you can do!

But, and here’s the next awesome thing, you can also ask it to play back ignoring any gaps longer than X amount of fractions of a second, so if you went off to make a coffee in the middle of doing a recording, playback won’t show it.

And then, playing it back… well, you can just run asciinema play demo.rec but that’s a bit boring… but you can link to it, like this demo I created a while ago or even embed it in your own site

Rendering it all

And the last thing I use? Whatever writing tool I have to hand! I typically write on my blog if it’s for public consumption (like this), but I’ve used private Github and gitlab repos, Visual Studio Code to create markdown files, Google Docs, Microsoft Word, LibreOffice Writer and even slide decks if it’s the right tool for the job.

But the most important thing is that you get your documentation OUT THERE! It’s so frustrating only having half a story… like this one:

This is an XKCD Comic titled "Wisdom of the Ancients" which has the alt-text:
"All long help threads should have a sticky globally-editable post at the top saying 'DEAR PEOPLE FROM THE FUTURE: Here's what we've figured out so far ...'"

It depicts a stick figure shaking his screen, and a series of lines of text which say:
"Never have I felt so close to another soul and yet so helplessly alone as when I Google an erorr and there's one result a thread by someone with the same problem and no answer last posted to in 2003"

Above the stick figure is the text "Who were you DenverCoder9? What did you see?!"

Featured image is “Tools” by “Paul Downey” on Flickr and is released under a CC-BY license.

Picture of a comms rack with a patch panel, a unifi USW-Pro-24 switch, two Dell Optiplex 3040M computers, two external hard drives and a Raspberry Pi.

Building a Highly Available (HA) two-node Home Lab on Proxmox

Warning, this is a long and dense document!

That said, If you’re thinking of getting started with Proxmox though it’s well worth a read. If you’ve *used* Proxmox, and think I’m doing something wrong here, let me know in the comments!

Context

In the various podcasts I listen to, I’ve been hearing over and over again about Proxmox, and how it’s a great system for building and running virtual machines. In a former life, I’d use a combination of VMWare ESXi servers or desktop machines running Vagrant and Virtualbox to build out small labs and build environments, and at home I’d previously used a i3 ex-demo machine that was resold to staff at a reduced price. Unfortunately, the power supply went pop one evening on that, and all my home-lab experiments died.

When I changed to my most recent job, I had a small cash windfall at the same time, and decided to rebuild my home lab. I bought two Dell Optiplex 3040M i5 with 16GB RAM and two 3TB external USB3 hard drives to provide storage. These were selected because of the small size which meant they would fit in the small comms rack I had fitted when I got my house wired with CAT6 networking cables last year. These were patched into the UniFi USW-Pro-24 which was fitted as part of the networking build.

Picture of a comms rack with a patch panel, a unifi USW-Pro-24 switch, two Dell Optiplex 3040M computers, two external hard drives and a Raspberry Pi.

(Yes, it’s a bit of a mess, but it’s also not been in there very long, so needs a bit of a clean-up!)

The Install

I allocated two static IP addresses for these hosts, and performed a standard installation of Proxmox using a USB stick with the multi-image-installer Ventoy on it.

Some screenshots follow:

Proxmox installation screen showing the EULA
Proxmox installation screen showing the installation target
Proxmox installation screen showing the location and timezone settings
Proxmox installation screen showing the prompt for credentials and contact email address
Proxmox installation screen showing the IP address and hostname selection screen

Note that these screenshots were built on one pass, and have been rebuilt with new IPs that are used later.

Proxmox installation screen showing the summary of all the options selected
Proxmox installation screen showing the actual installation details and an advert for why you should use it.
Proxmox installation screen showing the success screen

As I don’t have an enterprise subscription, I ran these commands to use tteck’s Post PVE Install script to change the repositories.

wget https://raw.githubusercontent.com/tteck/Proxmox/main/misc/post-pve-install.sh
# Run the following to confirm the download looks OK and non-corrupted
less post-pve-install.sh
bash post-pve-install.sh

This results in the following (time-lapse) output, which is a series of options asking you to approve making changes to the system.

A time-lapse video of what happens during the post-pve-install script.

[Most of the following are derived from this YouTube video: “1/2 Create a 2-node Proxmox VE Cluster. Gluster as shared storage. With High Availability! First ep”]

Clustering

After signing into both Proxmox nodes, I went to my first node (proxmox01), selected “Datacenter” and then “Cluster”.

An image of the Proxmox server selecting the cluster screen

I clicked on “Create Cluster”, and created a cluster, called (unimaginatively) proxmox-cluster.

The create cluster dialogue box
The task completion details for the create cluster action

I clicked “Join Information”.

A screenshot showing the "Join information" button
The join information dialogue box

Next, on proxmox02 on the same screen, I clicked on “Join Cluster” and then pasted that information into the dialogue box. I entered the root password, and clicked “Join ‘proxmox-cluster'”.

A screenshot of the proxmox cluster, showing where the "join cluster" button is.
The Cluster Join screen, showing the pasted in text from the other cluster and that the password has been entered.

When this finished running, if either screen has hung, check whether one of the screens is showing an error like permission denied - invalid PVE ticket (401), like this (hidden just behind the “Task Viewer: Join Cluster” dialogue box):

A screen shot showing the error message "permission denied - invalid PVE ticket"

Or /etc/pve/nodes/NODENAME/pve-ssl.pem' does not exist! (500):

A screen shot of the error message "pve-ssl.pem does not exist"

Refresh your browsers, and you’ll probably find that the joining node will present a new TLS certificate:

A screen shot of Firefox's "unknown certificate" screen

Accept the certificate to resume the process.

To ensure I had HA quorum, which requires three nodes, I added an unused Raspberry Pi 3 running Raspberry Pi OS.

With that, I enabled root SSH access:

echo "PermitRootLogin yes" | tee /etc/ssh/sshd_config.d/root_login.conf >/dev/null && systemctl restart ssh.service

Next, I setup a password for the root account:

sudo passwd

And I installed the package “corosync-qnetd” on it:

sudo apt update && sudo apt install -y corosync-qnetd

Back on both of the Proxmox nodes, I installed the package “corosync-qdevice”:

apt update && apt install -y corosync-qdevice
A screen shot of the installation of the corosync-qdevice package having completed

On proxmox01 I then ran pvecm qdevice setup 192.168.1.179(where 192.168.1.179 is the IP address of the Raspberry Pi device).

A screen shot of the first half of of the setup of the command pvecm qdevice setup
A screen shot of the second half of of the setup of the command pvecm qdevice setup

This gave me my quorum of 3 nodes. To confirm this, I ran pvecm statuswhich resulted in this output:

root@proxmox01:~# pvecm status
Cluster information
-------------------
Name:             proxmox-cluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue May 16 20:38:15 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate Qdevice 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.1.200 (local)
0x00000002          1    A,V,NMW 192.168.1.201
0x00000000          1            Qdevice
root@proxmox01:~#
A screen shot of the output from the pvecm status command.

Storage

ZFS

Once the machines were built, I went into the Disks screen on each node, found the 3TB drive and selected “Wipe Disk”.

A screenshot of the disks page, showing the location of the "wipe disk" button.
A confirmation screen shot asking if I want to format the disk.
The completion screen shot for the wipe disk action

Next I clicked “Initialize Disk with GPT”.

The disk screen showing the location of the "Initialize Disk with GPT" button
The completion screenshot for initializing the disk

Next I went into the ZFS page in the node and created a ZFS Single Disk pool.

The ZFS screen shot, showing the location of the "Create: ZFS" button.

This pool was named “zfs-proxmox##” where “##” was replaced by the node number (so zfs-proxmox01 and zfs-proxmox02).

A screen shot of the options for creating the ZFS pool.

This mounts the pool as the pool name in the root (so /zfs-proxmox01 and /zfs-proxmox02).

A screen shot confirming that the disks have been mounted

GlusterFS

I added the Gluster debian repository by downloading the key from https://download.gluster.org/pub/gluster/glusterfs/10/rsa.pub and placing it in /etc/apt/keyrings/gluster.asc.

mkdir /etc/apt/keyrings
cd /etc/apt/keyrings
wget https://download.gluster.org/pub/gluster/glusterfs/10/rsa.pub
mv rsa.pub gluster.asc
A screen shot showing that the gluster key has been added to the system

Next I created a new repository entry in /etc/apt/sources.list.d/gluster.listwhich contained the line:

deb [arch=amd64 signed-by=/etc/apt/keyrings/gluster.asc] https://download.gluster.org/pub/gluster/glusterfs/10/LATEST/Debian/bullseye/amd64/apt bullseye main
A screenshot showing the apt repository being added to the system

I next ran apt update && apt install -y glusterfs-serverwhich installed the Gluster service.

A screen shot showing the installation of glusterfs-server in progress
A screenshot showing the completion of the glusterfs-server package and it's dependencies having been installed.

Following the YouTube link above, I created an entry for gluster01 and gluster02 in /etc/hosts which pointed to the IP address of proxmox01 and proxmox02 respectively.

A screen shot of editing the hosts file

Next, I edited /etc/glusterfs/glusterd.volso it contained this content:

volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option transport.socket.listen-port 24007
    option transport.rdma.bind-address gluster01
    option transport.socket.bind-address gluster01
    option transport.tcp.bind-address gluster01
    option ping-timeout 0
    option event-threads 1
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
    option max-port  60999
end-volume
A screen shot of editing the glusterd.vol file.

Note that this content above is for proxmox01. For proxmox02 I replaced “gluster01” with “gluster02”. I then ran systemctl enable --now glusterdwhich started the Gluster service.

Once this is done, you must run gluster probe gluster02from proxmox01 (or vice versa), otherwise, when you run the next command, you get this message:

volume create: gluster-volume: failed: Host gluster02 is not in 'Peer in Cluster' state
A screen shot of the error message issued when you've not run gluster probe before creating the volume

(This takes some backing out… ugh)

On proxmox01, I created the volume using this command:

gluster volume create gluster-volume replica 2 gluster01:/zfs-proxmox01/gluster-volume gluster02:/zfs-proxmox02/gluster-volume
A screen shot of creating the gluster volume.

As you can see in the above screenshot, this warned about split brain situations. However, as this is for my home lab, I accepted the risk here. Following the YouTube video again, I ran these commands to “avoid [a] split-brain situation”:

gluster volume start gluster-volume
gluster volume set gluster-volume cluster.heal-timeout 5
gluster volume heal gluster-volume enable
gluster volume set gluster-volume cluster.quorum-reads false
gluster volume set gluster-volume cluster.quorum-count 1
gluster volume set gluster-volume network.ping-timeout 2
gluster volume set gluster-volume cluster.favorite-child-policy mtime
gluster volume heal gluster-volume granular-entry-heal enable
gluster volume set gluster-volume cluster.data-self-heal-algorithm full
A screenshot of the output of all the commands issued to prevent a gluster split brain scenario

I created /gluster-volume on both proxmox01 and proxmox02, and then added this line to /etc/fstab(yes, I know it should really have been a systemd mount unit) on proxmox01:

gluster01:gluster-volume /gluster-volume glusterfs defaults,_netdev,x-systemd.automount,backupvolfile-server=gluster02 0 0
A screen shot of the command issued to add the gluster volume to fstab

And on proxmox02:

gluster02:gluster-volume /gluster-volume glusterfs defaults,_netdev,x-systemd.automount,backupvolfile-server=gluster01 0 0

On both systems, I ensured that /gluster-volume was created, and then ran mount -a.

The result of adding the line to staband then mounting the volume.

In the Proxmox UI, I went to the “Datacenter” and selected “Storage”, then “Add” and selected “Directory”.

A screen shot of adding a directory to the proxmox server

I set the ID to “gluster-volume”, the directory to “/gluster-volume”, ticked the “Shared” box and selected all the content types (it looks like a list box, but it’s actually a multi-select box).

The Add Directory dialogue screen shot

(I forgot to click “Shared” before I selected all the items under “Content” here.)

I clicked Add and it was available on both systems then.

A screen shot proving that the gluster volume has been added.

Backups

This one saved me from having to rebuild my Home Assistant system last week! Go into “Datacenter” and select the “Backup” option.

A screen shot of the backup screen in Proxmox, showing the location of the "add" button.

Click the “Add” button, select the storage you’ve just configured (gluster-volume) and a schedule (I picked daily at 04:00) and choose “Selection Mode” of “All”.

A screenshot of the dialogue box for creating the backup job

On the retention tab, I entered the number 3 for “Keep Daily”, “Keep Weekly”, “Keep Monthly” and “Keep Yearly”. Your retention needs are likely to be different to mine!

A screenshot of the dialogue box for creating the retention in the backup job
Proof that the backup job has been created.

If you end up needing to restore one of these backups, you need a different tool depending on whether it’s a LXC container or a QEMU virtual machine. For a container, you’d run:

vmid=199
pct restore $vmid /path/to/backup-file

For a virtual machine, you’d run:

vmid=199
qmrestore /path/to/backup-file $vmid

…and yes, you can replace the vmid=199 \n $vmidwith just the number for the VMID like this:

pct restore 123 /backup/vzdump-lxc-100-1970_01_01-04_00_00.tar.zst

If you need to point the storage at a different device (perhaps Gluster broke, or your external drive) you’d add --storage storage-label(e.g. --storage local-lvm)

Networking

The biggest benefit for me of a home lab is being able to build things on their own VLAN. A VLAN allows a single network interface to carry traffic for multiple logical networks, in such a way that other ports on the switch which aren’t configured to carry that logical network can’t access that traffic.

For example, I’ve configured my switch to have a new VLAN on it, VLAN 30. This VLAN is exposed to the two Proxmox servers (which can access all the VLANs) and also the port to my laptop. This means that I can run virtual machines on VLAN 30 which can’t be accessed by any other machine on my network.

There are two ways to do this, the “easy way” and the “explicit way”. Both ways produce the same end state, it’s just down to which makes more logical sense in your head.

In both routes, you must create the VLANs on your switch first – I’m just addressing the way of configuring Proxmox to pass this traffic to your network switch.

Note that these VLAN tagged interfaces also don’t have a DHCP server or Internet gateway (unless you create one), so any addresses will need to be manually configured in any installation screens.

The easy way

Go into the individual nodes and select the Network option in the sidebar (nested under “System”). You’ll need to perform these actions on both nodes.

Click on the “Linux Bridge” line which is aligned to your “trunked” network interface. For me, as I have a single network interface (enp2s0) I have a single Linux Bridge (vmbr0). Click “Edit” and tick the “VLAN aware” box and click “OK”.

A screen shot showing how to add VLAN awareness to the linux bridge configuration.
A screen shot showing the changes to /etc/network/interfaces

When you now create your virtual machines, on the hardware option in the sidebar, find the network interface and enter the VLAN tag you want to assign.

A screen shot showing how to configure the VLAN tag when creating a new virtual machine in Proxmox

(This screenshot shows no VLAN tag added, but it’s fairly clear where you’d put that tag in there)

The explicit way

Go into the individual nodes and select the Network option in the sidebar. You’ll need to perform all the steps in the section on both nodes!

Create a new “Linux VLAN” object.

A screen shot showing where to add the van on the proxmox node.

Call it by the name of the interface (e.g. enp2s0) followed by a dot and then the VLAN tag, like this enp2s0.30. Click Create.

A screenshot of the dialogue box for creating a VLAN tagged interface

Next create a new “Linux Bridge”.

A screen shot showing where to find the Bridge interface button

Call it vmbr and then the VLAN tag, like this vmbr30. Set the ports to the VLAN you just created (enp2s0.30)

A screen shot of the creation of the  bridge interface, with the addition of the bridge port previously created.
A screen shot of the changes to the /etc/network/interfaces screen.

(I should note that I added the comment between writing this guide and taking these screen shots)

When you create your virtual machines select this bridge for accessing that VLAN.

A screen shot of the selection of the VLAN tagged bridge.

Making machines run in “HA”

If you haven’t already done the part with the QDevice under clustering, go back there and run those steps! You need quorum to do this right!

YOU MUST HAVE THE SAME NETWORK AND STORAGE CONFIGURATION FOR HIGH AVAILABILITY AND MIGRATIONS. This means every VM which you want to migrate from proxmox01 to proxmox02 must use the same network interface and storage device, no matter which host it’s connected to.

  • If you’re connecting enp2s0 to VLAN 55 by using a VLAN Bridge called vmbr55, then both nodes need this VLAN Bridge available. Alternatively, if you’re using a VLAN tag on vmbr0, that’s fine, but both nodes need to have vmbr0 set to be “VLAN aware”.
  • If you’re using a disk on gluster-volume, this must be shared across the cluster

Go to “Datacenter” and select “Groups” which is nested under “HA” in the sidebar.

A screen shot of where to find the HA Group Creation button.

Create a new group (again, unimaginatively, I went with “proxmox”). Select both nodes and press Create.

A screen shot of the HA Group Creation dialogue box.

Now go to the “HA” option in the sidebar and verify you have quorum, although it doesn’t matter which is the master.

A screen shot showing how to verify the HA quorum status

Under resources on that page, click “Add”.

A screen shot showing where the add button is to enable HA of a virtual machine.

In the VM box, select the ID for the container or virtual machine you want to be highly available and click Add.

A screen shot of the dialogue box when setting up high availability of a virtual machine.

This will restart that machine or container in HA mode.

A screen shot showing the HA status of that virtual machine.

The wrap up!

So, after all of this, there’s still no virtual machines running (well, that Ubuntu Desktop is created but not running yet!) and I’ve not even started playing around with Terraform yet… but I’m feeling really positive about Proxmox. It’s close enough to the proprietary solutions I’ve used at work in the past that I’m reasonably comfortable with it, but it’s open enough to mess around under the surface. I’m looking forward to doing more experiments!

The featured image is of the comms rack in my garage showing how bad my wiring is when I can’t get to the back of a rack!! It’s released under a CC-0 license.