A note to myself; resetting error status on proxmox HA workloads after a crash

I’ve had a couple of issues with brown-outs recently which have interrupted my Proxmox server, and stopped my connected disks from coming back up cleanly (yes, I’m working on that separately!) but it’s left me in a state where several of my containers and virtual machines on the cluster are down.

It’s possible to point-and-click your way around this, but far easier to script it!

A failed state may look like this:

root@proxmox1:~# ha-manager status
quorum OK
master proxmox2 (active, Fri Mar 22 10:40:49 2024)
lrm proxmox1 (active, Fri Mar 22 10:40:52 2024)
lrm proxmox2 (active, Fri Mar 22 10:40:54 2024)
service ct:101 (proxmox1, error)
service ct:102 (proxmox2, error)
service ct:103 (proxmox2, error)
service ct:104 (proxmox1, error)
service ct:105 (proxmox1, error)
service ct:106 (proxmox2, error)
service ct:107 (proxmox2, error)
service ct:108 (proxmox1, error)
service ct:109 (proxmox2, error)
service vm:100 (proxmox2, error)

Once you’ve fixed your issue, you can do this on each node:

for worker in $(ha-manager status | grep "($(hostnamectl hostname), error)" | cut -d\  -f2)
do
  echo "Disabling $worker"
  ha-manager set $worker --state disabled
  until ha-manager status | grep "$worker" | grep -q disabled ; do sleep 1 ; done
  echo "Restarting $worker"
  ha-manager set $worker --state started
  until ha-manager status | grep "$worker" | grep -q started ; do sleep 1 ; done
done

Note that this hasn’t been tested, but a scan over it with those nodes working suggests it should. I guess I’ll be updating this the next time I get a brown-out!

A padlock and chain on a rusted gate

Using #NetworkFirewall and #Route53 #DNS #Firewall to protect a private subnet’s egress traffic in #AWS

I wrote this post in January 2023, and it’s been languishing in my Drafts folder since then. I’ve had a look through it, and I can’t see any glaring reasons why I didn’t publish it so… it’s published… Enjoy 😁

If you’ve ever built a private subnet in AWS, you know it can be a bit tricky to get updates from the Internet – you end up having a NAT gateway or a self-managed proxy, and you can never be 100% certain that the egress traffic isn’t going somewhere you don’t want it to.

In this case, I wanted to ensure that outbound HTTPS traffic was being blocked if the SNI didn’t explicitly show the DNS name I wanted to permit through, and also, I only wanted specific DNS names to resolve. To do this, I used AWS Network Firewall and Route 53 DNS Firewall.

I’ve written this blog post, and followed along with this, I’ve created a set of terraform files to represent the steps I’ve taken.

The Setup

Let’s start this story from a simple VPC with three private subnets for my compute resources, and three private subnets for the VPC Endpoints for Systems Manager (SSM).

Here’s our network diagram, with the three subnets containing the VPC Endpoints at the top, and the three instances at the bottom.

I’ve created a tag in my Github repo at this “pre-changes” state, called step 1.

At this point, none of those instances can reach anything outside the network, with the exception of the SSM environment. So, we can’t install any packages, we can’t get data from outside the network or anything similar.

Getting Protected Internet Access

In order to get internet access, we need to add 4 things;

  1. An internet gateway
  2. A NAT gateway in each AZ
  3. Which needs three new subnets
  4. And three Elastic IP addresses
  5. Route tables in all the subnets

To clarify, a NAT gateway acts like a DSL router. It hides the source IP address of outbound traffic behind a single, public IP address (using an Elastic IP from AWS), and routes any return traffic back to wherever that traffic came from. To reduce inter-AZ data transfer rates, I’m putting one in each AZ, but if there’s not a lot of outbound traffic or the outbound traffic isn’t critical enough to require resiliency, this could all be centralised to a single NAT gateway. To put a NAT gateway in each AZ, you need a subnet in each AZ, and to get out to the internet (by whatever means you have), you need an internet gateway and route tables for how to reach the NAT and internet gateways.

We also should probably add, at this point, four additional things.

  1. The Network Firewall
  2. Subnets for the Firewall interfaces
  3. Stateless Policy
  4. Stateful Policy

The Network Firewall acts like a single appliance, and uses a Gateway Load Balancer to present an interface into each of the availability zones. It has a stateless policy (which is very fast, but needs to address both inbound and outbound traffic flows) to do IP and Port based filtering (referred to as “Layer 3” filtering) and then specific traffic can be passed into a stateful policy (which is slower) to do packet and flow inspection.

In this case, I only want outbound HTTPS traffic to be passed, so my stateless rule group is quite simple;

  • VPC range on any port → Internet on TCP/443; pass to Stateful rule groups
  • Internet on TCP/443 → VPC range on any port; pass to Stateful rule groups

I have two stateful rule groups, one is defined to just allow access out to example.com and any relevant subdomains, using the “Domain List” stateful policy item. The other allows access to example.org and any relevant subdomains, using a Suricata stateful policy item, to show the more flexible alternative route. (Suricata has lots more filters than just the SNI value, you can check for specific SSH versions, Kerberos CNAMEs, SNMP versions, etc. You can also add per-rule logging this way, which you can’t with the Domain List route).

These are added to the firewall policy, which also defines that if a rule doesn’t match a stateless rule group, or an established flow doesn’t match a stateful rule group, then it should be dropped.

New network diagram with more subnets and objects, but essentially, as described in the paragraphs above. Traffic flows from the instances either down towards the internet, or up towards the VPCe.

I’ve created a tag in my Github repo at this state, with the firewall, NAT Gateway and Internet Gateway, called step 2.

So far, so good… but why let our users even try to resolve the DNS name of a host they’re not permitted to reach. Let’s turn on DNS Firewalling too.

Turning on Route 53 DNS Firewall

You’ll notice that in the AWS Network Firewall, I didn’t let DNS out of the network. This is because, by default, AWS enables Route 53 as it’s local resolver. This lives on the “.2” address of the VPC, so in my example environment, this would be 198.18.0.2. Because it’s a local resolver, it won’t cross the Firewall exiting to the internet. You can also make Route 53 use your own DNS servers for specific DNS resolution (for example, if you’re running an Active Directory service inside your network).

Any Network Security Response team members you have working with you would appreciate it if you’d turn on DNS Logging at this point, so I’ll do it too!

In March 2021, AWS announced “Route 53 DNS Firewall”, which allow this DNS resolver to rewrite responses, or even to completely deny the existence of a DNS record. With this in mind, I’m going to add some custom DNS rules.

The first thing I want to do is to only permit traffic to my specific list of DNS names – example.org, example.com and their subdomains. DNS quite likes to terminate DNS names with a dot, signifying it shouldn’t try to resolve any higher up the chain, so I’m going to make a “permitted domains” DNS list;

example.com.
example.org.
*.example.com.
*.example.org.

Nice and simple! Except, this also stops me from being able to access the instances over SSM, so I’ll create a separate “VPCe” DNS list:

ssm.ex-ample-1.amazonaws.com.
*.ssm.ex-ample-1.amazonaws.com.
ssmmessages.ex-ample-1.amazonaws.com.
*.ssmmessages.ex-ample-1.amazonaws.com.
ec2messages.ex-ample-1.amazonaws.com.
*.ec2messages.ex-ample-1.amazonaws.com.

Next I create a “default deny” DNS list:

*.

And then build a DNS Firewall Policy which allows access to the “permitted domains”, “VPCe” lists, but blocks resolution of any “default deny” entries.

I’ve created a tag in my Github repo at this state, with the Route 53 DNS Firewall configured, called step 3.

In conclusion…

So there we have it. While the network is not “secure” (there’s still a few gaps here) it’s certainly MUCH more secure than it was, and it certainly would take a lot more work for anyone with malicious intent to get your content out.

Feel free to have a poke around, and leave comments below if this has helped or is of interest!

A green notice board in a country setting. It has leaflets and cards on it, although they are not readable in this image.

Create yourself a “Work Profile” to let others know how (and when) to contact you!

I recently got talking to a colleague about how people prefer to work and how they prefer to be contacted. It’s obvious in an office – if Bob isn’t there, then he’s not around, but when some of the team is remote, some are hybrid working, then it’s a lot harder.

There are three things I’ve found are really useful to know when trying to reach someone, and I’ve written this up in a simple page stored on our internal wiki;

  1. What’s your baseline – where do you live and when are you usually in the office.
  2. What are your usual working hours – how accurate is your calendar for non-meetings? do you have fixed meetings that happen every week, or a school run that you typically do? Do you need to be away from your desk at certain times for religious reasons?
  3. What’s the best way to contact you – if you’ve got a choice of tools (like Slack Hangouts or Google Meet) which would you rather use, and why. Is it best to drop in a 15 minute appointment, or just call you?

Once you’ve got these three items, in something everyone can access, add it to your directory profile, bio on slack, your email signature (for internal emails) and so on.

From here to the end of the post is a mildly sanitised version of my internally posted profile. I hope it’s useful to you!


Baseline

I am based in the UK, using the Europe/London time zone. I am remote based with very infrequent visits to the London office.

Typical Working Hours Patterns

I work from Monday to Friday, normally starting at X and finishing at X. During school term times, I will be out of the office between 3:00PM and 3:45PM to do school drop-off and pick ups. On Monday to Thursday, I am in a stand-up from X until Y. I will typically take my lunch break between X and Y. On Friday I have a weekly one-to-one which starts at X and finishes at Y. I will then take lunch until 1:00PM.

During school holidays, the start and end times will need to be a bit more flexible, and drop-off and pick-up slots will vary based on day-to-day activities.

I will keep my calendar up-to-date accordingly.

Contact Preference

I prefer being contacted by Slack mention or DM, however, I will often follow-up with a request for a DM chat or call, especially if I have been typing a lot during the day, or am trying to resolve an issue which I expect will require a lot of interaction.

I am happy to use Google Meetings, Slack Huddles, Microsoft Teams or Amazon Chime, all of which I have tested and work on my computer. I personally prefer to use Microsoft Teams because the presenter can allow participants to interact with the presenter’s screen or Slack Huddles because that allows participants to draw on the presenters screen, and because I can see more of your screen by default.


Featured image is “Notice board / Bulletin Board” by “Matthew Paul Argall” on Flickr and is released under a CC-BY license.

A stack of Jenga bricks falling over

A Quick Fix for “Backend initialization required” from Terragrunt

Today I ran terragrunt apply against a IaC directory, and got this response:

╷
│ Error: Backend initialization required: please run "terraform init"
│ 
│ Reason: Backend configuration block has changed
│ 
│ The "backend" is the interface that Terraform uses to store state,
│ perform operations, etc. If this message is showing up, it means that the
│ Terraform configuration you're using is using a custom configuration for
│ the Terraform backend.
│ 
│ Changes to backend configurations require reinitialization. This allows
│ Terraform to set up the new configuration, copy existing state, etc. Please
│ run
│ "terraform init" with either the "-reconfigure" or "-migrate-state" flags
│ to
│ use the current configuration.
│ 
│ If the change reason above is incorrect, please verify your configuration
│ hasn't changed and try again. At this point, no changes to your existing
│ configuration or state have been made.
╵
ERRO[0000] Hit multiple errors:
Hit multiple errors:
exit status 1 

But wait, I hear you say, Terragrunt runs terraform init for you… so what gives?

Well, in this case, the terragrunt.hcl has a dependency block, and one of those dependencies has not run properly, so… let’s fix it

Read the content of your terragrunt.hcl

terraform {
  source = "git@github.com:example/example-terraform-modules.git//module"
}

include {
  path = find_in_parent_folders()
}

dependency "dependency_1" {
  config_path = "${get_terragrunt_dir()}/../dependency"

  mock_outputs_allowed_terraform_commands = ["destroy", "force-unlock"]
  mock_outputs = {
    output_1 = []
    output_2 = ""
  }
}

inputs = {
  name      = "some_module"
  some_key  = dependency.dependency_1.outputs.output_1
  other_key = dependency.dependency_1.outputs.output_2
}

Right, so for some reason the dependency won’t run. Change into that directory, and run terragrunt apply --terragrunt-source-update. Hopefully, you’ll get something like this:

Initializing the backend...

Successfully configured the backend "example"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Reusing previous version of example/example from the dependency lock file
- Installing example/example v1.0.0...
- Installed example/example v1.0.0 (signed by Example)

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
example_module.this: Refreshing state... [id=an-example]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:
output_1 = {"some_key": "some_value"}
output_2 = "some_string"

You may find yourself having to traverse several different dependencies until you get to the one which is missing… and then it should work :)

Featured image is “Jenga” by “Mara Tr.” on Flickr and is released under a CC-BY license.

A medieval helm in gold on a bench at a museum

Fixing 403 errors from ghcr.io with #helm pull

At work, I’m using skaffold to deploy a helm chart which references a ghcr.io repository. Here’s the stanza I’m looking at:

apiVersion: skaffold/v3
kind: Config
deploy:
  helm:
    releases:
      - name: {package}
        remoteChart: oci://ghcr.io/{owner}/{package}

This is the first time I’ve tried to deploy this chart, and I kept getting this message:

No tags generated
Starting test...
Starting deploy...
Helm release {package} not installed. Installing...
Error: INSTALLATION FAILED: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden
deploying "{package}": install: exit status 1

I thought this might have been an issue with the skaffold file, so I tried running this directly with helm:

$ helm pull oci://ghcr.io/{owner}/{package}
Error: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden

Huh, that looks a bit familiar. I spent a little while checking to see whether this was something at the Kubernetes cluster, or if it was just me, and ended up finding this nugget (thanks to a steer from this post)

$ gh auth token | helm registry login ghcr.io -u {my_github_user} --password-stdin
Login Succeeded

And now it works!

elm pull oci://ghcr.io/{owner}/{package}
Pulled: ghcr.io/{owner}/{package}:1.2.3
Digest: sha256:decafbad1234567890aabbccddeeffdeadbeefbadbadbadbad12345678901234

Featured image is “helm” by “23 dingen voor musea” on Flickr and is released under a CC-BY-SA license.

A text dialogue from a web page showing "Uh oh. Something really just went wrong. Good thing we know about it and have our crack team of squirrels getting their nuts out of the system!"

How to capture stdout and stderr from a command in a shellscript without preventing piped processes from seeing them

I love the tee command – it captures stdout [1] and puts it in a file, while then returning that output to stdout for the next process in a pipe to consume, for example:

$ ls -l | tee /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh
$ cat /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh

But wait, why is that useful? Well, in a script, you don’t always want to see the content scrolling past, but in the case of a problem, you might need to catch up with the logs afterwards. Alternatively, you might do something like this:

if some_process | tee /tmp/output | grep -q "some text"
then
  echo "Found 'some text' - full output:"
  cat /tmp/output
fi

This works great for stdout but what about stderr [2]? In this case you could just do:

some_process 2>&1 | tee /tmp/output

But that mashes all of stdout and stderr into the same blob.

In my case, I want to capture all the output (stdout and stderr) of a given process into a file. Only stdout is forwarded to the next process, but I still wanted to have the option to see stderr as well during processing. Enter process substitution.

TEMP_DATA_PATH="$(mktemp -d)"
capture_out() {
  base="${TEMP_DATA_PATH}/${1}"
  mkdir "${base}"
  shift
  "$@" 2> >(tee "${base}/stderr" >&2) 1> >(tee "${base}/stdout")
}

With this, I run capture_out step-1 do_a_thing and then in /tmp/tmp.sometext/step-1/stdout and /tmp/tmp.sometext/step-1/stderr are the full outputs I need… but wait, I can also do:

$ capture_out step-1 do_a_thing | \
  capture_out step-2 process --the --thing && \
  capture_out step-3 echo "..." | capture_out step-4 profit
$ find /tmp/tmp.sometext -type f
/tmp/tmp.sometext/step-1/stdout
/tmp/tmp.sometext/step-1/stderr
/tmp/tmp.sometext/step-2/stdout
/tmp/tmp.sometext/step-2/stderr
/tmp/tmp.sometext/step-4/stdout
/tmp/tmp.sometext/step-4/stderr
/tmp/tmp.sometext/step-3/stderr
/tmp/tmp.sometext/step-3/stdout

Or

if capture_out has_an_error something-wrong | capture_out handler check_output
then
  echo "It all went great"
else
  echo "Process failure"
  echo "--Initial process"
  # Use wc -c to check the number of characters in the file
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/has_an_error/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/has_an_error/stderr"
  fi
  echo "--Second stage"
  if [ -e "${TEMP_DATA_PATH}/handler/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/handler/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/handler/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/handler/stderr"
  fi
fi

This has become part of my normal toolkit now for logging processes. Thanks bash!

Also, thanks to ChatGPT for helping me find this structure that I’d seen before, but couldn’t remember how to do it! (it almost got it right too! Remember kids, don’t *trust* what ChatGPT gives you, use it as a research starting point, test *that* against your own knowledge, test *that* against your environment and test *that* against expected error cases too! Copy & Paste is not the best idea with AI generated code!)

Footnotes

[1] stdout is the name of the normal output text we see in a shell, it’s also sometimes referred to as “file descriptor 1” or “fd1”. You can also output to &1 with >&1 which means “send to fd1”

[2] stderr is the name of the output in a shell when an error occurs. It isn’t caught by things like some_process > /dev/null which makes it useful when you don’t want to see output, just errors. Like stdout, it’s also referred to as “file descriptor 2” or “fd2” and you can output to &2 with >&2 if you want to send stdout to stderr.

Featured image is “WordPress Error” by “tara hunt” on Flickr and is released under a CC-BY-SA license.

A photo of a door with the focus on the handle which has a lock in the centre of the knob. The lock has a key in it, with a bunch of keys dangling from the central ring.

Using direnv with terraform, terragrunt, saml2aws, SOPS and AWS KMS

In my current project I am often working with Infrastructure as Code (IoC) in the form of Terraform and Terragrunt files. Before I joined the team a decision was made to use SOPS from Mozilla, and this is encrypted with an AWS KMS key. You can only access specific roles using the SAML2AWS credentials, and I won’t be explaining how to set that part up, as that is highly dependant on your SAML provider.

While much of our environment uses AWS, we do have a small presence hosted on-prem, using a hypervisor service. I’ll demonstrate this with Proxmox, as this is something that I also use personally :)

Firstly, make sure you have all of the above tools installed! For one stage, you’ll also require yq to be installed. Ensure you’ve got your shell hook setup for direnv as we’ll need this later too.

Late edit 2023-07-03: There was a bug in v0.22.0 of the terraform which didn’t recognise the environment variables prefixed PROXMOX_VE_ – a workaround by using TF_VAR_PROXMOX_VE and a variable "PROXMOX_VE_" {} block in the Terraform code was put in place for the inital publication of this post. The bug was fixed in 0.23.0 which this post now uses instead, and so as a result the use of TF_VAR_ prefixed variables was removed too.

Set up AWS Vault

AWS KMS

AWS Key Management Service (KMS) is a service which generates and makes available encryption keys, backed by the AWS service. There are *lots* of ways to cut that particular cake, but let’s do this a quick and easy way… terraform

variable "name" {
  default = "SOPS"
  type    = string
}
resource "aws_kms_key" "this" {
  tags                     = {
    Name : var.name,
    Owner : "Admins"
  }
  key_usage                = "ENCRYPT_DECRYPT"
  customer_master_key_spec = "SYMMETRIC_DEFAULT"
  deletion_window_in_days  = 30
  is_enabled               = true
  enable_key_rotation      = false
  policy                   = <<EOF
{
  "Version": "2012-10-17",
  "Id": "key-default-1",
  "Statement": [
    {
      "Sid": "Root Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Estate Admin Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:role/estateadmins"
      },
      "Action": [
        "kms:Describe*",
        "kms:List*",
        "kms:Get*",
        "kms:Encrypt*"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_kms_alias" "this" {
  target_key_id = aws_kms_key.this.key_id
  name          = "alias/${var.name}"
}

output "key" {
  value = aws_kms_alias.this.arn
}

After running this, let’s assume that we get an output for the “key” value of:

arn:aws:kms:us-east-1:123456789012:alias/main

Setup Sops

In your terragrunt tree, create a file called .sops.yaml, which contains:

---
creation_rules:
  - kms: arn:aws:kms:us-east-1:123456789012:alias/main

And a file called secrets.enc.yaml which contains:

---
PROXMOX_VE_USERNAME: root@pam
PROXMOX_VE_PASSWORD: deadb33f@2023

Test that your KMS works by assuming your IAM role via SAML2AWS like this:

$ saml2aws login --skip-prompt --quiet
$ saml2aws exec -- sops --verbose --encrypt --in-place secrets.enc.yaml
[AWSKMS]	 INFO[0000] Encryption succeeded                          arn="arn:aws:kms:us-east-1:123456789012:alias/main"
[CMD]		 INFO[0000] File written successfully

Setup direnv

Outside your tree, in ~/.config/direnv/lib create a file called use_sops.sh (does not need to be chmod +x or chmod 755!) containing this:

# Based on https://github.com/direnv/direnv/wiki/Sops
use_sops() {
    local path=${1:-$PWD/secrets.enc.yaml}
    if [ -e "$path" ]
    then
        if grep -q -E '^sops:' "$path"
        then
            eval "$(sops --decrypt --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"
        else
            if [ -n "$(command -v yq)" ]
            then
                eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"
                export SOPS_WARNING="unencrypted $path"
            fi
        fi
    fi
    watch_file "$path"
}

There are two key lines here, the first of which is:

eval "$(sops -d --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"

This line asks sops to decrypt the secrets file, using the “dotenv” output type, however, the dotenv format looks like this:

some_key = "some value"

So, as a result, we then pass that value to direnv and ask it to rewrite it in the format it expects, which looks like this:

export some_key="some value"

The second key line is this:

eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"

This asks yq to parse the secrets file, using the “props” formatter, which results in lines just like the dotenv output we saw above.

However, because we used yq to parse the file, it means that we know this file isn’t encrypted, so we also add an extra export value:

export SOPS_WARNING="unencrypted $path"

This can be picked up as part of your shell prompt to put a warning in! Anyway… let’s move on.

Now that you have your reusable library file, we now configure the direnv file, .envrc for the root of your proxmox cluster:

use sops

Oh, ok, that was simple. You can add several files here if you wish, like this:

use sops file1.enc.yaml
use sops file2.enc.yml
use sops ~/.core_sops

But, we don’t need that right now!

Open your shell in that window, and you’ll get this warning:

direnv: error /path/to/demo/.envrc is blocked. Run `direnv allow` to approve its content

So, let’s do that!

$ direnv allow
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

So far, so good… but wait, you’ve authenticated to your SAML access to AWS. Let’s close that shell, and go back in again

$ cd /path/to/demo
direnv: loading /path/to/demo/.envrc
direnv: using sops
$

Ah, now we don’t have our values exported. That’s what we wanted!

What now?!

Configuring the details of the proxmox cluster

We have our .envrc file which provides our credentials (let’s pretend we’re using a shared set of credentials across all the boxes), but now we need to setup access to each of the boxes.

Let’s make our two cluster directories;

mkdir cluster_01
mkdir cluster_02

And in each of these clusters, we need to put an .envrc file with the right IP address in. This needs to check up the tree for any credentials we may have already loaded:

source_env "$(find_up ../.envrc)"
export PROXMOX_VE_ENDPOINT="https://192.0.2.1:8006" # Documentation IP address for the first cluster - change for the second cluster.

The first line works up the tree, looking for a parent .envrc file to inject, and then, with the second line, adds the Proxmox API endpoint to the end of that chain. When we run direnv allow (having logged back into our saml2aws session), we get this:

$ direnv allow
direnv: loading /path/to/demo/cluster_01/.envrc
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

Great, now we can setup the connection to the cluster in the terragrunt file!

Set up Terragrunt

In /path/to/demo/terragrunt.hcl put this:

remote_state {
  backend = "s3"
  config  = {
    encrypt                = true
    bucket                 = "example-inc-terraform-state"
    key                    = "${path_relative_to_include()}/terraform.tfstate"
    region                 = "us-east-1"
    dynamodb_table         = "example-inc-terraform-state-lock"
    skip_bucket_versioning = false
  }
}
generate "providers" {
  path      = "providers.tf"
  if_exists = "overwrite"
  contents  = <<EOF
terraform {
  required_providers {
    proxmox = {
      source = "bpg/proxmox"
      version = "0.23.0"
    }
  }
}

provider "proxmox" {
  insecure = true
}
EOF
}

Then in the cluster_01 directory, create a directory for the code you want to run (e.g. create a VLAN might be called “VLANs/30/“) and put in it this terragrunt.hcl

terraform {
  source = "${get_terragrunt_dir()}/../../../terraform-module-network//vlan"
  # source = "git@github.com:YourProject/terraform-module-network//vlan?ref=production"
}

include {
  path = find_in_parent_folders()
}

inputs = {
  vlan_tag    = 30
  description = "VLAN30"
}

This assumes you have a terraform directory called terraform-module-network/vlan in a particular place in your tree or even better, a module in your git repo, which uses the input values you’ve provided.

That double slash in the source line isn’t a typo either – this is the point in that tree that Terragrunt will copy into the directory to run terraform from too.

A quick note about includes and provider blocks

The other key thing is that the “include” block loads the values from the first matching terragrunt.hcl file in the parent directories, which in this case is the one which defined the providers block. You can’t include multiple different parent files, and you can’t have multiple generate blocks either.

Running it all together!

Now we have all our depending files, let’s run it!

user@host:~$ cd test
direnv: loading ~/test/.envrc
direnv: using sops
user@host:~/test$ saml2aws login --skip-prompt --quiet ; saml2aws exec -- bash
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test$ cd cluster_01/VLANs/30
direnv: loading ~/test/cluster_01/.envrc
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test/cluster_01/VLANs/30$ terragrunt apply
data.proxmox_virtual_environment_nodes.available_nodes: Reading...
data.proxmox_virtual_environment_nodes.available_nodes: Read complete after 0s [id=nodes]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # proxmox_virtual_environment_network_linux_bridge.this[0] will be created
  + resource "proxmox_virtual_environment_network_linux_bridge" "this" {
      + autostart  = true
      + comment    = "VLAN30"
      + id         = (known after apply)
      + mtu        = (known after apply)
      + name       = "vmbr30"
      + node_name  = "proxmox01"
      + ports      = [
          + "enp3s0.30",
        ]
      + vlan_aware = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
proxmox_virtual_environment_network_linux_bridge.this[0]: Creating...
proxmox_virtual_environment_network_linux_bridge.this[0]: Creation complete after 2s [id=proxmox01:vmbr30]
user@host:~/test/cluster_01/VLANs/30$

Winning!!

Featured image is “2018/365/1 Home is Where The Key Fits” by “Alan Levine” on Flickr and is released under a CC-0 license.

"Tickets" by "Becky Snyder" on Flickr

IP Address Management using PHPIPAM integrated with Keycloak for SAML2 Authentication

I’ve recently been working with a network estate that was a bit hard to get a handle on. It had grown organically, and was a bit tricky to allocate new network segments in. To fix this, I deployed PHPIPAM, which was super easy to setup and configure (I used the docker-compose file on the project’s docker hub page, and put it behind an NGINX server which was pre-configured with a LetsEncrypt TLS/HTTPS certificate).

PHPIPAM is a IP Address Management tool which is self-hostable. I started by setting up the “Sections” (which was the hosting environments the estate is using), and then setup the supernets and subnets in the “Subnets” section.

Already, it was much easier to understand the network topology, but now I needed to get others in to take a look at the outcome. The team I’m working with uses a slightly dated version of Keycloak to provide Single Sign-On. PHPIPAM will use SAML for authentication, which is one of the protocols that Keycloak offers. The documentation failed me a bit at this point, but fortunately a well placed ticket helped me move it along.

Setting up Keycloak

Here’s my walk through

  1. Go to “Realm Settings” in the sidebar and find the “SAML Identity Provider Metadata” (on my system it’s on the “General” tab but it might have changed position on your system). This will be an XML file, and (probably) the largest block of continuous text will be in a section marked “ds:X509Certificate” – copy this text, and you’ll need to use this as the “IDP X.509 public cert” in PHPIPAM.
  2. Go to “Clients” in the sidebar and click “Create”. If you want Keycloak to offer access to PHPIPAM as a link, the client ID needs to start “urn:” If you just want to use the PHPIPAM login option, give the client ID whatever you want it to be (I’ve seen some people putting in the URL of the server at this point). Either way, this needs to be unique. The client protocol is “saml” and the client SAML endpoint is the URL that you will be signing into on PHPIPAM – in my case https://phpipam.example.org/saml2/. It should look like this:

    Click Save to take you to the settings for this client.
  3. If you want Keycloak to offer a sign-in button, set the name of the button and description.

    Further down the page is “Root URL” set that to the SAML Endpoint (the one ending /saml2/ from before). Also set the “Valid Redirect URIs” to that too.

    Where it says “IDP Initiated SSO URL Name” put a string that will identify the client – I put phpipam, but it can be anything you want. This will populate a URL like this: https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam, which you’ll need as the “IDP Issuer”, “IDP Login URL” and “IDP Logout URL”. Put everything after the /auth/ in the box marked “Base URL”. It should look like this:

    Hit Save.
  4. Go to the “SAML Keys” tab. Copy the private key and certificate, these are needed as the “Authn X.509 signing” cert and cert key in PHPIPAM.
  5. Go to the “Mappers” tab. Create each of the following mappers;
    • A Role List mapper, with the name of “role list”, Role Attribute Name of “Role”, no friendly name, the SAML Attribute NameFormat set to “Basic” and Single Role Attribute set to on.
    • A User Attribute mapper, with the name, User Attribute, Friendly Name and SAML Attribute Name set to “email”, the SAML Attribute NameFormat set to “Basic” and Aggregate Attribute Values set to “off”.
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “display_name” and the SAML Attribute NameFormat set to “Basic”. The Script should be set to this single line: user.getFirstName() + ' ' + user.getLastName().
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “is_admin” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
is_admin = false;
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "phpipamadmins":
            is_admin = true;
            break;
    }
}
is_admin
  • Create one more mapper item:
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “groups” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
everyone_who_can_access_gets_read_only_access = false;
send_groups = "";
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "LDAP_GROUP_1":
            use_group = "IPAM_GROUP_1";
            break;
        case "LDAP_GROUP_2":
            use_group = "IPAM_GROUP_2";
            break;
    }
    if (use_group !== "") {
        if (send_groups !== "") {
          send_groups = send_groups + ","
        }
        send_groups = send_groups + use_group;
    }    
}
if (send_groups === "" && everyone_who_can_access_gets_read_only_access) {
    "Guests"
} else {
    send_groups
}

For context, the groups listed there, LDAP_GROUP_1 might be “Customer 1 Support Staff” or “ITSupport” or “Networks”, and the IPAM_GROUP_1 might be “Customer 1” or “WAN Links” or “DC Patching” – depending on the roles and functions of the teams. In my case they relate to other roles assigned to the staff member and the name of the role those people will perform in PHP IPAM. Likewise in the is_admin mapper, I’ve mentioned a group called “phpipamadmins” but this could be any relevant role that might grant someone admin access to PHPIPAM.

Late Update (2023-06-07): I’ve figured out how to enable modules now too. Create a Javascript mapper as per above, but named “modules” and have this script in it:

// Current modules as at 2023-06-07
// Some default values are set here.
noaccess       =  0;
readonly       =  1;
readwrite      =  2;
readwriteadmin =  3;
unsetperm      = -1;

var modules = {
    "*":       readonly,  "vlan":      unsetperm, "l2dom":    unsetperm,
    "devices": unsetperm, "racks":     unsetperm, "circuits": unsetperm,
    "nat":     unsetperm, "locations":  noaccess, "routing":  unsetperm,
    "pdns":    unsetperm, "customers": unsetperm
}

function updateModules(modules, new_value, list_of_modules) {
    for (var module in list_of_modules) {
        modules[module] = new_value;
    }
    return modules;
}

var GroupSet = user.getGroups();
for (var group in GroupSet) {
    switch (group.getName()) {
        case "LDAP_ROLE_3":
            modules = updateModules(modules, readwriteadmin, [
                'racks', 'devices', 'nat', 'routing'
            ]);
            break;
    }
}

var moduleList = '';

for (var key in modules) {
    if (modules.hasOwnProperty(key) && modules[key] !==-1) {
        if (moduleList !== '') {
            moduleList += ',';
        }
        moduleList += key + ':' + modules[key];
    }
}

moduleList;

OK, that’s Keycloak sorted. Let’s move on to PHPIPAM.

Setting up PHPIPAM

In the administration menu, select “Authentication Methods” and then “Create New” and select “Create new SAML2 authentication”.

In the description field, give it a relevant name, I chose SSO, but you could call it any SSO system name. Set “Enable JIT” to “on”, leave “Use advanced settings” as “off”. In Client ID put the Client ID you defined in Keycloak, probably starting urn: or https://. Leave “Strict mode” off. Next is the IDP Issuer, IDP Login URL and IDP Logout URL, which should all be set to the same URL – the “IDP Initiated SSO URL Name” from step 4 of the Keycloak side (that was set to something like https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam).

After that is the certificate section – first the IDP X.509 public cert that we got in step 1, then the “Sign Authn requests” should be set to “On” and the Authn X.509 signing cert and cert key are the private key and certificate we retrieved in step 5 above. Leave “SAML username attribute” and “SAML mapped user” blank and “Debugging” set to “Off”. It should look like this:

Hit save.

Next, any groups you specified in the groups mapper need to be defined. This is in Administration -> Groups. Create the group name and set a description.

Lastly, you need to configure the sections to define whigh groups have access. Each defined group gets given four radio buttons; “na” (no access), “ro” (read only), “rw” (read write) and “rwa” (read, write and administrate).

Try logging in. It should just work!

Debugging

If it doesn’t, and checking all of the above doesn’t help, I’ve tried adding some code into the PHP file in app/saml2/index.php, currently on line 149, above where it says:

if (empty($auth->getAttribute("display_name")[0])) {                                                                                    
    $Result->show("danger", _("Mandatory SAML JIT attribute missing")." : display_name (string)", true);
}

This code is:

$outfile = fopen("/tmp/log.out", "w");                                                           
fwrite($outfile, var_export($auth, true));                                                                     
fclose($outfile);

**REMEMBER THIS IS JUST FOR TESTING PURPOSES AND SHOULD BE REMOVED ASAP**

In here is an array called _attributes which will show you what has been returned from the Keycloak server when someone tries to log in. In my case, I got this:

   '_attributes' =>
  array (
    'groups' => 
    array (
      0 => 'PHPIPAM_GROUP_1',
    ),
    'is_admin' => 
    array (
      0 => 'false',
    ),
    'display_name' => 
    array (
      0 => 'Jon Spriggs',
    ),
    'email' => 
    array (
      0 => 'spriggsj@example.org',
    ),
  )

If you get something back here that isn’t what you expected, now at least you have a fighting chance of finding where that issue was! Good luck!!

Featured image is “Tickets” by “Becky Snyder” on Flickr and is released under a CC-BY license.

Quick tip: How to stop package installations from auto-starting server services with Debian based distributions (like Ubuntu)

I’m working on another toy project to understand a piece of software a little better, and to make it work, I needed to install dnsmasq inside an Ubuntu-based virtual machine. The problem with this is that Ubuntu already runs systemd-resolved to perform DNS lookups, and Debian likes to start server services as soon as it’s installed them. So how do we work around this? Well, actually, it’s pretty simple.

Thanks to this blog post from 2013, I found out that if you create an executable script called /usr/sbin/policy-rc.d with the content:

exit 101

This will stop all services in the dpkg/apt process from running on install, so I was able to do this:

echo 'exit 101' >> /usr/sbin/policy-rc.d
chmod +x /usr/sbin/policy-rc.d
apt update
apt install dnsmasq -y
systemctl disable --now systemd-resolved
# Futz with dnsmasq config
systemctl enable --now dnsmasq
dig example.com

Brilliant

(Don’t do this!) Note to self; Is your shutdown command causing an error for your packer build? Try this.

Don’t do this! Turns out I was doing this wrong. The below code is only needed if you’ve got things wrong, and you should instead be using keep_vm = "on_success". The more you know, eh?

If you’ve got a command in your packer script that looks like this:

provisioner "shell" {
  inline = ["shutdown -h now"]
}

Try running this instead:

provisioner "shell" {
  inline = [
    "echo Provisioning complete. Shutting down.",
    "(sleep 5 ; shutdown -h now) &"
  ]
}

This will force packer to execute a command which is pushed into the background, returning a return code (RC) of 0, which the system will interpret as a successful result. 5 seconds later the machine will shut itself down by itself.