A padlock and chain on a rusted gate

Using #NetworkFirewall and #Route53 #DNS #Firewall to protect a private subnet’s egress traffic in #AWS

I wrote this post in January 2023, and it’s been languishing in my Drafts folder since then. I’ve had a look through it, and I can’t see any glaring reasons why I didn’t publish it so… it’s published… Enjoy 😁

If you’ve ever built a private subnet in AWS, you know it can be a bit tricky to get updates from the Internet – you end up having a NAT gateway or a self-managed proxy, and you can never be 100% certain that the egress traffic isn’t going somewhere you don’t want it to.

In this case, I wanted to ensure that outbound HTTPS traffic was being blocked if the SNI didn’t explicitly show the DNS name I wanted to permit through, and also, I only wanted specific DNS names to resolve. To do this, I used AWS Network Firewall and Route 53 DNS Firewall.

I’ve written this blog post, and followed along with this, I’ve created a set of terraform files to represent the steps I’ve taken.

The Setup

Let’s start this story from a simple VPC with three private subnets for my compute resources, and three private subnets for the VPC Endpoints for Systems Manager (SSM).

Here’s our network diagram, with the three subnets containing the VPC Endpoints at the top, and the three instances at the bottom.

I’ve created a tag in my Github repo at this “pre-changes” state, called step 1.

At this point, none of those instances can reach anything outside the network, with the exception of the SSM environment. So, we can’t install any packages, we can’t get data from outside the network or anything similar.

Getting Protected Internet Access

In order to get internet access, we need to add 4 things;

  1. An internet gateway
  2. A NAT gateway in each AZ
  3. Which needs three new subnets
  4. And three Elastic IP addresses
  5. Route tables in all the subnets

To clarify, a NAT gateway acts like a DSL router. It hides the source IP address of outbound traffic behind a single, public IP address (using an Elastic IP from AWS), and routes any return traffic back to wherever that traffic came from. To reduce inter-AZ data transfer rates, I’m putting one in each AZ, but if there’s not a lot of outbound traffic or the outbound traffic isn’t critical enough to require resiliency, this could all be centralised to a single NAT gateway. To put a NAT gateway in each AZ, you need a subnet in each AZ, and to get out to the internet (by whatever means you have), you need an internet gateway and route tables for how to reach the NAT and internet gateways.

We also should probably add, at this point, four additional things.

  1. The Network Firewall
  2. Subnets for the Firewall interfaces
  3. Stateless Policy
  4. Stateful Policy

The Network Firewall acts like a single appliance, and uses a Gateway Load Balancer to present an interface into each of the availability zones. It has a stateless policy (which is very fast, but needs to address both inbound and outbound traffic flows) to do IP and Port based filtering (referred to as “Layer 3” filtering) and then specific traffic can be passed into a stateful policy (which is slower) to do packet and flow inspection.

In this case, I only want outbound HTTPS traffic to be passed, so my stateless rule group is quite simple;

  • VPC range on any port → Internet on TCP/443; pass to Stateful rule groups
  • Internet on TCP/443 → VPC range on any port; pass to Stateful rule groups

I have two stateful rule groups, one is defined to just allow access out to example.com and any relevant subdomains, using the “Domain List” stateful policy item. The other allows access to example.org and any relevant subdomains, using a Suricata stateful policy item, to show the more flexible alternative route. (Suricata has lots more filters than just the SNI value, you can check for specific SSH versions, Kerberos CNAMEs, SNMP versions, etc. You can also add per-rule logging this way, which you can’t with the Domain List route).

These are added to the firewall policy, which also defines that if a rule doesn’t match a stateless rule group, or an established flow doesn’t match a stateful rule group, then it should be dropped.

New network diagram with more subnets and objects, but essentially, as described in the paragraphs above. Traffic flows from the instances either down towards the internet, or up towards the VPCe.

I’ve created a tag in my Github repo at this state, with the firewall, NAT Gateway and Internet Gateway, called step 2.

So far, so good… but why let our users even try to resolve the DNS name of a host they’re not permitted to reach. Let’s turn on DNS Firewalling too.

Turning on Route 53 DNS Firewall

You’ll notice that in the AWS Network Firewall, I didn’t let DNS out of the network. This is because, by default, AWS enables Route 53 as it’s local resolver. This lives on the “.2” address of the VPC, so in my example environment, this would be 198.18.0.2. Because it’s a local resolver, it won’t cross the Firewall exiting to the internet. You can also make Route 53 use your own DNS servers for specific DNS resolution (for example, if you’re running an Active Directory service inside your network).

Any Network Security Response team members you have working with you would appreciate it if you’d turn on DNS Logging at this point, so I’ll do it too!

In March 2021, AWS announced “Route 53 DNS Firewall”, which allow this DNS resolver to rewrite responses, or even to completely deny the existence of a DNS record. With this in mind, I’m going to add some custom DNS rules.

The first thing I want to do is to only permit traffic to my specific list of DNS names – example.org, example.com and their subdomains. DNS quite likes to terminate DNS names with a dot, signifying it shouldn’t try to resolve any higher up the chain, so I’m going to make a “permitted domains” DNS list;

example.com.
example.org.
*.example.com.
*.example.org.

Nice and simple! Except, this also stops me from being able to access the instances over SSM, so I’ll create a separate “VPCe” DNS list:

ssm.ex-ample-1.amazonaws.com.
*.ssm.ex-ample-1.amazonaws.com.
ssmmessages.ex-ample-1.amazonaws.com.
*.ssmmessages.ex-ample-1.amazonaws.com.
ec2messages.ex-ample-1.amazonaws.com.
*.ec2messages.ex-ample-1.amazonaws.com.

Next I create a “default deny” DNS list:

*.

And then build a DNS Firewall Policy which allows access to the “permitted domains”, “VPCe” lists, but blocks resolution of any “default deny” entries.

I’ve created a tag in my Github repo at this state, with the Route 53 DNS Firewall configured, called step 3.

In conclusion…

So there we have it. While the network is not “secure” (there’s still a few gaps here) it’s certainly MUCH more secure than it was, and it certainly would take a lot more work for anyone with malicious intent to get your content out.

Feel free to have a poke around, and leave comments below if this has helped or is of interest!

"Fishing fleet" by "Nomad Tales" on Flickr

Using Terraform to select multiple Instance Types for an Autoscaling Group in AWS

Tale as old as time, the compute instance type you want to use in AWS is highly contested (or worse yet, not as available in every availability zone in your region)! You plead with your TAM or AM “Please let us have more of that instance type” only to be told “well, we can put in a request, but… haven’t you thought about using a range of instance types”?

And yes, I’ve been on both sides of that conversation, sadly.

The commented terraform

# This is your legacy instance_type variable. Ideally we'd have
# a warning we could raise at this point, telling you not to use
# this variable, but... it's not ready yet.
variable "instance_type" {
  description = "The legacy single-instance size, e.g. t3.nano. Please migrate to instance_types ASAP. If you specify instance_types, this value will be ignored."
  type        = string
  default     = null
}

# This is your new instance_types value. If you don't already have
# some sort of legacy use of the instance_type variable, then don't
# bother with that variable or the locals block below!
variable "instance_types" {
  description = "A list of instance sizes, e.g. [t2.nano, t3.nano] and so on."
  type        = list(string)
  default     = null
}

# Use only this locals block (and the value further down) if you
# have some legacy autoscaling groups which might use individual
# instance_type sizes.
locals {
  # This means if var.instance_types is not defined, then use it,
  # otherwise create a new list with the single instance_type
  # value in it!
  instance_types = var.instance_types != null ? var.instance_types : [ var.instance_type ]
}

resource "aws_launch_template" "this" {
  # The prefix for the launch template name
  # default "my_autoscaling_group"
  name_prefix = var.name

  # The AMI to use. Calculated outside this process.
  image_id = data.aws_ami.this.id

  # This block ensures that any new instances are created
  # before deleting old ones.
  lifecycle {
    create_before_destroy = true
  }

  # This block defines the disk size of the root disk in GB
  block_device_mappings {
    device_name = data.aws_ami.centos.root_device_name
    ebs {
      volume_size = var.disksize # default "10"
      volume_type = var.disktype # default "gp2"
    }
  }

  # Security Groups to assign to the instance. Alternatively
  # create a network_interfaces{} block with your
  # security_groups = [ var.security_group ] in it.
  vpc_security_group_ids = [ var.security_group ]

  # Any on-boot customizations to make.
  user_data = var.userdata
}

resource "aws_autoscaling_group" "this" {
  # The name of the Autoscaling Group in the Web UI
  # default "my_autoscaling_group"
  name = var.name

  # The list of subnets into which the ASG should be deployed.
  vpc_zone_identifier = var.private_subnets
  # The smallest and largest number of instances the ASG should scale between
  min_size            = var.min_rep
  max_size            = var.max_rep

  mixed_instances_policy {
    launch_template {
      # Use this template to launch all the instances
      launch_template_specification {
        launch_template_id = aws_launch_template.this.id
        version            = "$Latest"
      }

      # This loop can either use the calculated value "local.instance_types"
      # or, if you have no legacy use of this module, remove the locals{}
      # and the variable "instance_type" {} block above, and replace the
      # for_each and instance_type values (defined as "local.instance_types")
      # with "var.instance_types".
      #
      # Loop through the whole list of instance types and create a
      # set of "override" values (the values are defined in the content{}
      # block).
      dynamic "override" {
        for_each = local.instance_types
        content {
          instance_type = local.instance_types[override.key]
        }
      }
    }

    instances_distribution {
      # If we "enable spot", then make it 100% spot.
      on_demand_percentage_above_base_capacity = var.enable_spot ? 0 : 100
      spot_allocation_strategy                 = var.spot_allocation_strategy
      spot_max_price                           = "" # Empty string is "on-demand price"
    }
  }
}

So what is all this then?

This is two Terraform resources; an aws_launch_template and an aws_autoscaling_group. These two resources define what should be launched by the autoscaling group, and then the settings for the autoscaling group.

You will need to work out what instance types you want to use (e.g. “must have 16 cores and 32 GB RAM, have an x86_64 architecture and allow up to 15 Gigabit/second throughput”)

When might you use this pattern?

If you have been seeing messages like “There is no Spot capacity available that matches your request.” or “We currently do not have sufficient <size> capacity in the Availability Zone you requested.” then you need to consider diversifying the fleet that you’re requesting for your autoscaling group. To do that, you need to specify more instance types. To achieve this, I’d use the above code to replace (something like) one of the code samples below.

If you previously have had something like this:

resource "aws_launch_configuration" "this" {
  iam_instance_profile        = var.instance_profile_name
  image_id                    = data.aws_ami.this.id
  instance_type               = var.instance_type
  name_prefix                 = var.name
  security_groups             = [ var.security_group ]
  user_data_base64            = var.userdata
  spot_price                  = var.spot_price

  root_block_device {
    volume_size = var.disksize
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "this" {
  capacity_rebalance   = false
  launch_configuration = aws_launch_configuration.this.id
  max_size             = var.max_rep
  min_size             = var.min_rep
  name                 = var.name
  vpc_zone_identifier  = var.private_subnets
}

Or this:

resource "aws_launch_template" "this" {
  lifecycle {
    create_before_destroy = true
  }

  block_device_mappings {
    device_name = data.aws_ami.this.root_device_name
    ebs {
      volume_size = var.disksize
    }
  }

  iam_instance_profile {
    name = var.instance_profile_name
  }

  network_interfaces {
    associate_public_ip_address = true
    security_groups             = local.node_security_groups
  }

  image_id      = data.aws_ami.this.id
  name_prefix   = var.name
  instance_type = var.instance_type
  user_data     = var.userdata

  instance_market_options {
    market_type = "spot"
    spot_options {
      spot_instance_type = "one-time"
    }
  }

  metadata_options {
    http_tokens                 = var.imds == 1 ? "optional" : "required"
    http_endpoint               = "enabled"
    http_put_response_hop_limit = 1
  }
}

resource "aws_autoscaling_group" "this" {
  name                = var.name
  vpc_zone_identifier = var.private_subnets
  min_size            = var.min_rep
  max_size            = var.max_rep

  launch_template {
    id      = aws_launch_template.this.id
    version = "$Latest"
  }
}

Then this new method is a much better idea :) Even more so if you had two launch templates to support spot and non-spot instance types!

Hat-tip to former colleague Paul Moran who opened my eyes to defining your fleet of variable instance types, as well as to my former customer (deliberately unnamed) and my current employer who both stumbled into the same documentation issue. Without Paul’s advice with my prior customer’s issue I’d never have known what I was looking for this time around!

Featured image is “Fishing fleet” by “Nomad Tales” on Flickr and is released under a CC-BY-SA license.

An open padlock with a key inserted into it, on a printed circuit board

Pulling container images from private registries (including Docker Hub) with a Kubernetes Kubelet Credential Provider

At work last week, I finally solved an issue by writing some code, and I wanted to explain why I wrote it.

At it’s core, Kubernetes is an orchestrator which runs “Container Images”, which are structured filesystem snapshots, taken after running individual commands against a base system. These container images are stored in a container registry, and the most well known of these is the Docker registry, known as Docker Hub.

A registry can be public, meaning you don’t need credentials to get any images from it, or private. Some also offer a mixed-mode where you can make a certain number of requests without requiring authentication, but if you need more than that amount of requests, you need to provide credentials.

During the build-out of a new cluster, I discovered that the ECR (Elastic Container Registry) from AWS requires a new type of authentication – the Kubelet Credential Provider, which required the following changes:

  1. In /etc/sysconfig/kubelet you provide these two switches;
    --image-credential-provider-bin-dir /usr/local/bin/image-credential-provider and --image-credential-provider-config /etc/kubernetes/image-credential-provider-config.json.
  2. In /etc/kubernetes/image-credential-provider-config.json you provide a list of registries and the credential provider to use, which looks like this:
{
  "apiVersion": "kubelet.config.k8s.io/v1",
  "kind": "CredentialProviderConfig",
  "providers": [
    {
      "name": "binary-credential-provider-name",
      "matchImages": [
        "example.org",
        "registry.*.example.org",
        "*.registry.*.example.org"
      ],
      "defaultCacheDuration": "12h",
      "apiVersion": "credentialprovider.kubelet.k8s.io/v1"
    }
  ]
}
  1. Downloading and placing the credential provider binary into the /usr/local/bin/image-credential-provider path.

The ECR Credential Provider has it’s own Github repostitory, and it made me think – we’ve been using the “old” method of storing credentials using the containerd configuration file, which is now marked as deprecated – but this means that any changes to these credentials would require a restart of the containerd service (which apparently used to have a big impact on the platform), but this new ECR provider doesn’t.

I decided to write my own Credential Provider, following the documentation for the Kubelet Credential Provider API and I wrote it in Python – a language I’m trying to get better in! (Pull requests, feature requests, etc. are all welcome!)

I will confess I made heavy use of ChatGPT to get a steer on certain aspects of how to write the code, but all the code is generic and there’s nothing proprietary in this code.

Using the Generic Credential Provider

  1. Follow the steps above – change your Kubernetes environment to ensure you have the kubelet configuration changes and the JSON credential provider configuration put in the relevant parts of your tree. Set the “matchImages” values to include the registry in question – for dockerhub, I’d probably use ["docker.io", "*.docker.io"]
  2. Download the generic-credential-provider script from Github, put it in the right path in your worker node’s filesystem (if you followed my notes above it’ll be in /usr/local/bin/image-credential-provider/generic-credential-provider but this is *your* system we’re talking about, not mine! You know your build better than I do!)
  3. Create the /etc/kubernetes/registries directory – this can be changed by editing the script to use a new path, and for testing purposes there is a flag --credroot /some/new/path but that doesn’t work for the kubelet configuration file.
  4. Create a credential file, for example, /etc/kubernetes/registries/example.org.json which contains this string: {"username":"token_username","password":"token_password"}. [Yes, it’s a plaintext credential. Make sure it’s scoped for only image downloads. No, this still isn’t very good. But how else would you do this?! (Pull requests are welcomed!)] You can add a duration value into that JSON dictionary, to change the default timeout from 5 minutes. Technically, the default is actually set in /etc/kubernetes/image-credential-provider-config.json but I wanted to have my own per-credential, and as these values are coming from the filesystem, and therefore has very little performance liability, I didn’t want to have a large delay in the cache.
  5. Test your credential! This code is what I used:
echo '{
  "apiVersion": "credentialprovider.kubelet.k8s.io/v1",
  "kind": "CredentialProviderRequest",
  "image": "your.registry.example.org/org/image:version"
}' | /usr/local/bin/image-credential-provider/generic-credential-provider

which should return:

'{"kind": "CredentialProviderResponse", "apiVersion": "credentialprovider.kubelet.k8s.io/v1", "cacheKeyType": "Registry", "cacheDuration": "0h5m0s", "auth": {"your.registry.example.com": {"username": "token_username", "password": "token_password"}}}'

You should also see an entry in your syslog service showing a line that says “Credential request fulfilled for your.registry.example.com” and if you pass it a check that it fails, it should say “Failed to fulfill credential request for failure.example.org“.

If this helped you, please consider buying me a drink to say thanks!

Featured image is “Padlock on computer parts” by “Marco Verch Professional Photographer” on Flickr and is released under a CC-BY license.

A green notice board in a country setting. It has leaflets and cards on it, although they are not readable in this image.

Create yourself a “Work Profile” to let others know how (and when) to contact you!

I recently got talking to a colleague about how people prefer to work and how they prefer to be contacted. It’s obvious in an office – if Bob isn’t there, then he’s not around, but when some of the team is remote, some are hybrid working, then it’s a lot harder.

There are three things I’ve found are really useful to know when trying to reach someone, and I’ve written this up in a simple page stored on our internal wiki;

  1. What’s your baseline – where do you live and when are you usually in the office.
  2. What are your usual working hours – how accurate is your calendar for non-meetings? do you have fixed meetings that happen every week, or a school run that you typically do? Do you need to be away from your desk at certain times for religious reasons?
  3. What’s the best way to contact you – if you’ve got a choice of tools (like Slack Hangouts or Google Meet) which would you rather use, and why. Is it best to drop in a 15 minute appointment, or just call you?

Once you’ve got these three items, in something everyone can access, add it to your directory profile, bio on slack, your email signature (for internal emails) and so on.

From here to the end of the post is a mildly sanitised version of my internally posted profile. I hope it’s useful to you!


Baseline

I am based in the UK, using the Europe/London time zone. I am remote based with very infrequent visits to the London office.

Typical Working Hours Patterns

I work from Monday to Friday, normally starting at X and finishing at X. During school term times, I will be out of the office between 3:00PM and 3:45PM to do school drop-off and pick ups. On Monday to Thursday, I am in a stand-up from X until Y. I will typically take my lunch break between X and Y. On Friday I have a weekly one-to-one which starts at X and finishes at Y. I will then take lunch until 1:00PM.

During school holidays, the start and end times will need to be a bit more flexible, and drop-off and pick-up slots will vary based on day-to-day activities.

I will keep my calendar up-to-date accordingly.

Contact Preference

I prefer being contacted by Slack mention or DM, however, I will often follow-up with a request for a DM chat or call, especially if I have been typing a lot during the day, or am trying to resolve an issue which I expect will require a lot of interaction.

I am happy to use Google Meetings, Slack Huddles, Microsoft Teams or Amazon Chime, all of which I have tested and work on my computer. I personally prefer to use Microsoft Teams because the presenter can allow participants to interact with the presenter’s screen or Slack Huddles because that allows participants to draw on the presenters screen, and because I can see more of your screen by default.


Featured image is “Notice board / Bulletin Board” by “Matthew Paul Argall” on Flickr and is released under a CC-BY license.

A stack of Jenga bricks falling over

A Quick Fix for “Backend initialization required” from Terragrunt

Today I ran terragrunt apply against a IaC directory, and got this response:

╷
│ Error: Backend initialization required: please run "terraform init"
│ 
│ Reason: Backend configuration block has changed
│ 
│ The "backend" is the interface that Terraform uses to store state,
│ perform operations, etc. If this message is showing up, it means that the
│ Terraform configuration you're using is using a custom configuration for
│ the Terraform backend.
│ 
│ Changes to backend configurations require reinitialization. This allows
│ Terraform to set up the new configuration, copy existing state, etc. Please
│ run
│ "terraform init" with either the "-reconfigure" or "-migrate-state" flags
│ to
│ use the current configuration.
│ 
│ If the change reason above is incorrect, please verify your configuration
│ hasn't changed and try again. At this point, no changes to your existing
│ configuration or state have been made.
╵
ERRO[0000] Hit multiple errors:
Hit multiple errors:
exit status 1 

But wait, I hear you say, Terragrunt runs terraform init for you… so what gives?

Well, in this case, the terragrunt.hcl has a dependency block, and one of those dependencies has not run properly, so… let’s fix it

Read the content of your terragrunt.hcl

terraform {
  source = "git@github.com:example/example-terraform-modules.git//module"
}

include {
  path = find_in_parent_folders()
}

dependency "dependency_1" {
  config_path = "${get_terragrunt_dir()}/../dependency"

  mock_outputs_allowed_terraform_commands = ["destroy", "force-unlock"]
  mock_outputs = {
    output_1 = []
    output_2 = ""
  }
}

inputs = {
  name      = "some_module"
  some_key  = dependency.dependency_1.outputs.output_1
  other_key = dependency.dependency_1.outputs.output_2
}

Right, so for some reason the dependency won’t run. Change into that directory, and run terragrunt apply --terragrunt-source-update. Hopefully, you’ll get something like this:

Initializing the backend...

Successfully configured the backend "example"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Reusing previous version of example/example from the dependency lock file
- Installing example/example v1.0.0...
- Installed example/example v1.0.0 (signed by Example)

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
example_module.this: Refreshing state... [id=an-example]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:
output_1 = {"some_key": "some_value"}
output_2 = "some_string"

You may find yourself having to traverse several different dependencies until you get to the one which is missing… and then it should work :)

Featured image is “Jenga” by “Mara Tr.” on Flickr and is released under a CC-BY license.

A medieval helm in gold on a bench at a museum

Fixing 403 errors from ghcr.io with #helm pull

At work, I’m using skaffold to deploy a helm chart which references a ghcr.io repository. Here’s the stanza I’m looking at:

apiVersion: skaffold/v3
kind: Config
deploy:
  helm:
    releases:
      - name: {package}
        remoteChart: oci://ghcr.io/{owner}/{package}

This is the first time I’ve tried to deploy this chart, and I kept getting this message:

No tags generated
Starting test...
Starting deploy...
Helm release {package} not installed. Installing...
Error: INSTALLATION FAILED: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden
deploying "{package}": install: exit status 1

I thought this might have been an issue with the skaffold file, so I tried running this directly with helm:

$ helm pull oci://ghcr.io/{owner}/{package}
Error: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ghcr.io/token?scope=repository%3A{owner}%2F{package}%3Apull&scope=repository%3Auser%2Fimage%3Apull&service=ghcr.io: 403 Forbidden

Huh, that looks a bit familiar. I spent a little while checking to see whether this was something at the Kubernetes cluster, or if it was just me, and ended up finding this nugget (thanks to a steer from this post)

$ gh auth token | helm registry login ghcr.io -u {my_github_user} --password-stdin
Login Succeeded

And now it works!

elm pull oci://ghcr.io/{owner}/{package}
Pulled: ghcr.io/{owner}/{package}:1.2.3
Digest: sha256:decafbad1234567890aabbccddeeffdeadbeefbadbadbadbad12345678901234

Featured image is “helm” by “23 dingen voor musea” on Flickr and is released under a CC-BY-SA license.

A text dialogue from a web page showing "Uh oh. Something really just went wrong. Good thing we know about it and have our crack team of squirrels getting their nuts out of the system!"

How to capture stdout and stderr from a command in a shellscript without preventing piped processes from seeing them

I love the tee command – it captures stdout [1] and puts it in a file, while then returning that output to stdout for the next process in a pipe to consume, for example:

$ ls -l | tee /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh
$ cat /tmp/output
total 1
xrwxrwxrw 1 jonspriggs jonspriggs 0 Jul 27 11:16 build.sh

But wait, why is that useful? Well, in a script, you don’t always want to see the content scrolling past, but in the case of a problem, you might need to catch up with the logs afterwards. Alternatively, you might do something like this:

if some_process | tee /tmp/output | grep -q "some text"
then
  echo "Found 'some text' - full output:"
  cat /tmp/output
fi

This works great for stdout but what about stderr [2]? In this case you could just do:

some_process 2>&1 | tee /tmp/output

But that mashes all of stdout and stderr into the same blob.

In my case, I want to capture all the output (stdout and stderr) of a given process into a file. Only stdout is forwarded to the next process, but I still wanted to have the option to see stderr as well during processing. Enter process substitution.

TEMP_DATA_PATH="$(mktemp -d)"
capture_out() {
  base="${TEMP_DATA_PATH}/${1}"
  mkdir "${base}"
  shift
  "$@" 2> >(tee "${base}/stderr" >&2) 1> >(tee "${base}/stdout")
}

With this, I run capture_out step-1 do_a_thing and then in /tmp/tmp.sometext/step-1/stdout and /tmp/tmp.sometext/step-1/stderr are the full outputs I need… but wait, I can also do:

$ capture_out step-1 do_a_thing | \
  capture_out step-2 process --the --thing && \
  capture_out step-3 echo "..." | capture_out step-4 profit
$ find /tmp/tmp.sometext -type f
/tmp/tmp.sometext/step-1/stdout
/tmp/tmp.sometext/step-1/stderr
/tmp/tmp.sometext/step-2/stdout
/tmp/tmp.sometext/step-2/stderr
/tmp/tmp.sometext/step-4/stdout
/tmp/tmp.sometext/step-4/stderr
/tmp/tmp.sometext/step-3/stderr
/tmp/tmp.sometext/step-3/stdout

Or

if capture_out has_an_error something-wrong | capture_out handler check_output
then
  echo "It all went great"
else
  echo "Process failure"
  echo "--Initial process"
  # Use wc -c to check the number of characters in the file
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/has_an_error/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/has_an_error/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/has_an_error/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/has_an_error/stderr"
  fi
  echo "--Second stage"
  if [ -e "${TEMP_DATA_PATH}/handler/stdout"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stdout")" ]
  then
    echo "----stdout:"
    cat "${TEMP_DATA_PATH}/handler/stdout"
  fi
  if [ -e "${TEMP_DATA_PATH}/handler/stderr"] && [ 0 -ne "$(wc -c "${TEMP_DATA_PATH}/handler/stderr")" ]
  then
    echo "----stderr:"
    cat "${TEMP_DATA_PATH}/handler/stderr"
  fi
fi

This has become part of my normal toolkit now for logging processes. Thanks bash!

Also, thanks to ChatGPT for helping me find this structure that I’d seen before, but couldn’t remember how to do it! (it almost got it right too! Remember kids, don’t *trust* what ChatGPT gives you, use it as a research starting point, test *that* against your own knowledge, test *that* against your environment and test *that* against expected error cases too! Copy & Paste is not the best idea with AI generated code!)

Footnotes

[1] stdout is the name of the normal output text we see in a shell, it’s also sometimes referred to as “file descriptor 1” or “fd1”. You can also output to &1 with >&1 which means “send to fd1”

[2] stderr is the name of the output in a shell when an error occurs. It isn’t caught by things like some_process > /dev/null which makes it useful when you don’t want to see output, just errors. Like stdout, it’s also referred to as “file descriptor 2” or “fd2” and you can output to &2 with >&2 if you want to send stdout to stderr.

Featured image is “WordPress Error” by “tara hunt” on Flickr and is released under a CC-BY-SA license.

A series of gold blocks, each crossed, and one of the lower blocks has engraved "Eduardo Nery 1995-1998 Aleluia - Secla"

Deploying the latest build of a template (machine) image with #Xen #Orchestrator

In my current role we are using Packer to build images on a Xen Orchestrator environment, use a CI/CD system to install that image into both a Xen Template and an AWS AMI, and then we use Terraform to use that image across our estate. The images we build with Packer have this stanza in it:

locals {
  timestamp = regex_replace(timestamp(), "[- TZ:]", "")
}
variable "artifact_name" {
  default = "SomeLinux-version.iso"
}
source "xenserver-iso" "this" {
  vm_name = "${var.artifact_name}-${local.timestamp}"
  # more config below
}

As a result, the built images include a timestamp.

When we use the AMI in Terraform, we can locate it with this code:

variable "ami_name" {
  default = "SomeLinux-version.iso-"
}

data "aws_ami" "this" {
  most_recent = true

  filter {
    name   = "name"
    values = [var.ami_name]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = [var.owner]
}

But, because Xen doesn’t track when a template is created, instead I needed to do something different. Enter get_xoa_template.sh.

#!/bin/bash
trap cleanup SIGINT SIGTERM EXIT
exit_message=""
set_exit=0
fail() {
[ -n "$1" ] && echo "$1" >&2
[ "$2" -gt 0 ] && exit $2
}
cleanup() {
trap – SIGINT SIGTERM EXIT
[ "$UNREGISTER" -eq 1 ] && [ "$STATE" == "signed_in" ] && xo-cli –unregister 2>&1
[ -n "$exit_message" ] && fail "$exit_message" $set_exit
}
log_debug() {
[ -n "$DEBUG" ] && echo "$1" >> "$DEBUG"
}
parse_params() {
UNREGISTER=1
DEBUG=""
while :; do
case "${1-}" in
-h | –help)
echo "usage: get_xoa_template.sh –template SomeTemplatePrefix" >&2
echo "" >&2
echo "Options:" >&2
echo " -t | –template MyTemplatePrefix = The template to look for (required)" >&2
echo " -s | –server ws://192.0.2.1 = Sign into Xen Orchestrator on 192.0.2.1" >&2
echo " [Default to using XOA_URL environment variable]" >&2
echo " -u | –user username@example.org = Sign into Xen Orchestrator using this username" >&2
echo " [Default to using XOA_USER environment variable]" >&2
echo " -p | –password hunter2 = Sign into Xen Orchestrator using this password" >&2
echo " [Default to using XOA_PASSWORD environment variable]" >&2
echo " -l | –pool MyXenPool1 = Use this pool when looking for the template." >&2
echo " [Omit to ignore]" >&2
echo " -x | –no-unregister = Don't log out from the XOA server once connected." >&2
echo " -d | –debug = Log output to /tmp/xocli_output" >&2
echo " -d | –debug /path/to/debug = Log output to the specified path" >&2
echo " –debug=/path/to/debug = Log output to the specified path" >&2
exit 255
;;
-s | –server)
XOA_URL="${2-}"
shift
;;
-u | –user)
XOA_USER="${2-}"
shift
;;
-p | –password)
XOA_PASSWORD="${2-}"
shift
;;
-l | –pool)
XOA_POOL="${2-}"
shift
;;
-t | –template)
TEMPLATE="${2-}"
shift
;;
-x | –no-unregister)
UNREGISTER=0
;;
-d | –debug)
DEBUG=/tmp/xocli_output
[ -n "${2-}" ] && [ "$(echo "${2-}" | cut -c1)" != "-" ] && DEBUG="${2-}" && shift
;;
–debug=*)
DEBUG="$(echo $1 | sed -E -e 's/^[^=]+=//')"
;;
*)
break
;;
esac
shift
done
}
sign_in() {
[ -z "$XOA_URL" ] || [ -z "$XOA_USER" ] || [ -z "$XOA_PASSWORD" ] && fail "Missing sign-in details" 1
log_debug "Logging in"
if [ -n "$DEBUG" ]
then
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | tee -a "$DEBUG" | grep -q 'Successfully' || fail "Login failed" 2
else
xo-cli –register –au "$XOA_URL" "$XOA_USER" "$XOA_PASSWORD" 2>&1 | grep -q 'Successfully' || fail "Login failed" 2
fi
STATE="signed_in"
}
get_pool() {
[ -z "$XOA_POOL" ] && log_debug "No Pool" && return 0
log_debug "Getting Pool ID"
if [ -n "$DEBUG" ]
then
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1 | tee -a "$DEBUG")"
else
POOL_ID="\$pool=$(xo-cli –list-objects type=pool | jq -c -r ".[] | select(.name_label | match(\"${XOA_POOL}\")) | .uuid" | sort | tail -n 1)"
fi
[ "$POOL_ID" == "\$pool=" ] && fail "Pool provided but no ID received" 3
}
get_template() {
log_debug "Getting template"
if [ -n "$DEBUG" ]
then
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1 | tee -a "$DEBUG")"
else
TEMPLATE_IS="$(xo-cli –list-objects type=VM-template "${POOL_ID-}" | jq -c ".[] | select(.name_label | match(\"${TEMPLATE}\")) | .name_label" | sort | tail -n 1)"
fi
[ -z "$TEMPLATE_IS" ] && fail "Could not match this template" 4
if [ -n "$DEBUG" ]
then
echo "{\"is\": ${TEMPLATE_IS}}" | tee -a "$DEBUG"
else
echo "{\"is\": ${TEMPLATE_IS}}"
fi
}
[ -n "$(command -v xo-cli)" ] || fail "xo-cli is missing, and is a required dependency for this script. Please install it; \`sudo npm -g install xo-cli\`" 5
parse_params "$@"
if [ -n "$DEBUG" ]
then
rm -f "$DEBUG"
log_debug "Invoked: $(date)"
log_debug "Template: $TEMPLATE"
log_debug "Pool: $XOA_POOL"
fi
sign_in
get_pool
get_template

This script is invoked from your terraform like this:

variable "template_name" {
  default     = "SomeLinux-version.iso-"
  description = "A regex, partial or full string to match in the template name"
}

variable "poolname" {
  default = "MyPool"
}

data "external" "get_xoa_template" {
  program = [
    "/bin/bash", "${path.module}/get_xoa_template.sh",
    "--template", var.template_name,
    "--pool", var.poolname
  ]
}

data "xenorchestra_pool" "pool" {
  name_label = var.poolname
}

data "xenorchestra_template" "template" {
  name_label = data.external.get_xoa_template.result.is
  pool_id    = data.xenorchestra_pool.pool.id
}

And that’s how you do it. Oh, and if you need to pin to a specific version? Change the template_name value from the partial or regex version to the full version, like this:

variable "template_name" {
  # This assumes your image was minted at midnight on 1970-01-01
  default     = "SomeLinux-version.iso-19700101000000"
}

Featured image is “Barcelos and Braga-18” by “Graeme Churchard” on Flickr and is released under a CC-BY license.

A photo of a door with the focus on the handle which has a lock in the centre of the knob. The lock has a key in it, with a bunch of keys dangling from the central ring.

Using direnv with terraform, terragrunt, saml2aws, SOPS and AWS KMS

In my current project I am often working with Infrastructure as Code (IoC) in the form of Terraform and Terragrunt files. Before I joined the team a decision was made to use SOPS from Mozilla, and this is encrypted with an AWS KMS key. You can only access specific roles using the SAML2AWS credentials, and I won’t be explaining how to set that part up, as that is highly dependant on your SAML provider.

While much of our environment uses AWS, we do have a small presence hosted on-prem, using a hypervisor service. I’ll demonstrate this with Proxmox, as this is something that I also use personally :)

Firstly, make sure you have all of the above tools installed! For one stage, you’ll also require yq to be installed. Ensure you’ve got your shell hook setup for direnv as we’ll need this later too.

Late edit 2023-07-03: There was a bug in v0.22.0 of the terraform which didn’t recognise the environment variables prefixed PROXMOX_VE_ – a workaround by using TF_VAR_PROXMOX_VE and a variable "PROXMOX_VE_" {} block in the Terraform code was put in place for the inital publication of this post. The bug was fixed in 0.23.0 which this post now uses instead, and so as a result the use of TF_VAR_ prefixed variables was removed too.

Set up AWS Vault

AWS KMS

AWS Key Management Service (KMS) is a service which generates and makes available encryption keys, backed by the AWS service. There are *lots* of ways to cut that particular cake, but let’s do this a quick and easy way… terraform

variable "name" {
  default = "SOPS"
  type    = string
}
resource "aws_kms_key" "this" {
  tags                     = {
    Name : var.name,
    Owner : "Admins"
  }
  key_usage                = "ENCRYPT_DECRYPT"
  customer_master_key_spec = "SYMMETRIC_DEFAULT"
  deletion_window_in_days  = 30
  is_enabled               = true
  enable_key_rotation      = false
  policy                   = <<EOF
{
  "Version": "2012-10-17",
  "Id": "key-default-1",
  "Statement": [
    {
      "Sid": "Root Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "Estate Admin Access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${get_aws_account_id()}:role/estateadmins"
      },
      "Action": [
        "kms:Describe*",
        "kms:List*",
        "kms:Get*",
        "kms:Encrypt*"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_kms_alias" "this" {
  target_key_id = aws_kms_key.this.key_id
  name          = "alias/${var.name}"
}

output "key" {
  value = aws_kms_alias.this.arn
}

After running this, let’s assume that we get an output for the “key” value of:

arn:aws:kms:us-east-1:123456789012:alias/main

Setup Sops

In your terragrunt tree, create a file called .sops.yaml, which contains:

---
creation_rules:
  - kms: arn:aws:kms:us-east-1:123456789012:alias/main

And a file called secrets.enc.yaml which contains:

---
PROXMOX_VE_USERNAME: root@pam
PROXMOX_VE_PASSWORD: deadb33f@2023

Test that your KMS works by assuming your IAM role via SAML2AWS like this:

$ saml2aws login --skip-prompt --quiet
$ saml2aws exec -- sops --verbose --encrypt --in-place secrets.enc.yaml
[AWSKMS]	 INFO[0000] Encryption succeeded                          arn="arn:aws:kms:us-east-1:123456789012:alias/main"
[CMD]		 INFO[0000] File written successfully

Setup direnv

Outside your tree, in ~/.config/direnv/lib create a file called use_sops.sh (does not need to be chmod +x or chmod 755!) containing this:

# Based on https://github.com/direnv/direnv/wiki/Sops
use_sops() {
    local path=${1:-$PWD/secrets.enc.yaml}
    if [ -e "$path" ]
    then
        if grep -q -E '^sops:' "$path"
        then
            eval "$(sops --decrypt --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"
        else
            if [ -n "$(command -v yq)" ]
            then
                eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"
                export SOPS_WARNING="unencrypted $path"
            fi
        fi
    fi
    watch_file "$path"
}

There are two key lines here, the first of which is:

eval "$(sops -d --output-type dotenv "$path" 2>/dev/null | direnv dotenv bash /dev/stdin || false)"

This line asks sops to decrypt the secrets file, using the “dotenv” output type, however, the dotenv format looks like this:

some_key = "some value"

So, as a result, we then pass that value to direnv and ask it to rewrite it in the format it expects, which looks like this:

export some_key="some value"

The second key line is this:

eval "$(yq eval --output-format props "$path" | direnv dotenv bash /dev/stdin)"

This asks yq to parse the secrets file, using the “props” formatter, which results in lines just like the dotenv output we saw above.

However, because we used yq to parse the file, it means that we know this file isn’t encrypted, so we also add an extra export value:

export SOPS_WARNING="unencrypted $path"

This can be picked up as part of your shell prompt to put a warning in! Anyway… let’s move on.

Now that you have your reusable library file, we now configure the direnv file, .envrc for the root of your proxmox cluster:

use sops

Oh, ok, that was simple. You can add several files here if you wish, like this:

use sops file1.enc.yaml
use sops file2.enc.yml
use sops ~/.core_sops

But, we don’t need that right now!

Open your shell in that window, and you’ll get this warning:

direnv: error /path/to/demo/.envrc is blocked. Run `direnv allow` to approve its content

So, let’s do that!

$ direnv allow
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

So far, so good… but wait, you’ve authenticated to your SAML access to AWS. Let’s close that shell, and go back in again

$ cd /path/to/demo
direnv: loading /path/to/demo/.envrc
direnv: using sops
$

Ah, now we don’t have our values exported. That’s what we wanted!

What now?!

Configuring the details of the proxmox cluster

We have our .envrc file which provides our credentials (let’s pretend we’re using a shared set of credentials across all the boxes), but now we need to setup access to each of the boxes.

Let’s make our two cluster directories;

mkdir cluster_01
mkdir cluster_02

And in each of these clusters, we need to put an .envrc file with the right IP address in. This needs to check up the tree for any credentials we may have already loaded:

source_env "$(find_up ../.envrc)"
export PROXMOX_VE_ENDPOINT="https://192.0.2.1:8006" # Documentation IP address for the first cluster - change for the second cluster.

The first line works up the tree, looking for a parent .envrc file to inject, and then, with the second line, adds the Proxmox API endpoint to the end of that chain. When we run direnv allow (having logged back into our saml2aws session), we get this:

$ direnv allow
direnv: loading /path/to/demo/cluster_01/.envrc
direnv: loading /path/to/demo/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
$

Great, now we can setup the connection to the cluster in the terragrunt file!

Set up Terragrunt

In /path/to/demo/terragrunt.hcl put this:

remote_state {
  backend = "s3"
  config  = {
    encrypt                = true
    bucket                 = "example-inc-terraform-state"
    key                    = "${path_relative_to_include()}/terraform.tfstate"
    region                 = "us-east-1"
    dynamodb_table         = "example-inc-terraform-state-lock"
    skip_bucket_versioning = false
  }
}
generate "providers" {
  path      = "providers.tf"
  if_exists = "overwrite"
  contents  = <<EOF
terraform {
  required_providers {
    proxmox = {
      source = "bpg/proxmox"
      version = "0.23.0"
    }
  }
}

provider "proxmox" {
  insecure = true
}
EOF
}

Then in the cluster_01 directory, create a directory for the code you want to run (e.g. create a VLAN might be called “VLANs/30/“) and put in it this terragrunt.hcl

terraform {
  source = "${get_terragrunt_dir()}/../../../terraform-module-network//vlan"
  # source = "git@github.com:YourProject/terraform-module-network//vlan?ref=production"
}

include {
  path = find_in_parent_folders()
}

inputs = {
  vlan_tag    = 30
  description = "VLAN30"
}

This assumes you have a terraform directory called terraform-module-network/vlan in a particular place in your tree or even better, a module in your git repo, which uses the input values you’ve provided.

That double slash in the source line isn’t a typo either – this is the point in that tree that Terragrunt will copy into the directory to run terraform from too.

A quick note about includes and provider blocks

The other key thing is that the “include” block loads the values from the first matching terragrunt.hcl file in the parent directories, which in this case is the one which defined the providers block. You can’t include multiple different parent files, and you can’t have multiple generate blocks either.

Running it all together!

Now we have all our depending files, let’s run it!

user@host:~$ cd test
direnv: loading ~/test/.envrc
direnv: using sops
user@host:~/test$ saml2aws login --skip-prompt --quiet ; saml2aws exec -- bash
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test$ cd cluster_01/VLANs/30
direnv: loading ~/test/cluster_01/.envrc
direnv: loading ~/test/.envrc
direnv: using sops
direnv: export +PROXMOX_VE_ENDPOINT +PROXMOX_VE_USERNAME +PROXMOX_VE_PASSWORD
user@host:~/test/cluster_01/VLANs/30$ terragrunt apply
data.proxmox_virtual_environment_nodes.available_nodes: Reading...
data.proxmox_virtual_environment_nodes.available_nodes: Read complete after 0s [id=nodes]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # proxmox_virtual_environment_network_linux_bridge.this[0] will be created
  + resource "proxmox_virtual_environment_network_linux_bridge" "this" {
      + autostart  = true
      + comment    = "VLAN30"
      + id         = (known after apply)
      + mtu        = (known after apply)
      + name       = "vmbr30"
      + node_name  = "proxmox01"
      + ports      = [
          + "enp3s0.30",
        ]
      + vlan_aware = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
proxmox_virtual_environment_network_linux_bridge.this[0]: Creating...
proxmox_virtual_environment_network_linux_bridge.this[0]: Creation complete after 2s [id=proxmox01:vmbr30]
user@host:~/test/cluster_01/VLANs/30$

Winning!!

Featured image is “2018/365/1 Home is Where The Key Fits” by “Alan Levine” on Flickr and is released under a CC-0 license.

"Tickets" by "Becky Snyder" on Flickr

IP Address Management using PHPIPAM integrated with Keycloak for SAML2 Authentication

I’ve recently been working with a network estate that was a bit hard to get a handle on. It had grown organically, and was a bit tricky to allocate new network segments in. To fix this, I deployed PHPIPAM, which was super easy to setup and configure (I used the docker-compose file on the project’s docker hub page, and put it behind an NGINX server which was pre-configured with a LetsEncrypt TLS/HTTPS certificate).

PHPIPAM is a IP Address Management tool which is self-hostable. I started by setting up the “Sections” (which was the hosting environments the estate is using), and then setup the supernets and subnets in the “Subnets” section.

Already, it was much easier to understand the network topology, but now I needed to get others in to take a look at the outcome. The team I’m working with uses a slightly dated version of Keycloak to provide Single Sign-On. PHPIPAM will use SAML for authentication, which is one of the protocols that Keycloak offers. The documentation failed me a bit at this point, but fortunately a well placed ticket helped me move it along.

Setting up Keycloak

Here’s my walk through

  1. Go to “Realm Settings” in the sidebar and find the “SAML Identity Provider Metadata” (on my system it’s on the “General” tab but it might have changed position on your system). This will be an XML file, and (probably) the largest block of continuous text will be in a section marked “ds:X509Certificate” – copy this text, and you’ll need to use this as the “IDP X.509 public cert” in PHPIPAM.
  2. Go to “Clients” in the sidebar and click “Create”. If you want Keycloak to offer access to PHPIPAM as a link, the client ID needs to start “urn:” If you just want to use the PHPIPAM login option, give the client ID whatever you want it to be (I’ve seen some people putting in the URL of the server at this point). Either way, this needs to be unique. The client protocol is “saml” and the client SAML endpoint is the URL that you will be signing into on PHPIPAM – in my case https://phpipam.example.org/saml2/. It should look like this:

    Click Save to take you to the settings for this client.
  3. If you want Keycloak to offer a sign-in button, set the name of the button and description.

    Further down the page is “Root URL” set that to the SAML Endpoint (the one ending /saml2/ from before). Also set the “Valid Redirect URIs” to that too.

    Where it says “IDP Initiated SSO URL Name” put a string that will identify the client – I put phpipam, but it can be anything you want. This will populate a URL like this: https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam, which you’ll need as the “IDP Issuer”, “IDP Login URL” and “IDP Logout URL”. Put everything after the /auth/ in the box marked “Base URL”. It should look like this:

    Hit Save.
  4. Go to the “SAML Keys” tab. Copy the private key and certificate, these are needed as the “Authn X.509 signing” cert and cert key in PHPIPAM.
  5. Go to the “Mappers” tab. Create each of the following mappers;
    • A Role List mapper, with the name of “role list”, Role Attribute Name of “Role”, no friendly name, the SAML Attribute NameFormat set to “Basic” and Single Role Attribute set to on.
    • A User Attribute mapper, with the name, User Attribute, Friendly Name and SAML Attribute Name set to “email”, the SAML Attribute NameFormat set to “Basic” and Aggregate Attribute Values set to “off”.
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “display_name” and the SAML Attribute NameFormat set to “Basic”. The Script should be set to this single line: user.getFirstName() + ' ' + user.getLastName().
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “is_admin” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
is_admin = false;
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "phpipamadmins":
            is_admin = true;
            break;
    }
}
is_admin
  • Create one more mapper item:
    • A Javascript Mapper, with the name, Friendly Name and SAML Attribute Name set to “groups” and the SAML Attribute NameFormat set to “Basic”.

      The script should be as follows:
everyone_who_can_access_gets_read_only_access = false;
send_groups = "";
var GroupSet = user.getGroups();
for each (var group in GroupSet) {
    use_group = ""
    switch (group.getName()) {
        case "LDAP_GROUP_1":
            use_group = "IPAM_GROUP_1";
            break;
        case "LDAP_GROUP_2":
            use_group = "IPAM_GROUP_2";
            break;
    }
    if (use_group !== "") {
        if (send_groups !== "") {
          send_groups = send_groups + ","
        }
        send_groups = send_groups + use_group;
    }    
}
if (send_groups === "" && everyone_who_can_access_gets_read_only_access) {
    "Guests"
} else {
    send_groups
}

For context, the groups listed there, LDAP_GROUP_1 might be “Customer 1 Support Staff” or “ITSupport” or “Networks”, and the IPAM_GROUP_1 might be “Customer 1” or “WAN Links” or “DC Patching” – depending on the roles and functions of the teams. In my case they relate to other roles assigned to the staff member and the name of the role those people will perform in PHP IPAM. Likewise in the is_admin mapper, I’ve mentioned a group called “phpipamadmins” but this could be any relevant role that might grant someone admin access to PHPIPAM.

Late Update (2023-06-07): I’ve figured out how to enable modules now too. Create a Javascript mapper as per above, but named “modules” and have this script in it:

// Current modules as at 2023-06-07
// Some default values are set here.
noaccess       =  0;
readonly       =  1;
readwrite      =  2;
readwriteadmin =  3;
unsetperm      = -1;

var modules = {
    "*":       readonly,  "vlan":      unsetperm, "l2dom":    unsetperm,
    "devices": unsetperm, "racks":     unsetperm, "circuits": unsetperm,
    "nat":     unsetperm, "locations":  noaccess, "routing":  unsetperm,
    "pdns":    unsetperm, "customers": unsetperm
}

function updateModules(modules, new_value, list_of_modules) {
    for (var module in list_of_modules) {
        modules[module] = new_value;
    }
    return modules;
}

var GroupSet = user.getGroups();
for (var group in GroupSet) {
    switch (group.getName()) {
        case "LDAP_ROLE_3":
            modules = updateModules(modules, readwriteadmin, [
                'racks', 'devices', 'nat', 'routing'
            ]);
            break;
    }
}

var moduleList = '';

for (var key in modules) {
    if (modules.hasOwnProperty(key) && modules[key] !==-1) {
        if (moduleList !== '') {
            moduleList += ',';
        }
        moduleList += key + ':' + modules[key];
    }
}

moduleList;

OK, that’s Keycloak sorted. Let’s move on to PHPIPAM.

Setting up PHPIPAM

In the administration menu, select “Authentication Methods” and then “Create New” and select “Create new SAML2 authentication”.

In the description field, give it a relevant name, I chose SSO, but you could call it any SSO system name. Set “Enable JIT” to “on”, leave “Use advanced settings” as “off”. In Client ID put the Client ID you defined in Keycloak, probably starting urn: or https://. Leave “Strict mode” off. Next is the IDP Issuer, IDP Login URL and IDP Logout URL, which should all be set to the same URL – the “IDP Initiated SSO URL Name” from step 4 of the Keycloak side (that was set to something like https://keycloak.example.org/auth/realms/yourrealm/protocol/saml/clients/phpipam).

After that is the certificate section – first the IDP X.509 public cert that we got in step 1, then the “Sign Authn requests” should be set to “On” and the Authn X.509 signing cert and cert key are the private key and certificate we retrieved in step 5 above. Leave “SAML username attribute” and “SAML mapped user” blank and “Debugging” set to “Off”. It should look like this:

Hit save.

Next, any groups you specified in the groups mapper need to be defined. This is in Administration -> Groups. Create the group name and set a description.

Lastly, you need to configure the sections to define whigh groups have access. Each defined group gets given four radio buttons; “na” (no access), “ro” (read only), “rw” (read write) and “rwa” (read, write and administrate).

Try logging in. It should just work!

Debugging

If it doesn’t, and checking all of the above doesn’t help, I’ve tried adding some code into the PHP file in app/saml2/index.php, currently on line 149, above where it says:

if (empty($auth->getAttribute("display_name")[0])) {                                                                                    
    $Result->show("danger", _("Mandatory SAML JIT attribute missing")." : display_name (string)", true);
}

This code is:

$outfile = fopen("/tmp/log.out", "w");                                                           
fwrite($outfile, var_export($auth, true));                                                                     
fclose($outfile);

**REMEMBER THIS IS JUST FOR TESTING PURPOSES AND SHOULD BE REMOVED ASAP**

In here is an array called _attributes which will show you what has been returned from the Keycloak server when someone tries to log in. In my case, I got this:

   '_attributes' =>
  array (
    'groups' => 
    array (
      0 => 'PHPIPAM_GROUP_1',
    ),
    'is_admin' => 
    array (
      0 => 'false',
    ),
    'display_name' => 
    array (
      0 => 'Jon Spriggs',
    ),
    'email' => 
    array (
      0 => 'spriggsj@example.org',
    ),
  )

If you get something back here that isn’t what you expected, now at least you have a fighting chance of finding where that issue was! Good luck!!

Featured image is “Tickets” by “Becky Snyder” on Flickr and is released under a CC-BY license.