"DeBugged!" by "Randy Heinitz" on Flickr

Debugging Bash Scripts

Yesterday I was struggling a bit with a bash script I was writing. I needed to stop it from running flat out through every loop, and I wanted to see what certain values were at key points in the script.

Yes, I know I could use “read” to pause the script and “echo” to print values, but that leaves a lot of mess that I need to clean up afterwards… so I went looking for something else I could try.

You can have extensive debug statements, which are enabled with a --debug flag or environment variable… but again, messy.

You can run bash -x ./myscript.sh – and, indeed, I do frequently do that… but that shows you the commands which were run at each point, not what the outcome is of each of those commands.

If my problem had been a syntax one, I could have installed shellcheck, which is basically a linter for Bash and other shell scripting languages, but no, I needed more detail about what was happening during the processing.

Instead, I wanted something like xdebug (from PHP)… and I found Bash Debug for VSCode. This doesn’t even need you to install any scripts or services on the target machine – it’s interactive, and has a “watch” section, where you either highlight and right-click a variable expression (like $somevar or ${somevar}) to see when it changes. You can see where in the “callstack” you are and see what values are registered by that script.

Shellcheck shows me problems in my code…
But Bash Debug helps me to find out what values are at specific points in the code.

All in all, a worthy addition to my toolbelt!

Featured image is “DeBugged!” by “Randy Heinitz” on Flickr and is released under a CC-BY license.

"raspberry pie" by "stu_spivack" on Flickr

Post-Config of a RaspberryPi Zero W as an OTG-USB Gadget for off-device computing


A few months ago, I was working on a personal project that needed a separate, offline linux environment. I tried various different schemes to run what I was doing in the confines of my laptop and I couldn’t make what I was working on actually achieve my goals. So… I bought a Raspberry Pi Zero W and a “Solderless Zero Dongle“, with the intention of running Docker containers on it… unfortunately, while Docker runs on a Pi Zero, it’s really hard to find base images for the ARMv6/armhf platform that the Pi Zero W… so I put it back in the drawer, and left it there.

Roll forwards a month or so, and I was doing some experiments with Nebula, and only had an old Chromebook to test it on… except, I couldn’t install the Nebula client for Linux on there, and the Android client wouldn’t give me some features I wanted… so I broke out that old Pi Zero W again…

Now, while the tests with Nebula I was working towards will be documented later, I found that a lot of the documentation about using a Raspberry Pi Zero as a USB gadget were rough and unexplained. So, this post breaks down much of the content of what I found, what I tried, and what did and didn’t work.

Late Edit 2021-06-04: I spotted some typos around providing specific DHCP options for interfaces, based on work I’m doing elsewhere with this script. I’ve updated these values accordingly. I’ve also created a specific branch for this revision.

Late Edit 2021-06-06: I’ve noticed this document doesn’t cover IPv6 at all right now. I started to perform some tweaks to cover IPv6, but as my ISP has decided not to bother with IPv6, and won’t support Hurricane Electric‘s Tunnelbroker system, I can’t test any of it, without building out an IPv6 test environment… maybe soon, eh?

Read More
"Observatories Combine to Crack Open the Crab Nebula" by "NASA Goddard Space Flight Center" on Flickr

Nebula Offline Certificate Management with a Raspberry Pi using Bash

I have been playing again, recently, with Nebula, an Open Source Peer-to-Peer VPN product which boasts speed, simplicity and in-built firewalling. Although I only have a few nodes to play with (my VPS, my NAS, my home server and my laptop), I still wanted to simplify, for me, the process of onboarding devices. So, naturally, I spent a few evenings writing a bash script that helps me to automate the creation of my Nebula nodes.

Nebula Certificates

Nebula have implemented their own certificate structure. It’s similar to an x509 “TLS Certificate” (like you’d use to access an HTTPS website, or to establish an OpenVPN connection), but has a few custom fields.

The result of typing “nebula-cert print -path ca.crt” to print the custom fields

In this context, I’ve created a nebula Certificate Authority (CA), using this command:

nebula-cert ca -name nebula.example.org -ips,, -groups Mobile,Workstation,Server,Lighthouse,db

So, what does this do?

Well, it creates the certificate and private key files, storing the name for the CA as “nebula.example.org” (there’s a reason for this!) and limiting the subnets and groups (like AWS or Azure Tags) the CA can issue certificates with.

Here, I’ve limited the CA to only issue IP addresses in the RFC5737 “Documentation” ranges, which are, and, but this can easily be expanded to or lots of individual subnets (I tested, and proved 1026 separate subnets which worked fine).

Groups, in Nebula parlance, are building blocks of the Security product, and can act like source or destination filters. In this case, I limited the CA to only being allowed to issue certificates with the groups of “Mobile”, “Workstation”, “Server”, “Lighthouse” and “db”.

As this certificate authority requires no internet access, and only enough access to read and write files, I have created my Nebula CA server on a separate Micro SD card to use with a Raspberry Pi device, and this is used only to generate a new CA certificate each 6 months (in theory, I’ve not done this part yet!), and to sign keys for all the client devices as they come on board.

I copy the ca.crt file to my target machines, and then move on to creating my client certificates

Client Certificates

When you generate key materials for Public Key Cryptographic activities (like this one), you’re supposed to generate the private key on the source device, and the private key should never leave the device on which it’s generated. Nebula allows you to do this, using the nebula-cert command again. That command looks like this:

nebula-cert keygen -out-key host.key -out-pub host.pub

If you notice, there’s a key difference at this point between Nebula’s key signing routine, and an x509 TLS style certificate, you see, this stage would be called a “Certificate Signing Request” or CSR in TLS parlance, and it usually would specify the record details for the certificate (normally things like “region”, “organisational unit”, “subject name” and so on) before sending it to the CA for signing (marking it as trusted).

In the Nebula world, you create a key, and send the public part of that (in this case, “host.pub” but it can have any name you like) to the CA, at which point the CA defines what IP addresses it will have, what groups it is in, and so on, so let’s do that.

nebula-cert sign -ca-crt ca.crt -ca-key ca.key -in-pub host.pub -out-crt host.crt -groups Workstation -ip -name host.nebula.example.org

Let’s pick apart these options, shall we? The first four flags “-ca-crt“, “-ca-key“, “-in-pub” and “-out-crt” all refer to the CSR process – it’s reading the CA certificate and key, as well as the public part of the keypair created for the process, and then defines what the output certificate will be called. The next switch, -groups, identifies the tags we’re assigning to this node, then (the mandatory flag) -ip sets the IP address allocated to the node. Note that the certificate is using one of the valid group names, and has been allocated a valid IP address address in the ranges defined above. If you provide a value for the certificate which isn’t valid, you’ll get a warning message.

nebula-cert issues a warning when signing a certificate that tries to specify a value outside the constraints of the CA

In the above screenshot, I’ve bypassed the key generation and asked for the CA to sign with values which don’t match the constraints.

The last part is the name of the certificate. This is relevant because Nebula has a DNS service which can resolve the Nebula IPs to the hostnames assigned on the Certificates.

Anyway… Now that we know how to generate certificates the “hard” way, let’s make life a bit easier for you. I wrote a little script – Nebula Cert Maker, also known as certmaker.sh.


So, what does certmaker.sh do that is special?

  1. It auto-assigns an IP address, based on the MD5SUM of the FQDN of the node. It uses (by default) the first CIDR mask (the IP range, written as something like specified in the CA certificate. If multiple CIDR masks are specified in the certificate, there’s a flag you can use to select which one to use. You can override this to get a specific increment from the network address.
  2. It takes the provided name (perhaps webserver) and adds, as a suffix, the name of the CA Certificate (like nebula.example.org) to the short name, to make the FQDN. This means that you don’t need to run a DNS service for support staff to access machines (perhaps you’ll have webserver1.nebula.example.org and webserver2.nebula.example.org as well as database.nebula.example.org).
  3. Three “standard” roles have been defined for groups, these are “Server”, “Workstation” and “Lighthouse” [1] (the latter because you can configure Lighthouses to be the DNS servers mentioned in step 2.) Additional groups can also be specified on the command line.

[1] A lighthouse, in Nebula terms, is a publically accessible node, either with a static IP, or a DNS name which resolves to a known host, that can help other nodes find each other. Because all the nodes connect to it (or a couple of “it”s) this is a prime place to run the DNS server, as, well, it knows where all the nodes are!

So, given these three benefits, let’s see these in a script. This script is (at least currently) at the end of the README file in that repo.

# Create the CA
mkdir -p /tmp/nebula_ca
nebula-cert ca -out-crt /tmp/nebula_ca/ca.crt -out-key /tmp/nebula_ca/ca.key -ips, -name nebula.example.org

# First lighthouse, lighthouse1.nebula.example.org -, group "Lighthouse"
./certmaker.sh --cert_path /tmp/nebula_ca --name lighthouse1 --ip 1 --lighthouse

# Second lighthouse, lighthouse2.nebula.example.org -, group "Lighthouse"
./certmaker.sh -c /tmp/nebula_ca -n lighthouse2 -i 2 -l

# First webserver, webserver1.nebula.example.org -, groups "Server" and "web"
./certmaker.sh --cert_path /tmp/nebula_ca --name webserver1 --server --group web

# Second webserver, webserver2.nebula.example.org -, groups "Server" and "web"
./certmaker.sh -c /tmp/nebula_ca -n webserver2 -s -g web

# Database Server, db.nebula.example.org -, groups "Server" and "db"
./certmaker.sh --cert_path /tmp/nebula_ca --name db --server --group db

# First workstation, admin1.nebula.example.org -, group "Workstation"
./certmaker.sh --cert_path /tmp/nebula_ca --index 1 --name admin1 --workstation

# Second workstation, admin2.nebula.example.org -, group "Workstation"
./certmaker.sh -c /tmp/nebula_ca -d 1 -n admin2 -w

# First Mobile device - Create the private/public key pairing first
nebula-cert keygen -out-key mobile1.key -out-pub mobile1.pub
# Then sign it, mobile1.nebula.example.org -, group "mobile"
./certmaker.sh --cert_path /tmp/nebula_ca --index 1 --name mobile1 --group mobile --public mobile1.pub

# Second Mobile device - Create the private/public key pairing first
nebula-cert keygen -out-key mobile2.key -out-pub mobile2.pub
# Then sign it, mobile2.nebula.example.org -, group "mobile"
./certmaker.sh -c /tmp/nebula_ca -d 1 -n mobile2 -g mobile -p mobile2.pub

Technically, the mobile devices are simulating the local creation of the private key, and the sharing of the public part of that key. It also simulates what might happen in a more controlled environment – not where everything is run locally.

So, let’s pick out some spots where this content might be confusing. I’ve run each type of invocation twice, once with the short version of all the flags (e.g. -c instead of --cert_path, -n instead of --name) and so on, and one with the longer versions. Before each ./certmaker.sh command, I’ve added a comment, showing what the hostname would be, the IP address, and the Nebula Groups assigned to that node.

It is also possible to override the FQDN with your own FQDN, but this command option isn’t in here. Also, if the CA doesn’t provide a CIDR mask, one will be selected for you (, or you can provide one with the -b/--subnet flag.

If the CA has multiple names (e.g. nebula.example.org and nebula.example.com), then the name for the host certificates will be host.nebula.example.org and also host.nebula.example.com.

Using Bash

So, if you’ve looked at, well, almost anything on my site, you’ll see that I like to use tools like Ansible and Terraform to deploy things, but for something which is going to be run on this machine, I’d like to keep things as simple as possible… and there’s not much in this script that needed more than what Bash offers us.

For those who don’t know, bash is the default shell for most modern Linux distributions and Docker containers. It can perform regular expression parsing (checking that strings, or specific collections of characters appear in a variable), mathematics, and perform extensive loop and checks on values.

I used a bash template found on a post at BetterDev.blog to give me a basic structure – usage, logging and parameter parsing. I needed two functions to parse and check whether IP addresses were valid, and what ranges of those IP addresses might be available. These were both found online. To get just enough of the MD5SUM to generate a random IPv4 address, I used a function to convert the hexedecimal number that the MDSUM produces, and then turned that into a decimal number, which I loop around the address space in the subnets. Lastly, I made extensive use of Bash Arrays in this. This was largely thanks to an article on OpenSource.com about bash arrays. It’s well worth a read!

So, take a look at the internals of the script, if you want to know some options on writing bash scripts that manipulate IP addresses and read the output of files!

If you’re looking for some simple tasks to start your portfolio of work, there are some “good first issue” tasks in the “issues” of the repo, and I’d be glad to help you work through them.

Wrap up

I hope you enjoy using this script, and I hope, if you’re planning on writing some bash scripts any time soon, that you take a look over the code and consider using some of the templates I reference.

Featured image is “Observatories Combine to Crack Open the Crab Nebula” by “NASA Goddard Space Flight Center” on Flickr and is released under a CC-BY license.

"Family" by "Ivan" on Flickr

Debian on Docker using Vagrant

I want to use Vagrant-Docker to try standing up some environments. There’s no reasonable justification, it’s just a thing I wanted to do. Normally, I’d go into this long and rambling story about why… but on this occasion, the reason was “Because it’s possible”…

TL;DR?: Get the code from the repo and enjoy 😁

Installing Docker

On Ubuntu you can install Docker following the instructions on the Docker Install Page, which includes a convenience script (that runs all the commands you need), if you want to use it. Similar instructions for Debian, CentOS and Fedora exist.

On Windows or Mac there are downloads you can get from the Docker Hub. The Windows Version requires WSL2. I don’t have a Mac, so I don’t know what the requirements are there! Installing WSL2 has a whole host of extra steps that I can’t really do justice to. See this Microsoft article for details.

Installing Vagrant

On Debian and Ubuntu you can add the HashiCorp Apt Repo and then install Vagrant, using these commands:

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt install vagrant

There are similar instructions for RHEL, CentOS and Fedora users there too.

Windows and Mac users will have to get the application from the download page.

Creating your Dockerfile

A Dockerfile is a simple text file which has a series of line prefixes which instruct the Docker image processor to add certain instructions to the Docker Image. I found two pages which helped me with what to add for this; “Ansible. Docker. Vagrant. Bringing together” and the git repo “AkihiroSuda/containerized-systemd“.

You see, while a Dockerfile is great at starting single binary files or scripts, it’s not very good at running SystemD… and I needed SystemD to be able to run the SSH service that Vagrant requires, and to also run the scripts and commands I needed for the image I wanted to build…

Sooooo…. here’s the Dockerfile I created:

# Based on https://vtorosyan.github.io/ansible-docker-vagrant/
# and https://github.com/AkihiroSuda/containerized-systemd/

FROM debian:buster AS debian_with_systemd

# This stuff enables SystemD on Debian based systems
RUN DEBIAN_FRONTEND=noninteractive apt update && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends systemd systemd-sysv dbus dbus-user-session
COPY docker-entrypoint.sh /
RUN chmod 755 /docker-entrypoint.sh
ENTRYPOINT [ "/docker-entrypoint.sh" ]
CMD [ "/bin/bash" ]

# This part installs an SSH Server (required for Vagrant)
RUN DEBIAN_FRONTEND=noninteractive apt install -y sudo openssh-server
RUN mkdir /var/run/sshd
#    We enable SSH here, but don't start it with "now" as the build stage doesn't run anything long-lived.
RUN systemctl enable ssh

# This part creates the vagrant user, sets the password to "vagrant", adds the insecure key and sets up password-less sudo.
RUN useradd -G sudo -m -U -s /bin/bash vagrant
#    chpasswd takes a colon delimited list of username/password pairs.
RUN echo 'vagrant:vagrant' | chpasswd
RUN mkdir -m 700 /home/vagrant/.ssh
# This key from https://github.com/hashicorp/vagrant/tree/main/keys. It will be replaced on first run.
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA6NF8iallvQVp22WDkTkyrtvp9eWW6A8YVr+kz4TjGYe7gHzIw+niNltGEFHzD8+v1I2YJ6oXevct1YeS0o9HZyN1Q9qgCgzUFtdOKLv6IedplqoPkcmF0aYet2PkEDo3MlTBckFXPITAMzF8dJSIFo9D8HfdOV0IAdx4O7PtixWKn5y2hMNG0zQPyUecp4pzC6kivAIhyfHilFR61RGL+GPXQ2MWZWFYbAGjyiYJnAmCP3NOTd0jMZEnDkbUvxhMmBYSdETk1rRgm+R4LOzFUGaHqHDLKLX+FIPKcF96hrucXzcWyLbIbEgE98OHlnVYCzRdK8jlqm8tehUc9c9WhQ== vagrant insecure public key' > /home/vagrant/.ssh/authorized_keys
RUN chmod 600 /home/vagrant/.ssh/authorized_keys
RUN chown -R vagrant:vagrant /home/vagrant
RUN echo 'vagrant ALL=(ALL:ALL) NOPASSWD:ALL' >> /etc/sudoers

This Dockerfile calls out to a separate script, called docker-entrypoint.sh, taken verbatim from AkihiroSuda’s repo, so here’s that file:

set -ex
export container

if [ $# -eq 0 ]; then
	echo >&2 'ERROR: No command specified. You probably want to run `journalctl -f`, or maybe `bash`?'
	exit 1

if [ ! -t 0 ]; then
	echo >&2 'ERROR: TTY needs to be enabled (`docker run -t ...`).'
	exit 1

env >/etc/docker-entrypoint-env

cat >/etc/systemd/system/docker-entrypoint.target <<EOF
Description=the target for docker-entrypoint.service
Requires=docker-entrypoint.service systemd-logind.service systemd-user-sessions.service
cat /etc/systemd/system/docker-entrypoint.target

quoted_args="$(printf " %q" "${@}")"
echo "${quoted_args}" >/etc/docker-entrypoint-cmd
cat /etc/docker-entrypoint-cmd

cat >/etc/systemd/system/docker-entrypoint.service <<EOF

ExecStart=/bin/bash -exc "source /etc/docker-entrypoint-cmd"
# EXIT_STATUS is either an exit code integer or a signal name string, see systemd.exec(5)
ExecStopPost=/bin/bash -ec "if echo \${EXIT_STATUS} | grep [A-Z] > /dev/null; then echo >&2 \"got signal \${EXIT_STATUS}\"; systemctl exit \$(( 128 + \$( kill -l \${EXIT_STATUS} ) )); else systemctl exit \${EXIT_STATUS}; fi"

cat /etc/systemd/system/docker-entrypoint.service

systemctl mask systemd-firstboot.service systemd-udevd.service
systemctl unmask systemd-logind
systemctl enable docker-entrypoint.service

if [ -x /lib/systemd/systemd ]; then
elif [ -x /usr/lib/systemd/systemd ]; then
elif [ -x /sbin/init ]; then
	echo >&2 'ERROR: systemd is not installed'
	exit 1
systemd_args="--show-status=false --unit=multi-user.target"
echo "$0: starting $systemd $systemd_args"
exec $systemd $systemd_args

Now, if you were to run this straight in Docker, it will fail, because you must pass certain flags to Docker to get this to run. These flags are:

  • -t : pass a “TTY” to the shell
  • --tmpfs /tmp : Create a temporary filesystem in /tmp
  • --tmpfs /run : Create another temporary filesystem in /run
  • --tmpfs /run/lock : Apparently having a tmpfs in /run isn’t enough – you ALSO need one in /run/lock
  • -v /sys/fs/cgroup:/sys/fs/cgroup:ro : Mount the CGroup kernel configuration values into the container

(I found these flags via a RedHat blog post, and a Podman issue on Github.)

So, how would this look, if you were to try and run it?

docker exec -t --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup:ro YourImage

Blimey, what a long set of text! Perhaps we could hide that behind something a bit more legible? Enter Vagrant.

Creating your Vagrantfile

Vagrant is an abstraction tool, designed to hide complicated virtualisation scripts into a simple command. In this case, we’re hiding a containerisation script into a simple command.

Like with the Dockerfile, I made extensive use of the two pages I mentioned before, as well as the two pages to get the flags to run this.

# Based on https://vtorosyan.github.io/ansible-docker-vagrant/
# and https://github.com/AkihiroSuda/containerized-systemd/
# and https://developers.redhat.com/blog/2016/09/13/running-systemd-in-a-non-privileged-container/
# with tweaks indicated by https://github.com/containers/podman/issues/3295
Vagrant.configure("2") do |config|
  config.vm.provider "docker" do |d|
    d.build_dir       = "."
    d.has_ssh         = true
    d.remains_running = false
    d.create_args     = ['--tmpfs', '/tmp', '--tmpfs', '/run', '--tmpfs', '/run/lock', '-v', '/sys/fs/cgroup:/sys/fs/cgroup:ro', '-t']

If you create that file, and run vagrant up you’ll get a working Vagrant boot… But if you try and execute any shell scripts, they’ll fail to run, as the they aren’t passed in with execute permissions… so I want to use Ansible to execute things, as these don’t require execute permissions on the /vagrant directory (also, as the thing I’m building in there requires Ansible… so it’s helpful either way 😁)

Executing Ansible scripts

Ansible still expects to find python in /usr/bin/python but current systems don’t make the symlink to /usr/bin/python3, as python was typically a symlink to /usr/bin/python2… and also I wanted to put the PPA for Ansible in the sources, which is what the Ansible team recommend in their documentation. I’ve done this as part of the Dockerfile, as again, I can’t run scripts from Vagrant. So, here’s the addition I made to the Dockerfile.

FROM debian_with_systemd AS debian_with_systemd_and_ansible
RUN apt install -y gnupg2 lsb-release software-properties-common
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 93C4A3FD7BB9C367
RUN add-apt-repository "deb http://ppa.launchpad.net/ansible/ansible/ubuntu trusty main"
RUN apt install -y ansible
# Yes, I know. Trusty? On Debian Buster?? But, that's what the Ansible Docs say!

In the Vagrantfile, I’ve added this block:

config.vm.provision "ansible_local" do |ansible|
  ansible.playbook = "test.yml"

And I created a test.yml, which looks like this:

- hosts: all
  - debug:
      msg: "Hello from Docker"

Running it

So how does this look on Windows when I run it?

PS C:\Dev\VagrantDockerBuster> vagrant up
==> default: Creating and configuring docker networks...
==> default: Building the container from a Dockerfile...
    default: #20 DONE 0.1s
    default: Image: 190ffdeaeed0b7ed206097e6c1d4b5cc796a428700c9bd3e27eedacce47fb63b
==> default: Creating the container...
    default:   Name: 2021-02-13DockerBusterWithSSH_default_1613469604
    default:  Image: 190ffdeaeed0b7ed206097e6c1d4b5cc796a428700c9bd3e27eedacce47fb63b
    default: Volume: C:/Users/SPRIGGSJ/OneDrive - FUJITSU/Documents/95 My Projects/2021-02-13 Docker Buster With SSH:/vagrant
    default:   Port:
    default: Container created: b64ed264d8949b12
==> default: Enabling network interfaces...
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address:
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default: Inserting generated public key within guest...
==> default: Machine booted and ready!
==> default: Running provisioner: ansible_local...
    default: Running ansible-playbook...

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[WARNING]: Platform linux on host default is using the discovered Python
interpreter at /usr/bin/python, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/referen
ce_appendices/interpreter_discovery.html for more information.
ok: [default]

TASK [debug] *******************************************************************
ok: [default] => {
    "msg": "Hello from Docker"

PLAY RECAP *********************************************************************
default                    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

PS C:\Dev\VagrantDockerBuster>

And on Linux?

Bringing machine 'default' up with 'docker' provider...
==> default: Creating and configuring docker networks...
==> default: Building the container from a Dockerfile...
    default: Removing intermediate container e56bed4f7be9
    default:  ---> cef749c205bf
    default: Successfully built cef749c205bf
    default: Image: cef749c205bf
==> default: Creating the container...
    default:   Name: 2021-02-13DockerBusterWithSSH_default_1613470091
    default:  Image: cef749c205bf
    default: Volume: /home/spriggsj/Projects/2021-02-13 Docker Buster With SSH:/vagrant
    default:   Port:
    default: Container created: 3fe46b02d7ad10ab
==> default: Enabling network interfaces...
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address:
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!
==> default: Running provisioner: ansible_local...
    default: Running ansible-playbook...

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[WARNING]: Platform linux on host default is using the discovered Python
interpreter at /usr/bin/python, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/referen
ce_appendices/interpreter_discovery.html for more information.
ok: [default]

TASK [debug] *******************************************************************
ok: [default] => {
    "msg": "Hello from Docker"

PLAY RECAP *********************************************************************
default                    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

So, if you’re crazy and want to do Vagrant using Docker with Debian Buster and Ansible, this is how to do it. I don’t know how much I’m likely to be using this in the future, but if you use it, let me know what you’re doing with it! 😀

Featured image is “Family” by “Ivan” on Flickr and is released under a CC-BY license.

"Blueprints" by "Cameron Degelia" on Flickr

Using Architectural Decision Records (ADR) with adr-tools

Introducing Architectural Decision Records

Over the last week, I discovered a new tool for my arsenal called Architectural Decision Records (ADR). They were first written about in 2011, in a post called “Documenting Architecture Decisions“, where the author, Michael Nygard, advocates for short documents explaining each decision that influences the architecture of an environment.

I found this via a Github repository, created by the team at gov.uk, which includes their ADR library, and references the tool they use to manage these documents – adr-tools.

Late edit 2021-01-25: I also found a post which suggests that Spotify uses ADR.

Installing adr-tools on Linux

Currently adr-tools are easier to install under OSX rather than Linux or Windows Subsystem for Linux (WSL) (I’m working on this – bear with me! 😃 ).

The current installation notes suggest for Linux (which would also work on WSL) is to download the latest release tar.gz or zip file and unpack it into your path somewhere. This isn’t exactly the best way to deploy anything on Linux, but… I guess it works right now?

For me, I downloaded the file, and unpacked the whole tar.gz file (as root) into /usr/local/bin/, giving me a directory of /usr/local/bin/adr-tools-3.0.0/. There’s a subdirectory in here, called src which contains a large number of files – mostly starting _adr or adr- and two additional files, init.md and template.md.

Rather than putting all of these files into /usr/local/bin directly, instead I leave them in the adr-tools-3.0.0 directory, and create a symbolic link (symlink) to the /usr/local/bin directory with this command:

cd /usr/local/bin
ln -s adr-tools-3.0.0/src/* .

This gives me all those files in one place, so I can refer to them later.

An aside – why link everything in that src directory? (Feel free to skip this block!)

Now, why, you might ask, do all of these unrelated files need to be in the same place? Well…. the author of the script has put this in at the top of almost all the files:

set -e
eval "$($(dirname $0)/adr-config)"

And then in that script, it says:

basedir=$(cd -L $(dirname $0) > /dev/null 2>&1 && pwd -L)
echo "adr_bin_dir=$basedir"
echo "adr_template_dir=$basedir"

There are, technically, good reasons for this! This is designed to be run in, what in the Windows world, you might call as a “Portable Script”. So, you bung adr-tools into some directory somewhere, and then just call adr somecommand and it knows that all the files are where they need to be. The (somewhat) down side to this is that if you just want to call adr somecommand rather than path/to/my/adr somecommand then all those files need to be there

I’m currently looking to see if I can improve this somewhat, so that it’s not quite so complex to install, but for now, that’s what you need.


Using adr-tools to document your decisions

I’ll start documenting a fictional hosted web service project, and note down some of the decisions which have been made.

Initializing your ADR directory

Start by running adr init. You may want to specify a directory where you want to put these records, so instead use: adr init path/to/adr, like this:

Initializing the ADR in “documentation/architecture-decisions” with adr init documentation/architecture-decisions

You’ll notice that when I run this command, it creates a new entry, called 0001-record-architecture-decisions.md. Let’s open this up, and see what’s in here.

The VSCode record for the choice to use ADR. It is a markdown file, with the standard types of data recorded.

In here we have the record ID (1.), the title of the record Record architecture decisions, the date the choice was made Date: 2021-01-19, a status of Accepted, the context on why we made this choice, the decision, and the consequences of making this decision. Make changes, if needed, and save it. Let’s move on.

Creating our first own record

This all is quite straightforward thus far. Let’s create our next record.

Issuing the command adr new <sometitle> you create the next ADR record.

Let’s open up that record.

The template for the ADR record for “Use AWS”.

Like the first record, we have a title, a status, a context, decision and consequences. Let’s define these.

A “finished” brief ADR record.

This document shouldn’t be very long! It just describes why a choice was made and what that entails.

Changing decisions – completely replacing (superseding) a decision

Of course, over time, decisions will be replaced due to various decisions elsewhere.

You can ask adr to supersede a previous record, using the “-s” flag, and the record number.

Let’s look at how that works on the second ADR record.

After the command adr new -s 2 Use Azure, the ADR record number 2 has a new status, “Superceded by” and the superseded linked document. Yes, “Superceded” is a typo. There is an open PR for it

So, under the “Status”, where is previously said “Approved”, it now says “Superceded by [3. Use Azure](0003-use-azure.md)“. This is a markdown statement which indicates where the superseded document is located. As I mentioned in the comment below the above image, there is an open Pull Request to fix this on the adr-tools, so hopefully that typo won’t last long!

We’ve got our new ADR too – let’s take a look at that one?

Our new ADR shows that it “supercedes” the previous record. Which is good! Typo aside :)

Other references

Of course, you don’t always completely overrule a decision. Sometimes your decision is influenced by, or has a dependency on something else, like this one.

We know which provider we’re using at long last, now let’s pick a region. Use the -l flag to “link” between the referenced and new ADR. The context for the -l flag is “<number>:<text for link to number>:<text for link in targetted document>”.

The command here is:

adr new -l '3:Dependency:Influences' Use Region UK South and UK West

I’m just going to crop from the “Status” block on both the referenced ADR (3) and the ADR which references it (4):

Status block in ADR 0003 which is referenced by ADR 0004
Status block in the new ADR 0004 which references ADR 0003

And of course, you can also use the same switch to mark documents as partially obsoleted, like this:

adr new -l '4:Partially obsoletes:Partially obsoleted by' Use West Europe region instead of UK West region
Status block in ADR 0004 indicating it’s partially obsoleted. Probably worth updating the status properly to show it’s not just “Accepted”.

If you forget to add the referencing in, you can also use the adr link command, like this:

adr link 3 Influences 5 Dependency

To be clear, that command adds a (complete) line to ADR 0003 saying “Influences [5. ADR Title](link)” and a separate (complete) line to ADR 0005 saying “Dependency [3. ADR Title](link)“.

What else can we do?

There are four other “things” that it’s worth doing at this point.

  1. Note that you can change the template per-ADR directory.

Create a directory called “templates” in the ADR directory, and put a file in there called “template.md“. Tweak this as you need. Ensure you have AT LEAST the line ## Status and # NUMBER. TITLE as these are required by the script.

A much abbreviated template file, containing just “Number”, “Title”, “Date”, “Status”, and a new dummy heading called “Stuff”.
And the result of running adr new Some Text once you’ve created that template.

As you can see, it’s possible to add all sorts of content in this template as a result. Bear in mind, before your template turns into something like this, that it’s supposed to be a short document explaining why each decision was made, not a funding proposal, or a complex epic of your user stories!

Be careful not to let your template run away with you!
  1. Note that you can automatically open an editor, by setting the EDITOR (where the process is expected to finish before returning control, like using nano, emacs or vim, for example) or VISUAL (where the process is expected to “fork”, like for example, gedit or vscode) environment variable, and then running adr new A Title, like this:
  1. We can create “Table of Contents” files, using the adr generate toc command, like this:
Generating the table of contents, for injecting into other files.

This can be included into your various other markdown files. There are switches, so you can set the link path, but your best bet is to find that using adr help generate toc.

  1. We can also generate graphviz files of the link maps between elements of the various ADRs, like this: adr generate graph | dot -Tjpg > graph.jpg

If you omit the “| dot -Tjpg > graph.jpg” part, then you’ll see the graphviz output, which looks like this: (I’ve removed the documents 6 and 7).

digraph {
  node [shape=plaintext];
  subgraph {
    _1 [label="1. Record architecture decisions"; URL="0001-record-architecture-decisions.html"];
    _2 [label="2. Use AWS"; URL="0002-use-aws.html"];
    _1 -> _2 [style="dotted", weight=1];
    _3 [label="3. Use Azure"; URL="0003-use-azure.html"];
    _2 -> _3 [style="dotted", weight=1];
    _4 [label="4. Use Region UK South and UK West"; URL="0004-use-region-uk-south-and-uk-west.html"];
    _3 -> _4 [style="dotted", weight=1];
    _5 [label="5. Use West Europe region instead of UK West region"; URL="0005-use-west-europe-region-instead-of-uk-west-region.html"];
    _4 -> _5 [style="dotted", weight=1];
  _3 -> _2 [label="Supercedes", weight=0]
  _3 -> _5 [label="Influences", weight=0]
  _4 -> _3 [label="Dependency", weight=0]
  _5 -> _4 [label="Partially obsoletes", weight=0]
  _5 -> _3 [label="Dependency", weight=0]

To make the graphviz part work, you’ll need to install graphviz, which is just an apt get away.

Any caveats?

adr-tools is not actively maintained. I’ve contacted the author, about seeing if I can help out with the maintenance, but… we’ll see, and given some fairly high profile malware takeovers of projects like this sort of thing on Github, Docker, NPM, and more… I can see why there might be some reluctance to consider it! Also, I’m an unknown entity, I’ve just dropped in on the project and offered to help, with no previous exposure to the lead dev or the project… so, we’ll see. Worst case, I’ll fork it!

Working with this also requires an understanding of markdown files, and why these might be a useful document format for records like this. There was a PR submitted to support multiple file formats (like asciidoc and rst) but these were not approved by the author.

There is no current intention to support languages other than English. The tool is hard-coded to look for strings like “status” and “superceded” which is hard. Part of the reason I raised the PRs I did was to let me fix some of these sorts of issues. Again, we’ll see what happens.

Lastly, it can be overwhelming to see a lot of documents in one place, particularly if they’re as granular as the documents I produced in this demo. If the project supported categories, or could be broken down into components (like doc/adr/networking and doc/adr/server_builds and doc/adr/applications) then this might help, but it’s not on the roadmap right now!

Late edit 2021-01-25: If you don’t think these templates have enough context or content, there are lots of others listed on Joel Parker Henderson’s repo of examples and templates. If you want a python based viewer of ADR records, take a look at adr-viewer.

Featured image is “Blueprints” by “Cameron Degelia” on Flickr and is released under a CC-BY license.

"Raven" by "Jim Bahn" on Flickr

Sending SSH login notifications to Matrix via Huginn using Webhooks

On the Self Hosted Podcast’s Discord Server, someone posted a link to the following blog post, which I read and found really interesting…: https://blog.hay-kot.dev/ssh-notifications-in-home-assistant/

You see, the key part of that post wasn’t that they were posting to Home Assistant when they were logging in, but instead that they were triggering a webhook on login. And I can do stuff with Webhooks.

What’s a webhook?

A webhook is a callable URL, either with a “secret” embedded in the URL or some authentication header that lets you trigger an action of some sort. I first came across these with Github, but they’re pretty common now. Services will offer these as a way to get an action in one service to do something in another. A fairly common webhook for those getting started with these sorts of things is where creating a pull request (PR) on a Github repository will trigger a message on something like Slack to say the PR is there.

Well, that’s all well and good, but what does Matrix or Huginn have to do with things?

Matrix is a decentralized, end to end encrypted, eventually consistent database system, that just happens to be used extensively as a chat network. In particular, it’s used by Open Source projects, like KDE and Mozilla, and by Government bodies, like the whole French goverment (lead by DINSIC) the German Bundeswehr (Unified Armed Forces) department.

Matrix has a reference client, Element, that was previously called “Riot”, and in 2018 I produced a YouTube video showing how to bridge various alternative messaging systems into Matrix.

Huginn describes itself as:

Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn’s Agents create and consume events, propagating them along a directed graph. Think of it as a hackable version of IFTTT or Zapier on your own server. You always know who has your data. You do.

Huginn Readme

With Huginn, I can create “agents”, including a “receive webhook agent” that will take the content I send, and tweak it to do something else. In the past I used IFTTT to do some fun things, like making this blog work, but now I use Huginn to post Tweets when I post to this blog.

So that I knew that Huginn was posting my twitter posts, I created a Matrix room called “Huginn Alerts” and used the Matrix account I created for the video I mentioned before to send me messages that it had made the posts I wanted. I followed the guidance from this page to do it: https://drwho.virtadpt.net/archive/2020-02-12/integrating-huginn-with-a-matrix-server/

Enough already. Just show me what you did.

In Element.io

  1. Get an access token for the Matrix account you want to post with.

Log into the web interface at https://app.element.io and go to your settings

Click where it says your handle, then click on where it says “All Settings”.

Then click on “Help & About” and scroll to the bottom of that page, where it says “Advanced”

Get to the “Advanced” part of the settings, under “Help & About” to get your access token.

Click where it says “Access Token: <click to reveal>” (strangely, I’m not posting that 😉)

  1. Click on the room, then click on it’s name at the top to open the settings, then click on “Advanced” to get the “Internal room ID”
Gettng the Room ID. Note, it starts with an exclamation mark (!) and ends :<servername>.

In Huginn

  1. Go to the “Credentials” tab, and click on “New Credential”. Give the credential a name (perhaps “Matrix Bot Access Token”), leave it as text and put your access token in here.
  1. Create a Credential for the Room ID. Like before, name it something sensible and put the ID you found earlier.
  1. Create a “Post Agent” by going to Agents and selecting “New agent”. This will show just the “Type” box. You can type in this box to put “Post Agent” and then find it. That will then provide you with the rest of these boxes. Provide a name, and tick the box marked “Propagate immediately”. I’ll cover the content of the “Options” box after this screenshot.

In the “Options” block is a button marked “Toggle View”. Select this which turns it from the above JSON pretty editor, into this text field (note your text is likely to be different):

My content of that box is as follows:

  "post_url": "https://matrix.org/_matrix/client/r0/rooms/{% credential Personal_Matrix_Notification_Channel %}/send/m.room.message?access_token={% credential Matrix_Bot_Access_Credential %}",
  "expected_receive_period_in_days": "365",
  "content_type": "json",
  "method": "post",
  "payload": {
    "msgtype": "m.text",
    "body": "{{ text }}"
  "emit_events": "true",
  "no_merge": "false",
  "output_mode": "clean"

Note that the “post_url” value contains two “credential” values, like this:

{% credential Personal_Matrix_Notification_Channel %} (this is the Room ID we found earlier) and {% credential Matrix_Bot_Access_Credential %} (this is the Access Token we found earlier).

If you’ve used different names for these values (which are perfectly valid!) then just change these two. The part where it says “{{ text }}” leave there, because we’ll be using that in a later section. Click “Save” (the blue button at the bottom).

  1. Create a Webhook Agent. Go to Agents and then “New Agent”. Select “Webhook Agent” from the “Type” field. Give it a name, like “SSH Logged In Notification Agent”. Set “Keep Events” to a reasonable number of days, like 5. In “Receivers” find the Notification agent you created (“Send Matrix Message to Notification Room” was the name I used). Then, in the screenshot, I’ve pressed the “Toggle View” button on the “Options” section, as this is, to me a little clearer.

The content of the “options” box is:

  "secret": "supersecretstring",
  "expected_receive_period_in_days": 365,
  "payload_path": ".",
  "response": ""

Change the “secret” from “supersecretstring” to something a bit more useful and secure.

The “Expected Receive Period in Days” basically means, if you’ve not had an event cross this item in X number of days, does Huginn think this agent is broken? And the payload path of “.” basically means “pass everything to the next agent”.

Once you’ve completed this step, press “Save” which will take you back to your agents, and then go into the agent again. This will show you a page like this:

Copy that URL, because you’ll need it later…

On the server you are logging the SSH to

As root, create a file called /etc/ssh/sshrc. This file will be your script that will run every time someone logs in. It must have the file permissions 0644 (u+rw,g+r,o+r), which means that there is a slight risk that the Webhook secret is exposed.

The content of that file is as follows:

ip="$(echo "$SSH_CONNECTION" | cut -d " " -f 1)"
curl --silent\
     --header "Content-Type: application/json"\
     --request POST\
     --data '{
       "At": "'"$(date -Is)"'",
       "Connection": "'"$SSH_CONNECTION"'",
       "User": "'"$USER"'",
       "Host": "'"$(hostname)"'",
       "Src": "'"$ip"'",
       "text": "'"$USER@$(hostname) logged in from $ip at $(date +%H:%M:%S)"'"

The heading line (#!/bin/sh) is more there for shellcheck, as, according to the SSH man page this is executed by /bin/sh either way.

The bulk of these values (At, Connection, User, Host or Src) are not actually used by Huginn, but might be useful for later… the key one is text, which if you recall from the “Send Matrix Message to Notification Room” Huginn agent, we put {{ text }} into the “options” block – that’s this block here!

So what happens when we log in over SSH?

SSH asks the shell in the user’s context to execute /etc/ssh/sshrc before it hands over to the user’s login session. This script calls curl and hands some POST data to the url.

Huginn receives this POST via the “SSH Logged In Notification Agent”, and files it.

Huginn then hands that off to the “Send Matrix Message to Notification Room”:

Huginn makes a POST to the Matrix.org server, and Matrix sends the finished message to all the attached clients.

Featured image is “Raven” by “Jim Bahn” on Flickr and is released under a CC-BY license.

"The Guitar Template" by "Neil Williamson" on Flickr

Testing (and failing inline) for data types in Ansible

I tend to write long and overly complicated set_fact statements in Ansible, ALL THE DAMN TIME. I write stuff like this:

rulebase: |
    {% for var in vars | dict2items %}
      {% if var.key | regex_search(regex_rulebase_match) | type_debug != "NoneType"
        and (
          var.value | type_debug == "dict" 
          or var.value | type_debug == "AnsibleMapping"
        ) %}
        {% for item in var.value | dict2items %}
          {% if item.key | regex_search(regex_rulebase_match) | type_debug != "NoneType"
            and (
              item.value | type_debug == "dict" 
              or item.value | type_debug == "AnsibleMapping"
            ) %}
            "{{ var.key | regex_replace(regex_rulebase_match, '\2') }}{{ item.key | regex_replace(regex_rulebase_match, '\2') }}": {
              {# This block is used for rulegroup level options #}
              {% for key in ['log_from_start', 'log', 'status', 'nat', 'natpool', 'schedule', 'ips_enable', 'ssl_ssh_profile', 'ips_sensor'] %}
                {% if var.value[key] is defined and rule.value[key] is not defined %}
                  {% if var.value[key] | type_debug in ['string', 'AnsibleUnicode'] %}
                    "{{ key }}": "{{ var.value[key] }}",
                  {% else %}
                    "{{ key }}": {{ var.value[key] }},
                  {% endif %}
                {% endif %}
              {% endfor %}
              {% for rule in item.value | dict2items %}
                {% if rule.key in ['sources', 'destinations', 'services', 'src_internet_service', 'dst_internet_service'] and rule.value | type_debug not in ['list', 'AnsibleSequence'] %}
                  "{{ rule.key }}": ["{{ rule.value }}"],
                {% elif rule.value | type_debug in ['string', 'AnsibleUnicode'] %}
                  "{{ rule.key }}": "{{ rule.value }}",
                {% else %}
                  "{{ rule.key }}": {{ rule.value }},
                {% endif %}
              {% endfor %}
          {% endif %}
        {% endfor %}
      {% endif %}
    {% endfor %}

Now, if you’re writing set_fact or vars like this a lot, what you tend to end up with is the dreaded dict2items requires a dictionary, got instead. which basically means “Hah! You wrote a giant blob of what you thought was JSON, but didn’t render right, so we cast it to a string for you!”

The way I usually write my playbooks, I’ll do something with this set_fact at line, let’s say, 10, and then use it at line, let’s say, 500… So, I don’t know what the bloomin’ thing looks like then!

So, how to get around that? Well, you could do a type check. In fact, I wrote a bloomin’ big blog post explaining just how to do that!

However, that gets unwieldy really quickly, and what I actually wanted to do was to throw the breaks on as soon as I’d created an invalid data type. So, to do that, I created a collection of functions which helped me with my current project, and they look a bit like this one, called “is_a_string.yml“:

- name: Type Check - is_a_string
    quiet: yes
    - vars[this_key] is not boolean
    - vars[this_key] is not number
    - vars[this_key] | int | string != vars[this_key] | string
    - vars[this_key] | float | string != vars[this_key] | string
    - vars[this_key] is string
    - vars[this_key] is not mapping
    - vars[this_key] is iterable
    success_msg: "{{ this_key }} is a string"
    fail_msg: |-
      {{ this_key }} should be a string, and is instead
      {%- if vars[this_key] is not defined %} undefined
      {%- else %} {{ vars[this_key] is boolean | ternary(
        'a boolean',
        (vars[this_key] | int | string == vars[this_key] | string) | ternary(
          'an integer',
          (vars[this_key] | float | string == vars[this_key] | string) | ternary(
            'a float',
            vars[this_key] is string | ternary(
              'a string',
              vars[this_key] is mapping | ternary(
                'a dict',
                vars[this_key] is iterable | ternary(
                  'a list',
                  'unknown (' ~ vars[this_key] | type_debug ~ ')'
      )}}{% endif %} - {{ vars[this_key] | default('unset') }}

To trigger this, I do the following:

- hosts: localhost
  gather_facts: false
    SomeString: abc123
    SomeDict: {'somekey': 'somevalue'}
    SomeList: ['somevalue']
    SomeInteger: 12
    SomeFloat: 12.0
    SomeBoolean: false
  - name: Type Check - SomeString
      this_key: SomeString
    include_tasks: tasks/type_check/is_a_string.yml
  - name: Type Check - SomeDict
      this_key: SomeDict
    include_tasks: tasks/type_check/is_a_dict.yml
  - name: Type Check - SomeList
      this_key: SomeList
    include_tasks: tasks/type_check/is_a_list.yml
  - name: Type Check - SomeInteger
      this_key: SomeInteger
    include_tasks: tasks/type_check/is_an_integer.yml
  - name: Type Check - SomeFloat
      this_key: SomeFloat
    include_tasks: tasks/type_check/is_a_float.yml
  - name: Type Check - SomeBoolean
      this_key: SomeBoolean
    include_tasks: tasks/type_check/is_a_boolean.yml

I hope this helps you, bold traveller with complex jinja2 templating requirements!

(Oh, and if you get “template error while templating string: no test named 'boolean'“, you’re probably running Ansible which you installed using apt from Ubuntu Universe, version 2.9.6+dfsg-1 [or, at least I was!] – to fix this, use pip to install a more recent version – preferably using virtualenv first!)

Featured image is “The Guitar Template” by “Neil Williamson” on Flickr and is released under a CC-BY-SA license.

'Geocache "Goodies"' by 'sk' on Flickr

Caching online data sources in Ansible for later development or testing

My current Ansible project relies on me collecting a lot of data from AWS and then checking it again later, to see if something has changed.

This is great for one-off tests (e.g. terraform destroy ; terraform apply ; ansible-playbook run.yml) but isn’t great for repetitive tests, especially if you have to collect data that may take many minutes to run all the actions, or if you have slow or unreliable internet in your development environment.

To get around this, I wrote a wrapper for caching this data.

At the top of my playbook, run.yml, I have these tasks:

- name: Set Online Status.
  # This stores the value of run_online, unless run_online
  # is not set, in which case, it defines it as "true".
    run_online: |-
      {{- run_online | default(true) | bool -}}

- name: Create cache_data path.
  # This creates a "cached_data" directory in the same
  # path as the playbook.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
  run_once: true
    path: "cached_data"
    state: directory
    mode: 0755

- name: Create cache_data for host.
  # This creates a directory under "cached_data" in the same
  # path as the playbook, with the name of each of the inventory
  # items.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
    path: "cached_data/{{ inventory_hostname }}"
    state: directory
    mode: 0755

Running this sets up an expectation for the normal operation of the playbook, that it will be “online”, by default.

Then, every time I need to call something “online”, for example, collect EC2 Instance Data (using the community.aws.ec2_instance_info module), I call out to (something like) this set of tasks, instead of just calling the task by itself.

- name: List all EC2 instances in the regions of interest.
  when: run_online | bool
    region: "{{ item.region_name }}"
  loop: "{{ regions }}"
    label: "{{ item.region_name }}"
  register: regional_ec2

- name: "NOTE: Set regional_ec2 data path"
  when: not run_online | bool or cache_data | default(false) | bool
    regional_ec2_cached_data_file_loop: "{{ regional_ec2_cached_data_file_loop | default(0) | int + 1 }}"
    cached_data_filename: "cached_data/{{ inventory_hostname }}/{{ cached_data_file | default('regional_ec2') }}.{{ regional_ec2_cached_data_file_loop | default(0) | int + 1 }}.json"

- name: "NOTE: Cache/Get regional_ec2 data path"
  when: not run_online | bool or cache_data | default(false) | bool
    msg: "File: {{ cached_data_filename }}"

- name: Cache all EC2 instances in the regions of interest.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
    dest: "{{ cached_data_filename }}"
    mode: "0644"
    content: "{{ regional_ec2 }}"

- name: "OFFLINE: Load all EC2 instances in the regions of interest."
  when: not run_online | bool
    regional_ec2: "{% include( cached_data_filename ) %}"

The first task, if it’s still set to being “online” will execute the task, and registers the result for later. If cache_data is configured, we generate a filename for the caching, record the filename to the log (via the debug task) and then store it (using the copy task). So far, so online… but what happens when we don’t need the instance to be up and running?

In that case, we use the set_fact module, triggered by running the playbook like this: ansible-playbook run.yml -e run_online=false. This reads the cached data out of that locally stored pool of data for later use.

Featured image is ‘Geocache “Goodies”‘ by ‘sk‘ on Flickr and is released under a CC-BY-ND license.

"Main console" by "Steve Parker" on Flickr

Running services (like SSH, nginx, etc) on Windows Subsystem for Linux (WSL1) on boot

I recently got a new laptop, and for various reasons, I’m going to be primarily running Windows on that laptop. However, I still like having a working SSH server, running in the context of my Windows Subsystem for Linux (WSL) environment.

Initially, trying to run service ssh start failed with an error, because you need to re-execute the ssh configuration steps which are missed in a WSL environment. To fix that, run sudo apt install --reinstall openssh-server.

Once you know your service runs OK, you start digging around to find out how to start it on boot, and you’ll see lots of people saying things like “Just run a shell script that starts your first service, and then another shell script for the next service.”

Well, the frustration for me is that Linux already has this capability – the current popular version is called SystemD, but a slightly older variant is still knocking around in modern linux distributions, and it’s called SystemV Init, often referred to as just “sysv” or “init.d”.

The way that those services work is that you have an “init” file in /etc/init.d and then those files have a symbolic link into a “runlevel” directory, for example /etc/rc3.d. Each symbolic link is named S##service or K##service, where the ## represents the order in which it’s to be launched. The SSH Daemon, for example, that I want to run is created in there as /etc/rc3.d/S01ssh.

So, how do I make this work in the grander scheme of WSL? I can’t use SystemD, where I could say systemctl enable --now ssh, instead I need to add a (yes, I know) shell script, which looks in my desired runlevel directory. Runlevel 3 is the level at which network services have started, hence using that one. If I was trying to set up a graphical desktop, I’d instead be looking to use Runlevel 5, but the X Windows system isn’t ported to Windows like that yet… Anyway.

Because the rc#.d directory already has this structure for ordering and naming services to load, I can just step over this directory looking for files which match or do not match the naming convention, and I do that with this script:

#! /bin/bash
function run_rc() {
  base="$(basename "$1")"
  if [[ ${base:0:1} == "S" ]]
    "$1" start
    "$1" stop

if [ "$1" != "" ] && [ -e "$1" ]
  run_rc "$1"
  if [ "$1" != "" ] && [ -e "/etc/rc${$1}.d/" ]
  for digit1 in {0..9}
    for digit2 in {0..9}
      find "/etc/rc${rc}.d/" -name "[SK]${digit1}${digit2}*" -exec "$0" '{}' \; 2>/dev/null

I’ve put this script in /opt/wsl_init.sh

This does a bit of trickery, but basically runs the bottom block first. It loops over the digits 0 to 9 twice (giving you 00, 01, 02 and so on up to 99) and looks in /etc/rc3.d for any file containing the filename starting S or K and then with the two digits you’ve looped to by that point. Finally, it runs itself again, passing the name of the file it just found, and this is where the top block comes in.

In the top block we look at the “basename” – the part of the path supplied, without any prefixed directories attached, and then extract just the first character (that’s the ${base:0:1} part) to see whether it’s an “S” or anything else. If it’s an S (which everything there is likely to be), it executes the task like this: /etc/rc3.d/S01ssh start and this works because it’s how that script is designed! You can run one of the following instances of this command: service ssh start, /etc/init.d/ssh start or /etc/rc3.d/S01ssh start. There are other options, notably “stop” or “status”, but these aren’t really useful here.

Now, how do we make Windows execute this on boot? I’m using NSSM, the “Non-sucking service manager” to add a line to the Windows System services. I placed the NSSM executable in C:\Program Files\nssm\nssm.exe, and then from a command line, ran C:\Program Files\nssm\nssm.exe install WSL_Init.

I configured it with the Application Path: C:\Windows\System32\wsl.exe and the Arguments: -d ubuntu -e sudo /opt/wsl_init.sh. Note that this only works because I’ve also got Sudo setup to execute this command without prompting for a password.

Here I invoke C:\Windows\System32\wsl.exe -d ubuntu -e sudo /opt/wsl_init.sh
I define the name of the service, as Services will see it, and also the description of the service.
I put in MY username and My Windows Password here, otherwise I’m not running WSL in my user context, but another one.

And then I rebooted. SSH was running as I needed it.

Featured image is “Main console” by “Steve Parker” on Flickr and is released under a CC-BY license.

"inventory" by "Lee" on Flickr

Using a AWS Dynamic Inventory with Ansible 2.10

In Ansible 2.10, Ansible started bundling modules and plugins as “Collections”, basically meaning that Ansible didn’t need to make a release every time a vendor wanted to update the libraries it required, or API changes required new fields to be supplied to modules. As part of this split between “Collections” and “Core”, the AWS modules and plugins got moved into a collection.

Now, if you’re using Ansible 2.9 or earlier, this probably doesn’t impact you, but there are some nice features in Ansible 2.10 that I wanted to use, so… buckle up :)

Getting started with Ansible 2.10, using a virtual environment

If you currently are using Ansible 2.9, it’s probably worth creating a “python virtual environment”, or “virtualenv” to try out Ansible 2.10. I did this on my Ubuntu 20.04 machine by typing:

sudo apt install -y virtualenv
mkdir -p ~/bin
cd ~/bin
virtualenv -p python3 ansible_2.10

The above ensures that you have virtualenv installed, creates a directory called “bin” in your home directory, if it doesn’t already exist, and then places the virtual environment, using Python3, into a directory there called “ansible_2.10“.

Whenever we want to use this new environment you must activate it, using this command:

source ~/bin/ansible_2.10/bin/activate

Once you’ve executed this, any binary packages created in that virtual environment will be executed from there, in preference to the file system packages.

You can tell that you’ve “activated” this virtual environment, because your prompt changes from user@HOST:~$ to (ansible_2.10) user@HOST:~$ which helps 😀

Next, let’s create a requirements.txt file. This will let us install the environment in a repeatable manner (which is useful with Ansible). Here’s the content of this file.


So, this isn’t just Ansible, it’s also the supporting libraries we’ll need to talk to AWS from Ansible.

We execute the following command:

pip install -r requirements.txt

Note, on Windows Subsystem for Linux version 1 (which I’m using) this will take a reasonable while, particularly if it’s crossing from the WSL environment into the Windows environment, depending on where you have specified the virtual environment to be placed.

If you get an error message about something to do with being unable to install ffi, then you’ll need to install the package libffi-dev with sudo apt install -y libffi-dev and then re-run the pip install command above.

Once the installation has completed, you can run ansible --version to see something like the following:

ansible 2.10.2
  config file = None
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/user/ansible_2.10/lib/python3.8/site-packages/ansible
  executable location = /home/user/ansible_2.10/bin/ansible
  python version = 3.8.2 (default, Jul 16 2020, 14:00:26) [GCC 9.3.0]

Configuring Ansible for local collections

Ansible relies on certain paths in the filesystem to store things like collections, roles and modules, but I like to circumvent these things – particularly if I’m developing something, or moving from one release to the next. Fortunately, Ansible makes this very easy, using a single file, ansible.cfg to tell the code that’s running in this path where to find things.

A quick note on File permissions with ansible.cfg

Note that the POSIX file permissions for the directory you’re in really matter! It must be set to 775 (-rwxrwxr-x) as a maximum – if it’s “world writable” (the last number) it won’t use this file! Other options include 770, 755. If you accidentally set this as world writable, or are using a directory from the “Windows” side of WSL, then you’ll get an error message like this:

[WARNING]: Ansible is being run in a world writable directory (/home/user/ansible_2.10_aws), ignoring it as an ansible.cfg source. For more information see

That link is this one: https://docs.ansible.com/ansible/devel/reference_appendices/config.html#cfg-in-world-writable-dir and has some useful advice.

Back to configuring Ansible

In ansible.cfg, I have the following configured:

collections_paths = ./collections:~/.ansible/collections:/usr/share/ansible/collections

This file didn’t previously exist in this directory, so I created that file.

This block asks Ansible to check the following paths in order:

  • collections in this path (e.g. /home/user/ansible_2.10_aws/collections)
  • collections in the .ansible directory under the user’s home directory (e.g. /home/user/.ansible/collections)
  • and finally /usr/share/ansible/collections for system-wide collections.

If you don’t configure Ansible with the ansible.cfg file, the default is to store the collections in ~/.ansible/collections, but you can “only have one version of the collection”, so this means that if you’re relying on things not to change when testing, or if you’re running multiple versions of Ansible on your system, then it’s safest to store the collections in the same file tree as you’re working in!

Installing Collections

Now we have Ansible 2.10 installed, and our Ansible configuration file set up, let’s get our collection ready to install. We do this with a requirements.yml file, like this:

- name: amazon.aws
  version: ">=1.2.1"

What does this tell us? Firstly, that we want to install the Amazon AWS collection from Ansible Galaxy. Secondly that we want at least the most current version (which is currently version 1.2.1). If you leave the version line out, it’ll get “the latest” version. If you replace ">=1.2.1" with 1.2.1 it’ll install exactly that version from Galaxy.

If you want any other collections, you add them as subsequent lines (more details here), like this:

- name: amazon.aws
  version: ">=1.2.1"
- name: some.other
- name: git+https://example.com/someorg/somerepo.git
  version: 1.0.0
- name: git@example.com:someorg/someotherrepo.git

Once we’ve got this file, we run this command to install the content of the requirements.yml: ansible-galaxy collection install -r requirements.yml

In our case, this installs just the amazon.aws collection, which is what we want. Fab!

Getting our dynamic inventory

Right, so we’ve got all the pieces now that we need! Let’s tell Ansible that we want it to ask AWS for an inventory. There are three sections to this.

Configuring Ansible, again!

We need to open up our ansible.cfg file. Because we’re using the collection to get our Dynamic Inventory plugin, we need to tell Ansible to use that plugin. Edit ./ansible.cfg in your favourite editor, and add this block to the end:

enable_plugins = aws_ec2

If you previously created the ansible.cfg file when you were setting up to get the collection installed alongside, then your ansible.cfg file will look (something) like this:

collections_paths     = ./collections:~/.ansible/collections:/usr/share/ansible/collections

enable_plugins = amazon.aws.aws_ec2

Configure AWS

Your machine needs to have access tokens to interact with the AWS API. These are stored in ~/.aws/credentials (e.g. /home/user/.aws/credentials) and look a bit like this:

aws_access_key_id = A1B2C3D4E5F6G7H8I9J0
aws_secret_access_key = A1B2C3D4E5F6G7H8I9J0a1b2c3d4e5f6g7h8i9j0

Set up your inventory

In a bit of a change to how Ansible usually does the inventory, to have a plugin based dynamic inventory, you can’t specify a file any more, you have to specify a directory. So, create the file ./inventory/aws_ec2.yaml (having created the directory inventory first). The file contains the following:

plugin: amazon.aws.aws_ec2

Late edit 2020-12-01: Further to the comment by Giovanni, I’ve amended this file snippet from plugin: aws_ec2 to plugin: amazon.aws.aws_ec2.

By default, this just retrieves the hostnames of any running EC2 instance, as you can see by running ansible-inventory -i inventory --graph

  |  |--ec2-176-34-76-187.eu-west-1.compute.amazonaws.com
  |  |--ec2-54-170-131-24.eu-west-1.compute.amazonaws.com
  |  |--ec2-54-216-87-131.eu-west-1.compute.amazonaws.com

I need a bit more detail than this – I like to use the tags I assign to AWS assets to decide what I’m going to target the machines with. I also know exactly which regions I’ve got my assets in, and what I want to use to get the names of the devices, so this is what I’ve put in my aws_ec2.yaml file:

plugin: amazon.aws.aws_ec2
- key: tags
  prefix: tag
- key: 'security_groups|json_query("[].group_name")'
  prefix: security_group
- key: placement.region
  prefix: aws_region
- key: tags.Role
  prefix: role
- eu-west-1
- tag:Name
- dns-name
- public-ip-address
- private-ip-address

Late edit 2020-12-01: Again, I’ve amended this file snippet from plugin: aws_ec2 to plugin: amazon.aws.aws_ec2.

Now, when I run ansible-inventory -i inventory --graph, I get this output:

  |  |--euwest1-firewall
  |  |--euwest1-demo
  |  |--euwest1-manager
  |  |--euwest1-firewall
  |  |--euwest1-demo
  |  |--euwest1-manager
  |  |--euwest1-firewall
  |  |--euwest1-manager
  |  |--euwest1-demo
  |  |--euwest1-firewall
  |  |--euwest1-demo
  |  |--euwest1-manager
  |  |--euwest1-firewall
  |  |--euwest1-demo
  |  |--euwest1-manager
  |  |--euwest1-firewall
  |  |--euwest1-manager
  |  |--euwest1-demo

To finish

Now you have your dynamic inventory, you can target your playbook at any of the groups listed above (like role_Firewall, aws_ec2, aws_region_eu_west_1 or some other tag) like you would any other inventory assignment, like this:

- hosts: role_Firewall
  gather_facts: false
  - name: Show the name of this device
      msg: "{{ inventory_hostname }}"

And there you have it. Hope this is useful!

Late edit: 2020-11-23: Following a conversation with Andy from Work, we’ve noticed that if you’re trying to do SSM connections, rather than username/password based ones, you might want to put this in your aws_ec2.yml file:

plugin: amazon.aws.aws_ec2
  - tag:Name
  ansible_host: instance_id
  ansible_connection: 'community.aws.aws_ssm'

Late edit 2020-12-01: One final instance, I’ve changed plugin: aws_ec2 to plugin: amazon.aws.aws_ec2.

This will keep your hostnames “pretty” (with whatever you’ve tagged it as), but will let you connect over SSM to the Instance ID. Good fun :)

Featured image is “inventory” by “Lee” on Flickr and is released under a CC-BY-SA license.