"Family" by "Ivan" on Flickr

Debian on Docker using Vagrant

I want to use Vagrant-Docker to try standing up some environments. There’s no reasonable justification, it’s just a thing I wanted to do. Normally, I’d go into this long and rambling story about why… but on this occasion, the reason was “Because it’s possible”…

TL;DR?: Get the code from the repo and enjoy 😁

Installing Docker

On Ubuntu you can install Docker following the instructions on the Docker Install Page, which includes a convenience script (that runs all the commands you need), if you want to use it. Similar instructions for Debian, CentOS and Fedora exist.

On Windows or Mac there are downloads you can get from the Docker Hub. The Windows Version requires WSL2. I don’t have a Mac, so I don’t know what the requirements are there! Installing WSL2 has a whole host of extra steps that I can’t really do justice to. See this Microsoft article for details.

Installing Vagrant

On Debian and Ubuntu you can add the HashiCorp Apt Repo and then install Vagrant, using these commands:

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt install vagrant

There are similar instructions for RHEL, CentOS and Fedora users there too.

Windows and Mac users will have to get the application from the download page.

Creating your Dockerfile

A Dockerfile is a simple text file which has a series of line prefixes which instruct the Docker image processor to add certain instructions to the Docker Image. I found two pages which helped me with what to add for this; “Ansible. Docker. Vagrant. Bringing together” and the git repo “AkihiroSuda/containerized-systemd“.

You see, while a Dockerfile is great at starting single binary files or scripts, it’s not very good at running SystemD… and I needed SystemD to be able to run the SSH service that Vagrant requires, and to also run the scripts and commands I needed for the image I wanted to build…

Sooooo…. here’s the Dockerfile I created:

# Based on https://vtorosyan.github.io/ansible-docker-vagrant/
# and https://github.com/AkihiroSuda/containerized-systemd/

FROM debian:buster AS debian_with_systemd

# This stuff enables SystemD on Debian based systems
STOPSIGNAL SIGRTMIN+3
RUN DEBIAN_FRONTEND=noninteractive apt update && DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends systemd systemd-sysv dbus dbus-user-session
COPY docker-entrypoint.sh /
RUN chmod 755 /docker-entrypoint.sh
ENTRYPOINT [ "/docker-entrypoint.sh" ]
CMD [ "/bin/bash" ]

# This part installs an SSH Server (required for Vagrant)
RUN DEBIAN_FRONTEND=noninteractive apt install -y sudo openssh-server
RUN mkdir /var/run/sshd
#    We enable SSH here, but don't start it with "now" as the build stage doesn't run anything long-lived.
RUN systemctl enable ssh
EXPOSE 22

# This part creates the vagrant user, sets the password to "vagrant", adds the insecure key and sets up password-less sudo.
RUN useradd -G sudo -m -U -s /bin/bash vagrant
#    chpasswd takes a colon delimited list of username/password pairs.
RUN echo 'vagrant:vagrant' | chpasswd
RUN mkdir -m 700 /home/vagrant/.ssh
# This key from https://github.com/hashicorp/vagrant/tree/main/keys. It will be replaced on first run.
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA6NF8iallvQVp22WDkTkyrtvp9eWW6A8YVr+kz4TjGYe7gHzIw+niNltGEFHzD8+v1I2YJ6oXevct1YeS0o9HZyN1Q9qgCgzUFtdOKLv6IedplqoPkcmF0aYet2PkEDo3MlTBckFXPITAMzF8dJSIFo9D8HfdOV0IAdx4O7PtixWKn5y2hMNG0zQPyUecp4pzC6kivAIhyfHilFR61RGL+GPXQ2MWZWFYbAGjyiYJnAmCP3NOTd0jMZEnDkbUvxhMmBYSdETk1rRgm+R4LOzFUGaHqHDLKLX+FIPKcF96hrucXzcWyLbIbEgE98OHlnVYCzRdK8jlqm8tehUc9c9WhQ== vagrant insecure public key' > /home/vagrant/.ssh/authorized_keys
RUN chmod 600 /home/vagrant/.ssh/authorized_keys
RUN chown -R vagrant:vagrant /home/vagrant
RUN echo 'vagrant ALL=(ALL:ALL) NOPASSWD:ALL' >> /etc/sudoers

This Dockerfile calls out to a separate script, called docker-entrypoint.sh, taken verbatim from AkihiroSuda’s repo, so here’s that file:

#!/bin/bash
set -ex
container=docker
export container

if [ $# -eq 0 ]; then
	echo >&2 'ERROR: No command specified. You probably want to run `journalctl -f`, or maybe `bash`?'
	exit 1
fi

if [ ! -t 0 ]; then
	echo >&2 'ERROR: TTY needs to be enabled (`docker run -t ...`).'
	exit 1
fi

env >/etc/docker-entrypoint-env

cat >/etc/systemd/system/docker-entrypoint.target <<EOF
[Unit]
Description=the target for docker-entrypoint.service
Requires=docker-entrypoint.service systemd-logind.service systemd-user-sessions.service
EOF
cat /etc/systemd/system/docker-entrypoint.target

quoted_args="$(printf " %q" "${@}")"
echo "${quoted_args}" >/etc/docker-entrypoint-cmd
cat /etc/docker-entrypoint-cmd

cat >/etc/systemd/system/docker-entrypoint.service <<EOF
[Unit]
Description=docker-entrypoint.service

[Service]
ExecStart=/bin/bash -exc "source /etc/docker-entrypoint-cmd"
# EXIT_STATUS is either an exit code integer or a signal name string, see systemd.exec(5)
ExecStopPost=/bin/bash -ec "if echo \${EXIT_STATUS} | grep [A-Z] > /dev/null; then echo >&2 \"got signal \${EXIT_STATUS}\"; systemctl exit \$(( 128 + \$( kill -l \${EXIT_STATUS} ) )); else systemctl exit \${EXIT_STATUS}; fi"
StandardInput=tty-force
StandardOutput=inherit
StandardError=inherit
WorkingDirectory=$(pwd)
EnvironmentFile=/etc/docker-entrypoint-env

[Install]
WantedBy=multi-user.target
EOF
cat /etc/systemd/system/docker-entrypoint.service

systemctl mask systemd-firstboot.service systemd-udevd.service
systemctl unmask systemd-logind
systemctl enable docker-entrypoint.service

systemd=
if [ -x /lib/systemd/systemd ]; then
	systemd=/lib/systemd/systemd
elif [ -x /usr/lib/systemd/systemd ]; then
	systemd=/usr/lib/systemd/systemd
elif [ -x /sbin/init ]; then
	systemd=/sbin/init
else
	echo >&2 'ERROR: systemd is not installed'
	exit 1
fi
systemd_args="--show-status=false --unit=multi-user.target"
echo "$0: starting $systemd $systemd_args"
exec $systemd $systemd_args

Now, if you were to run this straight in Docker, it will fail, because you must pass certain flags to Docker to get this to run. These flags are:

  • -t : pass a “TTY” to the shell
  • --tmpfs /tmp : Create a temporary filesystem in /tmp
  • --tmpfs /run : Create another temporary filesystem in /run
  • --tmpfs /run/lock : Apparently having a tmpfs in /run isn’t enough – you ALSO need one in /run/lock
  • -v /sys/fs/cgroup:/sys/fs/cgroup:ro : Mount the CGroup kernel configuration values into the container

(I found these flags via a RedHat blog post, and a Podman issue on Github.)

So, how would this look, if you were to try and run it?

docker exec -t --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup:ro YourImage

Blimey, what a long set of text! Perhaps we could hide that behind something a bit more legible? Enter Vagrant.

Creating your Vagrantfile

Vagrant is an abstraction tool, designed to hide complicated virtualisation scripts into a simple command. In this case, we’re hiding a containerisation script into a simple command.

Like with the Dockerfile, I made extensive use of the two pages I mentioned before, as well as the two pages to get the flags to run this.

# Based on https://vtorosyan.github.io/ansible-docker-vagrant/
# and https://github.com/AkihiroSuda/containerized-systemd/
# and https://developers.redhat.com/blog/2016/09/13/running-systemd-in-a-non-privileged-container/
# with tweaks indicated by https://github.com/containers/podman/issues/3295
ENV['VAGRANT_DEFAULT_PROVIDER'] = 'docker'
Vagrant.configure("2") do |config|
  config.vm.provider "docker" do |d|
    d.build_dir       = "."
    d.has_ssh         = true
    d.remains_running = false
    d.create_args     = ['--tmpfs', '/tmp', '--tmpfs', '/run', '--tmpfs', '/run/lock', '-v', '/sys/fs/cgroup:/sys/fs/cgroup:ro', '-t']
  end
end

If you create that file, and run vagrant up you’ll get a working Vagrant boot… But if you try and execute any shell scripts, they’ll fail to run, as the they aren’t passed in with execute permissions… so I want to use Ansible to execute things, as these don’t require execute permissions on the /vagrant directory (also, as the thing I’m building in there requires Ansible… so it’s helpful either way 😁)

Executing Ansible scripts

Ansible still expects to find python in /usr/bin/python but current systems don’t make the symlink to /usr/bin/python3, as python was typically a symlink to /usr/bin/python2… and also I wanted to put the PPA for Ansible in the sources, which is what the Ansible team recommend in their documentation. I’ve done this as part of the Dockerfile, as again, I can’t run scripts from Vagrant. So, here’s the addition I made to the Dockerfile.

FROM debian_with_systemd AS debian_with_systemd_and_ansible
RUN apt install -y gnupg2 lsb-release software-properties-common
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 93C4A3FD7BB9C367
RUN add-apt-repository "deb http://ppa.launchpad.net/ansible/ansible/ubuntu trusty main"
RUN apt install -y ansible
# Yes, I know. Trusty? On Debian Buster?? But, that's what the Ansible Docs say!

In the Vagrantfile, I’ve added this block:

config.vm.provision "ansible_local" do |ansible|
  ansible.playbook = "test.yml"
end

And I created a test.yml, which looks like this:

---
- hosts: all
  tasks:
  - debug:
      msg: "Hello from Docker"

Running it

So how does this look on Windows when I run it?

PS C:\Dev\VagrantDockerBuster> vagrant up
==> default: Creating and configuring docker networks...
==> default: Building the container from a Dockerfile...
<SNIP A LOAD OF DOCKER STUFF>
    default: #20 DONE 0.1s
    default:
    default: Image: 190ffdeaeed0b7ed206097e6c1d4b5cc796a428700c9bd3e27eedacce47fb63b
==> default: Creating the container...
    default:   Name: 2021-02-13DockerBusterWithSSH_default_1613469604
    default:  Image: 190ffdeaeed0b7ed206097e6c1d4b5cc796a428700c9bd3e27eedacce47fb63b
    default: Volume: C:/Users/SPRIGGSJ/OneDrive - FUJITSU/Documents/95 My Projects/2021-02-13 Docker Buster With SSH:/vagrant
    default:   Port: 127.0.0.1:2222:22
    default:
    default: Container created: b64ed264d8949b12
==> default: Enabling network interfaces...
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default:
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default:
    default: Inserting generated public key within guest...
==> default: Machine booted and ready!
==> default: Running provisioner: ansible_local...
    default: Running ansible-playbook...

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[WARNING]: Platform linux on host default is using the discovered Python
interpreter at /usr/bin/python, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/referen
ce_appendices/interpreter_discovery.html for more information.
ok: [default]

TASK [debug] *******************************************************************
ok: [default] => {
    "msg": "Hello from Docker"
}

PLAY RECAP *********************************************************************
default                    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

PS C:\Dev\VagrantDockerBuster>

And on Linux?

Bringing machine 'default' up with 'docker' provider...
==> default: Creating and configuring docker networks...
==> default: Building the container from a Dockerfile...
<SNIP A LOAD OF DOCKER STUFF>
    default: Removing intermediate container e56bed4f7be9
    default:  ---> cef749c205bf
    default: Successfully built cef749c205bf
    default:
    default: Image: cef749c205bf
==> default: Creating the container...
    default:   Name: 2021-02-13DockerBusterWithSSH_default_1613470091
    default:  Image: cef749c205bf
    default: Volume: /home/spriggsj/Projects/2021-02-13 Docker Buster With SSH:/vagrant
    default:   Port: 127.0.0.1:2222:22
    default:
    default: Container created: 3fe46b02d7ad10ab
==> default: Enabling network interfaces...
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default:
    default: Vagrant insecure key detected. Vagrant will automatically replace
    default: this with a newly generated keypair for better security.
    default:
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!
==> default: Running provisioner: ansible_local...
    default: Running ansible-playbook...

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
[WARNING]: Platform linux on host default is using the discovered Python
interpreter at /usr/bin/python, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/referen
ce_appendices/interpreter_discovery.html for more information.
ok: [default]

TASK [debug] *******************************************************************
ok: [default] => {
    "msg": "Hello from Docker"
}

PLAY RECAP *********************************************************************
default                    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

So, if you’re crazy and want to do Vagrant using Docker with Debian Buster and Ansible, this is how to do it. I don’t know how much I’m likely to be using this in the future, but if you use it, let me know what you’re doing with it! πŸ˜€

Featured image is β€œFamily” by β€œIvan” on Flickr and is released under a CC-BY license.

"Blueprints" by "Cameron Degelia" on Flickr

Using Architectural Decision Records (ADR) with adr-tools

Introducing Architectural Decision Records

Over the last week, I discovered a new tool for my arsenal called Architectural Decision Records (ADR). They were first written about in 2011, in a post called “Documenting Architecture Decisions“, where the author, Michael Nygard, advocates for short documents explaining each decision that influences the architecture of an environment.

I found this via a Github repository, created by the team at gov.uk, which includes their ADR library, and references the tool they use to manage these documents – adr-tools.

Late edit 2021-01-25: I also found a post which suggests that Spotify uses ADR.

Late edit 2021-08-11: I wrote a post about using other tooling.

Late edit 2021-12-14: I released (v0.0.1) my own rust-based application for making Decision Records. Yes, Decision Records – not Architecture Decision Records… because I think you should be able to apply the same logic to all decisions, not just architectural ones.

Installing adr-tools on Linux

Currently adr-tools are easier to install under OSX rather than Linux or Windows Subsystem for Linux (WSL) (I’m working on this – bear with me! πŸ˜ƒ ).

The current installation notes suggest for Linux (which would also work on WSL) is to download the latest release tar.gz or zip file and unpack it into your path somewhere. This isn’t exactly the best way to deploy anything on Linux, but… I guess it works right now?

For me, I downloaded the file, and unpacked the whole tar.gz file (as root) into /usr/local/bin/, giving me a directory of /usr/local/bin/adr-tools-3.0.0/. There’s a subdirectory in here, called src which contains a large number of files – mostly starting _adr or adr- and two additional files, init.md and template.md.

Rather than putting all of these files into /usr/local/bin directly, instead I leave them in the adr-tools-3.0.0 directory, and create a symbolic link (symlink) to the /usr/local/bin directory with this command:

cd /usr/local/bin
ln -s adr-tools-3.0.0/src/* .

This gives me all those files in one place, so I can refer to them later.

An aside – why link everything in that src directory? (Feel free to skip this block!)

Now, why, you might ask, do all of these unrelated files need to be in the same place? Well…. the author of the script has put this in at the top of almost all the files:

#!/bin/bash
set -e
eval "$($(dirname $0)/adr-config)"

And then in that script, it says:

#!/bin/bash
basedir=$(cd -L $(dirname $0) > /dev/null 2>&1 && pwd -L)
echo "adr_bin_dir=$basedir"
echo "adr_template_dir=$basedir"

There are, technically, good reasons for this! This is designed to be run in, what in the Windows world, you might call as a “Portable Script”. So, you bung adr-tools into some directory somewhere, and then just call adr somecommand and it knows that all the files are where they need to be. The (somewhat) down side to this is that if you just want to call adr somecommand rather than path/to/my/adr somecommand then all those files need to be there

I’m currently looking to see if I can improve this somewhat, so that it’s not quite so complex to install, but for now, that’s what you need.

Anyway…

Using adr-tools to document your decisions

I’ll start documenting a fictional hosted web service project, and note down some of the decisions which have been made.

Initializing your ADR directory

Start by running adr init. You may want to specify a directory where you want to put these records, so instead use: adr init path/to/adr, like this:

Initializing the ADR in “documentation/architecture-decisions” with adr init documentation/architecture-decisions

You’ll notice that when I run this command, it creates a new entry, called 0001-record-architecture-decisions.md. Let’s open this up, and see what’s in here.

The VSCode record for the choice to use ADR. It is a markdown file, with the standard types of data recorded.

In here we have the record ID (1.), the title of the record Record architecture decisions, the date the choice was made Date: 2021-01-19, a status of Accepted, the context on why we made this choice, the decision, and the consequences of making this decision. Make changes, if needed, and save it. Let’s move on.

Creating our first own record

This all is quite straightforward thus far. Let’s create our next record.

Issuing the command adr new <sometitle> you create the next ADR record.

Let’s open up that record.

The template for the ADR record for “Use AWS”.

Like the first record, we have a title, a status, a context, decision and consequences. Let’s define these.

A “finished” brief ADR record.

This document shouldn’t be very long! It just describes why a choice was made and what that entails.

Changing decisions – completely replacing (superseding) a decision

Of course, over time, decisions will be replaced due to various decisions elsewhere.

You can ask adr to supersede a previous record, using the “-s” flag, and the record number.

Let’s look at how that works on the second ADR record.

After the command adr new -s 2 Use Azure, the ADR record number 2 has a new status, “Superceded by” and the superseded linked document. Yes, “Superceded” is a typo. There is an open PR for it

So, under the “Status”, where is previously said “Approved”, it now says “Superceded by [3. Use Azure](0003-use-azure.md)“. This is a markdown statement which indicates where the superseded document is located. As I mentioned in the comment below the above image, there is an open Pull Request to fix this on the adr-tools, so hopefully that typo won’t last long!

We’ve got our new ADR too – let’s take a look at that one?

Our new ADR shows that it “supercedes” the previous record. Which is good! Typo aside :)

Other references

Of course, you don’t always completely overrule a decision. Sometimes your decision is influenced by, or has a dependency on something else, like this one.

We know which provider we’re using at long last, now let’s pick a region. Use the -l flag to “link” between the referenced and new ADR. The context for the -l flag is “<number>:<text for link to number>:<text for link in targetted document>”.

The command here is:

adr new -l '3:Dependency:Influences' Use Region UK South and UK West

I’m just going to crop from the “Status” block on both the referenced ADR (3) and the ADR which references it (4):

Status block in ADR 0003 which is referenced by ADR 0004
Status block in the new ADR 0004 which references ADR 0003

And of course, you can also use the same switch to mark documents as partially obsoleted, like this:

adr new -l '4:Partially obsoletes:Partially obsoleted by' Use West Europe region instead of UK West region
Status block in ADR 0004 indicating it’s partially obsoleted. Probably worth updating the status properly to show it’s not just “Accepted”.

If you forget to add the referencing in, you can also use the adr link command, like this:

adr link 3 Influences 5 Dependency

To be clear, that command adds a (complete) line to ADR 0003 saying “Influences [5. ADR Title](link)” and a separate (complete) line to ADR 0005 saying “Dependency [3. ADR Title](link)“.

What else can we do?

There are four other “things” that it’s worth doing at this point.

  1. Note that you can change the template per-ADR directory.

Create a directory called “templates” in the ADR directory, and put a file in there called “template.md“. Tweak this as you need. Ensure you have AT LEAST the line ## Status and # NUMBER. TITLE as these are required by the script.

A much abbreviated template file, containing just “Number”, “Title”, “Date”, “Status”, and a new dummy heading called “Stuff”.
And the result of running adr new Some Text once you’ve created that template.

As you can see, it’s possible to add all sorts of content in this template as a result. Bear in mind, before your template turns into something like this, that it’s supposed to be a short document explaining why each decision was made, not a funding proposal, or a complex epic of your user stories!

Be careful not to let your template run away with you!
  1. Note that you can automatically open an editor, by setting the EDITOR (where the process is expected to finish before returning control, like using nano, emacs or vim, for example) or VISUAL (where the process is expected to “fork”, like for example, gedit or vscode) environment variable, and then running adr new A Title, like this:
  1. We can create “Table of Contents” files, using the adr generate toc command, like this:
Generating the table of contents, for injecting into other files.

This can be included into your various other markdown files. There are switches, so you can set the link path, but your best bet is to find that using adr help generate toc.

  1. We can also generate graphviz files of the link maps between elements of the various ADRs, like this: adr generate graph | dot -Tjpg > graph.jpg

If you omit the “| dot -Tjpg > graph.jpg” part, then you’ll see the graphviz output, which looks like this: (I’ve removed the documents 6 and 7).

digraph {
  node [shape=plaintext];
  subgraph {
    _1 [label="1. Record architecture decisions"; URL="0001-record-architecture-decisions.html"];
    _2 [label="2. Use AWS"; URL="0002-use-aws.html"];
    _1 -> _2 [style="dotted", weight=1];
    _3 [label="3. Use Azure"; URL="0003-use-azure.html"];
    _2 -> _3 [style="dotted", weight=1];
    _4 [label="4. Use Region UK South and UK West"; URL="0004-use-region-uk-south-and-uk-west.html"];
    _3 -> _4 [style="dotted", weight=1];
    _5 [label="5. Use West Europe region instead of UK West region"; URL="0005-use-west-europe-region-instead-of-uk-west-region.html"];
    _4 -> _5 [style="dotted", weight=1];
  }
  _3 -> _2 [label="Supercedes", weight=0]
  _3 -> _5 [label="Influences", weight=0]
  _4 -> _3 [label="Dependency", weight=0]
  _5 -> _4 [label="Partially obsoletes", weight=0]
  _5 -> _3 [label="Dependency", weight=0]
}

To make the graphviz part work, you’ll need to install graphviz, which is just an apt get away.

Any caveats?

adr-tools is not actively maintained. I’ve contacted the author, about seeing if I can help out with the maintenance, but… we’ll see, and given some fairly high profile malware takeovers of projects like this sort of thing on Github, Docker, NPM, and more… I can see why there might be some reluctance to consider it! Also, I’m an unknown entity, I’ve just dropped in on the project and offered to help, with no previous exposure to the lead dev or the project… so, we’ll see. Worst case, I’ll fork it!

Working with this also requires an understanding of markdown files, and why these might be a useful document format for records like this. There was a PR submitted to support multiple file formats (like asciidoc and rst) but these were not approved by the author.

There is no current intention to support languages other than English. The tool is hard-coded to look for strings like “status” and “superceded” which is hard. Part of the reason I raised the PRs I did was to let me fix some of these sorts of issues. Again, we’ll see what happens.

Lastly, it can be overwhelming to see a lot of documents in one place, particularly if they’re as granular as the documents I produced in this demo. If the project supported categories, or could be broken down into components (like doc/adr/networking and doc/adr/server_builds and doc/adr/applications) then this might help, but it’s not on the roadmap right now!

Late edit 2021-01-25: If you don’t think these templates have enough context or content, there are lots of others listed on Joel Parker Henderson’s repo of examples and templates. If you want a python based viewer of ADR records, take a look at adr-viewer.

Featured image is β€œBlueprints” by β€œCameron Degelia” on Flickr and is released under a CC-BY license.

"Raven" by "Jim Bahn" on Flickr

Sending SSH login notifications to Matrix via Huginn using Webhooks

On the Self Hosted Podcast’s Discord Server, someone posted a link to the following blog post, which I read and found really interesting…: https://blog.hay-kot.dev/ssh-notifications-in-home-assistant/

You see, the key part of that post wasn’t that they were posting to Home Assistant when they were logging in, but instead that they were triggering a webhook on login. And I can do stuff with Webhooks.

What’s a webhook?

A webhook is a callable URL, either with a “secret” embedded in the URL or some authentication header that lets you trigger an action of some sort. I first came across these with Github, but they’re pretty common now. Services will offer these as a way to get an action in one service to do something in another. A fairly common webhook for those getting started with these sorts of things is where creating a pull request (PR) on a Github repository will trigger a message on something like Slack to say the PR is there.

Well, that’s all well and good, but what does Matrix or Huginn have to do with things?

Matrix is a decentralized, end to end encrypted, eventually consistent database system, that just happens to be used extensively as a chat network. In particular, it’s used by Open Source projects, like KDE and Mozilla, and by Government bodies, like the whole French goverment (lead by DINSIC) the German Bundeswehr (Unified Armed Forces) department.

Matrix has a reference client, Element, that was previously called “Riot”, and in 2018 I produced a YouTube video showing how to bridge various alternative messaging systems into Matrix.

Huginn describes itself as:

Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn’s Agents create and consume events, propagating them along a directed graph. Think of it as a hackable version of IFTTT or Zapier on your own server. You always know who has your data. You do.

Huginn Readme

With Huginn, I can create “agents”, including a “receive webhook agent” that will take the content I send, and tweak it to do something else. In the past I used IFTTT to do some fun things, like making this blog work, but now I use Huginn to post Tweets when I post to this blog.

So that I knew that Huginn was posting my twitter posts, I created a Matrix room called “Huginn Alerts” and used the Matrix account I created for the video I mentioned before to send me messages that it had made the posts I wanted. I followed the guidance from this page to do it: https://drwho.virtadpt.net/archive/2020-02-12/integrating-huginn-with-a-matrix-server/

Enough already. Just show me what you did.

In Element.io

  1. Get an access token for the Matrix account you want to post with.

Log into the web interface at https://app.element.io and go to your settings

Click where it says your handle, then click on where it says “All Settings”.

Then click on “Help & About” and scroll to the bottom of that page, where it says “Advanced”

Get to the “Advanced” part of the settings, under “Help & About” to get your access token.

Click where it says “Access Token: <click to reveal>” (strangely, I’m not posting that πŸ˜‰)

  1. Click on the room, then click on it’s name at the top to open the settings, then click on “Advanced” to get the “Internal room ID”
Gettng the Room ID. Note, it starts with an exclamation mark (!) and ends :<servername>.

In Huginn

  1. Go to the “Credentials” tab, and click on “New Credential”. Give the credential a name (perhaps “Matrix Bot Access Token”), leave it as text and put your access token in here.
  1. Create a Credential for the Room ID. Like before, name it something sensible and put the ID you found earlier.
  1. Create a “Post Agent” by going to Agents and selecting “New agent”. This will show just the “Type” box. You can type in this box to put “Post Agent” and then find it. That will then provide you with the rest of these boxes. Provide a name, and tick the box marked “Propagate immediately”. I’ll cover the content of the “Options” box after this screenshot.

In the “Options” block is a button marked “Toggle View”. Select this which turns it from the above JSON pretty editor, into this text field (note your text is likely to be different):

My content of that box is as follows:

{
  "post_url": "https://matrix.org/_matrix/client/r0/rooms/{% credential Personal_Matrix_Notification_Channel %}/send/m.room.message?access_token={% credential Matrix_Bot_Access_Credential %}",
  "expected_receive_period_in_days": "365",
  "content_type": "json",
  "method": "post",
  "payload": {
    "msgtype": "m.text",
    "body": "{{ text }}"
  },
  "emit_events": "true",
  "no_merge": "false",
  "output_mode": "clean"
}

Note that the “post_url” value contains two “credential” values, like this:

{% credential Personal_Matrix_Notification_Channel %} (this is the Room ID we found earlier) and {% credential Matrix_Bot_Access_Credential %} (this is the Access Token we found earlier).

If you’ve used different names for these values (which are perfectly valid!) then just change these two. The part where it says “{{ text }}” leave there, because we’ll be using that in a later section. Click “Save” (the blue button at the bottom).

  1. Create a Webhook Agent. Go to Agents and then “New Agent”. Select “Webhook Agent” from the “Type” field. Give it a name, like “SSH Logged In Notification Agent”. Set “Keep Events” to a reasonable number of days, like 5. In “Receivers” find the Notification agent you created (“Send Matrix Message to Notification Room” was the name I used). Then, in the screenshot, I’ve pressed the “Toggle View” button on the “Options” section, as this is, to me a little clearer.

The content of the “options” box is:

{
  "secret": "supersecretstring",
  "expected_receive_period_in_days": 365,
  "payload_path": ".",
  "response": ""
}

Change the “secret” from “supersecretstring” to something a bit more useful and secure.

The “Expected Receive Period in Days” basically means, if you’ve not had an event cross this item in X number of days, does Huginn think this agent is broken? And the payload path of “.” basically means “pass everything to the next agent”.

Once you’ve completed this step, press “Save” which will take you back to your agents, and then go into the agent again. This will show you a page like this:

Copy that URL, because you’ll need it later…

On the server you are logging the SSH to

As root, create a file called /etc/ssh/sshrc. This file will be your script that will run every time someone logs in. It must have the file permissions 0644 (u+rw,g+r,o+r), which means that there is a slight risk that the Webhook secret is exposed.

The content of that file is as follows:

#!/bin/sh
ip="$(echo "$SSH_CONNECTION" | cut -d " " -f 1)"
curl --silent\
     --header "Content-Type: application/json"\
     --request POST\
     --data '{
       "At": "'"$(date -Is)"'",
       "Connection": "'"$SSH_CONNECTION"'",
       "User": "'"$USER"'",
       "Host": "'"$(hostname)"'",
       "Src": "'"$ip"'",
       "text": "'"$USER@$(hostname) logged in from $ip at $(date +%H:%M:%S)"'"
     }'\
     https://my.huginn.website/some/path/web_requests/taskid/secret

The heading line (#!/bin/sh) is more there for shellcheck, as, according to the SSH man page this is executed by /bin/sh either way.

The bulk of these values (At, Connection, User, Host or Src) are not actually used by Huginn, but might be useful for later… the key one is text, which if you recall from the “Send Matrix Message to Notification Room” Huginn agent, we put {{ text }} into the “options” block – that’s this block here!

So what happens when we log in over SSH?

SSH asks the shell in the user’s context to execute /etc/ssh/sshrc before it hands over to the user’s login session. This script calls curl and hands some POST data to the url.

Huginn receives this POST via the “SSH Logged In Notification Agent”, and files it.

Huginn then hands that off to the “Send Matrix Message to Notification Room”:

Huginn makes a POST to the Matrix.org server, and Matrix sends the finished message to all the attached clients.

Featured image is β€œRaven” by β€œJim Bahn” on Flickr and is released under a CC-BY license.

"The Guitar Template" by "Neil Williamson" on Flickr

Testing (and failing inline) for data types in Ansible

I tend to write long and overly complicated set_fact statements in Ansible, ALL THE DAMN TIME. I write stuff like this:

rulebase: |
  {
    {% for var in vars | dict2items %}
      {% if var.key | regex_search(regex_rulebase_match) | type_debug != "NoneType"
        and (
          var.value | type_debug == "dict" 
          or var.value | type_debug == "AnsibleMapping"
        ) %}
        {% for item in var.value | dict2items %}
          {% if item.key | regex_search(regex_rulebase_match) | type_debug != "NoneType"
            and (
              item.value | type_debug == "dict" 
              or item.value | type_debug == "AnsibleMapping"
            ) %}
            "{{ var.key | regex_replace(regex_rulebase_match, '\2') }}{{ item.key | regex_replace(regex_rulebase_match, '\2') }}": {
              {# This block is used for rulegroup level options #}
              {% for key in ['log_from_start', 'log', 'status', 'nat', 'natpool', 'schedule', 'ips_enable', 'ssl_ssh_profile', 'ips_sensor'] %}
                {% if var.value[key] is defined and rule.value[key] is not defined %}
                  {% if var.value[key] | type_debug in ['string', 'AnsibleUnicode'] %}
                    "{{ key }}": "{{ var.value[key] }}",
                  {% else %}
                    "{{ key }}": {{ var.value[key] }},
                  {% endif %}
                {% endif %}
              {% endfor %}
              {% for rule in item.value | dict2items %}
                {% if rule.key in ['sources', 'destinations', 'services', 'src_internet_service', 'dst_internet_service'] and rule.value | type_debug not in ['list', 'AnsibleSequence'] %}
                  "{{ rule.key }}": ["{{ rule.value }}"],
                {% elif rule.value | type_debug in ['string', 'AnsibleUnicode'] %}
                  "{{ rule.key }}": "{{ rule.value }}",
                {% else %}
                  "{{ rule.key }}": {{ rule.value }},
                {% endif %}
              {% endfor %}
            },
          {% endif %}
        {% endfor %}
      {% endif %}
    {% endfor %}
  }

Now, if you’re writing set_fact or vars like this a lot, what you tend to end up with is the dreaded dict2items requires a dictionary, got instead. which basically means “Hah! You wrote a giant blob of what you thought was JSON, but didn’t render right, so we cast it to a string for you!”

The way I usually write my playbooks, I’ll do something with this set_fact at line, let’s say, 10, and then use it at line, let’s say, 500… So, I don’t know what the bloomin’ thing looks like then!

So, how to get around that? Well, you could do a type check. In fact, I wrote a bloomin’ big blog post explaining just how to do that!

However, that gets unwieldy really quickly, and what I actually wanted to do was to throw the breaks on as soon as I’d created an invalid data type. So, to do that, I created a collection of functions which helped me with my current project, and they look a bit like this one, called “is_a_string.yml“:

- name: Type Check - is_a_string
  assert:
    quiet: yes
    that:
    - vars[this_key] is not boolean
    - vars[this_key] is not number
    - vars[this_key] | int | string != vars[this_key] | string
    - vars[this_key] | float | string != vars[this_key] | string
    - vars[this_key] is string
    - vars[this_key] is not mapping
    - vars[this_key] is iterable
    success_msg: "{{ this_key }} is a string"
    fail_msg: |-
      {{ this_key }} should be a string, and is instead
      {%- if vars[this_key] is not defined %} undefined
      {%- else %} {{ vars[this_key] is boolean | ternary(
        'a boolean',
        (vars[this_key] | int | string == vars[this_key] | string) | ternary(
          'an integer',
          (vars[this_key] | float | string == vars[this_key] | string) | ternary(
            'a float',
            vars[this_key] is string | ternary(
              'a string',
              vars[this_key] is mapping | ternary(
                'a dict',
                vars[this_key] is iterable | ternary(
                  'a list',
                  'unknown (' ~ vars[this_key] | type_debug ~ ')'
                )
              )
            )
          )
        )
      )}}{% endif %} - {{ vars[this_key] | default('unset') }}

To trigger this, I do the following:

- hosts: localhost
  gather_facts: false
  vars:
    SomeString: abc123
    SomeDict: {'somekey': 'somevalue'}
    SomeList: ['somevalue']
    SomeInteger: 12
    SomeFloat: 12.0
    SomeBoolean: false
  tasks:
  - name: Type Check - SomeString
    vars:
      this_key: SomeString
    include_tasks: tasks/type_check/is_a_string.yml
  - name: Type Check - SomeDict
    vars:
      this_key: SomeDict
    include_tasks: tasks/type_check/is_a_dict.yml
  - name: Type Check - SomeList
    vars:
      this_key: SomeList
    include_tasks: tasks/type_check/is_a_list.yml
  - name: Type Check - SomeInteger
    vars:
      this_key: SomeInteger
    include_tasks: tasks/type_check/is_an_integer.yml
  - name: Type Check - SomeFloat
    vars:
      this_key: SomeFloat
    include_tasks: tasks/type_check/is_a_float.yml
  - name: Type Check - SomeBoolean
    vars:
      this_key: SomeBoolean
    include_tasks: tasks/type_check/is_a_boolean.yml

I hope this helps you, bold traveller with complex jinja2 templating requirements!

(Oh, and if you get “template error while templating string: no test named 'boolean'“, you’re probably running Ansible which you installed using apt from Ubuntu Universe, version 2.9.6+dfsg-1 [or, at least I was!] – to fix this, use pip to install a more recent version – preferably using virtualenv first!)

Featured image is β€œThe Guitar Template” by β€œNeil Williamson” on Flickr and is released under a CC-BY-SA license.

'Geocache "Goodies"' by 'sk' on Flickr

Caching online data sources in Ansible for later development or testing

My current Ansible project relies on me collecting a lot of data from AWS and then checking it again later, to see if something has changed.

This is great for one-off tests (e.g. terraform destroy ; terraform apply ; ansible-playbook run.yml) but isn’t great for repetitive tests, especially if you have to collect data that may take many minutes to run all the actions, or if you have slow or unreliable internet in your development environment.

To get around this, I wrote a wrapper for caching this data.

At the top of my playbook, run.yml, I have these tasks:

- name: Set Online Status.
  # This stores the value of run_online, unless run_online
  # is not set, in which case, it defines it as "true".
  ansible.builtin.set_fact:
    run_online: |-
      {{- run_online | default(true) | bool -}}

- name: Create cache_data path.
  # This creates a "cached_data" directory in the same
  # path as the playbook.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
  run_once: true
  file:
    path: "cached_data"
    state: directory
    mode: 0755

- name: Create cache_data for host.
  # This creates a directory under "cached_data" in the same
  # path as the playbook, with the name of each of the inventory
  # items.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
  file:
    path: "cached_data/{{ inventory_hostname }}"
    state: directory
    mode: 0755

Running this sets up an expectation for the normal operation of the playbook, that it will be “online”, by default.

Then, every time I need to call something “online”, for example, collect EC2 Instance Data (using the community.aws.ec2_instance_info module), I call out to (something like) this set of tasks, instead of just calling the task by itself.

- name: List all EC2 instances in the regions of interest.
  when: run_online | bool
  community.aws.ec2_instance_info:
    region: "{{ item.region_name }}"
  loop: "{{ regions }}"
  loop_control:
    label: "{{ item.region_name }}"
  register: regional_ec2

- name: "NOTE: Set regional_ec2 data path"
  when: not run_online | bool or cache_data | default(false) | bool
  set_fact:
    regional_ec2_cached_data_file_loop: "{{ regional_ec2_cached_data_file_loop | default(0) | int + 1 }}"
    cached_data_filename: "cached_data/{{ inventory_hostname }}/{{ cached_data_file | default('regional_ec2') }}.{{ regional_ec2_cached_data_file_loop | default(0) | int + 1 }}.json"

- name: "NOTE: Cache/Get regional_ec2 data path"
  when: not run_online | bool or cache_data | default(false) | bool
  debug:
    msg: "File: {{ cached_data_filename }}"

- name: Cache all EC2 instances in the regions of interest.
  when: run_online | bool and cache_data | default(false) | bool
  delegate_to: localhost
  copy:
    dest: "{{ cached_data_filename }}"
    mode: "0644"
    content: "{{ regional_ec2 }}"

- name: "OFFLINE: Load all EC2 instances in the regions of interest."
  when: not run_online | bool
  set_fact:
    regional_ec2: "{% include( cached_data_filename ) %}"

The first task, if it’s still set to being “online” will execute the task, and registers the result for later. If cache_data is configured, we generate a filename for the caching, record the filename to the log (via the debug task) and then store it (using the copy task). So far, so online… but what happens when we don’t need the instance to be up and running?

In that case, we use the set_fact module, triggered by running the playbook like this: ansible-playbook run.yml -e run_online=false. This reads the cached data out of that locally stored pool of data for later use.

Featured image is ‘Geocache “Goodies”‘ by ‘sk‘ on Flickr and is released under a CC-BY-ND license.

"Main console" by "Steve Parker" on Flickr

Running services (like SSH, nginx, etc) on Windows Subsystem for Linux (WSL1) on boot

I recently got a new laptop, and for various reasons, I’m going to be primarily running Windows on that laptop. However, I still like having a working SSH server, running in the context of my Windows Subsystem for Linux (WSL) environment.

Initially, trying to run service ssh start failed with an error, because you need to re-execute the ssh configuration steps which are missed in a WSL environment. To fix that, run sudo apt install --reinstall openssh-server.

Once you know your service runs OK, you start digging around to find out how to start it on boot, and you’ll see lots of people saying things like “Just run a shell script that starts your first service, and then another shell script for the next service.”

Well, the frustration for me is that Linux already has this capability – the current popular version is called SystemD, but a slightly older variant is still knocking around in modern linux distributions, and it’s called SystemV Init, often referred to as just “sysv” or “init.d”.

The way that those services work is that you have an “init” file in /etc/init.d and then those files have a symbolic link into a “runlevel” directory, for example /etc/rc3.d. Each symbolic link is named S##service or K##service, where the ## represents the order in which it’s to be launched. The SSH Daemon, for example, that I want to run is created in there as /etc/rc3.d/S01ssh.

So, how do I make this work in the grander scheme of WSL? I can’t use SystemD, where I could say systemctl enable --now ssh, instead I need to add a (yes, I know) shell script, which looks in my desired runlevel directory. Runlevel 3 is the level at which network services have started, hence using that one. If I was trying to set up a graphical desktop, I’d instead be looking to use Runlevel 5, but the X Windows system isn’t ported to Windows like that yet… Anyway.

Because the rc#.d directory already has this structure for ordering and naming services to load, I can just step over this directory looking for files which match or do not match the naming convention, and I do that with this script:

#! /bin/bash
function run_rc() {
  base="$(basename "$1")"
  if [[ ${base:0:1} == "S" ]]
  then
    "$1" start
  else
    "$1" stop
  fi
}

if [ "$1" != "" ] && [ -e "$1" ]
then
  run_rc "$1"
else
  rc=3
  if [ "$1" != "" ] && [ -e "/etc/rc${$1}.d/" ]
  then
    rc="$1"
  fi
  for digit1 in {0..9}
  do
    for digit2 in {0..9}
    do
      find "/etc/rc${rc}.d/" -name "[SK]${digit1}${digit2}*" -exec "$0" '{}' \; 2>/dev/null
    done
  done
fi

I’ve put this script in /opt/wsl_init.sh

This does a bit of trickery, but basically runs the bottom block first. It loops over the digits 0 to 9 twice (giving you 00, 01, 02 and so on up to 99) and looks in /etc/rc3.d for any file containing the filename starting S or K and then with the two digits you’ve looped to by that point. Finally, it runs itself again, passing the name of the file it just found, and this is where the top block comes in.

In the top block we look at the “basename” – the part of the path supplied, without any prefixed directories attached, and then extract just the first character (that’s the ${base:0:1} part) to see whether it’s an “S” or anything else. If it’s an S (which everything there is likely to be), it executes the task like this: /etc/rc3.d/S01ssh start and this works because it’s how that script is designed! You can run one of the following instances of this command: service ssh start, /etc/init.d/ssh start or /etc/rc3.d/S01ssh start. There are other options, notably “stop” or “status”, but these aren’t really useful here.

Now, how do we make Windows execute this on boot? I’m using NSSM, the “Non-sucking service manager” to add a line to the Windows System services. I placed the NSSM executable in C:\Program Files\nssm\nssm.exe, and then from a command line, ran C:\Program Files\nssm\nssm.exe install WSL_Init.

I configured it with the Application Path: C:\Windows\System32\wsl.exe and the Arguments: -d ubuntu -e sudo /opt/wsl_init.sh. Note that this only works because I’ve also got Sudo setup to execute this command without prompting for a password.

Here I invoke C:\Windows\System32\wsl.exe -d ubuntu -e sudo /opt/wsl_init.sh
I define the name of the service, as Services will see it, and also the description of the service.
I put in MY username and My Windows Password here, otherwise I’m not running WSL in my user context, but another one.

And then I rebooted. SSH was running as I needed it.

Featured image is β€œMain console” by β€œSteve Parker” on Flickr and is released under a CC-BY license.

"2009.01.17 - UNKNOWN, Unknown" by "Adrian Clark" on Flickr

Creating tagged AWS EC2 resources (like Elastic IPs) with Ansible

This is a quick note, having stumbled over this one today.

Mostly these days, I’m used to using Terraform to create Elastic IP (EIP) items in AWS, and I can assign tags to them during creation. For various reasons in $Project I’m having to create my EIPs in Ansible.

To make this work, you can’t just create an EIP with tags (like you would in Terraform), instead what you need to do is to create the EIP and then tag it, like this:

  - name: Allocate a new elastic IP
    community.aws.ec2_eip:
      state: present
      in_vpc: true
      region: eu-west-1
    register: eip

  - name: Tag that resource
    amazon.aws.ec2_tag:
      region: eu-west-1
      resource: "{{ eip.allocation_id }}"
      state: present
      tags:
        Name: MyTag
    register: tag

Notice that we create a VPC associated EIP, and assign the allocation_id from the result of that module to the resource we want to tag.

How about if you’re trying to be a bit more complex?

Here I have a list of EIPs I want to create, and then I pass this into the ec2_eip module, like this:

- name: Create list of EIPs
  set_fact:
    region: eu-west-1
    eip_list:
    - demo-eip-1
    - demo-eip-2
    - demo-eip-3

  - name: Allocate new elastic IPs
    community.aws.ec2_eip:
      state: present
      in_vpc: true
      region: "{{ region }}"
    register: eip
    loop: "{{ eip_list | dict2items }}"
    loop_control:
      label: "{{ item.key }}"

  - name: Tag the EIPs
    amazon.aws.ec2_tag:
      region: "{{ item.invocation.module_args.region }}"
      resource: "{{ item.allocation_id }}"
      state: present
      tags:
        Name: "{{ item.item.key }}"
    register: tag
    loop: "{{ eip.results }}"
    loop_control:
      label: "{{ item.item.key }}"

So, in this instance we pass the list of EIP names we want to create as a list with the loop instruction. Now, at the point we create them, we don’t actually know what they’ll be called, but we’re naming them there because when we tag them, we get the “item” (from the loop) that was used to create the EIP. When we then tag the EIP, we can use some of the data that was returned from the ec2_eip module (region, EIP allocation ID and the name we used as the loop key). I’ve trimmed out the debug statements I created while writing this, but here’s what you get back from ec2_eip:

"eip": {
        "changed": true,
        "msg": "All items completed",
        "results": [
            {
                "allocation_id": "eipalloc-decafbaddeadbeef1",
                "ansible_loop_var": "item",
                "changed": true,
                "failed": false,
                "invocation": {
                    "module_args": {
                        "allow_reassociation": false,
                        "aws_access_key": null,
                        "aws_ca_bundle": null,
                        "aws_config": null,
                        "aws_secret_key": null,
                        "debug_botocore_endpoint_logs": false,
                        "device_id": null,
                        "ec2_url": null,
                        "in_vpc": true,
                        "private_ip_address": null,
                        "profile": null,
                        "public_ip": null,
                        "public_ipv4_pool": null,
                        "region": "eu-west-1",
                        "release_on_disassociation": false,
                        "reuse_existing_ip_allowed": false,
                        "security_token": null,
                        "state": "present",
                        "tag_name": null,
                        "tag_value": null,
                        "validate_certs": true,
                        "wait_timeout": null
                    }
                },
                "item": {
                    "key": "demo-eip-1",
                    "value": {}
                },
                "public_ip": "192.0.2.1"
            }
     ]
}

So, that’s what I’m doing next!

Featured image is β€œ2009.01.17 – UNKNOWN, Unknown” by β€œAdrian Clark” on Flickr and is released under a CC-BY-ND license.

"pharmacy" by "Tim Evanson" on Flickr

AWX – The Gateway Drug to Ansible Tower

A love letter to Ansible Tower

I love Ansible… I mean, I really love Ansible. You can ask anyone, and they’ll tell you my first love is my wife, then my children… and then it’s Ansible.

OK, maybe it’s Open Source and then Ansible, but either way, Ansible is REALLY high up there.

But, while I love Ansible, I love what Ansible Tower brings to an environment. See, while you get to easily and quickly manage a fleet of machines with Ansible, Ansible Tower gives you the fine grained control over what you need to expose to your developers, your ops team, or even, in a fit of “what-did-you-just-do”-ness, your manager. (I should probably mention that Ansible Tower is actually part of a much larger portfolio of products, called Ansible Automation Platform, and there’s some hosted SaaS stuff that goes with it… but the bit I really want to talk about is Tower, so I’ll be talking about Tower and not Ansible Automation Platform. Sorry!)

Ansible Tower has a scheduling engine, so you can have a “Go” button, for deploying the latest software to your fleet, or just for the 11PM patching cycle. It has a credential store, so your teams can’t just quickly go and perform an undocumented quick fix on that “flaky” box – they need to do their changes via Ansible. And lastly, it has an inventory, so you can see that the last 5 jobs failed to deploy on that host, so maybe you’ve got a problem with it.

One thing that people don’t so much love to do, is to get a license to deploy Tower, particularly if they just want to quickly spin up a demonstration for some colleagues to show how much THEY love Ansible. And for those people, I present AWX.

The first hit is free

One of the glorious and beautiful things that RedHat did, when they bought Ansible, was to make the same assertion about the Ansible products that they make to the rest of their product line, which is… while they may sell a commercial product, underneath it will be an Open Source version of that product, and you can be part of developing and improving that version, to help improve the commercial product. Thus was released AWX.

Now, I hear the nay-sayers commenting, “but what if you have an issue with AWX at 2AM, how do you get support on that”… and to those people, I reply: “If you need support at 2AM for your box, AWX is not the tool for you – what you need is Tower.”… Um, I mean Ansible Automation Platform. However, Tower takes a bit more setting up than what I’d want to do for a quick demo, and it has a few more pre-requisites. ANYWAY, enough about dealing with the nay-sayers.

AWX is an application inside Docker containers. It’s split into three parts, the AWX Web container, which has the REST API. There’s also a PostgreSQL database inside there too, and one “Engine”, which is the separate container which gets playbooks from your version control system, asks for any dynamic inventories, and then runs those playbooks on your inventories.

I like running demos of Tower, using AWX, because it’s reasonably easy to get stood up, and it’s reasonably close to what Tower looks and behaves like (except for the logos)… and, well, it’s a good gateway to getting people interested in what Tower can do for them, without them having to pay (or spend time signing up for evaluation licenses) for the environment in the first place.

And what’s more, it can all be automated

Yes, folks, because AWX is just a set of docker containers (and an install script), and Ansible knows how to start Docker containers (and run an install script), I can add an Ansible playbook to my cloud-init script, Vagrantfile or, let’s face it, when things go really wrong, put it in a bash script for some poor keyboard jockey to install for you.

If you’re running a demo, and you don’t want to get a POC (proof of concept) or evaluation license for Ansible Tower, then the chances are you’re probably not running this on RedHat Enterprise Linux (RHEL) either. That’s OK, once you’ve sold the room on using Tower (by using AWX), you can sell them on using RHEL too. So, I’ll be focusing on using CentOS 8 instead. Partially because there’s a Vagrant box for CentOS 8, but also because I can also use CentOS 8 on AWS, where I can prove that the Ansible Script I’m putting into my Vagrantfile will also deploy nicely via Cloud-Init too. With a very small number of changes, this is likely to work on anything that runs Docker, so everything from Arch to Ubuntu… probably 😁

“OK then. How can you work this magic, eh?” I hear from the back of the room. OK, pipe down, nay-sayers.

First, install Ansible on your host. You just need to run dnf install -y ansible.

Next, you need to install Docker. This is a marked difference between AWX and Ansible Tower, as AWX is based on Docker, but Ansible Tower uses other magic to make it work. When you’re selling the benefits of Tower, note that it’s not a 1-for-1 match at this point, but it’s not a big issue. Fortunately, CentOS can install Docker Community edition quite easily. At this point, I’m swapping to using Ansible playbooks. At the end, I’ll drop a link to where you can get all this in one big blob… In fact, we’re likely to use it with our Cloud-Init deployment.

Aw yehr, here’s the good stuff

tasks:
- name: Update all packages
  dnf:
    name: "*"
    state: latest

- name: Add dependency for "yum config-manager"
  dnf:
    name: yum-utils
    state: present

- name: Add the Docker Repo
  shell: yum config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
  args:
    creates: /etc/yum.repos.d/docker-ce.repo
    warn: false

- name: Install Docker
  dnf:
    name:
    - docker-ce
    - docker-ce-cli
    - containerd.io
    state: present
  notify: Start Docker

That first stanza – update all packages? Well, that’s because containerd.io relies on a newer version of libseccomp, which hasn’t been built in the CentOS 8 Vagrantbox I’m using.

The next one? That ensures I can run yum config-manager to add a repo. I could use the copy module in Ansible to create the repo files so yum and/or dnf could use that instead, but… meh, this is a single line shell command.

And then we install the repo, and the docker-ce packages we require. We use the “notify” statement to trigger a handler call to start Docker, like this:

handlers:
- name: Start Docker
  systemd:
    name: docker
    state: started

Fab. We’ve got Docker. Now, let’s clone the AWX repo to our machine. Again, we’re doing this with Ansible, naturally :)

tasks:
- name: Clone AWX repo to local path
  git:
    repo: https://github.com/ansible/awx.git
    dest: /opt/awx

- name: Get latest AWX tag
  shell: |
    if [ $(git status -s | wc -l) -gt 0 ]
    then
      git stash >/dev/null 2>&1
    fi
    git fetch --tags && git describe --tags $(git rev-list --tags --max-count=1)
    if [ $(git stash list | wc -l) -gt 0 ]
    then
      git stash pop >/dev/null 2>&1
    fi
  args:
    chdir: /opt/awx
  register: latest_tag
  changed_when: false

- name: Use latest released version of AWX
  git:
    repo: https://github.com/ansible/awx.git
    dest: /opt/awx
    version: "{{ latest_tag.stdout }}"

OK, there’s a fair bit to get from this, but essentially, we clone the repo from Github, then ask (using a collection of git commands) for the latest released version (yes, I’ve been bitten by just using the head of “devel” before), and then we check out that released version.

Fab, now we can configure it.

tasks:
- name: Set or Read admin password
  set_fact:
    admin_password_was_generated: "{{ (admin_password is defined or lookup('env', 'admin_password') != '') | ternary(false, true) }}"
    admin_password: "{{ admin_password | default (lookup('env', 'admin_password') | default(lookup('password', 'pw.admin_password chars=ascii_letters,digits length=20'), true) ) }}"

- name: Configure AWX installer
  lineinfile:
    path: /opt/awx/installer/inventory
    regexp: "^#?{{ item.key }}="
    line: "{{ item.key }}={{ item.value }}"
  loop:
  - key: "awx_web_hostname"
    value: "{{ ansible_fqdn }}"
  - key: "pg_password"
    value: "{{ lookup('password', 'pw.pg_password chars=ascii_letters,digits length=20') }}"
  - key: "rabbitmq_password"
    value: "{{ lookup('password', 'pw.rabbitmq_password chars=ascii_letters,digits length=20') }}"
  - key: "rabbitmq_erlang_cookie"
    value: "{{ lookup('password', 'pw.rabbitmq_erlang_cookie chars=ascii_letters,digits length=20') }}"
  - key: "admin_password"
    value: "{{ admin_password }}"
  - key: "secret_key"
    value: "{{ lookup('password', 'pw.secret_key chars=ascii_letters,digits length=64') }}"
  - key: "create_preload_data"
    value: "False"
  loop_control:
    label: "{{ item.key }}"

If we don’t already have a password defined, then create one. We register the fact we’ve had to create one, as we’ll need to tell ourselves it once the build is finished.

After that, we set a collection of values into the installer – the hostname, passwords, secret keys and so on. It loops over a key/value pair, and passes these to a regular expression rewrite command, so at the end, we have the settings we want, without having to change this script between releases.

When this is all done, we execute the installer. I’ve seen this done two ways. In an ideal world, you’d throw this into an Ansible shell module, and get it to execute the install, but the problem with that is that the AWX install takes quite a while, so I’d much rather actually be able to see what’s going on… and so, instead, we exit our prepare script at this point, and drop back to the shell to run the installer. Let’s look at both options, and you can decide which one you want to do. In my script, I’m doing the first, but just because it’s a bit neater to have everything in one place.

- name: Run the AWX install.
  shell: ansible-playbook -i inventory install.yml
  args:
    chdir: /opt/awx/installer
cd /opt/awx/installer
ansible-playbook -i inventory install.yml

When this is done, you get a prepared environment, ready to access using the username admin and the password of … well, whatever you set admin_password to.

AWX takes a little while to stand up, so you might want to run this next Ansible stanza to see when it’s ready to go.

- name: Test access to AWX
  tower_user:
    tower_host: "http://{{ ansible_fqdn }}"
    tower_username: admin
    tower_password: "{{ admin_password }}"
    email: "admin@{{ ansible_fqdn }}"
    first_name: "admin"
    last_name: ""
    password: "{{ admin_password }}"
    username: admin
    superuser: yes
    auditor: no
  register: _result
  until: _result.failed == false
  retries: 240 # retry 240 times
  delay: 5 # pause for 5 sec between each try

The upshot to using that command there is that it sets the email address of the admin account to “admin@your.awx.example.org“, if the fully qualified domain name (FQDN) of your machine is your.awx.example.org.

Moving from the Theoretical to the Practical

Now we’ve got our playbook, let’s wrap this up in both a Vagrant Vagrantfile and a Terraform script, this means you can deploy it locally, to test something internally, and in “the cloud”.

To simplify things, and because the version of Ansible deployed on the Vagrant box isn’t the one I want to use, I am using a single “user-data.sh” script for both Vagrant and Terraform. Here that is:

#!/bin/bash
if [ -e "$(which yum)" ]
then
  yum install git python3-pip -y
  pip3 install ansible docker docker-compose
else
  echo "This script only supports CentOS right now."
  exit 1
fi

git clone https://gist.github.com/JonTheNiceGuy/024d72f970d6a1c6160a6e9c3e642e07 /tmp/Install_AWX
cd /tmp/Install_AWX
/usr/local/bin/ansible-playbook Install_AWX.yml

While they both have their differences, they both can execute a script once the machine has finished booting. Let’s start with Vagrant.

Vagrant.configure("2") do |config|
  config.vm.box = "centos/8"

  config.vm.provider :virtualbox do |v|
    v.memory = 4096
  end

  config.vm.provision "shell", path: "user-data.sh"

  config.vm.network "forwarded_port", guest: 80, host: 8080, auto_correct: true
end

To boot this up, once you’ve got Vagrant and Virtualbox installed, run vagrant up and it’ll tell you that it’s set up a port forward from the HTTP port (TCP/80) to a “high” port – TCP/8080. If there’s a collision (because you’re running something else on TCP/8080), it’ll tell you what port it’s forwarded the HTTP port to instead. Once you’ve finished, run vagrant destroy to shut it down. There are lots more tricks you can play with Vagrant, but this is a relatively quick and easy one. Be aware that you’re not using HTTPS, so traffic to the AWX instance can be inspected, but if you’re running this on your local machine, it’s probably not a big issue.

How about running this on a cloud provider, like AWS? We can use the exact same scripts – both the Ansible script, and the user-data.sh script, using Terraform, however, this is a little more complex, as we need to create a VPC, Internet Gateway, Subnet, Security Group and Elastic IP before we can create the virtual machine. What’s more, the Free Tier (that “first hit is free” thing that Amazon Web Services provide to you) does not have enough horsepower to run AWX, so, if you want to look at how to run up AWX in EC2 (or to tweak it to run on Azure, GCP, Digital Ocean or one of the fine offerings from IBM or RedHat), then click through to the gist I’ve put all my code from this post into. The critical lines in there are to select a “CentOS 8” image, open HTTP and SSH into the machine, and to specify the user-data.sh file to provision the machine. Everything else is cruft to make the virtual machine talk to, and be seen by, hosts on the Internet.

To run this one, you need to run terraform init to load the AWS plugin, then terraform apply. Note that this relies on having an AWS access token defined, so if you don’t have them set up, you’ll need to get that sorted out first. Once you’ve finished with your demo, you should run terraform destroy to remove all the assets created by this terraform script. Again, when you’re running that demo, note that you ONLY have HTTP access set up, not HTTPS, so don’t use important credentials on there!

Once you’ve got your AWX environment running, you’ve got just enough AWX there to demo what Ansible Tower looks like, what it can bring to your organisation… and maybe even convince them that it’s worth investing in a license, rather than running AWX in production. Just in case you have that 2AM call-out that we all dread.

Featured image is β€œpharmacy” by β€œTim Evanson” on Flickr and is released under a CC-BY-SA license.

"Status" by "Doug Letterman" on Flickr

Adding your Git Status to your Bash prompt

I was watching Lorna Mitchell‘s Open Source Hour twitch stream this morning, and noticed that she had a line in her prompt showing what her git status was.

A snip from Lorna’s screen during the Open Source Hour stream.

Git, for those of you who aren’t aware, is the version control software which has dominated software development and documentation for over 10 years now. It’s used for almost everything now, supplanting it’s competitors like Subversion, Visual Source Safe, Mercurial and Bazaar. While many people are only aware of Git using GitHub, before there was GitHub, there was the Git command line. I’m using the git command in a Bash shell all the time because I find it easier to use that for the sorts of things I do, than it is to use the GUI tools.

However, the thing that often stumbles me is what state I’m in with the project, and this line showed me just how potentially powerful this command can be.

During the video, I started researching how I could get this prompt set up on my machine, and finally realised that actually, git prompt was installed as part of the git package on my Ubuntu 20.04 install. To use it, I just had to add this string $(__git_ps1) into my prompt. This showed me which branch I was on, but I wanted more detail than that!

So, then I started looking into how to configure this prompt. I found this article from 2014, called “Git Prompt Variables” which showed me how to configure which features I wanted to enable:

GIT_PS1_DESCRIBE_STYLE='contains'
GIT_PS1_SHOWCOLORHINTS='y'
GIT_PS1_SHOWDIRTYSTATE='y'
GIT_PS1_SHOWSTASHSTATE='y'
GIT_PS1_SHOWUNTRACKEDFILES='y'
GIT_PS1_SHOWUPSTREAM='auto'

To turn this on, I edited ~/.bashrc (again, this is Ubuntu 20.04, I’ve not tested this on CentOS, Fedora, Slackware or any other distro). Here’s the lines I’m looking for:

The lines in the middle, between the two red lines are the lines in question – the lines above and below are for context in the standard .bashrc file shipped with Ubuntu 20.04

I edited each line starting PS1=, to add this: $(__git_ps1), so this now looks like this:

The content of those two highlighted lines in .bashrc

I’m aware that line is pretty hard to read in many cases, so here’s just the text for each PS1 line:

PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w$(__git_ps1)\[\033[00m\]\$ '
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w$(__git_ps1)\$ '

The first of those is the version that is triggered when if [ "$color_prompt" = yes ] is true, the second is when it isn’t.

What does this look like?

Let’s run through a “standard” work-flow of “conditions”. Yes, this is a really trivial example, and quite (I would imagine) different from how most people approach things like this… but it gives you a series of conditions so you can see what it looks like.

Note, as I’m still running a slightly older version of git, and I’ve not adjusted my defaults, the “initial” branch created is still called “master”, not “main”. For the purposes of this demonstration, it’s fine, but I should really have fixed this from the outset. My apologies.

First, we create and git init a directory, called git_test in /tmp.

Following a git init, the prompt ends (master #). Following the git init of the master branch, we are in a state where there is “No HEAD to compare against”, so the git prompt fragment ends #.

Next, we create a file in here. It’s unstaged.

Following the creation of an empty file, using the touch command, the prompt ends (master #%). We’re on the master branch, with no HEAD to compare against (#), and we have an untracked file (%).

And then we add that to the staging area.

Following a git add, the prompt ends (master +). We’re on the master branch, with a staged change (+).

Next we commit the file to the repository.

Following a git commit, the prompt ends (master). We’re on the master branch with a clean staging and unstaged area.

We add some content to the README file.

Following the change of a tracked file, by echoing content into the file, the prompt ends (master *). We’re on the master branch with an unstaged change (*).

We realise that we can’t use this change right now, let’s stash it for later.

Following the git stash of a tracked file, the prompt ends (master $). We’re on the master branch with stashed files ($).

We check out a new branch, so we can use that stash in there.

Following the creation of a new branch with git checkout -b, the prompt ends (My-New-Feature $). We’re on the My-New-Feature branch with stashed files ($).

And then pop the stashed file out.

Following the restoration of a stashed file, using git stash pop, the prompt ends (My-New-Feature *). We’re on the My-New-Feature branch with stashed files (*).

We then add the file and commit it.

Following a git add and git commit of the previously stashed file, the prompt ends (My-New-Feature). We’re on the My-New-Feature branch with a clean staged and unstaged area.

How about working with remote sources? Let’s change to back to the /tmp directory and fork git_test to git_local_fork.

Following the clone of the repository in a new path using git clone, and then changing into that directory with the cd command, the prompt ends (My-New-Feature=). We’re on the My-New-Feature branch, which is in an identical state to it’s default remote tracked branch (=).

We’ve checked it out at the feature branch instead of master, let’s check out master instead.

Subsequent to checking out the master branch in the repository using git checkout master, the prompt ends (master=). We’re on the master branch, which is in an identical state to it’s default remote tracked branch (=).

In the meantime, upstream, someone merged “My-New-Feature” into “master” on our original git_test repo.

Following the merge of a feature branch, using git merge, the prompt ends (master). We’re on the master branch with a clean staged and unstaged area.

On our local branch again, let’s fetch the state of our “upstream” into git_local_fork.

After we fetch the state of our default upstream repository, the prompt ends (master<). We’re on the master branch with a clean staged and unstaged area, but we’re behind the default remote tracked branch (<).

And then pull, to make sure we’re in-line with upstream’s master branch.

Once we perform a git pull to bring this branch up-to-date with the upstream repository, the prompt ends (master=). We’re on the master branch, which is back in an identical state to it’s default remote tracked branch (=).

We should probably make some local changes to this repository.

The prompt changes from (master=) to (master *=) to (master +=) and then (master>) as we create an unstaged change (*), stage it (+) and then bring the branch ahead of the default remote tracked branch (>).

Meanwhile, someone made some changes to the upstream repository.

The prompt changes from (master) to (master *) to (master +) and then (master) as we create an unstaged change (*) with echo, stage it (+) with git add and then end up with a clear staged and unstaged area following a git commit.

So, before we try and push, let’s quickly fetch their tree to see what’s going on.

After a git fetch to pull the latest state from the remote repository, the prompt ends (master <>). We’re on the master branch, but our branch has diverged from the default remote and won’t merge cleanly (<>).

Oh no, we’ve got a divergence. We need to fix this! Let’s pull the upstream master branch.

We do git pull and end up with the prompt ending (master *+|MERGING<>). We have unstaged (*) and staged (+) changes, and we’re in a “merging” state (MERGING) to try to resolve our diverged branches (<>).

Let’s fix the failed merge.

We resolve the merge conflict with nano, and confirm it has worked with cat, and then stage the merge resolution change using git add. The prompt ends (master +|MERGING<>). We have staged (+) changes, and we’re in a “merging” state (MERGING) to try to resolve our diverged branches (<>).

I think we’re ready to go with the merge.

We perform a git commit and the prompt ends as (master>). We have resolved our diverged master branches, have exited the “merging” state and are simply ahead of the default remote branch (>).

If the remote were a system like github, at this point we’d just do a git push. But… it’s not, so we’d need to do a git pull /tmp/git_local_fork in /tmp/git_test and then a git fetch in /tmp/git_local_fork… but that’s an implementation detail πŸ˜‰

Featured image is β€œStatus” by β€œDoug Letterman” on Flickr and is released under a CC-BY license.

"#security #lockpick" by "John Jones" on Flickr

Auto-starting an SSH Agent in Windows Subsystem for Linux

I tend to use Windows Subsystem for Linux (WSL) as a comprehensive SSH client, mostly for running things like Ansible scripts and Terraform. One of the issues I’ve had with it though is that, on a Linux GUI based system, I would start my SSH Agent on login, and then the first time I used an SSH key, I would unlock the key using the agent, and it would be cached for the duration of my logged in session.

While I was looking for something last night, I came across this solution on Stack Overflow (which in turn links to this blog post, which in turn links to this mailing list post) that suggests adding the following stanza to ~/.profile in WSL. I’m running the WSL version of Ubuntu 20.04, but the same principles apply on Cygwin, or, probably, any headless-server installation of a Linux distribution, if that’s your thing.

SSH_ENV="$HOME/.ssh/agent-environment"
function start_agent {
    echo "Initialising new SSH agent..."
    /usr/bin/ssh-agent | sed 's/^echo/#echo/' > "${SSH_ENV}"
    echo succeeded
    chmod 600 "${SSH_ENV}"
    . "${SSH_ENV}" > /dev/null
}
# Source SSH settings, if applicable
if [ -f "${SSH_ENV}" ]; then
    . "${SSH_ENV}" > /dev/null
    ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ > /dev/null || {
        start_agent;
    }
else
    start_agent;
fi

Now, this part is all well-and-good, but what about that part where you want to SSH using a key, and then that being unlocked for the duration of your SSH Agent being available?

To get around that, in the same solution page, there is a suggestion of adding this line to your .ssh/config: AddKeysToAgent yes. I’ve previously suggested using dynamically included SSH configuration files, so in this case, I’d look for your file which contains your “wildcard” stanza (if you have one), and add the line there. This is what mine looks like:

Host *
  AddKeysToAgent yes
  IdentityFile ~/.ssh/MyCurrentKey

How does this help you? Well, if you’re using jump hosts (using ProxyJump MyBastionHost, for example) you’ll only be prompted for your SSH Key once, or if you typically do a lot of SSH sessions, you’ll only need to unlock your session once.

BUT, and I can’t really stress this enough, don’t use this on a shared or suspected compromised system! If you’ve got a root account which can access the content of your Agent’s Socket and PID, then any protections that private key may have held for your system is compromised.

Featured image is β€œ#security #lockpick” by β€œJohn Jones” on Flickr and is released under a CC-BY-ND license.