"Tower" by " Yijun Chen" on Flickr

Building a Gitlab and Ansible Tower (AWX) Demo in Vagrant with Ansible

TL;DR – I created a repository on GitHub‌ containing a Vagrantfile and an Ansible Playbook to build a VM running Docker. That VM hosts AWX (Ansible Tower’s upstream open-source project) and Gitlab.

A couple of years ago, a colleague created (and I enhanced) a Vagrant and Ansible playbook called “Project X” which would run an AWX instance in a Virtual Machine. It’s a bit heavy, and did a lot of things to do with persistence that I really didn’t need, so I parked my changes and kept an eye on his playbook…

Fast-forward to a week-or-so ago. I needed to explain what a Git/Ansible Workflow would look like, and so I went back to look at ProjectX. Oh my, it looks very complex and consumed a lot of roles that, historically, I’ve not been that impressed with… I just needed the basics to run AWX. Oh, and I also needed a Gitlab environment.

I knew that Gitlab had a docker-based install, and so does AWX, so I trundled off to find some install guides. These are listed in the playbook I eventually created (hence not listing them here). Not all the choices I made were inspired by those guides – I wanted to make quite a bit of this stuff “build itself”… this meant I wanted users, groups and projects to be created in Gitlab, and users, projects, organisations, inventories and credentials to be created in AWX.

I knew that you can create Docker Containers in Ansible, so after I’d got my pre-requisites built (full upgrade, docker installed, pip libraries installed), I add the gitlab-ce:latest docker image, and expose some ports. Even now, I’m not getting the SSH port mapped that I was expecting, but … it’s no disaster.

I did notice that the Gitlab service takes ages to start once the container is marked as running, so I did some more digging, and found that the uri module can be used to poll a URL. It wasn’t documented well how you can make it keep polling until you get the response you want, so … I added a PR on the Ansible project’s github repo for that one (and I also wrote a blog post about that earlier too).

Once I had a working Gitlab service, I needed to customize it. There are a bunch of Gitlab modules in Ansible but since a few releases back of Gitlab, these don’t work any more, so I had to find a different way. That different way was to run an internal command called “gitlab-rails”. It’s not perfect (so it doesn’t create repos in your projects) but it’s pretty good at giving you just enough to build your demo environment. So that’s getting Gitlab up…

Now I need to build AWX. There’s lots of build guides for this, but actually I had most luck using the README in their repository (I know, who’d have thought it!??!) There are some “Secrets” that should be changed in production that I’m changing in my script, but on the whole, it’s pretty much a vanilla install.

Unlike the Gitlab modules, the Ansible Tower modules all work, so I use these to create the users, credentials and so-on. Like the gitlab-rails commands, however, the documentation for using the tower modules is pretty ropey, and I still don’t have things like “getting your users to have access to your organisation” working from the get-go, but for the bulk of the administration, it does “just work”.

Like all my playbooks, I use group_vars to define the stuff I don’t want to keep repeating. In this demo, I’ve set all the passwords to “Passw0rd”, and I’ve created 3 users in both AWX and Gitlab – csa, ops and release – indicative of the sorts of people this demo I ran was aimed at – Architects, Operations and Release Managers.

Maybe, one day, I’ll even be able to release the presentation that went with the demo ;)

On a more productive note, if you’re doing things with the tower_ modules and want to tell me what I need to fix up, or if you’re doing awesome things with the gitlab-rails tool, please visit the repo with this automation code in, and take a look at some of my “todo” items! Thanks!!

Featured image is “Tower” by “Yijun Chen” on Flickr and is released under a CC-BY-SA license.

"funfair action" by "Jon Bunting" on Flickr

Improving the speed of Azure deployments in Ansible with Async

Recently I was building a few environments in Azure using Ansible, and found this stanza which helped me to speed things up.

  - name: "Schedule UDR Creation"
    azure_rm_routetable:
      resource_group: "{{ resource_group }}"
      name: "{{ item.key }}_udr"
    loop: "{{ routetables | dict2items }}"
    loop_control:
        label: "{{ item.key }}_udr"
    async: 1000
    poll: 0
    changed_when: False
    register: sleeper

  - name: "Check UDRs Created"
    async_status:
      jid: "{{ item.ansible_job_id }}"
    register: sleeper_status
    until: sleeper_status.finished
    retries: 500
    delay: 4
    loop: "{{ sleeper.results|flatten(levels=1) }}"
    when: item.ansible_job_id is defined
    loop_control:
      label: "{{ item._ansible_item_label }}"

What we do here is to start an action with an “async” time (to give the Schedule an opportunity to register itself) and a “poll” time of 0 (to prevent the Schedule from waiting to be finished). We then tell it that it’s “never changed” (changed_when: False) because otherwise it always shows as changed, and to register the scheduled item itself as a “sleeper”.

After all the async jobs get queued, we then check the status of all the scheduled items with the async_status module, passing it the registered job ID. This lets me spin up a lot more items in parallel, and then “just” confirm afterwards that they’ve been run properly.

It’s not perfect, and it can make for rather messy code. But, it does work, and it’s well worth giving it the once over, particularly if you’ve got some slow-to-run tasks in your playbook!

Featured image is “funfair action” by “Jon Bunting” on Flickr and is released under a CC-BY license.

A web browser with the example.com web page loaded

Working around the fact that Ansible’s URI module doesn’t honour the no_proxy variable…

An Ansible project I’ve been working on has tripped me up this week. I’m working with some HTTP APIs and I need to check early whether I can reach the host. To do this, I used a simple Ansible Core Module which lets you call an HTTP URI.

- uri:
    follow_redirects: none
    validate_certs: False
    timeout: 5
    url: "http{% if ansible_https | default(True) %}s{% endif %}://{{ ansible_host }}/login"
  register: uri_data
  failed_when: False
  changed_when: False

This all seems pretty simple. One of the environments I’m working in uses the following values in their environment:

http_proxy="http://192.0.2.1:8080"
https_proxy="http://192.0.2.1:8080"
no_proxy="10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24"

And this breaks the uri module, because it tries to punt everything through the proxy if the “no_proxy” contains CIDR values (like 192.0.2.0/24) (there’s a bug raised for this)… So here’s my fix!

- set_fact:
    no_proxy_match: |
      {
        {% for no_proxy in (lookup('env', 'no_proxy') | replace(',', '') ).split() %}
          {% if no_proxy| ipaddr | type_debug != 'NoneType' %}
            {% if ansible_host | ipaddr(no_proxy) | type_debug != 'NoneType' %}
              "match": "True"
            {% endif %}
          {% endif %}
        {% endfor %}
      }

- uri:
    follow_redirects: none
    validate_certs: False
    timeout: 5
    url: "http{% if ansible_https | default(True) %}s{% endif %}://{{ ansible_host }}/login"
  register: uri_data
  failed_when: False
  changed_when: False
  environment: "{ {% if no_proxy_match.match | default(False) %}'no_proxy': '{{ ansible_host }}'{% endif %} }"

So, let’s break this down.

The key part to this script is that we need to override the no_proxy environment variable with the IP address that we’re trying to address (so that we’re not putting 16M addresses for 10.0.0.0/8 into no_proxy, for example). To do that, we use the exact same URI block, except for the environment line at the end.

In turn, the set_fact block steps through the no_proxy values, looking for IP Addresses to check ({% if no_proxy | ipaddr ... %}‌ says “if the no_proxy value is an IP Address, return it, but if it isn’t, return a ‘None’ value”) and if it’s an IP address or subnet mask, it checks to see whether the IP address of the host you’re trying to reach falls inside that IP Address or Subnet Mask ({% if ansible_host | ipaddr(no_proxy) ... %} says “if the ansible_host address falls inside the no_proxy range, then return it, otherwise return a ‘None’ value”). Both of these checks say “If this previous check returns anything other than a ‘None’ value, do the next thing”, and on the last check, the “next” thing is to set the flag ‘match’ to ‘true’. When we get to the environment variable, we say “if match is not true, it’s false, so don’t put a value in there”.

So that’s that! Yes, I could merge the set_fact block into the environment variable, but I do end up using that a fair amount. And really, if it was merged, that would be even MORE complicated to pick through.

I have raised a pull request on the Ansible project to update the documentation, so we’ll see whether we end up with people over here looking for ways around this issue. If so, let me know in the comments below! Thanks!!

"Matrix" by "Paul Downey" on Flickr

Idempotent Dynamic Content in Ansible

One of my colleagues recently sent out a link to a post about generating idempotent random numbers for Ansible. As I was reading it, I realised that there are other ways of doing the same thing (but not quite as pretty).

See, one of the things I (mis-)use Ansible for is to build Azure, AWS and OpenStack environments (instead of, perhaps, using Terraform, Cloud Formations or Heat Stacks). As a result, I frequently want to set complex passwords that are unique to *that environment* but that aren’t new for each build. My way of doing this is to run a delegated task to generate files in host_vars. Here’s a version of the playbook I use for that!

In the same gist as that block has been sourced from I have some example output from “20 hosts” – one of which has a pre-defined password in the inventory, and the rest of which are generated by the script.

I hope this is useful to someone!

Late Edit – 2019-05-19: Encrypting the values you generate

Following this post, a friend of mine – Jeremy mentioned on Linked In that I should have a look at Ansible Vault. Well, *ideally*, yes, however, when I looked at this code, I couldn’t work out a way of forcing the session to run Vault against a value I’ve just created, short of running something a raw or a shell module like “ansible-vault encrypt {{ file_containing_password }}“. Realistically, if you’re doing a lot with these passwords, you should probably use an external password vault, such as HashiCorp’s Vault or PasswordStore.org’s Pass. Neither of which I tend to use, because it’s just not part of my life yet – but I’ve heard good things about both!

Featured image is “Matrix” by “Paul Downey” on Flickr and is released under a CC-BY license.

"LEGO Factory Playset" from Brickset on Flickr

Building Azure Environments in Ansible

Recently, I’ve been migrating my POV (proof of value) and POC (proof of concept) environment from K5 to Azure to be able to test vendor products inside Azure. I ran a few tests to build the environment using the native tools (the powershell scripts) and found that the Powershell way of delivering Azure environments seems overly complicated… particularly as I’m comfortable with how Ansible works.

To be fair, I also need to look at Terraform, but that isn’t what I’m looking at today :)

So, let’s start with the scaffolding. Any Ansible Playbook which deals with creating virtual machines needs to have some extra modules installed. Make sure you’ve got ansible 2.7 or later and the python azure library 2.0.0 or later (you can get both with pip for python).

Next, let’s look at the group_vars for this playbook.

This file has several pieces. We define the project settings (anything prefixed project_ is a project setting), including the prefix used for all resources we create (in this case “env01“), and a standard password used for all VMs we create (in this case “My$uper$ecret$Passw0rd“).

Next we define the standard images to load from the Marketplace. You can extend this with other images, these are just the “easiest” ones that I’m most familiar with (your mileage may vary). Next up is the networks to build inside the VNet, and lastly we define the actual machines we want to build. If you’ve got questions about any of the values we define here, just let me know in the comments below :)

Next, we’ll start looking at the playbook (this has been exploded out – the full playbook is also in the gist).

Here we start by pulling in the variables we might want to override, and we do this by reading system environment variables (ANSIBLE_PREFIX and BREAKGLASS) and using them if they’re set. If they’re not, use the project defaults, and if that hasn’t been set, use some pre-defined values… and then tell us what they are when we’re running the tasks (those are the debug: lines).

This block is where we create our “Static Assets” – individual items that we will be consuming later. This shows a clear win here over the Powershell methods endorsed by Microsoft – here you can create a Resource Group (RG) as part of the playbook! We also create a single Storage Account for this RG and a single VNET too.

These creation rules are not suitable for production use, as this defines an “Any-Any” Security group! You should tailor your security groups for your need, not for blanket access in!

This is where things start to get a bit more interesting – We’re using the “async/async_status” pattern here (and the rest of these sections) to start creating the resources in parallel. As far as I can tell, sometimes you’ll get a case where the async doesn’t quite get set up fast enough, then the async_status can’t track the resources properly, but re-running the playbook should be enough to sort that out, without slowing things down too much.

But what are we actually doing with this block of code? A UDR is a “User Defined Route” or routing table for Azure. Effectively, you treat each network interface as being plumbed directly to the router (none of this “same subnet broadcast” stuff works here!) so you can do routing at the router for all the networks.

By default there are some existing network routes (stuff to the internet flows to the internet, RFC1918 addresses are dropped with the exception of any RFC1918 addresses you have covered in your VNETs, and each of your subnets can reach each other “directly”). Adding a UDR overrides this routing table. The UDRs we’re creating here are applied at a subnet level, but currently don’t override any of the existing routes (they’re blank). We’ll start putting routes in after we’ve added the UDRs to the subnets. Talking of which….

Again, this block is not really suitable for production use, and assumes the VNET supernet of /8 will be broken down into several /24’s. In the “real world” you might deliver a handful of /26’s in a /24 VNET… or you might even have lots of disparate /24’s in the VNET which are then allocated exactly as individual /24 subnets… this is not what this model delivers but you might wish to investigate further!

Now that we’ve created our subnets, we can start adding the routing table to the UDR. This is a basic one – add a 0.0.0.0/0 route (internet access) from the “protected” network via the firewall. You can get a lot more specific than this – most people are likely to want to add the VNET range (in this case 10.0.0.0/8) via the firewall as well, except for this subnet (because otherwise, for example, 10.0.0.100 trying to reach 10.0.0.101 will go via the firewall too).

Without going too much into the intricacies of network architecture, if you are routing your traffic between subnets to the firewall, it’s probably better to get an appliance with more interfaces, so you can route traffic across the appliance, rather than going across a single interface as this will halve your traffic bandwidth (it’s currently capped 1Gb/s – so 500Mb/s).

Having mentioned “The Internet” – let’s give our firewall a public IP address, and create the rest of the interfaces as well.

This script creates a public IP address by default for each interface unless you explicitly tell it not to (see lines 40, 53 and 62 in the group_vars file I rendered above). You could easily turn this around by changing the lines which contain this:

item.1.public is not defined or (item.1.public is defined and item.1.public == 'true')

into lines which contain this:

item.1.public is defined and item.1.public == 'true'

OK, having done all that, we’re now ready to build our virtual machines. I’ve introduced a “Priority system” here – VMs with priority 0 go first, then 1, and 2 go last. The code snippet below is just for priority 0, but you can easily see how you’d extrapolate that out (and in fact, the full code sample does just that).

There are a few blocks here to draw attention to :) I’ve re-jigged them a bit here so it’s clearer to understand, but when you see them in the main playbook they’re a bit more compact. Let’s start with looking at the Network Interfaces section!

network_interfaces: |
  [
    {%- for nw in item.value.ports -%}
      '{{ prefix }}{{ item.value.name }}port{{ nw.subnet.name }}'
      {%- if not loop.last -%}, {%- endif -%} 
    {%- endfor -%}
  ]

In this part, we loop over the ports defined for the virtual machine. This is because one device may have 1 interface, or four interfaces. YAML is parsed to make a JSON variable, so here we can create a JSON variable, that when the YAML is parsed it will just drop in. We’ve previously created all the interfaces to have names like this PREFIXhostnamePORTsubnetname (or aFW01portWAN in more conventional terms), so here we construct a JSON array, like this: ['aFW01portWAN'] but that could just as easily have been ['aFW01portWAN', 'aFW01portProtect', 'aFW01portMGMT', 'aFW01portSync']. This will then attach those interfaces to the virtual machine.

Next up, custom_data. This section is sometimes known externally as userdata or config_disk. My code has always referred to it as a “Provision Script” – hence the variable name in the code below!

custom_data: |
  {%- if item.value.provision_script is defined and item.value.provision_script != '' -%}
    {%- include(item.value.provision_script) -%}
  {%- elif item.value.image.provision_script is defined and item.value.image.provision_script != '' -%}
    {%- include(item.value.image.provision_script) -%}
  {%- else -%}
    {{ omit }}
  {%- endif -%}

Let’s pick this one apart too. If we’ve defined a provisioning script file for the VM, include it, if we’ve defined a provisioning script file for the image (or marketplace entry), then include that instead… otherwise, pretend that there’s no “custom_data” field before you submit this to Azure.

One last quirk to Azure, is that some images require a “plan” to go with it, and others don’t.

plan: |
  {%- if item.value.image.plan is not defined -%}{{ omit }}{%- else -%}
    {'name': '{{ item.value.image.sku }}',
     'publisher': '{{ item.value.image.publisher }}',
     'product': '{{ item.value.image.offer }}'
    }
  {%- endif -%}

So, here we say “if we’ve not got a plan, omit the value being passed to Azure, otherwise use these fields we previously specified. Weird huh?

The very last thing we do in the script is to re-render the standard password we’ve used for all these builds, so that we can check them out!

Want to review this all in one place?

Here’s the link to the full playbook, as well as the group variables (which should be in ./group_vars/all.yml) and two sample userdata files (which should be in ./userdata) for an Ubuntu machine (using cloud-init) and one for a FortiGate Firewall.

All the other files in that gist (prefixes from 10-16 and 00) are for this blog post only, and aren’t likely to work!

If you do end up using this, please drop me a note below, or star the gist! That’d be awesome!!

Image credit: “Lego Factory Playset” from Flickr by “Brickset” released under a CC-BY license. Used with Thanks!

Picture of a bull from an 1882 reference guide, as found on Flickr

Abattoir Architecture for Evergreen – using the Pets versus Cattle parable

Work very generously sent me on a training course today about a cloud based technology we’re considering deploying.

During the course, the organiser threw a question to the audience about “who can explain what a container does?” and a small number of us ended up talking about Docker (primarily for Linux) and CGroups, and this then turned into a conversation about the exceedingly high rate of changes deployed by Amazon, Etsy and others who have completely embraced microservices and efficient CI/CD pipelines… and then I mentioned the parable of Pets versus Cattle.

The link above points to where the story comes from, but the short version is…

When you get a pet, it comes as a something like a puppy or kitten, you name it, you nurture it, bring it up to live in your household, and when it gets sick, because you’ve made it part of your family, you bring in a vet who nurses it back to health.

When you have cattle, it doesn’t have a name, it has a number, and if it gets sick, you isolate it from the herd and if it doesn’t get better, you take it out back and shoot it, and get a new one.

A large number of the audience either hadn’t heard of the parable, or if they had, hadn’t heard it delivered like this.

We later went on to discuss how this applies in a practical sense, not just in docker or kubernetes containers, but how it could be applied to Infrastructure as a Service (IaaS), even down to things like vendor supplied virtual firewalls where you have Infrastructure as Code (IaC).

If, in your environment, you have some service you treat like cattle – perhaps a cluster of firewalls behind a load balancer or a floating IP address and you need to upgrade it (because it isn’t well, or it’s not had the latest set of policy deployed to it). You don’t amend the policy on the boxes in question… No! You stand up a new service using your IaC with the latest policy deployed upon it, and then you would test it (to make sure it’s stood up right), and then once you’re happy it’s ready, you transition your service to the new nodes. Once the connections have drained from your old nodes, you take them out and shoot them.

Or, if you want this in pictures…

Stage 1 - Before the new service deployment
Stage 1 – Before the new service deployment
Stage 2 - The new service deployment is built and tested
Stage 2 – The new service deployment is built and tested
Stage 3 - Service transitions to the new service deployment
Stage 3 – Service transitions to the new service deployment
Stage 4 - The old service is demised, and any resources (including licenses) return to the pool
Stage 4 – The old service is demised, and any resources (including licenses) return to the pool

I was advised (by a very enthusiastic Mike until he realised that I intended to follow through with it) that the name for this should be as per the title. So, the next time someone asks me to explain how they could deploy, I’ll suggest they look for the Abattoir in my blog, because, you know, that’s normal, right? :)

Image credit: Image from page 255 of “Breeder and sportsman” (1882) via Internet Archive Book Image on Flickr

Troubleshooting FortiGate API issues with the CLI?

One of my colleagues has asked me for some help with an Ansible script he’s writing to push some policy to a cloud hosted FortiGate appliance. Unfortunately, he kept getting some very weird error messages, like this one:

fatal: [localhost]: FAILED! => {"changed": false, "meta": {"build": 200, "error": -651, "http_method": "PUT", "http_status": 500, "mkey": "vip8080", "name": "vip", "path": "firewall", "revision": "36.0.0.10745196634707694665.1544442857", "serial": "CENSORED", "status": "error", "vdom": "root", "version": "v6.0.3"}, "msg": "Error in repo"}

This is using Fortinet’s own Ansible Modules, which, in turn use the fortiosapi python module.

This same colleague came across a post on the Fortinet Developer Network site (access to the site requires vendor approval), which said “this might be an internal bug, but to debug it, use the following”

fgt # diagnose debug enable

fgt # diagnose debug cli 8
Debug messages will be on for 30 minutes.

And then run your API commands. Your error message will be surfaced there… so here’s mine! (Mapped port doesn’t match extport in a vip).

0: config firewall vip
0: edit "vip8080"
0: unset src-filter
0: unset service
0: set extintf "port1"
0: set portforward enable
0: unset srcintf-filter
0: set mappedip "192.0.2.1-192.0.2.1"
0: unset extport
0: set extport 8080-8081
0: unset mappedport
0: set mappedport 8080
-651: end

Late edit 2020-03-27: I spotted a bug in the Ansible issues tracker today, and I added a note to the end of that bug mentioning that as well as diagnose debug cli 8, if that doesn’t give you enough logs to figure out what’s up, you can also try diagnose debug application httpsd -1 but this enables LOTS AND LOTS of logs, so really think twice before turning that one on!

Oh, and if 30 minutes isn’t enough, try diagnose debug duration 480 or however many minutes you think you need. Beware that it will write event logs out to the serial console even when you’ve logged out.

Working with complicated template data UserData in Ansible

My new job means I’m currently building a lot of test boxes with Ansible, particularly OpenStack guests. This means I’m trying to script as much as possible without actually … getting my hands dirty with the actual “logging into it and running things” perspective.

This week, I hit a problem standing up a popular firewall vendor’s machine with Ansible, because I was trying to bypass the first-time-wizard… anyway, it wasn’t working, and I couldn’t figure out why. I talked to my colleague [mohclips] and he eventually told me that I needed to use a template, because what I was trying to do was too complicated.

But, damn him, I knew that wasn’t the answer :)

Anyway, I found this comment on a ticket, which lead me to the following… if you’re finding that your userdata: variable in the os_server module of Ansible isn’t working, you might need to wrap it up like this:

userdata: |
  {%- raw -%}#!/bin/bash
  # Kill script if the pipe fails
  set -euf -o pipefail
  # Write everything from this point on to Syslog
  echo " == Set admin credentials == "
  clish -c 'set user admin password-hash {% endraw -%}{{ default_password|password_hash('sha512') }}{%- raw -%}' -s
  {% endraw %}

Note that, if you have a space before your variable, use {% endraw -%} and if you’ve a space after it, use {%- raw %} as the hyphen means “ditch all the spaces before/after this command”.