“code crunching” by “Ruben Molina” on Flickr

Getting Started with Terraform on AWS

I recently wrote a blog post about Getting Started with Terraform on Azure. You might have read it (I know I did!).

Having got a VM stood up in Azure, I wanted to build a VM in AWS, after all, it’s more-or-less the same steps. Note, this is a work-in-progress, and shouldn’t be considered “Final” – this is just something to use as *your* starting block.

What do you need?

You need an AWS account for this. If you’ve not got one, signing up for one is easy, but bear in mind that while there are free resource on AWS (only for the first year!), it’s also quite easy to suddenly enable a load of features that cost you money.

Best practice suggests (or rather, INSISTS) you shouldn’t use your “root” account for AWS. It’s literally just there to let you define the rest of your admin accounts. Turn on MFA (Multi-Factor Authentication) on that account, give it an exceedingly complex password, write that on a sheet of paper, and lock it in a box. You should NEVER use it!

Create your admin account, log in to that account. Turn on MFA on *that* account too. Then, create an “Access Token” for your account. This is in IAM (Identity and Access Management). These are what we’ll use to let Terraform perform actions in AWS, without you needing to actually “log in”.

On my machine, I’ve put the credentials for this in /home/<MYUSER>/.aws/credentials and it looks like this:

[default]
aws_access_key_id = ABC123DEF456
aws_secret_access_key = AaBbCcDd1234EeFf56

This file should be chmod 600 and make sure it’s only your account that can access this file. With this token, Terraform can perform *ANY ACTION* as you, including anything that charges you money, or creating servers that can mine a “cryptocurrency” for someone malicious.

I’m using Windows Subsystem for Linux (WSL). I’m using the Ubuntu 18.04 distribution obtained from the Store. This post won’t explain how to get *that*. Also, you might want to run Terraform on Mac, in Windows or on Linux natively… so, yehr.

Next, we need to actually install Terraform. Excuse the long, unwrapped code block, but it gets what you need quickly (assuming the terraform webpage doesn’t change any time soon!)

mkdir -p ~/bin
cd ~/bin
sudo apt update && sudo apt install unzip
curl -sLO $(curl https://www.terraform.io/downloads.html | grep "linux_amd64.zip" | cut -d\" -f 2) && unzip terraform*.zip && rm terraform*.zip && chmod 755 terraform

Starting coding your infrastructure

Before you can build your first virtual machine on AWS, you need to stand up the supporting infrastructure. These are:

  • An SSH Keypair (no password logins here!)
  • A VPC (“Virtual Private Cloud”, roughly the same as a VNet on Azure, or somewhat like a L3 switch in the Physical Realm).
  • An Internet Gateway (if your VPC isn’t classed as “the default one”)
  • A Subnet.
  • A Security Group.

Once we’ve got these, we can build our Virtual Machine on EC2 (“Elastic Cloud Compute”), and associate a “Public IP” to it.

To quote my previous post:

One quirk with Terraform, versus other tools like Ansible, is that when you run one of the terraform commands (like terraform init, terraform plan or terraform apply), it reads the entire content of any file suffixed “tf” in that directory, so if you don’t want a file to be loaded, you need to either move it out of the directory, comment it out, or rename it so it doesn’t end .tf. By convention, you normally have three “standard” files in a terraform directory – main.tf, variables.tf and output.tf, but logically speaking, you could have everything in a single file, or each instruction in it’s own file.

Getting Started with Terraform on Azure – Building the file structure

For the sake of editing and annotating the files for this post, these code blocks are all separated, but on my machine, they’re all currently one big file called “main.tf“.

In that file, I start by telling it that I’m working with the Terraform AWS provider, and that it should target my nearest region.

If you want to risk financial ruin, you can put things like your access tokens in here, but I really wouldn’t chance this!

Next, we create our network infrastructure – VPC, Internet Gateway and Subnet. We also change the routing table.

I suspect, if I’d created the VPC as “The Default” VPC, then I wouldn’t have needed to amend the routing table, nor added an Internet Gateway. To help us make the routing table change, there’s a “data” block in this section of code. A data block is an instruction to Terraform to go and ask a resource for *something*, in this case, we need AWS to tell Terraform what the routing table is that it created for the VPC. Once we have that we can ask for the routing table change.

AWS doesn’t actually give “proper” names to any of it’s assets. To provide something with a “real” name, you need to tag that thing with the “Name” tag. These can be practically anything, but I’ve given semi-sensible names to everything. You might want to name everything “main” (like I nearly did)!

We’re getting close to being able to create the VM now. First of all, we’ll create the Security Groups. I want to separate out my “Allow Egress Traffic” rule from my “Inbound SSH” rule. This means that I can clearly see what hosts allow inbound SSH access. Like with my Azure post, I’m using a “data provider” to get my public IP address, but in a normal “live” network, you’d specify a collection of valid source address ranges.

Last steps before we create the Virtual Machine. We need to upload our SSH key, and we need to find the “AMI” (AWS Machine ID) of the image we’ll be using. To create the key, in this directory, along side the .tf files, I’ve put my SSH public key (called id_rsa.pub), and we load that key when we create the “my_key” resource. To find the AMI, we need to make another data call, this time asking the AMI index to find the VM with the name containing ubuntu-bionic-18.04 and some other stuff. AMIs are region specific, so the image I’m using in eu-west-2 will not be the same AMI in eu-west-1 or us-east-1 and so on. This filtering means that, as long as the image exists in that region, we can use “the right one”. So let’s take a look at this file.

So, now we have everything we need to create our VM. Let’s do that!

In here, we specify a “user_data” file to upload, in this case, the contents of a file – CloudDev.sh, but you can load anything you want in here. My CloudDev.sh is shown below, so you can see what I’m doing with this file :)

So, having created all this lot, you need to execute the terraform workload. Initially you do terraform init. This downloads all the provisioners and puts them into the same tree as these .tf files are stored in. It also resets the state of the terraform discovered or created datastore.

Next, you do terraform plan -out tfout. Technically, the tfout part can be any filename, but having something like tfout marks it as clearly part of Terraform. This creates the tfout file with the current state, and whatever needs to change in the Terraform state file on it’s next run. Typically, if you don’t use a tfout file within about 20 minutes, it’s probably worth removing it.

Finally, once you’ve run your plan stage, now you need to apply it. In this case you execute terraform apply tfout. This tfout is the same filename you specified in terraform plan. If you don’t include -out tfout on your plan (or even run a plan!) and tfout in your apply, then you can skip the terraform plan stage entirely.

Once you’re done with your environment, use terraform destroy to shut it all down… and enjoy :)

Featured image is “code crunching” by “Ruben Molina” on Flickr and is released under a CC-ND license.

"Seca" by "Olearys" on Flickr

Getting Started with Terraform on Azure

I’m strongly in the “Ansible is my tool, what needs fixing” camp, when it comes to Infrastructure as Code (IaC) but, I know there are other tools out there which are equally as good. I’ve been strongly advised to take a look at Terraform from HashiCorp. I’m most familiar at the moment with Azure, so this is going to be based around resources available on Azure.


Late edit: I want to credit my colleague, Pete, for his help getting started with this. While many of the code samples have been changed from what he provided me with, if it hadn’t been for these code samples in the first place, I’d never have got started!

Late edit 2: This post was initially based on Terraform 0.11, and I was prompted by another colleague, Jon, that the available documentation still follows the 0.11 layout. 0.12 was released in May, and changes how variables are reused in the code. This post now *should* follow the 0.12 conventions, but if you spot something where it doesn’t, check out this post from the Terraform team.


As with most things, there’s a learning curve, and I struggled to find a “simple” getting started guide for Terraform. I’m sure this is a failing on my part, but I thought it wouldn’t hurt to put something out there for others to pick up and see if it helps someone else (and, if that “someone else” is you, please let me know in the comments!)

Pre-requisites

You need an Azure account for this. This part is very far outside my spectrum of influence, but I’m assuming you’ve got one. If not, look at something like Digital Ocean, AWS or VMWare :) For my “controller”, I’m using Windows Subsystem for Linux (WSL), and wrote the following notes about getting my pre-requisites.

Building the file structure

One quirk with Terraform, versus other tools like Ansible, is that when you run one of the terraform commands (like terraform init, terraform plan or terraform apply), it reads the entire content of any file suffixed “tf” in that directory, so if you don’t want a file to be loaded, you need to either move it out of the directory, comment it out, or rename it so it doesn’t end .tf. By convention, you normally have three “standard” files in a terraform directory – main.tf, variables.tf and output.tf, but logically speaking, you could have everything in a single file, or each instruction in it’s own file. Because this is a relatively simple script, I’ll use this standard layout.

The actions I’ll be performing are the “standard” steps you’d perform in Azure to build a single Infrastructure as a Service (IAAS) server service:

  • Create your Resource Group (RG)
  • Create a Virtual Network (VNET)
  • Create a Subnet
  • Create a Security Group (SG) and rules
  • Create a Public IP address (PubIP) with a DNS name associated to that IP.
  • Create a Network Interface (NIC)
  • Create a Virtual Machine (VM), supplying a username and password, the size of disks and VM instance, and any post-provisioning instructions (yep, I’m using Ansible for that :) ).

I’m using Visual Studio Code, but almost any IDE will have integrations for Terraform. The main thing I’m using it for is auto-completion of resource, data and output types, also the fact that control+clicking resource types opens your browser to the documentation page on terraform.io.

So, creating my main.tf, I start by telling it that I’m working with the Terraform AzureRM Provider (the bit of code that can talk Azure API).

This simple statement is enough to get Terraform to load the AzureRM, but it still doesn’t tell Terraform how to get access to the Azure account. Use az login from a WSL shell session to authenticate.

Next, we create our basic resource, vnet and subnet resources.

But wait, I hear you cry, what are those var.something bits in there? I mentioned before that in the “standard” set of files is a “variables.tf” file. In here, you specify values for later consumption. I have recorded variables for the resource group name and location, as well as the VNet name and subnet name. Let’s add those into variables.tf.

When you’ve specified a resource, you can capture any of the results from that resource to use later – either in the main.tf or in the output.tf files. By creating the resource group (called “rg” here, but you can call it anything from “demo” to “myfirstresourcegroup”), we can consume the name or location with azurerm_resource_group.rg.name and azurerm_resource_group.rg.location, and so on. In the above code, we use the VNet name in the subnet, and so on.

After the subnet is created, we can start adding the VM specific parts – a security group (with rules), a public IP (with DNS name) and a network interface. I’ll create the VM itself later. So, let’s do this.

BUT WAIT, what’s that ${trimspace(data.http.icanhazip.body)}/32 bit there?? Any resources we want to load from the terraform state, but that we’ve not directly defined ourselves needs to come from somewhere. These items are classed as “data” – that is, we want to know what their values are, but we aren’t *changing* the service to get it. You can also use this to import other resource items, perhaps a virtual network that is created by another team, or perhaps your account doesn’t have the rights to create a resource group. I’ll include a commented out data block in the overall main.tf file for review that specifies a VNet if you want to see how that works.

In this case, I want to put the public IP address I’m coming from into the NSG Rule, so I can get access to the VM, without opening it up to *everyone*. I’m not that sure that my IP address won’t change between one run and the next, so I’m using the icanhazip.com service to determine my IP address. But I’ve not defined how to get that resource yet. Let’s add it to the main.tf for now.

So, we’re now ready to create our virtual machine. It’s quite a long block, but I’ll pull certain elements apart once I’ve pasted this block in.

So, this is broken into four main pieces.

  • Virtual Machine Details. This part is relatively sensible. Name RG, location, NIC, Size and what happens to the disks when the machine powers on. OK.
name                             = "iaas-vm"
location                         = azurerm_resource_group.rg.location
resource_group_name              = azurerm_resource_group.rg.name
network_interface_ids            = [azurerm_network_interface.iaasnic.id]
vm_size                          = "Standard_DS1_v2"
delete_os_disk_on_termination    = true
delete_data_disks_on_termination = true
  • Disk details.
storage_image_reference {
  publisher = "Canonical"
  offer     = "UbuntuServer"
  sku       = "18.04-LTS"
  version   = "latest"
}
storage_os_disk {
  name              = "iaas-os-disk"
  caching           = "ReadWrite"
  create_option     = "FromImage"
  managed_disk_type = "Standard_LRS"
}
  • OS basics: VM Hostname, username of the first user, and it’s password. Note, if you want to use an SSH key, this must be stored for Terraform to use without passphrase. If you mention an SSH key here, as well as a password, this can cause all sorts of connection issues, so pick one or the other.
os_profile {
  computer_name  = "iaas"
  admin_username = var.ssh_user
  admin_password = var.ssh_password
}
os_profile_linux_config {
  disable_password_authentication = false
}
  • And lastly, provisioning. I want to use Ansible for my provisioning. In this example, I have a basic playbook stored locally on my Terraform host, which I transfer to the VM, install Ansible via pip, and then execute ansible-playbook against the file I uploaded. This could just as easily be a git repo to clone or a shell script to copy in, but this is a “simple” example.
provisioner "remote-exec" {
  inline = ["mkdir /tmp/ansible"]

  connection {
    type     = "ssh"
    host     = azurerm_public_ip.iaaspubip.fqdn
    user     = var.ssh_user
    password = var.ssh_password
  }
}

provisioner "file" {
  source = "ansible/"
  destination = "/tmp/ansible"

  connection {
    type     = "ssh"
    host     = azurerm_public_ip.iaaspubip.fqdn
    user     = var.ssh_user
    password = var.ssh_password
  }
}

provisioner "remote-exec" {
  inline = [
    "sudo apt update > /tmp/apt_update || cat /tmp/apt_update",
    "sudo apt install -y python3-pip > /tmp/apt_install_python3_pip || cat /tmp/apt_install_python3_pip",
    "sudo -H pip3 install ansible > /tmp/pip_install_ansible || cat /tmp/pip_install_ansible",
    "ansible-playbook /tmp/ansible/main.yml"
  ]

  connection {
    type     = "ssh"
    host     = azurerm_public_ip.iaaspubip.fqdn
    user     = var.ssh_user
    password = var.ssh_password
  }
}

This part of code is done in three parts – create upload path, copy the files in, and then execute it. If you don’t create the upload path, it’ll upload just the first file it comes to into the path specified.

Each remote-exec and file provisioner statement must include the hostname, username and either the password, or SSH private key. In this example, I provide just the password.

So, having created all this lot, you need to execute the terraform workload. Initially you do terraform init. This downloads all the provisioners and puts them into the same tree as these .tf files are stored in. It also resets the state of the terraform discovered or created datastore.

Next, you do terraform plan -out tfout. Technically, the tfout part can be any filename, but having something like tfout marks it as clearly part of Terraform. This creates the tfout file with the current state, and whatever needs to change in the Terraform state file on it’s next run. Typically, if you don’t use a tfout file within about 20 minutes, it’s probably worth removing it.

Finally, once you’ve run your plan stage, now you need to apply it. In this case you execute terraform apply tfout. This tfout is the same filename you specified in terraform plan. If you don’t include -out tfout on your plan (or even run a plan!) and tfout in your apply, then you can skip the terraform plan stage entirely.

When I ran this, with a handful of changes to the variable files, I got this result:

Once you’re done with your environment, use terraform destroy to shut it all down… and enjoy :)

The full source is available in the associated Gist. Pull requests and constructive criticism are very welcome!

Featured image is “Seca” by “Olearys” on Flickr and is released under a CC-BY license.

"Untitled" by "Ryan Dickey" on Flickr

Run an Ansible Playbook against a Check Point Gaia node running R80+

Late Edit – 2019-11-05: Ansible 2.9 has some Check Point modules for interacting with the Check Point Manager API which are actually Idempotent, and if you’re running Ansible <=2.8, there are some non-idempotent modules available directly from Check Point. This post is about interacting with the OS. The OS might now be much more addressable using ansible_connection=ssh!

In Check Point Gaia R77, if you wanted to run Ansible against this node, you were completely out of luck. The version of Python on the host was broken, modules were missing and … well, it just wouldn’t work.

Today, I’m looking at running some Ansible playbooks against Check Point R80 nodes. Here’s some steps you need to get through to make it work.

  1. Make sure the user that Ansible is going to be using has the shell /bin/bash. If you don’t have this set up, the command is: set user ansible shell /bin/bash.
  2. If you want a separate user account to do ansible actions, run these commands:
    add user ansible uid 9999 homedir /home/ansible
    set user ansible password-hash $1$D3caF9$B4db4Ddecafbadnogoood (note this hash is not valid!)
    add rba user ansible roles adminRole
    set user ansible shell /bin/bash
  3. Make sure your inventory specifies the right path for your Python binary. In the next code block you’ll see my inventory for three separate Check Point R80+ nodes. Note that I’ll only be targetting the “checkpoint” group, but that I’m using the r80_10, r80_20 and r80_30 groups to load the variables into there. I could, alternatively, add these in as values in group_vars/r80_10.yml and so on, but I find keeping everything to do with my connection in one place much cleaner. The python interpreter is in a separate path for each version time, and if you don’t specify ansible_ssh_transfer_method=piped you’ll get a message like this: [WARNING]: sftp transfer mechanism failed on [cpr80-30]. Use ANSIBLE_DEBUG=1 to see detailed information (fix from Add pipeline-ish method using dd for file transfer over SSH (#18642) on the Ansible git repo)
[checkpoint]
cpr80-10        ansible_user=admin      ansible_password=Sup3rS3cr3t-
cpr80-20        ansible_user=admin      ansible_password=Sup3rS3cr3t-
cpr80-30        ansible_user=admin      ansible_password=Sup3rS3cr3t-

[r80_10]
cpr80-10

[r80_20]
cpr80-20

[r80_30]
cpr80-30

[r80_10:vars]
ansible_ssh_transfer_method=piped
ansible_python_interpreter=/opt/CPsuite-R80/fw1/Python/bin/python

[r80_20:vars]
ansible_ssh_transfer_method=piped
ansible_python_interpreter=/opt/CPsuite-R80.20/fw1/Python/bin/python

[r80_30:vars]
ansible_ssh_transfer_method=piped
ansible_python_interpreter=/opt/CPsuite-R80.30/fw1/Python/bin/python

And there you have it, one quick “ping” check later…

$ ansible -m 'ping' -i hosts checkpoint
cpr80-10 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
cpr80-30 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
cpr80-20 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

One quick word of warning though, don’t use gather_facts: true or the setup: module. Both of these still rely on missing libraries on the Check Point nodes, and won’t work… But then again, you can get whatever you need from shell commands….. right? ;)

Featured image is “Untitled” by “Ryan Dickey” on Flickr and is released under a CC-BY license.

"Tower" by " Yijun Chen" on Flickr

Building a Gitlab and Ansible Tower (AWX) Demo in Vagrant with Ansible

TL;DR – I created a repository on GitHub‌ containing a Vagrantfile and an Ansible Playbook to build a VM running Docker. That VM hosts AWX (Ansible Tower’s upstream open-source project) and Gitlab.

A couple of years ago, a colleague created (and I enhanced) a Vagrant and Ansible playbook called “Project X” which would run an AWX instance in a Virtual Machine. It’s a bit heavy, and did a lot of things to do with persistence that I really didn’t need, so I parked my changes and kept an eye on his playbook…

Fast-forward to a week-or-so ago. I needed to explain what a Git/Ansible Workflow would look like, and so I went back to look at ProjectX. Oh my, it looks very complex and consumed a lot of roles that, historically, I’ve not been that impressed with… I just needed the basics to run AWX. Oh, and I also needed a Gitlab environment.

I knew that Gitlab had a docker-based install, and so does AWX, so I trundled off to find some install guides. These are listed in the playbook I eventually created (hence not listing them here). Not all the choices I made were inspired by those guides – I wanted to make quite a bit of this stuff “build itself”… this meant I wanted users, groups and projects to be created in Gitlab, and users, projects, organisations, inventories and credentials to be created in AWX.

I knew that you can create Docker Containers in Ansible, so after I’d got my pre-requisites built (full upgrade, docker installed, pip libraries installed), I add the gitlab-ce:latest docker image, and expose some ports. Even now, I’m not getting the SSH port mapped that I was expecting, but … it’s no disaster.

I did notice that the Gitlab service takes ages to start once the container is marked as running, so I did some more digging, and found that the uri module can be used to poll a URL. It wasn’t documented well how you can make it keep polling until you get the response you want, so … I added a PR on the Ansible project’s github repo for that one (and I also wrote a blog post about that earlier too).

Once I had a working Gitlab service, I needed to customize it. There are a bunch of Gitlab modules in Ansible but since a few releases back of Gitlab, these don’t work any more, so I had to find a different way. That different way was to run an internal command called “gitlab-rails”. It’s not perfect (so it doesn’t create repos in your projects) but it’s pretty good at giving you just enough to build your demo environment. So that’s getting Gitlab up…

Now I need to build AWX. There’s lots of build guides for this, but actually I had most luck using the README in their repository (I know, who’d have thought it!??!) There are some “Secrets” that should be changed in production that I’m changing in my script, but on the whole, it’s pretty much a vanilla install.

Unlike the Gitlab modules, the Ansible Tower modules all work, so I use these to create the users, credentials and so-on. Like the gitlab-rails commands, however, the documentation for using the tower modules is pretty ropey, and I still don’t have things like “getting your users to have access to your organisation” working from the get-go, but for the bulk of the administration, it does “just work”.

Like all my playbooks, I use group_vars to define the stuff I don’t want to keep repeating. In this demo, I’ve set all the passwords to “Passw0rd”, and I’ve created 3 users in both AWX and Gitlab – csa, ops and release – indicative of the sorts of people this demo I ran was aimed at – Architects, Operations and Release Managers.

Maybe, one day, I’ll even be able to release the presentation that went with the demo ;)

On a more productive note, if you’re doing things with the tower_ modules and want to tell me what I need to fix up, or if you’re doing awesome things with the gitlab-rails tool, please visit the repo with this automation code in, and take a look at some of my “todo” items! Thanks!!

Featured image is “Tower” by “Yijun Chen” on Flickr and is released under a CC-BY-SA license.

"funfair action" by "Jon Bunting" on Flickr

Improving the speed of Azure deployments in Ansible with Async

Recently I was building a few environments in Azure using Ansible, and found this stanza which helped me to speed things up.

  - name: "Schedule UDR Creation"
    azure_rm_routetable:
      resource_group: "{{ resource_group }}"
      name: "{{ item.key }}_udr"
    loop: "{{ routetables | dict2items }}"
    loop_control:
        label: "{{ item.key }}_udr"
    async: 1000
    poll: 0
    changed_when: False
    register: sleeper

  - name: "Check UDRs Created"
    async_status:
      jid: "{{ item.ansible_job_id }}"
    register: sleeper_status
    until: sleeper_status.finished
    retries: 500
    delay: 4
    loop: "{{ sleeper.results|flatten(levels=1) }}"
    when: item.ansible_job_id is defined
    loop_control:
      label: "{{ item._ansible_item_label }}"

What we do here is to start an action with an “async” time (to give the Schedule an opportunity to register itself) and a “poll” time of 0 (to prevent the Schedule from waiting to be finished). We then tell it that it’s “never changed” (changed_when: False) because otherwise it always shows as changed, and to register the scheduled item itself as a “sleeper”.

After all the async jobs get queued, we then check the status of all the scheduled items with the async_status module, passing it the registered job ID. This lets me spin up a lot more items in parallel, and then “just” confirm afterwards that they’ve been run properly.

It’s not perfect, and it can make for rather messy code. But, it does work, and it’s well worth giving it the once over, particularly if you’ve got some slow-to-run tasks in your playbook!

Featured image is “funfair action” by “Jon Bunting” on Flickr and is released under a CC-BY license.

A web browser with the example.com web page loaded

Working around the fact that Ansible’s URI module doesn’t honour the no_proxy variable…

An Ansible project I’ve been working on has tripped me up this week. I’m working with some HTTP APIs and I need to check early whether I can reach the host. To do this, I used a simple Ansible Core Module which lets you call an HTTP URI.

- uri:
    follow_redirects: none
    validate_certs: False
    timeout: 5
    url: "http{% if ansible_https | default(True) %}s{% endif %}://{{ ansible_host }}/login"
  register: uri_data
  failed_when: False
  changed_when: False

This all seems pretty simple. One of the environments I’m working in uses the following values in their environment:

http_proxy="http://192.0.2.1:8080"
https_proxy="http://192.0.2.1:8080"
no_proxy="10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24"

And this breaks the uri module, because it tries to punt everything through the proxy if the “no_proxy” contains CIDR values (like 192.0.2.0/24) (there’s a bug raised for this)… So here’s my fix!

- set_fact:
    no_proxy_match: |
      {
        {% for no_proxy in (lookup('env', 'no_proxy') | replace(',', '') ).split() %}
          {% if no_proxy| ipaddr | type_debug != 'NoneType' %}
            {% if ansible_host | ipaddr(no_proxy) | type_debug != 'NoneType' %}
              "match": "True"
            {% endif %}
          {% endif %}
        {% endfor %}
      }

- uri:
    follow_redirects: none
    validate_certs: False
    timeout: 5
    url: "http{% if ansible_https | default(True) %}s{% endif %}://{{ ansible_host }}/login"
  register: uri_data
  failed_when: False
  changed_when: False
  environment: "{ {% if no_proxy_match.match | default(False) %}'no_proxy': '{{ ansible_host }}'{% endif %} }"

So, let’s break this down.

The key part to this script is that we need to override the no_proxy environment variable with the IP address that we’re trying to address (so that we’re not putting 16M addresses for 10.0.0.0/8 into no_proxy, for example). To do that, we use the exact same URI block, except for the environment line at the end.

In turn, the set_fact block steps through the no_proxy values, looking for IP Addresses to check ({% if no_proxy | ipaddr ... %}‌ says “if the no_proxy value is an IP Address, return it, but if it isn’t, return a ‘None’ value”) and if it’s an IP address or subnet mask, it checks to see whether the IP address of the host you’re trying to reach falls inside that IP Address or Subnet Mask ({% if ansible_host | ipaddr(no_proxy) ... %} says “if the ansible_host address falls inside the no_proxy range, then return it, otherwise return a ‘None’ value”). Both of these checks say “If this previous check returns anything other than a ‘None’ value, do the next thing”, and on the last check, the “next” thing is to set the flag ‘match’ to ‘true’. When we get to the environment variable, we say “if match is not true, it’s false, so don’t put a value in there”.

So that’s that! Yes, I could merge the set_fact block into the environment variable, but I do end up using that a fair amount. And really, if it was merged, that would be even MORE complicated to pick through.

I have raised a pull request on the Ansible project to update the documentation, so we’ll see whether we end up with people over here looking for ways around this issue. If so, let me know in the comments below! Thanks!!

"Juniper NetScreen 25 Firewall front" by "jackthegag" on Flickr

Standard Firewall Rules

One of the things I like to do is to explain how I set things up, but a firewall is one of those things that’s a bit complicated, because it depends on your situation, and what you’re trying to do in your environment. That said, there’s a template that you can probably get away with deploying, and see if it works for your content, and then you’ll see where to add the extra stuff from there. Firewall policies typically work from the top down.

This document will assume you have a simple boundary firewall. This simple firewall has two interfaces, the first being an “Outside” interface, connected to your ISP, with an IPv4 address of 192.0.2.2/24 and a default gateway of 192.0.2.1, it also has a IPv6 address of 2001:db8:123c:abd::2/64 and a default gateway address of 2001:db8:123c:abd::1. The second “Inside” interface, where your protected network is attached, has an IPv4 address of 198.51.100.1/24 and an IPv6 address of 2001:db8:123d:abc::1/64. On this inside interface, the firewall is the default gateway for the inside network.

I’ll be using simple text rules to describe firewall policies, following this format:

Source Interface: <outside | inside>
Source IP Address: <x.x.x.x/x | "any">
NAT Source IP Address: <x.x.x.x/x | no>
Destination Interface: <outside | inside>
Destination IP Address: <x.x.x.x/x | "any">
NAT Destination IP Address: <x.x.x.x/x | no>
Destination Port: <tcp | udp | icmp | ip>/<x>
Action: <allow | deny | reject>
Log: <yes | no>
Notes: <some commentary if required>

In this model, if you want to describe HTTP access to a web server, you might write the following policy:

Source Interface: outside
Source IP Address: 0.0.0.0/0 (Any IP)
NAT Source IP Address: no
Destination Interface: inside
Destination IP Address: 192.0.2.2 (External IP)
NAT Destination IP Address: 198.51.100.2 (Internal IP)
Destination Port: tcp/80
Action: allow
Log: yes

So, without further waffling, let’s build a policy. By default all traffic will be logged. In high-traffic environments, you may wish to prevent certain traffic from being logged, but on the whole, I think you shouldn’t really lose firewall logs unless you need to!

Allowing established, related and same-host traffic

This rule is only really needed on iptables based firewalls, as all the commercial vendors (as far as I can tell, at least) already cover this as “standard”. If you’re using UFW (a wrapper to iptables), this rule is covered off already, but essentially it goes a bit like this:

Source Interface: lo (short for "local", where the traffic never leaves the device)
Source IP Address: any
NAT Source IP Address: no
Destination Interface: lo
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: any
Action: allow
Log: no
Notes: This above rule permits traffic between localhost addresses (127.0.0.0/8) or between public addresses on the same host, for example, between two processes without being blocked.
flags: Established OR Related
Action: allow
Log: no
Notes: This above rule is somewhat special, as it looks for specific flags on the packet, that says "If we've already got a session open, let it carry on talking".

Dropping Noisy Traffic

In a network, some proportion of the traffic is going to be “noisy”. Whether it’s broadcast traffic from your application that uses mDNS, or the Windows File Share trying to find like-minded hosts to exchange data… these can fill up your logs, so lets drop the broadcast and multicast IPv4 traffic, and not log them.

Source Interface: any
Source IP Address: 0.0.0.0/0
NAT Source IP Address: no
Destination Interface: any
Destination IP Address: 255.255.255.255 (global broadcast), 192.0.2.255 ("outside" broadcast), 198.51.100.255 ("inside" broadcast) and 224.0.0.0/4 (multicast)
NAT Destination IP Address: no
Destination Port: any
Action: deny
Log: no
Notes: The global and local broadcast addresses are used to "find" other hosts in a network, whether that's a DHCP server or something like mDNS. Dropping this prevents the traffic from appearing in your logs later.

Permitting Management Traffic

Typically you want to trust certain machines to access or be accessed by this host – whether it’s your SYSLOG collector, or the box that can manage the firewall policy, so here we’ll create a policy that lets these in.

Source Interface: inside
Source IP Address: 198.51.100.2 and 2001:db8:123d:abc::2 (Management IP)
NAT Source IP Address: no
Destination Interface: inside
Destination IP Address: 198.51.100.1 and 2001:db8:123d:abc::1 (Firewall IP)
NAT Destination IP Address: no
Destination Port: SSH (tcp/22)
Action: permit
Log: yes
Notes: Allow inbound SSH access. You're unlikely to need more inbound ports, but if you do - customise them here.
Source Interface: inside
Source IP Address: 198.51.100.1 and 2001:db8:123d:abc::1 (Firewall IP)
NAT Source IP Address: no
Destination Interface: inside
Destination IP Address: 198.51.100.2 and 2001:db8:123d:abc::2 (Management IP)
NAT Destination IP Address: no
Destination Port: SYSLOG (udp/514)
Action: permit
Log: yes
Notes: Allow outbound SYSLOG access. Tailor this to outbound ports you need.

Allowing Control Traffic

ICMP is a protocol that is fundamental to IPv4 and IPv6. Commonly used for Traceroute and Ping, but also used to perform REJECT responses and that sort of thing. We’re only going to let it be initiated *out* not in. Some people won’t allow this rule, or tailor it to more specific destinations.

Source Interface: inside
Source IP Address: any
NAT Source IP Address: 192.0.2.2 (The firewall IP address which may be replaced with 0.0.0.0 indicating "whatever IP address is bound to the outbound interface")
Destination Interface: outside
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: icmp
Action: allow
Log: yes
Notes: ICMPv4 and ICMPv6 are different things. This is just the ICMPv4 version. IPv4 does require NAT, hence the difference from the IPv6 version below.
Source Interface: inside
Source IP Address: any
NAT Source IP Address: no
Destination Interface: outside
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: icmpv6
Action: allow
Log: yes
Notes: ICMPv4 and ICMPv6 may be treated as different things. This is just the ICMPv6 version. IPv6 does not require NAT.

Protect the Firewall

There should be no other traffic going to the Firewall, so let’s drop everything. There are two types of “Deny” message – a “Reject” and a “Drop”. A Reject sends a message back from the host which is refusing the connection – usually the end server to say that the service didn’t want to reply to you, but if there’s a box in the middle – like a firewall – this reject (actually an ICMP packet) comes from the firewall instead. In this case it’s identifying that the firewall was refusing the connection for the node, so it advertises the fact the end server is protected by a security box. Instead, firewall administrators tend to use Drop, which just silently discards the initial request, leaving the initiating end to “Time Out”. You’re free to either “Reject” or “Drop” whenever we show “Deny” in the below policies, but bear it in mind that it’s less secure to use Reject than it is to Drop.

Source Interface: any
Source IP Address: any
NAT Source IP Address: no
Destination Interface: any
Destination IP Address: 192.0.2.2, 2001:db8:123c:abd::2, 198.51.100.1 and 2001:db8:123d:abc::1 (may also be represented as :: or 0.0.0.0 depending on the platform)
NAT Destination IP Address: no
Destination Port: any
Action: deny
Log: no
Notes: Drop everything targetted at the firewall IPs. If you have more NICs or additional IP addresses on the firewall, these will also need blocking.

“Normal” Inbound Traffic

After you’ve got your firewall protected, now you can sort out your “normal” traffic flows. I’m going to add a single inbound policy to represent the sort of traffic you might want to configure (in this case a simple web server), but bear in mind some environments don’t have any “inbound” rules (for example, most homes would be in this case), and some might need lots and lots of inbound rules. This is just to give you a flavour on what you might see here.

Source Interface: outside
Source IP Address: any
NAT Source IP Address: no
Destination Interface: inside
Destination IP Address: 192.0.2.2 (External IP)
NAT Destination IP Address: 198.51.100.2 (Internal IP)
Destination Port: tcp/80 (HTTP), tcp/443 (HTTPS)
Action: allow
Log: yes
Notes: This is the IPv4-only rule. Note a NAT MUST be applied here.
Source Interface: outside
Source IP Address: any
NAT Source IP Address: no
Destination Interface: inside
Destination IP Address: 2001:db8:123d:abc::2
NAT Destination IP Address: no
Destination Port: tcp/80 (HTTP), tcp/443 (HTTPS)
Action: allow
Log: yes
Notes: This is the IPv6-only rule. Note that NO NAT is required (but, you may wish to perform NAT, depending on your environment).

“Normal” Outbound Traffic

If you’re used to a DSL router, that basically just allows all outbound traffic. We’re going to implement that here. If you want to be more specific about things, you’d define your outbound rules like the inbound rules in the block above… but if you’re not that worried, then this rule below is generally going to be all OK :)

Source Interface: inside
Source IP Address: any
NAT Source IP Address: 192.0.2.2 (The firewall IP address which may be replaced with 0.0.0.0 indicating "whatever IP address is bound to the outbound interface")
Destination Interface: outside
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: any
Action: allow
Log: yes
Notes: This is just the IPv4 version. IPv4 does require NAT, hence the difference from the IPv6 version below.
Source Interface: inside
Source IP Address: any
NAT Source IP Address: no
Destination Interface: outside
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: any
Action: allow
Log: yes
Notes: This is just the IPv6 version. IPv6 does not require NAT.

Drop Rule

Following your permit rules above, you now need to drop everything else. Fortunately, by now, you’ve “white-listed” all the permitted traffic, so now we can just drop “everything”. So, let’s do that!

Source Interface: any
Source IP Address: any
NAT Source IP Address: no
Destination Interface: any
Destination IP Address: any
NAT Destination IP Address: no
Destination Port: any
Action: deny
Log: yes

And so that is a basic firewall policy… or at least, it’s the template I tend to stick to! :)

Expanding XFS drives with LVM

Say, for example, you’ve got a lovely CentOS VM (using XFS by default) which has a disk that isn’t quite big enough. Fair enough, your VM Hypervisor is sensible enough to resize that disk without question… How do you resize the XFS partition? Assuming you’ve got your disk mounted as /dev/sda, and you’ve got a boot volume as partition 1 and a root volume as partition 2 (the standard install model)

  1. parted /dev/sda resizepart 2 100%
  2. partprobe /dev/sda
  3. pvresize /dev/sda2
  4. lvextend /dev/centos/root /dev/sda2
  5. xfs_growfs /dev/mapper/centos-root
The graphical version of the steps above

Research via:

"Zenith Z-19 Terminal" from ajmexico on Flickr

Some things I learned this week while coding extensions to Ansible!

If you follow any of the content I post around the internet, you might have seen me asking questions about trying to get data out of azure_rm_*_facts into something that’s usable. I can’t go into why I needed that data yet (it’s a little project I’m working on), but the upshot is that trying to manipulate data using “set_fact” with jinja is *doable* but *messy*. In the end, I decided to hand it all off to a new ansible module I’m writing. So, here are the things I learned about this.

  1. There’s lots more documentation about writing a module (a plugin that let’s you do stuff) than there is about writing filters (things that change text inline) or lookups (things that let you search other data stores). In the end, while I could have spent the time to figure out how better to write a filter or a lookup, it actually makes more sense in my context to hand a module all my data, and say “Parse this” and register the result than it would have done to have the playbook constantly check whether things were in other things. I still might have to do that, but… you know, for now, I’ve got the bits I want! :)
  2. I did start looking at writing a filter, and discovered that the “debugging advice” on the ansible site is all geared up to enable more modules than enabling filters… but I did discover that modules execute on their target (e.g. WebHost01) while filters and lookups execute on the local machine. Why does this matter? Well…..
  3. While I was looking for documentation about debugging Ansible code, I stumbled over this page on debugging modules that makes it all look easy. Except, it’s only for debugging *MODULES* (very frustrating. Well, what does it actually mean? The modules get zipped up and sent to the host that will be executing the code, which means that with an extra flag to your playbook (ANSIBLE_KEEP_REMOTE_FILES – even if it’s going to be run on “localhost”), you get the combined output of the script placed into a path on your machine, which means you can debug that specific play. That doesn’t work for filters…
  4. SOO, I jumped into #ansible on Freenode and asked for help. They in turn couldn’t help me (it’s more about writing playbooks than writing filters, modules, etc), so they directed me to #ansible-devel, where I was advised to use a python library called “q” (Edit, same day: my friend @mohclips pointed me to this youtube video from 2003 of the guy who wrote q explaining about it. Thanks Nick! I learned something *else* about this library).
  5. Oh man, this is the motherlode. So, q makes life *VERY* easy. Assuming this is valid code: All you’d need to do would be to add two lines, as you’ll see here: This then dumps the output from each of the q(something) lines into /tmp/q for you to read at your leisure! (To be fair, I’d probably remove it after you’ve finished, so you don’t fill a disk :) )
  6. And that’s when I discovered that it’s actually easier to use q() for all my python debugging purposes than it is to follow the advice above about debugging modules. Yehr, it’s basically a load of print statements, so you don’t get to see stack traces, or read all the variables, and you don’t get to step through code to see why decisions were taken… but for the rubbish code I produce, it’s easily enough for me!

Header image is “Zenith Z-19 Terminal” by “ajmexico” on Flickr and is released under a CC-BY license. Used with thanks!

"LEGO Factory Playset" from Brickset on Flickr

Building Azure Environments in Ansible

Recently, I’ve been migrating my POV (proof of value) and POC (proof of concept) environment from K5 to Azure to be able to test vendor products inside Azure. I ran a few tests to build the environment using the native tools (the powershell scripts) and found that the Powershell way of delivering Azure environments seems overly complicated… particularly as I’m comfortable with how Ansible works.

To be fair, I also need to look at Terraform, but that isn’t what I’m looking at today :)

So, let’s start with the scaffolding. Any Ansible Playbook which deals with creating virtual machines needs to have some extra modules installed. Make sure you’ve got ansible 2.7 or later and the python azure library 2.0.0 or later (you can get both with pip for python).

Next, let’s look at the group_vars for this playbook.

This file has several pieces. We define the project settings (anything prefixed project_ is a project setting), including the prefix used for all resources we create (in this case “env01“), and a standard password used for all VMs we create (in this case “My$uper$ecret$Passw0rd“).

Next we define the standard images to load from the Marketplace. You can extend this with other images, these are just the “easiest” ones that I’m most familiar with (your mileage may vary). Next up is the networks to build inside the VNet, and lastly we define the actual machines we want to build. If you’ve got questions about any of the values we define here, just let me know in the comments below :)

Next, we’ll start looking at the playbook (this has been exploded out – the full playbook is also in the gist).

Here we start by pulling in the variables we might want to override, and we do this by reading system environment variables (ANSIBLE_PREFIX and BREAKGLASS) and using them if they’re set. If they’re not, use the project defaults, and if that hasn’t been set, use some pre-defined values… and then tell us what they are when we’re running the tasks (those are the debug: lines).

This block is where we create our “Static Assets” – individual items that we will be consuming later. This shows a clear win here over the Powershell methods endorsed by Microsoft – here you can create a Resource Group (RG) as part of the playbook! We also create a single Storage Account for this RG and a single VNET too.

These creation rules are not suitable for production use, as this defines an “Any-Any” Security group! You should tailor your security groups for your need, not for blanket access in!

This is where things start to get a bit more interesting – We’re using the “async/async_status” pattern here (and the rest of these sections) to start creating the resources in parallel. As far as I can tell, sometimes you’ll get a case where the async doesn’t quite get set up fast enough, then the async_status can’t track the resources properly, but re-running the playbook should be enough to sort that out, without slowing things down too much.

But what are we actually doing with this block of code? A UDR is a “User Defined Route” or routing table for Azure. Effectively, you treat each network interface as being plumbed directly to the router (none of this “same subnet broadcast” stuff works here!) so you can do routing at the router for all the networks.

By default there are some existing network routes (stuff to the internet flows to the internet, RFC1918 addresses are dropped with the exception of any RFC1918 addresses you have covered in your VNETs, and each of your subnets can reach each other “directly”). Adding a UDR overrides this routing table. The UDRs we’re creating here are applied at a subnet level, but currently don’t override any of the existing routes (they’re blank). We’ll start putting routes in after we’ve added the UDRs to the subnets. Talking of which….

Again, this block is not really suitable for production use, and assumes the VNET supernet of /8 will be broken down into several /24’s. In the “real world” you might deliver a handful of /26’s in a /24 VNET… or you might even have lots of disparate /24’s in the VNET which are then allocated exactly as individual /24 subnets… this is not what this model delivers but you might wish to investigate further!

Now that we’ve created our subnets, we can start adding the routing table to the UDR. This is a basic one – add a 0.0.0.0/0 route (internet access) from the “protected” network via the firewall. You can get a lot more specific than this – most people are likely to want to add the VNET range (in this case 10.0.0.0/8) via the firewall as well, except for this subnet (because otherwise, for example, 10.0.0.100 trying to reach 10.0.0.101 will go via the firewall too).

Without going too much into the intricacies of network architecture, if you are routing your traffic between subnets to the firewall, it’s probably better to get an appliance with more interfaces, so you can route traffic across the appliance, rather than going across a single interface as this will halve your traffic bandwidth (it’s currently capped 1Gb/s – so 500Mb/s).

Having mentioned “The Internet” – let’s give our firewall a public IP address, and create the rest of the interfaces as well.

This script creates a public IP address by default for each interface unless you explicitly tell it not to (see lines 40, 53 and 62 in the group_vars file I rendered above). You could easily turn this around by changing the lines which contain this:

item.1.public is not defined or (item.1.public is defined and item.1.public == 'true')

into lines which contain this:

item.1.public is defined and item.1.public == 'true'

OK, having done all that, we’re now ready to build our virtual machines. I’ve introduced a “Priority system” here – VMs with priority 0 go first, then 1, and 2 go last. The code snippet below is just for priority 0, but you can easily see how you’d extrapolate that out (and in fact, the full code sample does just that).

There are a few blocks here to draw attention to :) I’ve re-jigged them a bit here so it’s clearer to understand, but when you see them in the main playbook they’re a bit more compact. Let’s start with looking at the Network Interfaces section!

network_interfaces: |
  [
    {%- for nw in item.value.ports -%}
      '{{ prefix }}{{ item.value.name }}port{{ nw.subnet.name }}'
      {%- if not loop.last -%}, {%- endif -%} 
    {%- endfor -%}
  ]

In this part, we loop over the ports defined for the virtual machine. This is because one device may have 1 interface, or four interfaces. YAML is parsed to make a JSON variable, so here we can create a JSON variable, that when the YAML is parsed it will just drop in. We’ve previously created all the interfaces to have names like this PREFIXhostnamePORTsubnetname (or aFW01portWAN in more conventional terms), so here we construct a JSON array, like this: ['aFW01portWAN'] but that could just as easily have been ['aFW01portWAN', 'aFW01portProtect', 'aFW01portMGMT', 'aFW01portSync']. This will then attach those interfaces to the virtual machine.

Next up, custom_data. This section is sometimes known externally as userdata or config_disk. My code has always referred to it as a “Provision Script” – hence the variable name in the code below!

custom_data: |
  {%- if item.value.provision_script is defined and item.value.provision_script != '' -%}
    {%- include(item.value.provision_script) -%}
  {%- elif item.value.image.provision_script is defined and item.value.image.provision_script != '' -%}
    {%- include(item.value.image.provision_script) -%}
  {%- else -%}
    {{ omit }}
  {%- endif -%}

Let’s pick this one apart too. If we’ve defined a provisioning script file for the VM, include it, if we’ve defined a provisioning script file for the image (or marketplace entry), then include that instead… otherwise, pretend that there’s no “custom_data” field before you submit this to Azure.

One last quirk to Azure, is that some images require a “plan” to go with it, and others don’t.

plan: |
  {%- if item.value.image.plan is not defined -%}{{ omit }}{%- else -%}
    {'name': '{{ item.value.image.sku }}',
     'publisher': '{{ item.value.image.publisher }}',
     'product': '{{ item.value.image.offer }}'
    }
  {%- endif -%}

So, here we say “if we’ve not got a plan, omit the value being passed to Azure, otherwise use these fields we previously specified. Weird huh?

The very last thing we do in the script is to re-render the standard password we’ve used for all these builds, so that we can check them out!

Want to review this all in one place?

Here’s the link to the full playbook, as well as the group variables (which should be in ./group_vars/all.yml) and two sample userdata files (which should be in ./userdata) for an Ubuntu machine (using cloud-init) and one for a FortiGate Firewall.

All the other files in that gist (prefixes from 10-16 and 00) are for this blog post only, and aren’t likely to work!

If you do end up using this, please drop me a note below, or star the gist! That’d be awesome!!

Image credit: “Lego Factory Playset” from Flickr by “Brickset” released under a CC-BY license. Used with Thanks!