"Apoptosis Network (alternate)" by "Simon Cockell" on Flickr

A few weird issues in the networking on our custom AWS EKS Workers, and how we worked around them

For “reasons”, at work we run AWS Elastic Kubernetes Service (EKS) with our own custom-built workers. These workers are based on Alma Linux 9, instead of AWS’ preferred Amazon Linux 2023. We manage the deployment of these workers using AWS Auto-Scaling Groups.

Our unusal configuration of these nodes mean that we sometimes trip over configurations which are tricky to get support on from AWS (no criticism of their support team, if I was in their position, I wouldn’t want to try to provide support for a customer’s configuration that was so far outside the recommended configuration either!)

Over the past year, we’ve upgraded EKS1.23 to EKS1.27 and then on to EKS1.31, and we’ve stumbled over a few issues on the way. Here are a couple of notes on the subject, in case they help anyone else in their journey.

All three of the issues below were addressed by running an additional service on the worker nodes in a Systemd timed service which triggers every minute.

Incorrect routing for the 12th IP address onwards

Something the team found really early on (around EKS 1.18 or somewhere around there) was that the AWS VPC-CNI wasn’t managing the routing tables on the node properly. We raised an issue on the AWS VPC CNI (we were on CentOS 7 at the time) and although AWS said they’d fixed the issue, we currently need to patch the routing tables every minute on our nodes.

What happens?

When you get past the number of IP addresses that a single ENI can have (typically ~12), the AWS VPC-CNI will attach a second interface to the worker, and start adding new IP addresses to that. The VPC-CNI should setup routing for that second interface, but for some reason, in our case, it doesn’t. You can see this happens because the traffic will come in on the second ENI, eth1, but then try to exit the node on the first ENI, eth0, with a tcpdump, like this:

[root@test-i-01234567890abcdef ~]# tcpdump -i any host 192.0.2.123
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
09:38:07.331619 eth1  In  IP ip-192-168-1-100.eu-west-1.compute.internal.41856 > ip-192-0-2-123.eu-west-1.compute.internal.irdmi: Flags [S], seq 1128657991, win 64240, options [mss 1359,sackOK,TS val 2780916192 ecr 0,nop,wscale 7], length 0
09:38:07.331676 eni989c4ec4a56 Out IP ip-192-168-1-100.eu-west-1.compute.internal.41856 > ip-192-0-2-123.eu-west-1.compute.internal.irdmi: Flags [S], seq 1128657991, win 64240, options [mss 1359,sackOK,TS val 2780916192 ecr 0,nop,wscale 7], length 0
09:38:07.331696 eni989c4ec4a56 In  IP ip-192-0-2-123.eu-west-1.compute.internal.irdmi > ip-192-168-1-100.eu-west-1.compute.internal.41856: Flags [S.], seq 3367907264, ack 1128657992, win 26847, options [mss 8961,sackOK,TS val 1259768406 ecr 2780916192,nop,wscale 7], length 0
09:38:07.331702 eth0  Out IP ip-192-0-2-123.eu-west-1.compute.internal.irdmi > ip-192-168-1-100.eu-west-1.compute.internal.41856: Flags [S.], seq 3367907264, ack 1128657992, win 26847, options [mss 8961,sackOK,TS val 1259768406 ecr 2780916192,nop,wscale 7], length 0

The critical line here is the last one – it’s come in on eth1 and it’s going out of eth0. Another test here is to look at ip rule

[root@test-i-01234567890abcdef ~]# ip rule
0:	from all lookup local
512:	from all to 192.0.2.111 lookup main
512:	from all to 192.0.2.143 lookup main
512:	from all to 192.0.2.66 lookup main
512:	from all to 192.0.2.113 lookup main
512:	from all to 192.0.2.145 lookup main
512:	from all to 192.0.2.123 lookup main
512:	from all to 192.0.2.5 lookup main
512:	from all to 192.0.2.158 lookup main
512:	from all to 192.0.2.100 lookup main
512:	from all to 192.0.2.69 lookup main
512:	from all to 192.0.2.129 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
1536:	from 192.0.2.123 lookup 2
32766:	from all lookup main
32767:	from all lookup default

Notice here that we have two entries from all to 192.0.2.123 lookup main and from 192.0.2.123 lookup 2. Let’s take a look at what lookup 2 gives us, in the routing table

[root@test-i-01234567890abcdef ~]# ip route show table 2
192.0.2.1 dev eth1 scope link

Fix the issue

This is pretty easy – we need to add a default route if one doesn’t already exist. Long before I got here, my boss created a script which first runs ip route show table main | grep default to get the gateway for that interface, then runs ip rule list, looks for each lookup <number> and finally runs ip route add to put the default route on that table, the same as on the main table.

ip route add default via "${GW}" dev "${INTERFACE}" table "${TABLE}"

Is this still needed?

I know when we upgraded our cluster from EKS1.23 to EKS1.27, this script was still needed. When I’ve just checked a worker running EKS1.31, after around 12 hours of running, and a second interface being up, it’s not been needed… so perhaps we can deprecate this script?

Dropping packets to the containers due to Martians

When we upgraded our cluster from EKS1.23 to EKS1.27 we also changed a lot of the infrastructure under the surface (AlmaLinux 9 from CentOS7, Containerd and Runc from Docker, CGroups v2 from CGroups v1, and so on). We also moved from using an AWS Elastic Load Balancer (ELB) or “Classic Load Balancer” to AWS Network Load Balancer (NLB).

Following the upgrade, we started seeing packets not arriving at our containers and the system logs on the node were showing a lot of martian source messages, particularly after we configured our NLB to forward original IP source addresses to the nodes.

What happens

One thing we noticed was that each time we added a new pod to the cluster, it added a new eni[0-9a-f]{11} interface, but the sysctl value for net.ipv4.conf.<interface>.rp_filter (return path filtering – basically, should we expect the traffic to be arriving at this interface for that source?) in sysctl was set to 1 or “Strict mode” where the source MUST be the coming from the best return path for the interface it arrived on. The AWS VPC-CNI is supposed to set this to 2 or “Loose mode” where the source must be reachable from any interface.

In this case you’d tell this because you’d see this in your system journal (assuming you’ve got net.ipv4.conf.all.log_martians=1 configured):

Dec 03 10:01:19 test-i-01234567890abcdef kernel: IPv4: martian source 192.168.1.100 from 192.0.2.123, on dev eth1

The net result is that packets would be dropped by the host at this point, and they’d never be received by the containers in the pods.

Fix the issue

This one is also pretty easy. We run sysctl -a and loop through any entries which match net.ipv4.conf.([^\.]+).rp_filter = (0|1) and then, if we find any, we run sysctl -w net.ipv4.conf.\1.rp_filter = 2 to set it to the correct value.

Is this still needed?

Yep, absolutely. As of our latest upgrade to EKS1.31, if this value isn’t set, then it will drop packets. VPC-CNI should be fixing this, but for some reason it doesn’t. And setting the conf.ipv4.all.rp_filter to 2 doesn’t seem to make a difference, which is contrary to the documentation in the relevant Kernel documentation.

After 12 IP addresses are assigned to a node, Kubernetes services stop working for some pods

This was pretty weird. When we upgraded to EKS1.31 on our smallest cluster we initially thought we had an issue with CoreDNS, in that it sometimes wouldn’t resolve IP addresses for services (DNS names for services inside the cluster are resolved by <servicename>.<namespace>.svc.cluster.local to an internal IP address for the cluster – in our case, in the range 172.20.0.0/16). We upgraded CoreDNS to the EKS1.31 recommended version, v1.11.3-eksbuild.2 and that seemed to fix things… until we upgraded our next largest cluster, and things REALLY went wrong, but only when we had increased to over 12 IP addresses assigned to the node.

You might see this as frequent restarts of a container, particularly if you’re reliant on another service to fulfil an init container or the liveness/readyness check.

What happens

EKS1.31 moves KubeProxy from iptables or ipvs mode to nftables – a shift we had to make internally as AlmaLinux 9 no longer supports iptables mode, and ipvs is often quite flaky, especially when you have a lot of pod movements.

With a single interface and up to 11 IP addresses assigned to that interface, everything runs fine, but the moment we move to that second interface, much like in the first case above, we start seeing those pods attached to the second+ interface being unable to resolve service addresses. On further investigation, doing a dig from a container inside that pod to the service address of the CoreDNS service 172.20.0.10 would timeout, but a dig against the actual pod address 192.0.2.53 would return a valid response.

Under the surface, on each worker, KubeProxy adds a rule to nftables to say “if you try and reach 172.20.0.10, please instead direct it to 192.0.2.53”. As the containers fluctuate inside the cluster, KubeProxy is constantly re-writing these rules. For whatever reason though, KubeProxy currently seems unable to determine that a second or subsequent interface has been added, and so these rules are not applied to the pods attached to that interface…. or at least, that’s what it looks like!

Fix the issue

In this case, we wrote a separate script which was also triggered every minute. This script looks to see if the interfaces have changed by running ip link and looking for any interfaces called eth[0-9]+ which have changed, and then if it has, it runs crictl pods (which lists all the running pods in Containerd), looks for the Pod ID of KubeProxy, and then runs crictl stopp <podID> [1] and crictl rmp <podID> [1] to stop and remove the pod, forcing kubelet to restart the KubeProxy on the node.

[1] Yes, they aren’t typos, stopp means “stop the pod” and rmp means “remove the pod”, and these are different to stop and rm which relate to the container.

Is this still needed?

As this was what I was working on all-day yesterday, yep, I’d say so šŸ˜Š – in all seriousness though, if this hadn’t been a high-priority issue on the cluster, I might have tried to upgrade the AWS VPC-CNI and KubeProxy add-ons to a later version, to see if the issue was resolved, but at this time, we haven’t done that, so maybe I’ll issue a retraction later šŸ˜‚

Featured image is ā€œApoptosis Network (alternate)ā€ by ā€œSimon Cockellā€ on Flickr and is released under a CC-BY license.

Talk Summary – OggCamp ’24 – Kubernetes, A Guide for Docker users

Format: Theatre Style room. ~30 attendees.

Slides: Available to view (Firefox/Chrome recommended ā€“ press ā€œSā€ to see the required speaker notes)

Video: Not recorded. I’ll try to record it later, if I get a chance.

Slot: Graphine 1, 13:30-14:00

Notes: Apologies for the delay on posting this summary. The talk was delivered to a very busy room. Lots of amazing questions. The presenter notes were extensive, but entirely unused when delivered. One person asked a question, I said I’d follow up with them later, but didn’t find them before the end of the conference. One person asked about the benefits of EKS over ECS in AWS… as I’ve not used ECS, I couldn’t answer, but it sounds like they largely do the same thing.

Two pages from an old notebook with slightly yellowing paper, and black ink cursive writing and occasional doodles filling the pages

This little #bash script will make capturing #output from lots of #scripts a lot easier

A while ago, I was asked to capture a LOT of data for a support case, where they wanted lots of commands to be run, like “kubectl get namespace” and then for each namespace, get all the pods with “kubectl get pods -n $namespace” and then describe each pod with “kubectl get pod -n namespace $podname”. Then do the same with all the services, deployments, ingresses and endpoints.

I wrote this function, and a supporting script to execute the actual checks, and just found it while clearing up!

#!/bin/bash

filename="$(echo $* | sed -E -e 's~[ -/\\]~_~g').log"
echo "\$ $@" | tee "${filename}"
$@ 2>&1 | tee -a "${filename}"

This script is quite simple, it does three things

  1. Take the command you’re about to run, strip all the non-acceptable-filename characters out and replace them with underscores, and turn that into the output filename.
  2. Write the command into the output file, replacing any prior versions of that file
  3. Execute the command, and append the log to the output file.

So, how do you use this? Simple

log_result my-command --with --all --the options

This will produce a file called my-command_--with_--all_--the_options.log that contains this content:

$ my-command --with --all --the options
Congratulations, you ran my-command and turned on the options "--with --all --the options". Nice one!

… oh, and the command I ran to capture the data for the support case?

log_result kubectl get namespace
for TYPE in pod ingress service deployment endpoints
do
  for ns in $(kubectl get namespace | grep -v NAME | awk '{print $1}' )
  do
    echo $ns
    for item in $(kubectl get $TYPE -n $ns | grep -v NAME | awk '{print $1}')
    do
      log_result kubectl get $TYPE -n $ns $item -o yaml
      log_result kubectl describe $TYPE -n $ns $item
    done
  done
done

Featured image is ā€œTravel log textureā€ by ā€œMary Vicanā€ on Flickr and is released under a CC-BY license.

A photo of a conch shell in front of a blurry photo frame.

Why (and how) I’ve started writing my Shell Scripts in Python

I’ve been using Desktop Linux for probably 15 years, and Server Linux for more like 25 in one form or another. One of the things you learn to write pretty early on in Linux System Administration is Bash Scripting. Here’s a great example

#!/bin/bash

i = 0
until [ $i -eq 10 ]
do
  print "Jon is the best!"
  (( i += 1 ))
done

Bash scripts are pretty easy to come up with, you just write the things you’d type into the interactive shell, and it does those same things for you! Yep, it’s pretty hard not to love Bash for a shell script. Oh, and it’s portable too! You can write the same Bash script for one flavour of Linux (like Ubuntu), and it’s probably going to work on another flavour of Linux (like RedHat Enterprise Linux, or Arch, or OpenWRT).

But. There comes a point where a Bash script needs to be more than just a few commands strung together.

At work, I started writing a “simple” installer for a Kubernetes cluster – it provisions the cloud components with Terraform, and then once they’re done, it then starts talking to the Kubernetes API (all using the same CLI tools I use day-to-day) to install other components and services.

When the basic stuff works, it’s great. When it doesn’t work, it’s a bit of a nightmare, so I wrote some functions to put logs in a common directory, and another function to gracefully stop the script running when something fails, and then write those log files out to the screen, so I know what went wrong. And then I gave it to a colleague, and he ran it, and things broke in a way that didn’t make sense for either of us, so I wrote some more functions to trap that type of error, and try to recover from them.

And each time, the way I tested where it was working (or not working) was to just… run the shell script, and see what it told me. There had to be a better way.

Enter Python

Python earns my vote for a couple of reasons (and they might not be right for you!)

  • I’ve been aware of the language for some time, and in fact, had patched a few code libraries in the past to use Ansible features I wanted.
  • My preferred IDE (Integrated Desktop Environment), Visual Studio Code, has a step-by-step debugger I can use to work out what’s going on during my programming
  • It’s still portable! In fact, if anything, it’s probably more portable than Bash, because the version of Bash on the Mac operating system – OS X is really old, so lots of “modern” features I’d expect to be in bash and associate tooling isn’t there! Python is Python everywhere.
  • There’s an argument parsing tool built into the core library, so if I want to handle things like ./myscript.py --some-long-feature "option-A" --some-long-feature "option-B" -a -s -h -o -r -t --argument I can do, without having to remember how to write that in Bash (which is a bit esoteric!)
  • And lastly, for now at least!, is that Python allows you to raise errors that can be surfaced up to other parts of your program

Given all this, my personal preference is to write my shell scripts now in Python.

If you’ve not written python before, variables are written without any prefix (like you might have seen $ in PHP) and any flow control (like if, while, for, until) as well as any functions and classes use white-space indentation to show where that block finishes, like this:

def do_something():
  pass

if some_variable == 1:
  do_something()
  and_something_else()
  while some_variable < 2:
    some_variable = some_variable * 2

Starting with Boilerplate

I start from a “standard” script I use. This has a lot of those functions I wrote previously for bash, but with cleaner code, and in a way that’s a bit more understandable. I’ll break down the pieces I use regularly.

Starting the script up

Here’s the first bit of code I always write, this goes at the top of everything

#!/usr/bin/env python3
import logging
logger = logging

This makes sure this code is portable, but is always using Python3 and not Python2. It also starts to logging engine.

At the bottom I create a block which the “main” code will go into, and then run it.

def main():
  logger.basicConfig(level=logging.DEBUG)
  logger.debug('Started main')

if __name__ == "__main__":
    main()

Adding argument parsing

There’s a standard library which takes command line arguments and uses them in your script, it’s called argparse and it looks like this:

#!/usr/bin/env python3
# It's convention to put all the imports at the top of your files
import argparse
import logging
logger = logging

def process_args():
  parser=argparse.ArgumentParser(
    description="A script to say hello world"
  )

  parser.add_argument(
    '--verbose', # The stored variable can be found by getting args.verbose
    '-v',
    action="store_true",
    help="Be more verbose in logging [default: off]"
  )

  parser.add_argument(
    'who', # This is a non-optional, positional argument called args.who
    help="The target of this script"
  )
  args = parser.parse_args()

  if args.verbose:
      logger.basicConfig(level=logging.DEBUG)
      logger.debug('Setting verbose mode on')
  else:
      logger.basicConfig(level=logging.INFO)

  return args

def main():
  args=process_args()

  print(f'Hello {args.who}')
  # Using f'' means you can include variables in the string
  # You could instead do printf('Hello %s', args.who)
  # but I always struggle to remember in what order I wrote things!

if __name__ == "__main__":
    main()

The order you put things in makes a lot of difference. You need to have the if __name__ == "__main__": line after you’ve defined everything else, but then you can put the def main(): wherever you want in that file (as long as it’s before the if __name__). But by having everything in one file, it feels more like those bash scripts I was talking about before. You can have imports (a bit like calling out to other shell scripts) and use those functions and classes in your code, but for the “simple” shell scripts, this makes most sense.

So what else do we do in Shell scripts?

Running commands

This is class in it’s own right. You can pass a class around in a variable, but it has functions and properties of it’s own. It’s a bit chunky, but it handles one of the biggest issues I have with bash scripts – capturing both the “normal” output (stdout) and the “error” output (stderr) without needing to put that into an external file you can read later to work out what you saw, as well as storing the return, exit or error code.

# Add these extra imports
import os
import subprocess

class RunCommand:
    command = ''
    cwd = ''
    running_env = {}
    stdout = []
    stderr = []
    exit_code = 999

    def __init__(
      self,
      command: list = [], 
      cwd: str = None,
      env: dict = None,
      raise_on_error: bool = True
    ):
        self.command = command
        self.cwd = cwd
        
        self.running_env = os.environ.copy()

        if env is not None and len(env) > 0:
            for env_item in env.keys():
                self.running_env[env_item] = env[env_item]

        logger.debug(f'exec: {" ".join(command)}')

        try:
            result = subprocess.run(
                command,
                cwd=cwd,
                capture_output=True,
                text=True,
                check=True,
                env=self.running_env
            )
            # Store the result because it worked just fine!
            self.exit_code = 0
            self.stdout = result.stdout.splitlines()
            self.stderr = result.stderr.splitlines()
        except subprocess.CalledProcessError as e:
            # Or store the result from the exception(!)
            self.exit_code = e.returncode
            self.stdout = e.stdout.splitlines()
            self.stderr = e.stderr.splitlines()

        # If verbose mode is on, output the results and errors from the command execution
        if len(self.stdout) > 0:
            logger.debug(f'stdout: {self.list_to_newline_string(self.stdout)}')
        if len(self.stderr) > 0:
            logger.debug(f'stderr: {self.list_to_newline_string(self.stderr)}')

        # If it failed and we want to raise an exception on failure, record the command and args
        # then Raise Away!
        if raise_on_error and self.exit_code > 0:
            command_string = None
            args = []
            for element in command:
                if not command_string:
                    command_string = element
                else:
                    args.append(element)

            raise Exception(
                f'Error ({self.exit_code}) running command {command_string} with arguments {args}\nstderr: {self.stderr}\nstdout: {self.stdout}')

    def __repr__(self) -> str: # Return a string representation of this class
        return "\n".join(
            [
               f"Command: {self.command}",
               f"Directory: {self.cwd if not None else '{current directory}'}",
               f"Env: {self.running_env}",
               f"Exit Code: {self.exit_code}",
               f"nstdout: {self.stdout}",
               f"stderr: {self.stderr}" 
            ]
        )

    def list_to_newline_string(self, list_of_messages: list):
        return "\n".join(list_of_messages)

So, how do we use this?

Well… you can do this: prog = RunCommand(['ls', '/tmp', '-l']) with which we’ll get back the prog object. If you literally then do print(prog) it will print the result of the __repr__() function:

Command: ['ls', '/tmp', '-l']
Directory: current directory
Env: <... a collection of things from your environment ...>
Exit Code: 0
stdout: total 1
drwx------ 1 root  root  0 Jan 1 01:01 somedir
stderr:

But you can also do things like:

for line in prog.stdout:
  print(line)

or:

try:
  prog = RunCommand(['false'], raise_on_error=True)
catch Exception as e:
  logger.error(e)
  exit(e.exit_code)

Putting it together

So, I wrote all this up into a git repo, that you’re more than welcome to take your own inspiration from! It’s licenced under an exceptional permissive license, so you can take it and use it without credit, but if you want to credit me in some way, feel free to point to this blog post, or the git repo, which would be lovely of you.

Github: JonTheNiceGuy/python_shell_script_template

Featured image is ā€œThe Conchā€ by ā€œKurtis Garbuttā€ on Flickr and is released under a CC-BY license.

A scuffed painting on what appears to be a bin. The painting is of an orangutan holding up a sign saying "Don't Panic".

Mounting a damaged #ZFS Pool disk to recover data

TL;DR? zpool import -d /dev/sdb1 -o readonly=on -R /recovery/poolname poolname

I have a pair of Proxmox servers, each with a single ZFS drive attached, with GlusterFS over the top to provide storage to the VMs.

Last week I had a power outage which took both nodes offline. When the power came back on, one node’s system drive had failed entirely and during recovery the second machine refused to restart some of the VMs.

Rather than try to fix things properly, I decided to “Nuke-and-Pave”, a decision I’m now regretting a little!

I re-installed one of the nodes OK, set up the new ZFS drive, set up Gluster and then started transferring the content from the old machine to the new one.

During the file transfer, I saw a couple of messages about failed blocks, and finally got a message from the cluster about how the pool was considered degraded, but as this was largely performed while I was asleep, I didn’t notice until I woke up… when the new node was offline.

I connected a Keyboard and Monitor to the box and saw a kernel panic. I rebooted the node, and during the boot sequence, just after the Systemd service that scanned the ZFS pool, it panicked again.

Unplugging the data drive from the machine and rebooting it, the node came up just fine.

I plugged the drive into my laptop and ran zpool import -d /dev/sdb1 -R /recovery/poolname poolname and my laptop crashed (although, I was running this in GUI mode, so I don’t know if it was a kernel panic or “just” a crash.)

Finally, I ran zpool import -d /dev/sdb1 -o read-only=on -R /recovery/poolname poolname and the drive came up in /recovery/poolname, so I could transfer files off to another drive until I figure out what’s going on!

Once I was done, I ran zfs unmount poolname and was able to detach the disk from the device.

Featured image is ā€œdon’t panic orangutanā€ by ā€œEsperluetteā€ on Flickr and is released under a CC-BY license.

A colour photograph of a series of cogs and gears interlinked to create a machine

Making .bashrc more manageable

How many times have you seen an instruction in a setup script which says “Now add source <(somescript completion bash) to your ~/.bashrc file” or “Add export SOMEVAR=abc123 to your .bashrc file”?

This is great when it’s one or two lines, but for a big chunk of them? Whew!

Instead, I created this block in mine:

if [ -d ~/.bash_extensions.d ]; then
    for extension in ~/.bash_extensions.d/[a-zA-Z0-9]*
    do
        . "$extension"
    done
fi

This dynamically loads all the files in ~/.bash_extensions.d/ which start with a letter or a digit, so it means I can manage when things get loaded in, or removed from my bash shell.

For example, I recently installed the pre-release of Atuin, so my ~/.bash_extensions.d/atuin file looks like this:

source $HOME/.atuin/bin/env
eval "$(atuin init bash --disable-up-arrow)"

And when I installed direnv, I created ~/.bash_extensions.d/direnv which has this in it:

eval "$(direnv hook bash)"

This is dead simple, and now I know that if I stop using direnv, I just need to remove that file, rather than hunting for a line in .bashrc.

Featured image is ā€œGears gears cogs bits n piecesā€ by ā€œLes Chatfieldā€ on Flickr and is released under a CC-BY license.

Replacing Be-Fibre’s router with an OpenWRT-based router

I’d seen quite a few comments from the community talking about using your own router, and yesterday I finally got around to sorting this out for myself!

I obtained a Turris Omnia CZ11NIC13 router last year to replace the Be-Fibre supplied Adtrans 854-v6. Both Adtrans and Omnia use an SFP to terminate the fibre connection on-box, but if you have an ONT then the Omnia will activate the WAN Ethernet socket if the SFP isn’t filled.

The version I was using was a little on the older side than the currently listed models, and the version of OpenWRT that was shipped on it kept breaking when trying to perform updates and upgrades. I followed the instructions on the OpenWRT Wiki to replace the custom firmware with the more generic OpenWRT for this model (upgrading U-Boot and then flashing the updated image using dd in the rescue environment).

Initial stumbles

I’d said I’d got this last year, and when I’d tried to migrate it over, it wasn’t clear that the SFP had been detected… On this router eth2 is either the WAN Ethernet interface or the SFP, and the only way you can tell is to run dmesg | grep ether which shows configuring for inband/1000base-x link mode (it’s on the SFP) or configuring for fixed/rgmii link mode (it’s on the ethernet port). Once you realise that, it all becomes a bit easier!

Getting ready

The main thing to realise about this process is that Be-Fibre only really care about two things; the MAC address of the router (the SFP doesn’t have a MAC address) and the VLAN it’s connected to – which is VLAN 10.

So, let’s set that up.

Using the Web interface

The first thing we need to do is to change the MAC address of “eth2” – the WAN/SFP port. Navigate to to “Network” then “Interfaces” and select the “Devices” tab.

The devices page of the OpenWRT web interface before any changes have been made to it.

Click “Configure…” next to eth2 to change the MAC address. Here I’ve changed it to “DE:CA:FB:AD:99:99” which is the value I got from my Be-Fibre router.

Once you click “Save” you go back to the devices page, and you can see here that the changed MAC address is now in bold.

Next we need to setup the VLAN tag, so click on “Add device configuration…”

Once you click “Save”, you’ll be taken back to the devices page.

Click on “Save & Apply”, and then go into the “Interfaces” tab.

You may see that an IP address has been allocated to the wan interface… Don’t believe it! You still need to select the VLAN Tagged interface for it to work! Click “Edit” for the wan interface.

Change device to eth2.10 and click Save. Do the same for the wan6 interface.

Click Save & Apply and then navigate to the “System” -> “Reboot” screen and reboot the device to get the network changes recognised.

Using the command line

SSH to the OpenWRT device, and then run the following (replace MAC address as needed!)

eth2=$(uci show network.@device[1] | grep =device | cut -d= -f1 | cut -d. -f2)
uci set network.${eth2}.macaddr='DE:CA:FB:AD:99:99'
uci commit network

if ! ip link show dev eth2.20
then
  uci add network device
  uci set network.@device[-1].type='8021q'
  uci set network.@device[-1].ifname='eth2'
  uci set network.@device[-1].vid='10'
  uci set network.@device[-1].name='eth2.10'
  uci commit network
fi

uci set network.wan.device='eth2.10'
uci set network.wan6.device='eth2.10'
uci commit network
reload_config

A note to myself; resetting error status on proxmox HA workloads after a crash

I’ve had a couple of issues with brown-outs recently which have interrupted my Proxmox server, and stopped my connected disks from coming back up cleanly (yes, I’m working on that separately!) but it’s left me in a state where several of my containers and virtual machines on the cluster are down.

It’s possible to point-and-click your way around this, but far easier to script it!

A failed state may look like this:

root@proxmox1:~# ha-manager status
quorum OK
master proxmox2 (active, Fri Mar 22 10:40:49 2024)
lrm proxmox1 (active, Fri Mar 22 10:40:52 2024)
lrm proxmox2 (active, Fri Mar 22 10:40:54 2024)
service ct:101 (proxmox1, error)
service ct:102 (proxmox2, error)
service ct:103 (proxmox2, error)
service ct:104 (proxmox1, error)
service ct:105 (proxmox1, error)
service ct:106 (proxmox2, error)
service ct:107 (proxmox2, error)
service ct:108 (proxmox1, error)
service ct:109 (proxmox2, error)
service vm:100 (proxmox2, error)

Once you’ve fixed your issue, you can do this on each node:

for worker in $(ha-manager status | grep "($(hostnamectl hostname), error)" | cut -d\  -f2)
do
  echo "Disabling $worker"
  ha-manager set $worker --state disabled
  until ha-manager status | grep "$worker" | grep -q disabled ; do sleep 1 ; done
  echo "Restarting $worker"
  ha-manager set $worker --state started
  until ha-manager status | grep "$worker" | grep -q started ; do sleep 1 ; done
done

Note that this hasn’t been tested, but a scan over it with those nodes working suggests it should. I guess I’ll be updating this the next time I get a brown-out!

A padlock and chain on a rusted gate

Using #NetworkFirewall and #Route53 #DNS #Firewall to protect a private subnet’s egress traffic in #AWS

I wrote this post in January 2023, and it’s been languishing in my Drafts folder since then. I’ve had a look through it, and I can’t see any glaring reasons why I didn’t publish it so… it’s published… Enjoy šŸ˜

If you’ve ever built a private subnet in AWS, you know it can be a bit tricky to get updates from the Internet – you end up having a NAT gateway or a self-managed proxy, and you can never be 100% certain that the egress traffic isn’t going somewhere you don’t want it to.

In this case, I wanted to ensure that outbound HTTPS traffic was being blocked if the SNI didn’t explicitly show the DNS name I wanted to permit through, and also, I only wanted specific DNS names to resolve. To do this, I used AWS Network Firewall and Route 53 DNS Firewall.

I’ve written this blog post, and followed along with this, I’ve created a set of terraform files to represent the steps I’ve taken.

The Setup

Let’s start this story from a simple VPC with three private subnets for my compute resources, and three private subnets for the VPC Endpoints for Systems Manager (SSM).

Here’s our network diagram, with the three subnets containing the VPC Endpoints at the top, and the three instances at the bottom.

I’ve created a tag in my Github repo at this “pre-changes” state, called step 1.

At this point, none of those instances can reach anything outside the network, with the exception of the SSM environment. So, we can’t install any packages, we can’t get data from outside the network or anything similar.

Getting Protected Internet Access

In order to get internet access, we need to add 4 things;

  1. An internet gateway
  2. A NAT gateway in each AZ
  3. Which needs three new subnets
  4. And three Elastic IP addresses
  5. Route tables in all the subnets

To clarify, a NAT gateway acts like a DSL router. It hides the source IP address of outbound traffic behind a single, public IP address (using an Elastic IP from AWS), and routes any return traffic back to wherever that traffic came from. To reduce inter-AZ data transfer rates, I’m putting one in each AZ, but if there’s not a lot of outbound traffic or the outbound traffic isn’t critical enough to require resiliency, this could all be centralised to a single NAT gateway. To put a NAT gateway in each AZ, you need a subnet in each AZ, and to get out to the internet (by whatever means you have), you need an internet gateway and route tables for how to reach the NAT and internet gateways.

We also should probably add, at this point, four additional things.

  1. The Network Firewall
  2. Subnets for the Firewall interfaces
  3. Stateless Policy
  4. Stateful Policy

The Network Firewall acts like a single appliance, and uses a Gateway Load Balancer to present an interface into each of the availability zones. It has a stateless policy (which is very fast, but needs to address both inbound and outbound traffic flows) to do IP and Port based filtering (referred to as “Layer 3” filtering) and then specific traffic can be passed into a stateful policy (which is slower) to do packet and flow inspection.

In this case, I only want outbound HTTPS traffic to be passed, so my stateless rule group is quite simple;

  • VPC range on any port ā†’ Internet on TCP/443; pass to Stateful rule groups
  • Internet on TCP/443 ā†’ VPC range on any port; pass to Stateful rule groups

I have two stateful rule groups, one is defined to just allow access out to example.com and any relevant subdomains, using the “Domain List” stateful policy item. The other allows access to example.org and any relevant subdomains, using a Suricata stateful policy item, to show the more flexible alternative route. (Suricata has lots more filters than just the SNI value, you can check for specific SSH versions, Kerberos CNAMEs, SNMP versions, etc. You can also add per-rule logging this way, which you can’t with the Domain List route).

These are added to the firewall policy, which also defines that if a rule doesn’t match a stateless rule group, or an established flow doesn’t match a stateful rule group, then it should be dropped.

New network diagram with more subnets and objects, but essentially, as described in the paragraphs above. Traffic flows from the instances either down towards the internet, or up towards the VPCe.

I’ve created a tag in my Github repo at this state, with the firewall, NAT Gateway and Internet Gateway, called step 2.

So far, so good… but why let our users even try to resolve the DNS name of a host they’re not permitted to reach. Let’s turn on DNS Firewalling too.

Turning on Route 53 DNS Firewall

You’ll notice that in the AWS Network Firewall, I didn’t let DNS out of the network. This is because, by default, AWS enables Route 53 as it’s local resolver. This lives on the “.2” address of the VPC, so in my example environment, this would be 198.18.0.2. Because it’s a local resolver, it won’t cross the Firewall exiting to the internet. You can also make Route 53 use your own DNS servers for specific DNS resolution (for example, if you’re running an Active Directory service inside your network).

Any Network Security Response team members you have working with you would appreciate it if you’d turn on DNS Logging at this point, so I’ll do it too!

In March 2021, AWS announced “Route 53 DNS Firewall”, which allow this DNS resolver to rewrite responses, or even to completely deny the existence of a DNS record. With this in mind, I’m going to add some custom DNS rules.

The first thing I want to do is to only permit traffic to my specific list of DNS names – example.org, example.com and their subdomains. DNS quite likes to terminate DNS names with a dot, signifying it shouldn’t try to resolve any higher up the chain, so I’m going to make a “permitted domains” DNS list;

example.com.
example.org.
*.example.com.
*.example.org.

Nice and simple! Except, this also stops me from being able to access the instances over SSM, so I’ll create a separate “VPCe” DNS list:

ssm.ex-ample-1.amazonaws.com.
*.ssm.ex-ample-1.amazonaws.com.
ssmmessages.ex-ample-1.amazonaws.com.
*.ssmmessages.ex-ample-1.amazonaws.com.
ec2messages.ex-ample-1.amazonaws.com.
*.ec2messages.ex-ample-1.amazonaws.com.

Next I create a “default deny” DNS list:

*.

And then build a DNS Firewall Policy which allows access to the “permitted domains”, “VPCe” lists, but blocks resolution of any “default deny” entries.

I’ve created a tag in my Github repo at this state, with the Route 53 DNS Firewall configured, called step 3.

In conclusion…

So there we have it. While the network is not “secure” (there’s still a few gaps here) it’s certainly MUCH more secure than it was, and it certainly would take a lot more work for anyone with malicious intent to get your content out.

Feel free to have a poke around, and leave comments below if this has helped or is of interest!

"Killer travel plug and socket board" by "Ashley Basil" on Flickr

Testing and Developing WordPress Plugins using Vagrant to provide the test environment

I keep trundling back to a collection of WordPress plugins that I really love. And sometimes I want to contribute patches to the plugin.

I don’t want to develop against this server (that would be crazy… huh… right… no one does that… *cough*) but instead, I want a nice, fresh and new WordPress instance to just check that it works the way I was expecting.

So, I created a little Vagrant environment, just for testing WordPress plugins. I clone the repository for the plugin, and create a “TestingEnvironment” directory in there.

I then create the following Vagrantfile.

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/jammy64"
  # This will create an IP address in the range 192.168.64.0/24 (usually)
  config.vm.network "private_network", type: "dhcp"
  # This loads the git repo for the plugin into /tmp/git_repo
  config.vm.synced_folder "../", "/tmp/git_repo"

  # If you've got vagrant-cachier, this will speed up apt update/install operations
  if Vagrant.has_plugin?("vagrant-cachier")
    config.cache.scope = :box
  end

  config.vm.provision "shell", inline: <<-SHELL

    # Install Dependencies
    apt-get update
    apt-get install -y apache2 libapache2-mod-fcgid php-fpm mysql-server php-mysql git

    # Set up Apache
    a2enmod proxy_fcgi setenvif
    a2enconf "$(basename "$(ls /etc/apache2/conf-available/php*)" .conf)"
    systemctl restart apache2
    rm -f /var/www/html/index.html

    # Set up WordPress
    bash /vagrant/root_install_wordpress.sh
  SHELL
end

Next, let’s create that root_install_wordpress.sh file.

#! /bin/bash

# Allow us to run commands as www-data
chsh -s /bin/bash www-data
# Let www-data access files in the web-root.
chown -R www-data:www-data /var/www

# Install wp-cli system-wide
curl -s -S -O https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar
mv wp-cli.phar /usr/local/bin/wp
chmod +x /usr/local/bin/wp

# Slightly based on 
# https://www.a2hosting.co.uk/kb/developer-corner/mysql/managing-mysql-databases-and-users-from-the-command-line
echo "CREATE DATABASE wp;" | mysql -u root
echo "CREATE USER 'wp'@'localhost' IDENTIFIED BY 'wp';" | mysql -u root
echo "GRANT ALL PRIVILEGES ON wp.* TO 'wp'@'localhost';" | mysql -u root
echo "FLUSH PRIVILEGES;" | mysql -u root

# Execute the generic install script
su - www-data -c bash -c /vagrant/user_install_wordpress.sh
# Install any plugins with this script
su - www-data -c bash -c /vagrant/customise_wordpress.sh
# Log the path to access
echo "URL: http://$(sh /vagrant/get_ip.sh) User: admin Password: password"

Now we have our dependencies installed and our database created, let’s get WordPress installed with user_install_wordpress.sh.

#! /bin/bash

# Largely based on https://d9.hosting/blog/wp-cli-install-wordpress-from-the-command-line/
cd /var/www/html
# Install the latest WP into this directory
wp core download --locale=en_GB
# Configure the database with the credentials set up in root_install_wordpress.sh
wp config create --dbname=wp --dbuser=wp --dbpass=wp --locale=en_GB
# Skip the first-run-wizard
wp core install --url="http://$(sh /vagrant/get_ip.sh)" --title=Test --admin_user=admin --admin_password=password --admin_email=example@example.com --skip-email
# Setup basic permalinks
wp option update permalink_structure ""
# Flush the rewrite schema based on the permalink structure
wp rewrite structure ""

Excellent. This gives us a working WordPress environment. Now we need to add our customisation – the plugin we’re deploying. In this case, I’ve been tweaking the “presenter” plugin so here’s the customise_wordpress.sh code:

#! /bin/bash

cd /var/www/html/wp-content/plugins
git clone /tmp/git_repo presenter --recurse-submodules
wp plugin activate presenter

Actually, that /tmp/git_repo path is a call-back to this line in the Vagrantfile: config.vm.synced_folder "../", "/tmp/git_repo".

And there you have it; a vanilla WordPress install, with the plugin installed and ready to test. It only took 4 years to write up a blog post for it!

As an alternative, you could instead put the plugin you’re working with in a subdirectory of the Vagrantfile and supporting files, then you’d just need to change that git clone /tmp/git_repo line to git clone /vagrant/MyPlugin – but then you can’t offer this to the plugin repo as a PR, can you? šŸ˜€

Featured image is ā€œKiller travel plug and socket boardā€ by ā€œAshley Basilā€ on Flickr and is released under a CC-BY license.