What It Looks Like To Build Automation From Scratch

I've happily made a career out of writing code to do interesting things with server, network, and application infrastructure, and in that time I've found the signal-to-noise ratio of the common post/tweet/gist has been wildly lopsided in favor of people throwing often-incomplete or inaccurate scripts into the wind without actually explaining how they were created. I'm not sure if what I'm about to post helps or hurts that ratio, but as I worked through a problem the other day in my lab, I got the bright and/or dumb idea that it might make an interesting post to explain the process of going from "I should do this thing" to "perfect, that's all wrapped up now" in detail, including how long each step takes and my thought process as I decide how to Do The Thing™.

So anyways, here's how I wrote some Ansible code to patch my servers.

The Problem

If for some strange reason you're a repeat reader of my posts rather than having stumbled across this stuff by accident, you may recall that a couple of months ago I had a totally-not-self-inflicted meltdown in the lab, in which I had to rebuild essentially all of my critical systems. The pain of that event aside, it gave me that rare opportunity to start fresh; I got (was forced) to take around a decade of technical debt and backlog and dump it into the nether where it simply ceased to exist.

One of the biggest takeaways I had from that was that I needed to overhaul the practices I adhered to for the super boring care and feeding of the lab and the things running in it. And one of the biggest changes I made was that I started actually patching my hosts regularly, where I previously had not been doing so at all, or had been doing so infrequently and only when something I was trying to do required it. Coming out of that event I rebuilt most of my useful tools, and the result was around 20 or so VMs. Not nothing, but certainly not a huge amount to patch either, which was great because I had been doing so by hand. This was okay in the immediate aftermath of my environment self-destructing, but I knew from the start it wouldn't scale and I'd need to write some automation eventually if I wanted to be able to stick with it as a principle. So, five months later, I sat down to do that.

What Am I Doing?

Total time: ~30 minutes

As always, the first step in building a thing is to put some meaningful definition to what the thing actually is. It really is true that the best way to waste your time in this profession is to just start writing code without thinking about what you want to build. In my case it's pretty straightforward so I open Notejoy and jot down a couple of requirements.

Gracefully shut down all services, with validations
Patch the host OS on all hosts
Bring services back online and validate them

There's some nuance here that I take a few minutes to define too.

What does "gracefully" mean?
How do I account for different OSes?
How do I account for services running in different ways (systemd vs direct invocation etc)

To me, graceful shutdown and restart means that automation can take a working, functional, healthy service and shut it down, do something, and then bring it back up to a functional, healthy state again. The 80/20 rule is always a sensible starting point, so I decide that the first iteration will only cover the most common stuff, services running via systemd on linux hosts. With that in mind, I can write out a simple problem statement:

I need automation to shut down healthy services on my hosts, patch and restart the hosts, and then bring the services back online.

How am I Doing It?

Total time: ~15 minutes

The time and energy required to figure out how to do this is significantly shorter than it might be for a brand new idea/solution, because of a few key factors:

I have a smallish environment
I have limited the scope to only a subset of things, all of which operate in exactly the same way
I have a predefined standard tool of choice for automation in Ansible and Jenkins; I don't have to re-invent those parts of the pipeline or solve those problems.

The key aspects of the question how do I do automation in my environment are already answered, so a design phase becomes an exercise in searching around for common ways that Ansible handles this stuff, and whatever a grab bag of solutions on stackoverflow and github and youtube seem to form a consensus around. The reason this works, and is safe is because of the standardization and consensus. You never want to grab code off the internet and run it sight-unseen, but when multiple independent people find the same or similar solutions to a relatively simple problem (like patching a host), it's usually a safe bet that it works and can be made sane and sensible for your situation, occasionally with the application of a little extra crowbar.

Some quick searching reveals that Ansible has a tool for this, and it's pretty idiot-proof:

- name: Stop a service
  ansible.builtin.systemd:
    state: stopped
    name: httpd

- name: Start a service
  ansible.builtin.systemd:
    state: started
    name: httpd

That's about as simple as it gets.

Additionally, I know how to run Ansible playbooks in a Jenkins stage already using the Ansible plugin, because I've done that hundreds of times:

ansiblePlaybook(credentialsId: '---', inventory: 'hosts', playbook: 'my_playbook.yaml')

So a possible pattern is emerging in my brain:

Put a list of hosts to patch in the hosts file
Write a playbook which goes to each host, shuts down some services that are specified, patches, and restarts the host
Write a Jenkins pipeline to run the playbook

An additional problem is that not all hosts run the same services. I'd like to have one playbook, so maybe I just need to dump a bunch of tasks into the playbook that each have a clause checking inventory_hostname to see if they want to run. For example, I could do something like this:

- name: Stop Gitea
  ansible.builtin.service:
    name: ""
    state: stopped
  when:
    - inventory_hostname == "gitea"
  with_items:
    - gitea

Finally, this design means that I'll have to allow for exceptions to the process - for example, I know that if I'm running the playbook through a Jenkins pipeline, I probably shouldn't shut down that host. For now, I assume I can just handle this with the -l (limit) functionality in Ansible, or leave that host out of the hosts file entirely.

Building A Mousetrap

Total Time: ~2 hours

With a problem statement and basic design in hand/head, I can start writing code.

The first step is always the boring stuff: create a repository, run git init, create a readme. I had some scripts somewhere that made a lot of this easy but I haven't set them up on my new computer yet, so for now I'm doing it by hand.

Once that stuff is done, I can create a couple of playbooks:

├── playbooks/
│   ├── patching-services-shutdown.yaml
│   ├── patching-services-startup.yaml
│   └── patching-host-os.yaml
└── pipelines

The first version of these playbooks takes around an hour to create. That time is split out into a couple of different things:

Scaffolding out a bunch of tasks in the playbooks
Fighting with Ansible inventory_hostname to figure out why it isn't properly evaluating to the correct result
Reconciling the services on each host with the list of things to manage in the playbooks (by running systemctl --type=service on each host) and scanning the results.

The patching-services-shutdown.yaml playbook stops services in the correct order. There's some nuance here to make sure for example that the postgresql service isn't shut down until all of the services which use it have been shut down. Inversely, the startup playbook does the opposite - it starts the hosts in the order that makes the most sense (so, for example, postgresql first). I also add a third playbook, which checks that a host is reachable and is running a supported OS (in this case, CentOS) before patching it and rebooting it.

- hosts: all
  gather_facts: yes
  become: true
  collections:
  vars:
  tasks:
    - name: Check reachability of the host
      ansible.builtin.ping:

    - name: Validate that the host is running Centos
      assert:
        that: "'' == 'CentOS'"

    - name: Upgrade packages
      ansible.builtin.yum:
        name: '*'
        state: latest

    - name: Reboot host
      ansible.builtin.reboot:

The last step is to combine it all into a Jenkins pipeline. In the example below, I'm running it against a hard-coded set of "tier-4" hosts, which are my testing/dev hosts. If something breaks these or they fail to come back online after a reboot, they're unimportant and can be disposed of and recreated without disrupting anything else. As an aside, most of the interesting services in my environment are configured to start automatically with the host, so the majority of the patching-services-startup.yaml playbook isn't actually needed since the step immediately preceding it is a reboot. But, it's there just in case something fails to start.

pipeline {
        agent any
        environment {
            INVENTORY = "playbooks/inventory.yaml"
        }

        stages {
            stage('Turn down services') {
                steps {
                    ansiblePlaybook(
                        credentialsId: '---', 
                        inventory: "${env.INVENTORY}", 
                        playbook: "${env.WORKSPACE}/playbooks/monthly-patching/patching-services-shutdown.yaml",
                        limit: "tier-4",
                    )
                }
            }

            stage('Update all hosts') {
                steps {
                    ansiblePlaybook(
                        credentialsId: '---', 
                        inventory: "${env.INVENTORY}", 
                        playbook: "${env.WORKSPACE}/playbooks/monthly-patching/patching-host-os.yaml",
                        limit: "tier-4"
                    )
                }
            }

            stage('Turn up services') {
                steps {
                    ansiblePlaybook(
                        credentialsId: '---', 
                        inventory: "${env.INVENTORY}", 
                        playbook: "${env.WORKSPACE}/playbooks/monthly-patching/patching-services-startup.yaml.yaml",
                        limit: "tier-4",
                    )
                }
            }

        }

        post {
            always {
                cleanWs()
            }
        }
    }

Successfully running these playbooks causes my monitoring systems to go wild, but more importantly it means that the initial stage of developing the automation is done and I can take a few minutes to think about possible refinements. At this point I take about 30-45 minutes away from the computer. It's important to give your brain a chance to reset so that you can take on new problems and approach them with a clear mind, and I always do this before I start a refinement/refactoring stage.

Building A Better Mousetrap

Total time: ~1 1/2 hours

When I start to look through it, a couple of things stand out to me:

These playbooks are very long
They're very repetitive - every task is basically the exact same thing, just with a couple of variables swapped out
It's probably fine, but something about running a sequence of steps that requires a certain order daisy-chained together just strikes me as a little fragile and susceptible to failure

The thought strikes me that I could probably improve this automation with a short refactor, and by combining a couple of the steps. At a glance, I'd estimate something like 80-85% of the code is probably all just the task that starts and stops services, something like this:

- name: Stop Gitea
  ansible.builtin.service:
    name: ""
    state: stopped
  when:
    - inventory_hostname == "gitea"
  with_items:
    - gitea

- name: Start Gitea
  ansible.builtin.service:
    name: ""
    state: started
  when:
    - inventory_hostname == "gitea"
  with_items:
    - gitea

There are two major improvements that stand out to me right away with this pattern. First, the list of services each host is running currently lives in the playbook, and is duplicated in each task that needs it. It would be smarter and more concise to move it to the inventory file where it can be associated with each host as a variable.

So instead of the tasks above, we end up with this in inventory.yaml:

gitea:
  services:
    - gitea

This may seem like a lateral move, but it gets really useful when we have hosts that have lots of services running on a host that we have to enumerate in a list.

oc:
  services:
    - nginx
    - memcached
    - php-fpm
    - redis
    - wazuh-agent

The host above has five services running on it. If I have a startup and shutdown task for this host, I can eliminate half my lines of code by moving this list into the inventory file and re-using it, rather than hard-coding it in every single task. To re-use this list, the service control tasks become:

- name: Stop Gitea
  ansible.builtin.service:
    name: ""
    state: stopped
  when:
    - inventory_hostname == "gitea"
  with_items:
    - ""

- name: Start Gitea
  ansible.builtin.service:
    name: ""
    state: started
  when:
    - inventory_hostname == "gitea"
  with_items:
    - ""

The second improvement I can see is that there's really no compelling reason for these tasks to be separate; again, I'm just duplicating code that ultimately does the same thing. That's always going to be bad, and create maintenance and development headaches down the road when I have to change something. So my next step is to combine the tasks. Not only can I combine the startup/shutdown tasks for each host, but because I've abstracted the service list, I can actually collapse all tasks for all hosts into a single task. To do this, I add a SERVICE_STATE variable which is passed in as an extra variable at runtime:

vars:
    SERVICE_STATE: UNSET
  tasks:
    - name: Set service state
      ansible.builtin.service:
        name: ""
        state: ""
      with_items:
        - ""

I also add a sanity-check task at the start of the playbook to ensure that it won't run unless you explicitly set SERVICE_STATE.

- name: Validate that required vars are set
  delegate_to: localhost
  run_once: true
  assert:
    quiet: true
    that:
      - item != "UNSET"
  with_items:
    - SERVICE_STATE

This is a really simple pattern that has saved my butt hundreds of times. If you want to have a piece of automation do something, don't let it make inferences and guesses. Tell it exactly what you want. If you don't know what you want to pass in here, or even that you need to pass something in, it's probably not a good idea to run the playbook until you do.

When all of that is said and done (and after some sensible renaming), the repository looks like this:

├── playbooks/
│   ├── service-control.yaml
│   └── do-update.yaml
└── pipelines
    └── monthly-patching.jenkinsfile

The service-control.yaml playbook has collapsed from around 600 lines of code, to 68, which is a gigantic improvement. Only around 20 of those lines are actually required to control the service state, while the rest of it is just extra validations and sanity checks to make sure that it runs and/or fails gracefully. do-update.yaml remains small, with only around 10 lines of actual patching/restart code and the rest again dedicated to making sure everything runs smoothly for a total of 43 lines of code (including comments and documentation).

Further Improvements

Total time: ~1 hour

A bit of testing confirms that the pipeline works as expected with no real issues; two hosts fumble, but in both cases the cause is a host issue, not a playbook issue. It does however strike me that it might be nice to have the ability to just exclude hosts. There are two cases that come to mind here:

Hosts that are failing and I haven't had time to troubleshoot or fix yet.
Hosts that are doing something that I can't/don't want to interrupt (like my Factorio server, which has gotten a little silly and probably deserves it's own post again).

The implementation here is pretty straightforward. I briefly considered adding a text input to the Jenkins pipeline, but that seemed like it might be over-engineered. The reality is, most of the time I either need to exclude no hosts, or the same host every run. It's easier to just put it in the pipeline itself. So I end up adding an environment variable.

EXCLUSIONS = "!host1,!host2"

After that, I can just concatenate it with the limit field on the playbook step, like this:

ansiblePlaybook(
  credentialsId: '---', 
  inventory: "${env.INVENTORY}", 
  playbook: "${env.WORKSPACE}/playbooks/monthly-patching/service-control.yaml",
  limit: "tier-3,${env.EXCLUSIONS}",
  extraVars: [
    SERVICE_STATE: "stopped"
  ]
)

I also spend some time re-organizing the pipeline steps a bit. As mentioned above, things need to happen in a particular order to be "graceful". For my environment, that means:

Turn everything down, starting with services that consume other services (vs providing them) or do not have any dependencies at all, and moving down the stack before turning off critical services (database, network monitoring, etc) last.
Update "tier 1" hosts; these are the critical host services that everything else relies on, like postgres, log aggregation, monitoring, etc.
Reboot and validate tier 1 hosts. Postgres has a quirk I haven't sat down to figure out which requires an additional service restart after the reboot, so that happens in this stage too. Tier 1 hosts should be fully patched, online, and validated before moving on.
Repeat from #2 with tier-2 hosts: update, reboot, re-enable, and validate.
Repeat with tier-3 hosts.

The original pipeline had it split by update stage, so for example all services were brought down, and then all hosts were patched, and then everything was rebooted. This created some dependency errors where things came online before their supporting services, so it was easier to re-organize around tiers than to try and put artificial limits and controls in place.

The last improvement I make is to add an additional playbook which I can invoke at the very start of the pipeline, called staging.yaml:

ansiblePlaybook(
  credentialsId: '---', 
  inventory: "${env.INVENTORY}", 
  playbook: "${env.WORKSPACE}/playbooks/monthly-patching/staging.yaml",
)

The only point of staging.yaml is to provide a hook into the pipeline where I can do things that are required before patching. These are one-off steps that don't fit anywhere else (neatly), like triggering a Factorio/Minecraft save before shutting down the server, or a database backup. Right now, there's nothing in it, but the goal is just to provide an entrypoint so that I can do those things as needed. I also add a similar hook at the end of the pipeline called cleanup.yaml.

Finishing Up

I won't cover the boring details here, but the rest of the work is just things like "testing" (in this case, clicking the build button a few more times in Jenkins so that I get a pretty green checkmark and a hit of dopamine) and mostly writing documentation. There are around 130 lines of code in the entire pipeline (excluding the pipeline file itself), and for all of that I'll write something like a thousand or so words of documentation in a readme and in-line comments. It's not the funnest part of the process, but - again - I've learned my lesson well enough to know that it's a necessity to keep this stuff manageable.

Here's the final (sanitized) output:

inventory.yaml

This is not my real inventory file, it's an example. You'll need to build your own.

all:
  hosts:
  children:
    tier-1:
      hosts:
        pg:
          services:
            - postgresql
    tier-2:
      hosts:
        gitea:
          services:
            - gitea
        wiki:
          services:
            - wiki
    tier-3:
      hosts:
        factorio:
          services:
            - factorio
        microbin:
          services:
            - microbin

service-control.yaml

---
# This playbook sets services to the desired state based on the value
# of the SERVICE_CONTROL extra variable. This value must be one of
# started, stopped, reloaded, or restarted.
#
# Prerequisites
#   - An ansible inventory file with each host to patch specified, and
#     a list of systemd services for each host defined as the "services"
#     variable for that host (see inventory.yaml for example).
#
#   - Hosts in the inventory file grouped by tier 1, tier 2, tier 3 with
#     tier-1 hosts being patched first, and tier-3 hosts being patched last.
#     Additional tiers may be added as required.
#
# Example Invocation:
# ansible-playbook service-control.yaml -i inventory.yaml -e SERVICE_STATE=stopped -l tier-3

- hosts: all
  gather_facts: yes
  become: true
  collections:
    - community.general
    - community.docker
  vars:
    SERVICE_STATE: UNSET
  tasks:
    - name: Validate that required vars are set
      delegate_to: localhost
      run_once: true
      assert:
        quiet: true
        that:
          - item != "UNSET"
      with_items:
          - SERVICE_STATE

    - name: Print debug information
      delegate_to: localhost
      run_once: true
      debug:
        msg: 
          - "SERVICE_STATE: "

    - name: Set service state
      ansible.builtin.service:
        name: ""
        state: ""
      with_items:
        - ""

do-update.yaml

---
# This playbook updates and reboots a host.
#
# Prerequisites
#   - An understanding that this will take the host offline. Don't
#     do this if you don't know what you're doing.
#
#   - An ansible inventory file with each host to patch specified
#     (see inventory.yaml for example).
#
#   - Hosts in the inventory file grouped by tier 1, tier 2, tier 3 with
#     tier-1 hosts being patched first, and tier-3 hosts being patched last.
#     Additional tiers may be added as required.
#
# Example Invocation:
# ansible-playbook do-update -i inventory -l tier-3

- hosts: all
  gather_facts: yes
  become: true
  collections:
  vars:
  tasks:

    - name: Check reachability of the host
      ansible.builtin.ping:

    - name: Validate that the host is running Centos
      assert:
        that: "'' == 'CentOS'"

    - name: Upgrade all packages
      ansible.builtin.yum:
        name: '*'
        state: latest

    - name: Unconditionally reboot the machine with all defaults
      ansible.builtin.reboot:

staging.yaml/cleanup.yaml

---
# This playbook provides a pre-execution hook for the monthly patching
# routine. It is intended as a catch-all playbook for tasks which need
# to be executed prior to monthly patching, to allow them to be
# handled automatically.
#
# Prerequisites
#   - None
#
# Example Invocation:
# ansible-playbook staging.yaml

- hosts: all
  gather_facts: yes
  become: true
  collections:
  vars:
  tasks:
    - name: Debugging
      delegate_to: localhost
      run_once: true
      debug:
        msg: 
          - "NYI"

monthly-patching.jenkinsfile

pipeline {
    agent any
    environment {
        INVENTORY = "playbooks/inventory.yaml"
        EXCLUSIONS = "!host1,!host2" /* The ! notation means exclude this host from the inventory list */
    }

    stages {
        /* Run the pre-update hook */
        stage('Staging and Update Preparation') {
            steps {
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/staging.yaml",
                )
            }
        }

        /* Turn down all services, starting with tier-3 (orphan) and tier-2 (consumer)
        services before moving to tier-1 (provider/critical) services*/
        stage('Turn down services') {
            steps {
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/service-control.yaml",
                    limit: "tier-3,${env.EXCLUSIONS}",
                    extraVars: [
                        SERVICE_STATE: "stopped"
                    ]
                )

                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/service-control.yaml",
                    limit: "tier-2,${env.EXCLUSIONS}",
                    extraVars: [
                        SERVICE_STATE: "stopped"
                    ]
                )

                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/service-control.yaml",
                    limit: "tier-1,${env.EXCLUSIONS}",
                    extraVars: [
                        SERVICE_STATE: "stopped"
                    ]
                )
            }
        }

        /* Update and reboot tier-1 hosts first. They should be back online before the update
        proceeds. Postgres has an unexplained behavior that the systemd service requires an
        extra restart after rebooting before it starts listening for connections properly,
        so do that here too. Probably also fix that at some point. */
        stage('Update Tier 1 Hosts') {
            steps {
                /* Update the hosts */
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/do-update.yaml",
                    limit: "tier-1,${env.EXCLUSIONS}"
                )

                /* Postgres requires a special restart for !!reasons!! */
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/service-control.yaml",
                    limit: "pg",
                    extraVars: [
                        SERVICE_STATE: "restarted"
                    ]
                )
            }
        }

        /* Repeat the process for tier-2 hosts */
        stage('Update Tier 2 Hosts') {
            steps {
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/do-update.yaml",
                    limit: "tier-2,${env.EXCLUSIONS}"
                )
            }
        }

        /* Repeat the process for tier-3 hosts */
        stage('Update Tier 3 Hosts') {
            steps {
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/do-update.yaml",
                    limit: "tier-3,${env.EXCLUSIONS}"
                )
            }
        }

        /* Run the post-update hook */
        stage('Post-update Cleanup Tasks') {
            steps {
                ansiblePlaybook(
                    credentialsId: '---', 
                    inventory: "${env.INVENTORY}", 
                    playbook: "${env.WORKSPACE}/playbooks/monthly-patching/cleanup.yaml",
                )
            }
        }
    }

    post {
        always {
            cleanWs()
        }
    }
}