Operations

How it runs.

This page covers how my estate is actually deployed, the signals I watch, and the incidents I've caused and what they changed. It runs on a hybrid NixOS fleet I own and operate.

The estate

Five NixOS hosts on a UniFi network and one Ubuntu NAS left off NixOS on purpose, all described in a single flake and deployed with one colmena apply. The off-site Linode leg is currently torn down — the config that rebuilds it is still in the repo.

Host Role Zone
desktop Workstation, and the Colmena deploy controller the fleet is pushed from lan
mgmt Internal DNS, a private step-ca CA, Forgejo, the Harmonia binary cache, and the SIEM lan
media The media stack, declaratively; storage mounted from the NAS over NFS lan
playground libvirt security lab: Kali + Parrot, reached in-browser through Guacamole lan
hacktop Staging host, and the self-hosted Forgejo Actions runner that gates the fleet lan
cloud1 Linode leg, Terraform-provisioned, nixos-anywhere-installed — currently torn down; one apply from returning cloud
nas Ubuntu 24.04 over NFSv4.2; the one host I left off NixOS on purpose
  1. 01

    One flake, one apply

    Five NixOS hosts plus an Ubuntu NAS live in a single Nix flake and deploy with colmena apply from one workstation. Adding a host means a new directory and one line; the Linode leg came and went exactly that way.

  2. 02

    Provision vs. configure

    Terraform makes cloud boxes exist; nixos-anywhere wipes them to NixOS; Nix owns everything inside. Neither tool reaches into the other’s lifecycle.

  3. 03

    A deploy user that can’t get a shell

    Colmena pushes closures as a dedicated deploy user, a Nix trusted-user with scoped NOPASSWD sudo for only the activation commands. A stolen deploy key re-runs a build; it never opens root.

  4. 04

    Secrets keyed to the host

    sops-nix; each host decrypts its own secrets with its existing SSH host key via ssh-to-age. No key distribution, and no plaintext secret in the Nix store.

  5. 05

    A private CA, wired into NixOS

    step-ca issues real, auto-renewing certificates to internal services over its own ACME endpoint. Trusting the root is a module a host opts into.

  6. 06

    A SIEM defined in the flake

    Wazuh is out; a declarative Loki/Alloy/Alertmanager stack ships every host’s journal to mgmt, with alert rules in git and alerts pushed to a self-hosted ntfy topic. The cutover was gated on mgmt because it serves DNS and PKI.

The pipeline

Changes to the infra repo run through a self-hosted Forgejo Actions pipeline on hacktop. Forgejo is the private source of truth; GitHub is the public mirror, and only green builds reach it.

  1. 01

    flake check

    nix flake check evaluates every host in the fleet. Bad config fails here, before anything is built.

  2. 02

    gitleaks

    Every push is scanned for secrets. The repo is mirrored publicly, so a leaked key is exposed the moment it lands.

  3. 03

    build all hosts

    A matrix builds every host against the Harmonia binary cache, fail-fast off. One broken host doesn’t hide the others.

  4. 04

    publish on green

    Only after the gates pass does a mirror job force-push to the public GitHub repo. Red commits stay in the private Forgejo and never reach the public mirror.

next: NixOS VM tests, boot a host in a VM and assert its services before a change touches real hardware · then merge-gated auto-deploy to the staging boxes. The runner holds deploy keys to the whole fleet, so it's treated as the most sensitive box on the network.

Signals

A signal is something you can check. These are the ones I watch; the last row is the one I don't measure yet.

Signal What it tells me State
Internal TLS Every internal service on a real step-ca cert, ~90-day, auto-renewed live · 15 vhosts
Fleet drift colmena apply is a clean no-op when a host matches the flake live · equality check
SIEM enrollment Every host shipping its journal to the homelab SIEM live · Loki/Alloy
Binary cache A green build hits the Harmonia cache instead of rebuilding live
Log retention Journals shipped to a central store, 30-day, single-tenant live
Availability SLOs + error budgets A number I’d actually defend a change against not yet measured

Incidents & postmortems

Blameless, pulled straight from the homelab log, each ending in a change that stuck.

pki/dns-deadlock Jun 15, 2026

Every internal service quietly served the wrong certificate for a day

Impact: All fifteen *.mgmt.lan services fell back to self-signed certs: trust warnings fleet-wide and the binary cache untrusted. No data exposed; step-ca itself was healthy.

What changed: mgmt was resolving DNS upstream to 1.1.1.1, not its own AdGuard, so step-ca’s HTTP-01 ACME validation couldn’t resolve *.mgmt.lan to fetch challenges; every order failed and nginx fell back to minica. The box that serves DNS couldn’t use it to validate its own certs. Fix: pin the ACME domains to mgmt’s own IP in networkd-managed /etc/hosts, clear the stale renew units, redeploy. The same bootstrap deadlock is why Colmena deploys by IP rather than by the name AdGuard resolves.

drift/mgmt-channel Jun 20, 2026

The box that runs the fleet had quietly stopped running the fleet’s config

Impact: Weeks of staged flake changes were never live on mgmt, and colmena apply --on mgmt then failed auth: the deploy user only exists in the flake, which wasn’t the active system.

What changed: An unplanned nixos-rebuild had switched mgmt to a channel build, silently breaking Colmena management. Drift stays hidden until you go to deploy. Fix: re-fold carefully. Nix store diff-closures to confirm DNS/PKI weren’t in the diff, dry-activate to see what would restart, then switch on-box to recreate the deploy user and bring the flake live. Drift detection only works if you actually run the equality check, and the most-trusted box is the one worth checking most.

net/wifi-deploy Jun 14–15, 2026

A deploy that restarts the network, on a box reachable only over the network

Impact: A channel downgrade restarted NetworkManager on the Wi-Fi-only hacktop; it dropped off Wi-Fi and didn’t reconnect. The closure applied fine; the box was just gone. A later dead Ethernet dongle black-holed the route and caused ARP flux, taking it fully offline.

What changed: Know which deploys touch the link you’re standing on. Fix: pull the dead dongle to clear the ARP flux, redeploy from the no-git /tmp copy without bouncing NetworkManager, and promote “wire it to Ethernet with a static lease” to an action item. The fragile host is the one with no second route in.

cloud/kexec-oom Jun 20, 2026

The first cloud node OOM’d in the middle of its own install

Impact: nixos-anywhere hung bringing up the first Linode node: a 1 GB Nanode couldn’t evaluate the NixOS closure in the in-RAM kexec environment.

What changed: The fix was a terraform destroy and a bump to 2 GB. Greenfield NixOS installs need RAM headroom for the kexec; the smallest instance doesn’t leave enough for the install to fit.

Runbooks

If a failure mode is known, the response shouldn't be improvised. Runbooks live in the same repo as the configuration they describe, so the two version together.