Operations

How it runs.

This page covers how my estate is actually deployed, the signals I watch, and the incidents I've caused and what they changed. It runs on a hybrid NixOS fleet I own and operate.

The estate

Five NixOS hosts on a UniFi network and one Ubuntu NAS left off NixOS on purpose, all described in a single flake and deployed with one colmena apply. The off-site Linode leg is currently torn down — the config that rebuilds it is still in the repo.

Host	Role	Zone
desktop	Workstation, and the Colmena deploy controller the fleet is pushed from	lan
mgmt	Internal DNS, a private step-ca CA, Forgejo, the Harmonia binary cache, and the SIEM	lan
media	The media stack, declaratively; storage mounted from the NAS over NFS	lan
playground	libvirt security lab: Kali + Parrot, reached in-browser through Guacamole	lan
hacktop	Staging host, and the self-hosted Forgejo Actions runner that gates the fleet	lan
cloud1	Linode leg, Terraform-provisioned, nixos-anywhere-installed — currently torn down; one apply from returning	cloud
nas	Ubuntu 24.04 over NFSv4.2; the one host I left off NixOS on purpose	–

01

One flake, one apply

Five NixOS hosts plus an Ubuntu NAS live in a single Nix flake and deploy with colmena apply from one workstation. Adding a host means a new directory and one line; the Linode leg came and went exactly that way.
02

Provision vs. configure

Terraform makes cloud boxes exist; nixos-anywhere wipes them to NixOS; Nix owns everything inside. Neither tool reaches into the other’s lifecycle.
03

A deploy user that can’t get a shell

Colmena pushes closures as a dedicated deploy user, a Nix trusted-user with scoped NOPASSWD sudo for only the activation commands. A stolen deploy key re-runs a build; it never opens root.
04

Secrets keyed to the host

sops-nix; each host decrypts its own secrets with its existing SSH host key via ssh-to-age. No key distribution, and no plaintext secret in the Nix store.
05

A private CA, wired into NixOS

step-ca issues real, auto-renewing certificates to internal services over its own ACME endpoint. Trusting the root is a module a host opts into.
06

A SIEM defined in the flake

Wazuh is out; a declarative Loki/Alloy/Alertmanager stack ships every host’s journal to mgmt, with alert rules in git and alerts pushed to a self-hosted ntfy topic. The cutover was gated on mgmt because it serves DNS and PKI.

The pipeline

Changes to the infra repo run through a self-hosted Forgejo Actions pipeline on hacktop. Forgejo is the private source of truth; GitHub is the public mirror, and only green builds reach it.

01

flake check

nix flake check evaluates every host in the fleet. Bad config fails here, before anything is built.
02

gitleaks

Every push is scanned for secrets. The repo is mirrored publicly, so a leaked key is exposed the moment it lands.
03

build all hosts

A matrix builds every host against the Harmonia binary cache, fail-fast off. One broken host doesn’t hide the others.
04

publish on green

Only after the gates pass does a mirror job force-push to the public GitHub repo. Red commits stay in the private Forgejo and never reach the public mirror.

next: NixOS VM tests, boot a host in a VM and assert its services before a change touches real hardware · then merge-gated auto-deploy to the staging boxes. The runner holds deploy keys to the whole fleet, so it's treated as the most sensitive box on the network.

Signals

A signal is something you can check. These are the ones I watch; the last row is the one I don't measure yet.

Signal	What it tells me	State
Internal TLS	Every internal service on a real step-ca cert, ~90-day, auto-renewed	live · 15 vhosts
Fleet drift	colmena apply is a clean no-op when a host matches the flake	live · equality check
SIEM enrollment	Every host shipping its journal to the homelab SIEM	live · Loki/Alloy
Binary cache	A green build hits the Harmonia cache instead of rebuilding	live
Log retention	Journals shipped to a central store, 30-day, single-tenant	live
Availability SLOs + error budgets	A number I’d actually defend a change against	not yet measured

Incidents & postmortems

Blameless, pulled straight from the homelab log, each ending in a change that stuck.

pki/dns-deadlock Jun 15, 2026

Every internal service quietly served the wrong certificate for a day

Impact: All fifteen *.mgmt.lan services fell back to self-signed certs: trust warnings fleet-wide and the binary cache untrusted. No data exposed; step-ca itself was healthy.

What changed: mgmt was resolving DNS upstream to 1.1.1.1, not its own AdGuard, so step-ca’s HTTP-01 ACME validation couldn’t resolve *.mgmt.lan to fetch challenges; every order failed and nginx fell back to minica. The box that serves DNS couldn’t use it to validate its own certs. Fix: pin the ACME domains to mgmt’s own IP in networkd-managed /etc/hosts, clear the stale renew units, redeploy. The same bootstrap deadlock is why Colmena deploys by IP rather than by the name AdGuard resolves.

drift/mgmt-channel Jun 20, 2026

The box that runs the fleet had quietly stopped running the fleet’s config

Impact: Weeks of staged flake changes were never live on mgmt, and colmena apply --on mgmt then failed auth: the deploy user only exists in the flake, which wasn’t the active system.

What changed: An unplanned nixos-rebuild had switched mgmt to a channel build, silently breaking Colmena management. Drift stays hidden until you go to deploy. Fix: re-fold carefully. Nix store diff-closures to confirm DNS/PKI weren’t in the diff, dry-activate to see what would restart, then switch on-box to recreate the deploy user and bring the flake live. Drift detection only works if you actually run the equality check, and the most-trusted box is the one worth checking most.

net/wifi-deploy Jun 14–15, 2026

A deploy that restarts the network, on a box reachable only over the network

Impact: A channel downgrade restarted NetworkManager on the Wi-Fi-only hacktop; it dropped off Wi-Fi and didn’t reconnect. The closure applied fine; the box was just gone. A later dead Ethernet dongle black-holed the route and caused ARP flux, taking it fully offline.

What changed: Know which deploys touch the link you’re standing on. Fix: pull the dead dongle to clear the ARP flux, redeploy from the no-git /tmp copy without bouncing NetworkManager, and promote “wire it to Ethernet with a static lease” to an action item. The fragile host is the one with no second route in.

cloud/kexec-oom Jun 20, 2026

The first cloud node OOM’d in the middle of its own install

Impact: nixos-anywhere hung bringing up the first Linode node: a 1 GB Nanode couldn’t evaluate the NixOS closure in the in-RAM kexec environment.

What changed: The fix was a terraform destroy and a bump to 2 GB. Greenfield NixOS installs need RAM headroom for the kexec; the smallest instance doesn’t leave enough for the install to fit.

Runbooks

If a failure mode is known, the response shouldn't be improvised. Runbooks live in the same repo as the configuration they describe, so the two version together.

host-wont-boot

Trigger: A NixOS host is stuck after a config change

Boot the previous generation from the bootloader menu, then colmena apply the known-good closure. That is what generations are for.
gated-mgmt-deploy

Trigger: Any change to mgmt: DNS and PKI live here

Never push blind: colmena apply --dry-activate first, check nix store diff-closures for DNS/PKI, keep nixos-rebuild --rollback ready, then apply.
deploy-with-secrets

Trigger: A deploy needs sops secrets the flake can’t see

Secrets are gitignored and flakes only read tracked files. Deploy from an rsync’d /tmp copy of the tree, not the dirty repo.
cert-not-trusted

Trigger: An internal service is serving an untrusted cert

Check mgmt is resolving through its own AdGuard, not upstream. ACME validation needs *.mgmt.lan to resolve locally before step-ca can issue.
drift-check

Trigger: Unsure a host still matches its config

colmena apply should be a no-op. If it isn’t, the host has drifted. Find out why before forcing the closure.

How it runs.

One flake, one apply

Provision vs. configure

A deploy user that can’t get a shell

Secrets keyed to the host

A private CA, wired into NixOS

A SIEM defined in the flake

flake check

gitleaks

build all hosts

publish on green

Every internal service quietly served the wrong certificate for a day

The box that runs the fleet had quietly stopped running the fleet’s config

A deploy that restarts the network, on a box reachable only over the network

The first cloud node OOM’d in the middle of its own install