Blameless, pulled straight from the homelab log, each ending in a change
that stuck.
pki/dns-deadlock Jun 15, 2026
Every internal service quietly served the wrong certificate for a day
Impact: All fifteen *.mgmt.lan services fell back to self-signed certs: trust warnings fleet-wide and the binary cache untrusted. No data exposed; step-ca itself was healthy.
What changed: mgmt was resolving DNS upstream to 1.1.1.1, not its own AdGuard, so step-ca’s HTTP-01 ACME validation couldn’t resolve *.mgmt.lan to fetch challenges; every order failed and nginx fell back to minica. The box that serves DNS couldn’t use it to validate its own certs. Fix: pin the ACME domains to mgmt’s own IP in networkd-managed /etc/hosts, clear the stale renew units, redeploy. The same bootstrap deadlock is why Colmena deploys by IP rather than by the name AdGuard resolves.
drift/mgmt-channel Jun 20, 2026
The box that runs the fleet had quietly stopped running the fleet’s config
Impact: Weeks of staged flake changes were never live on mgmt, and colmena apply --on mgmt then failed auth: the deploy user only exists in the flake, which wasn’t the active system.
What changed: An unplanned nixos-rebuild had switched mgmt to a channel build, silently breaking Colmena management. Drift stays hidden until you go to deploy. Fix: re-fold carefully. Nix store diff-closures to confirm DNS/PKI weren’t in the diff, dry-activate to see what would restart, then switch on-box to recreate the deploy user and bring the flake live. Drift detection only works if you actually run the equality check, and the most-trusted box is the one worth checking most.
net/wifi-deploy Jun 14–15, 2026
A deploy that restarts the network, on a box reachable only over the network
Impact: A channel downgrade restarted NetworkManager on the Wi-Fi-only hacktop; it dropped off Wi-Fi and didn’t reconnect. The closure applied fine; the box was just gone. A later dead Ethernet dongle black-holed the route and caused ARP flux, taking it fully offline.
What changed: Know which deploys touch the link you’re standing on. Fix: pull the dead dongle to clear the ARP flux, redeploy from the no-git /tmp copy without bouncing NetworkManager, and promote “wire it to Ethernet with a static lease” to an action item. The fragile host is the one with no second route in.
cloud/kexec-oom Jun 20, 2026
The first cloud node OOM’d in the middle of its own install
Impact: nixos-anywhere hung bringing up the first Linode node: a 1 GB Nanode couldn’t evaluate the NixOS closure in the in-RAM kexec environment.
What changed: The fix was a terraform destroy and a bump to 2 GB. Greenfield NixOS installs need RAM headroom for the kexec; the smallest instance doesn’t leave enough for the install to fit.