Managing a NixOS Fleet with Claude Code

Mar 16, 2026 · 5 min read AI Augmented

I manage a small fleet of NixOS machines — five hosts running everything from blockchain validators to a betting system to this website. The entire configuration lives in a git repository at /etc/nixos, and changes deploy with nixos-rebuild switch. It is the NixOS dream: declarative, reproducible, version-controlled infrastructure.

The reality is that even with declarative configuration, the day-to-day work of managing infrastructure involves a lot of context switching. Reading logs on one host, editing a module, checking if a service came up, debugging a firewall rule, looking up which VLAN a container is on, figuring out why a build failed because a package changed its license. Each task is small, but they add up.

I have been using Claude Code — Anthropic’s CLI tool — as an infrastructure co-pilot. Not as a replacement for understanding the system, but as a collaborator that can hold the full context of a NixOS configuration in its head while I focus on what I actually want to achieve.

The Workflow

The basic loop is:

I describe what I want ("set up a Consul cluster across griffin, wolfhound, and blackjack")
Claude reads the relevant Nix files, understands the existing architecture
It writes the configuration, updates DNS records, bumps zone serials
I review the diff, approve the rebuild
It runs nixos-rebuild switch and verifies the result

This is not fundamentally different from pair programming. The difference is that Claude can read and cross-reference a dozen Nix files simultaneously, remember that wolfhound uses enp4s0 while blackjack uses enp5s0, and generate consistent configurations without me looking up IP addresses or copy-pasting boilerplate.

A Dedicated Service Account

For Claude to be useful for infrastructure work, it needs to run commands on the hosts. Rather than using my personal login, I created a dedicated service account with scoped privileges:

# modules/users.nix

users.groups.claude = {};

users.users.claude = {
  isNormalUser  = true;
  description   = "Claude Code Service Account";
  home          = "/home/claude";
  createHome    = true;
  shell         = pkgs.zsh;
  group         = "claude";
  extraGroups   = [ "docker" ];
  openssh.authorizedKeys.keys = [
    "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINb3... claude@nixos-fleet"
  ];
};

Scoped Sudo

The service account gets passwordless sudo, but only for specific commands:

security.sudo.extraRules = [{
  users = [ "claude" ];
  commands = [
    { command = "/run/current-system/sw/bin/nixos-rebuild"; options = [ "NOPASSWD" ]; }
    { command = "/run/current-system/sw/bin/systemctl";     options = [ "NOPASSWD" ]; }
    { command = "/run/current-system/sw/bin/nixos-container"; options = [ "NOPASSWD" ]; }
    { command = "/run/current-system/sw/bin/journalctl";    options = [ "NOPASSWD" ]; }
    { command = "/run/current-system/sw/bin/machinectl";    options = [ "NOPASSWD" ]; }
  ];
}];

This gives Claude the ability to rebuild configurations, manage containers, and read logs — but not rm -rf / or passwd root. The principle is the same as any service account: minimum viable privileges for the task.

Since users.nix is in `common.nix’s import chain, the account is created on every host in the fleet automatically.

What a Session Looks Like

In a single session today, Claude and I:

Deployed Netdata across all five NixOS hosts via common.nix, debugged the withCloudUi flag requirement, documented three failed Cloud claiming approaches
Set up Uptime Kuma in a declarative container on anubis, fixed the DynamicUser bind mount path, fixed the localhost-only listen address, wrote a Python provisioning script for 40 monitors
Created a 3-node Consul cluster across griffin, wolfhound, and blackjack with declarative containers, registered 36 services with health checks, populated the KV store with infrastructure metadata
Fixed the touter container — added --capability=all for Docker-in-nspawn, enabled --profile default for the full compose stack
Updated DNS and dashboards throughout — new A records, CNAMEs, zone serial bumps, Dashlit links

Each of these involved reading existing configuration, understanding the network topology, writing Nix modules, running rebuilds, and debugging issues. The context built up across the session — by the time we got to Consul, Claude already knew every container IP, every VLAN, every bridge interface from the earlier work.

The Hard Parts

This is not magic. Some things Claude handles well, and some require human judgement:

Works well:

Generating consistent NixOS modules that match existing patterns
Cross-referencing IPs, ports, and hostnames across multiple config files
Debugging build failures (the Consul BSL license error was diagnosed and fixed in one step)
Writing provisioning scripts for APIs (the Uptime Kuma Socket.IO workaround)
Keeping DNS records in sync with service changes

Needs human input:

Network topology decisions (which VLAN, which bridge, which IP range)
Security trade-offs (--capability=all is a real privilege escalation — I chose to accept it)
Knowing when a service is genuinely down vs just unreachable from a different VLAN
Deciding whether to use Docker-in-nspawn or native services
Sops secrets — Claude can set up the plumbing but cannot know your Betfair password

Practical Tips

If you are considering this workflow:

Version control is non-negotiable. Every change Claude makes goes through git diff before I approve it. The NixOS configuration is the source of truth, and git history is the audit trail.

Keep modules small and focused. One file per container or service. Claude can read and modify a 50-line module reliably. A 500-line monolith is harder for anyone — human or AI — to reason about.

Let it read before it writes. Claude’s best work comes when it reads the existing codebase first. "Read the dns.nix file and add an entry for consul" produces better results than "write me a DNS config."

Use it for the boring parts. Updating 5 DNS records, bumping a zone serial, adding firewall ports to 3 containers, writing the 30th monitor definition — this is where the time savings compound.

Review everything. The diff is your friend. Claude will occasionally propose something that works technically but does not match your intent. A quick review catches this before it hits production.

What This Is Not

This is not "AI replacing sysadmins." I still need to understand NixOS, networking, systemd, and the services I run. Claude does not make architectural decisions for me — it executes them. The value is in reducing the friction between "I know what I want" and "it is deployed and working."

Think of it as having a very fast, very patient colleague who has read every man page and never forgets an IP address. The architecture is still yours.

nixos claude-code ai infrastructure automation homelab devops

Comments

Loading comments...