Seven AI Agents Running My Infrastructure

The original version of this post described a single OpenClaw agent running as root, with one skill file for infra ops. That worked, but it missed the point of what the tool can actually do.

This is the real setup: seven specialized agents, a dedicated non-root user, proactive heartbeats, and a command center dashboard at agents.cloud.merox.dev. Everything is templated in the infra repo so I can rebuild from scratch in 15 minutes.

Why seven agents instead of one

One agent trying to do everything becomes a compromise. The system prompt grows, context gets polluted between domains, and the “personality” that makes an agent useful for one task makes it annoying for another.

The split:

Agent	Purpose	Runs
`news`	Morning briefing in Romanian — tech stack updates, CVEs filtered to installed stack, community news, stocks/crypto alerts	Daily at 07:00 UTC
`blog`	Analyzes merox.dev for content opportunities, keeps homelab posts up to date	Weekly (Mon 09:00 UTC)
`design`	UX/design review and recommendations for merox.dev	On demand
`infra`	K8s cluster + VPS health checks, security alerts	2× daily (08:00 + 20:00 UTC)
`costs`	Backup verification, resource tracking, storage trends	Weekly (Sun 09:00 UTC)
`dashboard`	Nightly audit + improvement of the command center dashboard	Daily at 23:00 UTC
`orchestrator`	Monitors all agents, auto-fixes safe issues, proposes improvements via Telegram	Daily at 12:00 UTC

Each agent has its own workspace directory, its own AGENTS.md (operating instructions), SOUL.md (personality), and sometimes HEARTBEAT.md (what to do proactively). The news, infra, dashboard, and orchestrator agents send Telegram messages on their own. The rest respond when asked.

Architecture

1
Phone / Laptop
2
    │
3
    └── Telegram ──────────────────────────► openclaw gateway (loopback:18789)
4
                                                      │
5
                                           openclaw user (non-root)
6
                                                      │
7
                                             Claude Code CLI (Pro OAuth)
8
                                                      │
9
          ┌────────────┬────────────┬────────────┬──────────────┬──────────────┐
10
      news agent  infra agent  blog agent  costs agent  dashboard    orchestrator
11
      AGENTS.md   AGENTS.md   AGENTS.md   AGENTS.md    AGENTS.md    AGENTS.md
12
      SOUL.md     SOUL.md     SOUL.md     SOUL.md      SOUL.md      SOUL.md
13
          │            │                                    │              │
14
     web search   kubectl / flux                     /srv/dashboard  agents.json
15
    /srv/dashboard talosctl / docker                  index.html    proposals.json

The gateway runs as a systemd user service under a dedicated openclaw user. No root. The agent workspaces describe what each agent manages and how it should behave. Claude figures out the commands to run.

Security model

This is where the previous setup was weak: running as root because it was easier. The new setup:

Dedicated user with minimal privileges:

useradd -m -s /bin/bash openclaw
usermod -aG docker openclaw
loginctl enable-linger openclaw

Sudoers — two files, each scoped to exact binaries:

/etc/sudoers.d/openclaw — infra tooling:

1
Defaults:openclaw !requiretty
2
openclaw ALL=(ALL) NOPASSWD: /usr/bin/kubectl
3
openclaw ALL=(ALL) NOPASSWD: /usr/bin/flux
4
openclaw ALL=(ALL) NOPASSWD: /usr/bin/talosctl
5
openclaw ALL=(ALL) NOPASSWD: /bin/systemctl status *
6
openclaw ALL=(ALL) NOPASSWD: /usr/bin/systemctl status *
7
openclaw ALL=(ALL) NOPASSWD: /usr/bin/node

/etc/sudoers.d/openclaw-fix-perms — permissions automation:

1
openclaw ALL=(root) NOPASSWD: /usr/local/bin/openclaw-fix-perms

Docker group membership handles container access. kubectl and talosctl get dedicated config files copied to /home/openclaw/.kube/ and /home/openclaw/.talos/. The agents are explicitly blocked from reading age.key, *.sops.yaml, and .env files — this is enforced in AGENTS.md, not just hoped for.

Permissions automation — the Claude Code problem:

The gateway runs as openclaw, but if you interact with workspace files as root (e.g. via Claude Code CLI), any file you touch becomes root:root and the agents can’t write to it. The fix is a script that runs every 5 minutes and corrects ownership:

chown -R openclaw:openclaw /home/openclaw/.openclaw/
chown -R openclaw:openclaw /srv/dashboard/
# Update scripts stay root-owned intentionally
chown root:root /srv/dashboard/update-*.sh
chown -R openclaw:openclaw /srv/merox/src/content/blog/

Root crontab: */5 * * * * /usr/local/bin/openclaw-fix-perms Systemd service: ExecStartPost=/usr/bin/sudo /usr/local/bin/openclaw-fix-perms

This means: even if you edit workspace files as root, they’re back to openclaw ownership within 5 minutes. No manual fixup needed.

Gateway binds to loopback only:

1
{
2
  gateway: {
3
    bind: "loopback",
4
    auth: { mode: "token" }
5
  }
6
}

Remote access goes through Tailscale. Nothing is exposed publicly.

Telegram allowlist:

1
{
2
  channels: {
3
    telegram: {
4
      botToken: "...",
5
      allowFrom: ["YOUR_NUMERIC_ID"]
6
    }
7
  }
8
}

Only your Telegram user ID can interact with the bot. Anyone else gets ignored at the gateway level, not the agent level.

Workspace files

OpenClaw’s workspace system is the key difference from a SKILL.md single-file approach. Each agent gets a directory with:

1
workspace-infra/
2
├── AGENTS.md      # operating instructions — what to check, how to respond
3
├── SOUL.md        # personality — paranoid SRE tone, factual, no false positives
4
├── HEARTBEAT.md   # what to do proactively on each scheduled tick
5
└── TOOLS.md       # local notes — paths, commands, how to update the dashboard

The content matters more than the format. Here’s what actually makes the infra agent useful:

AGENTS.md (excerpt):

1
## Security check (2x/day via heartbeat)
2

3
kubectl get nodes
4
kubectl get pods -A | grep -v Running | grep -v Completed
5
docker ps --format "{{.Names}}\t{{.Status}}" | grep -v "Up"
6
df -h
7

8
Report on Telegram ONLY if there is a real problem.
9
Do not send "everything is ok" every check.

SOUL.md (infra agent):

1
You are an SRE with healthy paranoia. You don't dramatize, but you don't minimize.
2

3
- "Trust but verify" — check live, don't assume
4
- Silence is golden — no false positives; when you send, it's real
5
- When you don't know something, say so

HEARTBEAT.md (news agent):

1
Triggered daily at 04:00 UTC. Trigger message: MORNING_RUN.
2

3
You are running headless — use tools for every step, never output the briefing as text.
4

5
1. Read /srv/dashboard/data/news-releases.json (GitHub releases, pre-fetched every 6h)
6
2. Check memory/ for the last 3 days — no repeats
7
3. Collect tech & community news: Hacker News API, Reddit r/homelab + r/selfhosted,
8
   web search (CVEs, AI releases, infra news), GitHub trending
9
4. Write /srv/dashboard/data/news.json — dashboard data, BEFORE the HTML
10
5. Write /srv/dashboard/news.html — full HTML briefing
11
6. Send Telegram via Python urllib (not response text)
12
7. Update /srv/dashboard/data/agents.json
13
8. Save to memory/YYYY-MM-DD.md
14

15
CVE rules: only alert if it affects the exact installed version on K8s cluster
16
or Oracle VPS. gRPC/OpenSSL/Go runtime CVEs only if a specific service is
17
directly affected — not "K8s uses this internally". Scanner noise = skip.

The command center dashboard

Each agent writes to /srv/dashboard/data/ — JSON files that a static HTML page reads and renders:

1
/srv/dashboard/
2
├── index.html          # command center — all agents status
3
├── news.html           # latest news briefing (written by news agent)
4
└── data/
5
    ├── agents.json     # status + last run per agent
6
    ├── news.json           # structured news items (written by news agent)
7
    ├── news-releases.json  # GitHub releases pre-fetch (written by update-news.sh)
8
    ├── infra.json          # live cluster/VPS metrics (written by update-infra.sh)
9
    ├── backup.json         # backup verification results
10
    ├── upgrades.json       # open Renovate PRs
11
    ├── weather.json        # hyperlocal weather — Open-Meteo, no API key (update-weather.sh)
12
    ├── network.json        # subnet scan — homelab LAN + WiFi + Tailscale (update-network.sh)
13
    ├── shared-memory.json  # cross-agent context (what agents flagged recently, suppressions)
14
    ├── orchestrator.json   # orchestrator run history
15
    └── proposals.json      # pending improvement proposals

The page auto-refreshes every 60 seconds. Dark mode, card-based layout. No backend needed — nginx serves static files.

An nginx container handles the serving:

1
services:
2
  agents-dashboard:
3
    image: nginx:alpine
4
    volumes:
5
      - /srv/dashboard:/usr/share/nginx/html:ro
6
    networks:
7
      network-cloud-merox:
8
        ipv4_address: 172.25.10.90
9
    labels:
10
      - "traefik.enable=true"
11
      - "traefik.http.routers.agents-dashboard.rule=Host(`agents.cloud.merox.dev`)"
12
      - "traefik.http.routers.agents-dashboard.entrypoints=https"
13
      - "traefik.http.routers.agents-dashboard.tls.certresolver=cloudflare"
14
      - "traefik.http.routers.agents-dashboard.middlewares=middlewares-authentik@file,default-headers@file"

Protected by Authentik — same SSO as the rest of the homelab stack.

After each run, an agent updates its status in agents.json:

1
import json
2
with open('/srv/dashboard/data/agents.json') as f:
3
    data = json.load(f)
4
data['infra'] = {
5
    'lastRun': '2026-05-28T08:00:00Z',
6
    'status': 'ok',   # ok / warn / error
7
    'summary': 'All nodes healthy. Disk at 45%.'
8
}
9
with open('/srv/dashboard/data/agents.json', 'w') as f:
10
    json.dump(data, f, indent=2)

OpenClaw config

All 7 agents defined in one openclaw.json:

1
{
2
  gateway: {
3
    mode: "local",
4
    port: 18789,
5
    bind: "loopback",
6
    auth: { mode: "token", token: "GENERATED_BY_ONBOARD_DO_NOT_SET_MANUALLY" }
7
  },
8
  agents: {
9
    defaults: {
10
      model: {
11
        primary: "anthropic/claude-sonnet-4-6",
12
        fallbacks: ["openai/gpt-5.5"]   // failover if Anthropic is down
13
      },
14
      thinkingDefault: "low",
15
      timeoutSeconds: 1800,
16
      heartbeat: { every: "0m" },       // heartbeats via cron, not gateway polling
17
      skipBootstrap: true,
18
      contextPruning: { mode: "cache-ttl", ttl: "5m" },
19
      agentRuntime: { id: "claude-cli" }
20
    },
21
    list: [
22
      { id: "news",         default: true, workspace: "/home/openclaw/.openclaw/workspace" },
23
      { id: "blog",         workspace: "/home/openclaw/.openclaw/workspace-blog" },
24
      { id: "design",       workspace: "/home/openclaw/.openclaw/workspace-design" },
25
      { id: "infra",        workspace: "/home/openclaw/.openclaw/workspace-infra" },
26
      { id: "costs",        workspace: "/home/openclaw/.openclaw/workspace-costs" },
27
      { id: "dashboard",    workspace: "/home/openclaw/.openclaw/workspace-dashboard" },
28
      { id: "orchestrator", workspace: "/home/openclaw/.openclaw/workspace-orchestrator" }
29
    ]
30
  },
31
  channels: {
32
    telegram: {
33
      botToken: "YOUR_TELEGRAM_BOT_TOKEN",
34
      allowFrom: ["YOUR_TELEGRAM_USER_ID"],
35
      dmPolicy: "allowlist"   // strict: only allowFrom IDs can DM the bot
36
    }
37
  },
38
  commands: {
39
    ownerAllowFrom: ["telegram:YOUR_TELEGRAM_USER_ID"]  // privileged commands
40
  },
41
  session: {
42
    scope: "per-sender",
43
    dmScope: "per-channel-peer",
44
    resetTriggers: ["/new", "/reset"],
45
    reset: { mode: "daily", atHour: 4, idleMinutes: 10080 },
46
    threadBindings: { enabled: true, idleHours: 24 }
47
  },
48
  tools: {
49
    profile: "coding",
50
    fs: { workspaceOnly: false },  // needed: agents write to /srv/dashboard/
51
    elevated: { enabled: false },
52
    agentToAgent: {
53
      enabled: true,
54
      allow: ["news", "blog", "design", "infra", "costs", "dashboard", "orchestrator"]
55
    },
56
    sessions: { visibility: "all" }
57
  },
58
  logging: {
59
    level: "info",
60
    redactSensitive: "tools"
61
  }
62
}

The model is claude-sonnet-4-6. OpenClaw uses Claude Code CLI’s OAuth — no separate Anthropic API key needed if you have Claude Pro. This was the main cost change from the previous setup.

One bot, seven agents. With a single Telegram bot, all messages go to the default agent (news). agentToAgent lets that agent delegate to the others when you ask — so you send one message like “uită-te la cluster” and news routes the request to infra, gets the answer, and reports back to you. No bot-switching needed.

Setup

Full install in 15 minutes from scratch:

1. Install OpenClaw

curl -fsSL https://deb.nodesource.com/setup_24.x | sudo -E bash -
sudo apt install -y nodejs
sudo npm install -g openclaw@latest
openclaw --version

2. Create the service user

sudo useradd -m -s /bin/bash -d /home/openclaw -c "OpenClaw Service" openclaw
sudo usermod -aG docker openclaw
sudo loginctl enable-linger openclaw

# Copy sudoers from infra repo
sudo cp /srv/kubernetes/infrastructure/agent/scripts/sudoers-openclaw /etc/sudoers.d/openclaw
sudo chmod 440 /etc/sudoers.d/openclaw

3. Kubeconfig + talosconfig

sudo -u openclaw mkdir -p /home/openclaw/.kube /home/openclaw/.talos
sudo cp /srv/kubernetes/infrastructure/kubeconfig /home/openclaw/.kube/config
sudo cp /srv/kubernetes/infrastructure/talos/clusterconfig/talosconfig /home/openclaw/.talos/config
sudo chown openclaw:openclaw /home/openclaw/.kube/config /home/openclaw/.talos/config
sudo chmod 600 /home/openclaw/.kube/config /home/openclaw/.talos/config

4. Configure and install workspaces

sudo -u openclaw mkdir -p /home/openclaw/.openclaw

# Copy config (fill in Telegram token + your user ID)
sudo cp /srv/kubernetes/infrastructure/agent/openclaw.json.example \
        /home/openclaw/.openclaw/openclaw.json
sudo chown openclaw:openclaw /home/openclaw/.openclaw/openclaw.json
sudo chmod 600 /home/openclaw/.openclaw/openclaw.json

# Install all 7 workspaces
AGENT_DIR=/srv/kubernetes/infrastructure/agent/workspaces
for agent in news blog design infra costs dashboard orchestrator; do
  sudo -u openclaw cp -r $AGENT_DIR/$agent \
    /home/openclaw/.openclaw/workspace$([ "$agent" = "news" ] && echo "" || echo "-$agent")
done
sudo -u openclaw mkdir -p /home/openclaw/.openclaw/workspace/memory

5. Authenticate Claude Code + wire up OpenClaw auth

# 1. Log in to Claude Pro via OAuth (opens browser link)
sudo -u openclaw claude login

# 2. Run OpenClaw onboard to wire up Claude CLI auth and generate the gateway token
#    This adds agentRuntime: { id: "claude-cli" } to openclaw.json — no API key needed
sudo -u openclaw XDG_RUNTIME_DIR=/run/user/$(id -u openclaw) \
  openclaw onboard --non-interactive \
  --mode local \
  --auth-choice anthropic-cli \
  --skip-bootstrap \
  --skip-skills \
  --skip-daemon \
  --accept-risk

This uses your Claude Pro subscription — no per-token billing, no separate API key. The onboard command generates gateway.auth.token and adds agentRuntime: { id: "claude-cli" } for all Claude models. After this, re-add your Telegram credentials to ~/.openclaw/openclaw.json if onboard overwrote them.

6. Set up dashboard

sudo mkdir -p /srv/dashboard/data
sudo chown openclaw:openclaw /srv/dashboard /srv/dashboard/data

# Copy operational scripts (cron jobs that collect metrics without AI tokens)
sudo cp /srv/kubernetes/infrastructure/agent/scripts/dashboard/*.sh /srv/dashboard/
sudo chmod +x /srv/dashboard/*.sh
# Fill in secrets before starting: BOT_TOKEN in tg-notify.sh, GARAGE_TOKEN in update-backup.sh

cd /srv/docker/agents-dashboard && docker compose up -d

7. Install fix-perms script + crontab

# Copy fix-perms script from infra repo
sudo cp /srv/kubernetes/infrastructure/agent/scripts/openclaw-fix-perms /usr/local/bin/
sudo chmod 755 /usr/local/bin/openclaw-fix-perms
sudo cp /srv/kubernetes/infrastructure/agent/scripts/sudoers-fix-perms \
        /etc/sudoers.d/openclaw-fix-perms
sudo chmod 440 /etc/sudoers.d/openclaw-fix-perms

# Add to root crontab
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/local/bin/openclaw-fix-perms") | crontab -

# Install talosctl system-wide (mise installs it in root's home, not accessible by openclaw)
sudo cp ~/.local/share/mise/installs/aqua-siderolabs-talos/TALOS_VERSION/talosctl /usr/local/bin/
sudo chmod 755 /usr/local/bin/talosctl

# Install openclaw user crontab
sudo -u openclaw crontab /srv/kubernetes/infrastructure/agent/scripts/openclaw-crontab

8. Start the gateway

sudo -u openclaw mkdir -p /home/openclaw/.config/systemd/user
sudo cp /srv/kubernetes/infrastructure/agent/scripts/openclaw-gateway.service \
        /home/openclaw/.config/systemd/user/
sudo chown openclaw:openclaw /home/openclaw/.config/systemd/user/openclaw-gateway.service

XDG_RUNTIME_DIR=/run/user/$(id -u openclaw) \
  sudo -u openclaw systemctl --user daemon-reload

XDG_RUNTIME_DIR=/run/user/$(id -u openclaw) \
  sudo -u openclaw systemctl --user enable --now openclaw-gateway.service

Verify:

XDG_RUNTIME_DIR=/run/user/$(id -u openclaw) \
  sudo -u openclaw systemctl --user status openclaw-gateway
sudo -u openclaw openclaw doctor

Proactive agents

Four agents run without being asked:

infra at 08:00 and 20:00 UTC — checks cluster nodes, unhealthy pods, disk space, stopped containers. Sends Telegram only if something is wrong. If you stop getting messages, either everything is fine or the agent itself died (check the service).

news at 04:00 UTC — reads pre-fetched GitHub release data, then searches Hacker News, Reddit r/homelab and r/selfhosted, and does targeted web searches for CVEs and AI/infra news. CVEs are filtered to what’s actually installed on the K8s cluster and Oracle VPS — no generic “this library is used somewhere” noise. Writes the structured JSON for the dashboard, the HTML briefing page, then sends a Telegram summary in Romanian.

dashboard at 23:00 UTC — audits the command center dashboard nightly: validates JSON data files, checks that all JavaScript references have matching HTML elements, then makes one incremental improvement if the audit passes. Self-tests after each change.

orchestrator at 12:00 UTC — audits all other agents: checks they ran on schedule, rotates oversized logs, validates data files. If it finds a pattern worth fixing in an agent’s logic, it sends a Telegram proposal with a concrete description. You approve or reject with /da or /nu. Applied changes are backed up automatically with rollback detection.

The orchestrator can also propose infrastructure-level changes — crontab schedule adjustments, new workspace files, openclaw.json agent additions — all requiring explicit approval before any change is applied. It never modifies security-sensitive config (auth tokens, allowlists, sudoers).

Heartbeats are configured via cron jobs on the openclaw user, not via OpenClaw’s built-in heartbeat polling (which burns tokens every 30 minutes even when there’s nothing to do):

# As openclaw user: crontab -e

# AI agents
0 7  * * *  /srv/dashboard/news-morning-run.sh  # wrapper: fresh session + MORNING_RUN trigger
0 8  * * *  /usr/bin/openclaw agent --agent infra        --message "HEARTBEAT"
0 20 * * *  /usr/bin/openclaw agent --agent infra        --message "HEARTBEAT"
0 9  * * 0  /usr/bin/openclaw agent --agent costs        --message "HEARTBEAT"
0 23 * * *  /usr/bin/openclaw agent --agent dashboard    --message "HEARTBEAT"
0 12 * * *  /usr/bin/openclaw agent --agent orchestrator --message "HEARTBEAT"
0 9  * * 1  /usr/bin/openclaw agent --agent blog         --message "HEARTBEAT"

# Zero-AI data collection (bash scripts, no tokens)
*/5  * * * *  /srv/dashboard/update-infra.sh       # K8s nodes/pods/docker/longhorn
*/30 * * * *  /srv/dashboard/update-backup.sh      # NAS/Garage S3/Longhorn backup status
*/30 * * * *  /srv/dashboard/update-weather.sh     # Open-Meteo hyperlocal weather (no API key)
0 */1 * * *   /srv/dashboard/update-network.sh     # Subnet scan — homelab LAN + WiFi + Tailscale
0 */6 * * *   /srv/dashboard/update-news.sh        # GitHub releases → news-releases.json
*/30 * * * *  /srv/dashboard/update-upgrades.sh    # Renovate PRs from infrastructure repo

# Watchdogs
5 8,20 * * *  /srv/dashboard/self-healing.sh       # restart stale agents after infra runs
0 */2 * * *   /srv/dashboard/check-logs.sh         # scan agent logs for errors
15 12 * * *   /srv/dashboard/check-proposals.sh    # notify pending orchestrator proposals

Orchestrator proposals

The orchestrator introduces an approval loop for agent improvements. When it detects a pattern worth fixing — say, the news agent has been alerting on releases that Renovate would handle anyway, three weeks in a row — it writes a proposal to /srv/dashboard/data/proposals.json and sends a Telegram message:

1
🔧 Propunere #prop-20260529-news-001
2

3
Agent: news
4
Risc: low
5

6
Skip Renovate-covered releases in news alerts
7
News agent has alerted on 4 simple version bumps this week that Renovate
8
would have caught on Saturday. Add a filter for releases where no CVE
9
or breaking change is present.
10

11
Răspunde cu /da prop-20260529-news-001 sau /nu prop-20260529-news-001

Reply /da and the next orchestrator run patches the relevant section of that agent’s AGENTS.md, backs up the original, and confirms on Telegram. If the change causes a regression (detected by comparing agents.json status before/after), it auto-rolls back and alerts.

Safe fixes — log rotation, missing JSON keys, stale file cleanup — happen automatically without asking.

Disaster recovery

Everything except secrets and memory is in the infra repo at agent/. Rebuilding on a new server in ~20 minutes:

Install Node.js 24 + OpenClaw (npm install -g openclaw@latest)
Create openclaw user + docker group + linger
Copy sudoers-openclaw → /etc/sudoers.d/openclaw
Copy kubeconfig + talosconfig to /home/openclaw/.kube/ and /home/openclaw/.talos/
Install talosctl to /usr/local/bin/ (mise installs it per-user, not system-wide)
Copy all 7 workspaces from agent/workspaces/ — create memory/ dirs for all
Copy openclaw.json.example → ~/.openclaw/openclaw.json, fill in Telegram token
sudo -u openclaw claude login (OAuth — opens browser)
sudo -u openclaw openclaw onboard --auth-choice anthropic-cli --skip-bootstrap
Install openclaw-fix-perms to /usr/local/bin/ + root crontab */5 * * * *
Install openclaw-crontab as openclaw user’s crontab
Start systemd user service + docker compose up -d for dashboard

Can’t be auto-recovered:

openclaw.json secrets (Telegram token + gateway auth token) — keep in a password manager
workspace*/memory/ files — agents rebuild context over a few days, history is lost

What the agents can’t do without attention on first boot:

infra-extended.json won’t exist until the infra agent runs (08:00 UTC)
proposals.json needs to be initialized: echo '{"pending":[],"history":[]}' > /srv/dashboard/data/proposals.json

Everything else is in git or regenerates automatically.

Config template, workspace files, sudoers, and systemd unit: github.com/meroxdotdev/infrastructure/agent/