Skip to main content
Overview

From Rack to Cloud — My Infrastructure in 2026

Homelab 2026

merox-cloud status

Hardware

Compute

DeviceCPURAMStoragePurpose
Beelink GTi 13i9-13900H (14C/20T)64GB DDR52× 2TB NVMeProxmox (px-0)
OptiPlex #1i5-6500T (4C/4T)32GB DDR4128GB NVMeProxmox (px-1)
OptiPlex #2i5-6500T (4C/4T)32GB DDR4128GB NVMeProxmox (px-2)
Synology DS223+ARM RTD1619B2GB2× 2TB RAID1NAS/Media

Network Gear

DeviceModelSpecsPurpose
ONTHuawei1GbEISP Gateway
FirewallXCY X448× 1GbEpfSense Router
WiFiTP-Link AX3000WiFi 6Wireless AP
SwitchTP-Link24-portCore Switch

Power Protection

DeviceModelProtected EquipmentCapacity
UPS #1CyberPowerMini PCs (Proxmox cluster)1500VA
UPS #2CyberPowerNetwork gear1000VA

Network

Three dedicated physical interfaces on pfSense:

WAN Interface → Orange ISP (Bridge Mode)
LAN Interface → Homelab Network
WiFi Interface → Guest/IoT Isolation
Warning

WiFi clients are firewalled from homelab services, except whitelisted ones like Jellyfin.

pfSense

A fanless mini PC from AliExpress (~200€) running pfSense for 3+ years: XCY X44 on AliExpress

pfSense services dashboard

Tailscale Subnet Router exposes the entire homelab to the cloud VPS without installing Tailscale on every device. Also the solution to CGNAT — when your ISP doesn’t give you a public IP, this gets you in. Full setup guide →

Unbound DNS runs as a local recursive resolver with domain overrides for *.k8s.merox.dev pointing to K8s-Gateway.

Telegraf pushes system metrics to Grafana.

Firewall rules: WiFi → LAN blocks everything except whitelisted apps; LAN → WAN allows all; WAN → Internal blocks all except explicitly exposed services.

Tailscale Mesh

Tailscale creates a flat network across all locations — the homelab rack and the Oracle Cloud VPS both appear on the same mesh. No VPN tunnels to configure, no firewall holes to punch.

Tip

The Oracle instance doubles as a Tailscale exit node — useful for routing traffic through the US when needed.

The Subnet Router on pfSense means every device in the homelab (including things like the Synology and iDRAC interfaces) is reachable from anywhere on the mesh without touching their individual configs.

Homelab Network Topology

Virtualization

Proxmox Cluster

Three-node cluster across the mini PCs. Each node runs one Talos VM, so Kubernetes has HA across physical hosts with no single point of failure.

Proxmox cluster overview

Nodes:

NodeDeviceCPURAMRole
px-0Beelink GTi 13i9-13900H (20T)64GBPrimary — hosts K8s controlplane-1
px-1OptiPlex #1i5-6500T (4T)32GBK8s controlplane-2
px-2OptiPlex #2i5-6500T (4T)32GBK8s controlplane-3

Storage:

PoolTypeUsedTotal
cluster-storageZFS713GB899GB
synology-nasNFS985GB1.4TB
local-datadir177GB812GB

Current VMs:

VMPurposeSpecsStatus
kubernetes-controlplane-1K8s node (px-0)8vCPU/24GBRunning
kubernetes-controlplane-2K8s node (px-1)4vCPU/16GBRunning
kubernetes-controlplane-3K8s node (px-2)4vCPU/16GBRunning
Home AssistantSmart home hub2vCPU/4GBRunning
Windows 10Lab / testing4vCPU/12GBStopped
Windows Server 2019AD Lab8vCPU/14GBStopped
Windows 11Remote desktop8vCPU/16GBStopped

Intel Iris Xe GPU on px-0 is passed through to the kubernetes-controlplane-1 VM for Jellyfin hardware transcoding (Intel QuickSync). GPU passthrough guide →

Synology DS223+

Dual purpose: NFS/SMB shares for the ARR stack (still experimenting with both protocols), and personal cloud via Synology Drive.

After 3 years of self-hosting Nextcloud, I switched. Better performance, native mobile apps that actually work, and zero maintenance. Sometimes the best self-hosted solution is the one you never have to think about.

Synology services dashboard

Power Management

The CyberPower UPS covers all mini PCs and network gear. When power fails, it triggers a cascading shutdown — Kubernetes nodes drain properly before Proxmox hosts go down.

FeatureImplementationPurpose
pwrstatUSB to GTi13 ProAutomated shutdown orchestration
SSH ScriptsCustom automationGraceful cluster shutdown
MonitoringTelegram alertsReal-time power notifications

UPS monitoring dashboard

Telegram UPS notification

Note

The Dell R720 is in the rack but currently off — it’s running NixOS for learning and exploration, not part of the active production stack. iDRAC7 Enterprise gives remote console access when needed. R720 setup post →

Kubernetes

Fair warning: this is where I went full “because I can” mode. If you just want to run services, Docker is the right answer. But if you want to learn enterprise-grade container orchestration in your homelab, keep reading.

The starting point: onedr0p/cluster-template

Talos OS was the first immutable, declarative OS I’d run. After a few days of troubleshooting, I was sold.

Tip

Why Talos over K3s? Immutable OS means less maintenance, GitOps-first design, declarative everything, and it’s closer to what you’d run in production.

My infrastructure repo: github.com/meroxdotdev/infrastructure

Key customizations:

ComponentModificationReason
StorageLonghorn CSISimpler PV/PVC management
Talos PatchesCustom machine configLonghorn requirements
Custom Imagefactory.talos.devIntel iGPU + iSCSI support

GitOps structure:

kubernetes/apps/
├── cert-manager/ # TLS automation
├── default/ # Production workloads
├── flux-system/ # Flux operator + instance
├── kube-system/ # Cilium, CoreDNS, NFS CSI, metrics-server
├── network/ # k8s-gateway, Cloudflare tunnel + DNS
├── observability/ # Prometheus, Grafana, Loki
└── storage/ # Longhorn configuration

Grafana and Loki dashboard

Lens Kubernetes cluster overview

Deployed apps:

AppPurposeNotes
RadarrMovie automationNFS to Synology
SonarrTV automationNFS to Synology
ProwlarrIndexer managerCentral search
qBittorrentTorrent clientGluetun sidecar + SurfShark WireGuard VPN
JellyseerrRequest managementPublic via Cloudflare
JellyfinMedia serverIntel QuickSync enabled
HomepageDashboard
GrafanaMetrics dashboards
Prometheus + AlertmanagerMetrics collection + alerts
Loki + PromtailLog aggregation
NetdataPer-node system monitoringDaemonSet — one agent per K8s node
cert-managerTLS certificate automationACME via Let’s Encrypt

The live dashboard is public — current service status at inside.merox.dev.

LoadBalancer IPs are handled by Cilium’s L2 announcement — a pool of addresses (10.57.57.100–120) announced directly on the LAN via ARP, no external load balancer needed. Services like qBittorrent, k8s-gateway, and the Cloudflare tunnel endpoint each get a dedicated IP from this pool.

Cluster Rebuild & Disaster Recovery

With declarative config for everything and Flux keeping state in Git, a full cluster rebuild takes ~20 minutes — provision the Talos VMs, bootstrap, and a single command restores all Longhorn volumes from S3 and reconciles every app. The full step-by-step procedure lives in DEPLOY.md.

Terminal window
# Phase 2 K8s restore — one command does everything
task bootstrap:apps
task restore:longhorn

task restore:longhorn handles the full sequence automatically: patches the Longhorn BackupTarget CRD, restores all volumes from S3, creates PersistentVolumes with correct claimRefs, rebinds grafana/loki/prometheus PVCs to their restored data, fixes the Longhorn 1.11.2 duplicate HelmRelease issue, and force-reconciles all app HelmReleases. No manual steps.

The repo also includes a full DR automation toolkit:

Script / TaskPurpose
scripts/dr-preflight.shPre-flight checks before DR — vault, age.key, Tailscale key, tools
scripts/dr-verify.sh --phase 1|2|3|allPost-DR verification for each phase
scripts/garage-extract-creds.shExtract Garage S3 creds after Phase 1, auto-update vault
scripts/gen-dr-talconfig.shPatch talconfig.yaml with DR IPs from Terraform outputs
talos/terraform/Terraform for 3 DR Talos VMs on Proxmox (IDs 810–812)
task dr:create-vms / dr:destroy-vmsSpin DR VMs up/down
ScenarioWhere to look
Full rebuild (new hardware)DEPLOY.md — Phase 1 (VPS) → Phase 2 (K8s) → Phase 3 (Agent)
Restore Longhorn volumes from S3task restore:longhorn — fully automated
New hardware (different IPs/disks)Update talos/talconfig.yaml, cluster-vars.yaml, cilium/networks.yaml
Intel iGPU absent on new hardwareRemove gpu.intel.com/i915 from Jellyfin HelmRelease, disable device plugin
Warning

Back up two things before decommissioning any node: age.key (losing it = losing all SOPS-encrypted secrets) and ~/.openclaw/.env (Anthropic API key + Telegram tokens).

Longhorn PVC backups land in Garage S3 on the Oracle VPS, so persistent data survives even if all three Proxmox nodes go down simultaneously. Restore procedure: Restoring from Longhorn Backups →

Cloud

The Oracle Cloud Free Tier Ampere A1 instance (4 vCPU / 24GB RAM / 200GB disk) is the off-site anchor of the entire setup. It’s not just a place to park Docker containers — it’s the external access layer, the backup target, and the recovery fallback.

Everything on it is managed through a single Portainer instance at cloud.merox.dev:

Portainer multi-cluster view

Services

ServicePurpose
TraefikSSL termination for all VPS services
Pi-holeDedicated Tailscale split-DNS
PortainerContainer management
AuthentikIdentity provider — SSO across all services
GuacamoleRemote desktop access via Cloudflare Tunnel
Joplin ServerSelf-hosted notes sync
Uptime KumaService uptime monitoring
GlancesSystem resource monitoring
Garage S3 + WebUIS3-compatible object storage for Longhorn backups
Rsync endpointOff-site backup target from Synology NAS
OpenClawAI infrastructure agent (Telegram → kubectl/flux/docker)

Authentik acts as the identity layer for everything — Google SSO, proxy authentication for Guacamole, OAuth2 for Portainer, and a Kubernetes outpost for cluster services. Full Authentik setup →

OpenClaw is a self-built AI agent that connects Telegram to the infrastructure — kubectl, flux, and docker commands via chat, from anywhere on the Tailscale mesh. Full OpenClaw setup →

Tip

The Oracle instance doubles as a Tailscale exit node — useful for routing traffic through the US when needed.

Disaster Recovery

Oracle can and does terminate Always Free instances without warning. I don’t depend on that not happening.

The full stack is codified in Ansible roles + Terraform under vps/: one make dr-full command provisions an on-demand Hetzner VPS and deploys every service in ~15 minutes. Hetzner is not a standing server — it’s spun up only when Oracle is lost, then torn down after migration back. Cloudflare Tunnel reconnects automatically (same token), Tailscale rejoins the mesh (same auth key), and data volumes restore from the Synology rsync backup.

Terminal window
make dr-preflight # checks vault, age.key, Tailscale key, tools
make dr-full # terraform apply + ansible deploy (~15 min)
./scripts/dr-verify.sh --phase 1 # post-DR verification
make dr-full
└─ terraform apply (~2 min) — new Hetzner server, inventory updated
└─ ansible setup (~12 min) — full stack deployed
└─ data restore (~30 min) — volumes from Synology rsync

For the full walkthrough including security hardening, Ansible vault setup, and the Terraform config: Oracle Cloud Free Tier: Building a Full DR Plan →

Backup

Every layer has an off-site copy:

Kubernetes PVCs
▼ Longhorn → S3
Garage S3 (Oracle Cloud VPS) ◄── Synology NAS rsync

Longhorn dashboard

Longhorn PVCs → Garage S3 on Oracle (S3-compatible). Currently ~28GB used out of ~98GB free. Garage replaced MinIO after MinIO discontinued their Docker images and left older builds with known CVEs. MinIO → Garage migration →

Synology NAS → rsync to the same Oracle VPS on a schedule — ~26GB of Docker volumes, configs, and data. Off-site, encrypted. Synology backup setup →

Synology local redundancy — 2× 2TB in RAID1 for the primary NAS data.

For the full backup philosophy, retention policies, and restore procedures: 3-2-1 Backup Strategy →

Share this post