High Availability Control Plane using Talos VIP
The Problem: Single Point of Failure
After a busy weekend developing my Proxmox automated restart workflow (keep an eye out for that post), I discovered a critical issue: my kubectl was hardcoded to a single control plane node. When that node went down during restart of one of the Proxmox nodes, my kubectl commands no longer were working, even though I had three control planes running! I was able to set the sever to another node to get it working again, but that's not how I expected or wanted it to work.
While Kubernetes itself was highly available (pods running on other nodes were fine), I couldn't manage the cluster without manual intervention because my client tools pointed to a downed node.
The Solution: Virtual IP (VIP)
Talos Linux has built-in support for Layer 2 Virtual IPs on control plane nodes. A VIP is a floating IP address that automatically moves between healthy control planes. If one node goes down, the VIP instantly fails over to another node—zero downtime for kubectl access.
Implementation
Network Setup
Before:
- Control planes: 192.168.1.30, 192.168.1.31, 192.168.1.32 (DHCP)
- kubectl → 192.168.1.30:6443 (single point of failure)
After:
- Control planes: Same individual IPs (DHCP)
- VIP: 192.168.1.250 (floating between control planes)
- kubectl → 192.168.1.250:6443 (HA!)
Step 1: Create VIP Patch
Create talos-vip-patch.yaml:
machine:
network:
interfaces:
- interface: ens18 # Your network interface
dhcp: true
vip:
ip: 192.168.1.250
Key notes:
- Interface name: Use
talosctl get links -n <node-ip>to find yours (ens18 for my Proxmox VMs) - Keep
dhcp: true- nodes keep their individual IPs - Choose an unused IP for the VIP
Step 2: Apply to All Control Planes
# Apply VIP patch to each control plane
for node in 192.168.1.30 192.168.1.31 192.168.1.32; do
talosctl patch machineconfig --nodes $node --patch @talos-vip-patch.yaml
done
No reboot required! The VIP becomes active immediately.
Step 3: Test the VIP
# VIP should respond to pings
ping 192.168.1.250
# Kubernetes API should be accessible
curl -k https://192.168.1.250:6443/version
# Should return: {"kind":"Status",...,"code":401} # Unauthorized = working!
Step 4: Update Cluster Endpoint
Update all control planes to use the VIP as their cluster endpoint:
# cluster-endpoint-patch.yaml
cluster:
controlPlane:
endpoint: https://192.168.1.250:6443
# Apply to all control planes
for node in 192.168.1.30 192.168.1.31 192.168.1.32; do
talosctl patch machineconfig --nodes $node --patch @cluster-endpoint-patch.yaml
done
Step 5: Update kubectl and talosctl
# Update kubectl config
kubectl config set-cluster your-cluster --server=https://192.168.1.250:6443
# Test it works
kubectl cluster-info
# Should show Kubernetes control plane is running at https://192.168.1.250:6443
# Update talosctl endpoint
talosctl config endpoint 192.168.1.250
# Most commands work fine
talosctl version
# For health checks, specify --nodes as the VIP
talosctl health --nodes 192.168.1.250
Testing Failover
The real test: does it actually fail over?
# Shutdown one control plane
# Make sure you cordon/drain first if you have things running
talosctl shutdown --nodes 192.168.1.30
# kubectl should still work immediately!
kubectl get nodes
Result: Instant failover! kubectl continued working with zero downtime.
CI/CD Integration
I'm using Gitea Actions, but for any CI/CD, you'll need to update your talosconfig secret afterwards:
# After updating talosctl to use VIP
cat ~/.talos/config | base64
Update your actions secret with the value.
Now all your automation workflows benefit from HA control plane access!
Configuration Management
I created scripts to manage this setup in my homelab repo:
sync-configs.sh- Pull current config from cluster, auto-redact secretsmerge-secrets.sh- Merge real secrets from another config into redacted template for disaster recoveryencode-talosconfig.sh- Encode for CI/CD secrets
This allows me to keep configs version-controlled with secrets safely redacted. That said, always keep another backup of the full config with the secrets!
Results
Before VIP
- ❌ Single control plane failure = lost cluster access
- ❌ Manual intervention required to reconnect (change the node/server ip used)
- ❌ Workflow failures during maintenance
After VIP
- ✅ Control plane failure = instant transparent failover
- ✅ Zero downtime for kubectl access
- ✅ CI/CD workflows resilient to node failures
- ✅ Safe to restart/maintain individual control planes
Lessons Learned
- HA is more than running multiple nodes - You need to configure HA client access too
- Talos VIP is trivial to set up - No external load balancer needed
- Test your HA! - My Proxmox testing revealed this gap
- VIP works great with DHCP - No need to change node network config
What's Next
This VIP setup was the final issue I ran into for my Proxmox automated restart workflow. This workflow allows me to do the following if there is a Talos VM on the Proxmox host:
- Drain and cordon Kubernetes node
- Shutdown Talos VM
- Maintain ability to use
talosctlandkubectlcommands - Safely restart Proxmox host
I'll be blogging about that as a whole as well as a few other hurdles I encountered during development of this workflow, so keep an eye out for some more posts in the coming days.
Update (October 25, 2025)
The Proxmox automated restart post is up! Check out Automating Proxmox Host Restarts with Gitea Actions for the full story.