The Physics of Kubernetes: What Happens When You Force a Live Cluster Into Isolation
The Setup: A Simple Request
It's a classic Tuesday. My Dev Manager walks in with what sounds like a straightforward request:
"We need to simulate an air-gapped environment for 24 hours to see how the app handles zero-internet connectivity. Can you just block outbound traffic on the firewall for our RKE2 node?"
We have a dedicated air-gapped cluster that was built for this exact purpose-Air-gap by Design. But this time, they wanted to test our main development node: a single-node RKE2 cluster that normally lives, breathes, and updates via the internet.
If you've ever considered:
- Testing disaster recovery by disconnecting networks
- Simulating air-gapped deployments on existing infrastructure
- Investigating "what if the WAN goes down for a day"
This post-mortem will save you from catastrophic failure.
The Experiment: 24 Hours in the Dark
I configured the firewall to block outbound traffic from the DGX to the WAN. The cluster could still communicate internally, but it couldn't reach the internet. For 24 hours, everything looked perfectly normal. The cluster kept running, pods stayed healthy, the application continued serving requests.
Then I restored the connection.
The cluster didn't just "wake up"-it went into a violent death spiral.
The Symptoms: When Everything Breaks at Once
Within seconds of reconnection:
gRPC Floods:
rpc error: code = ResourceExhausted desc = grpc: received message
larger than max (16782685 vs. 16777216)
Control Plane Failure:
Failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods":
context deadline exceeded
Kubelet Crash Loop:
Failed to initialize CPU manager: policy mismatch: expected static,
got none
GPUs Vanished:
Our NVIDIA A100 GPUs with MIG instances completely disappeared from the cluster capacity. nvidia.com/gpu: 0
We were looking at total system failure.
The Root Cause: Understanding Kubernetes as a Physical System
After diving deep into journalctl logs, containerd traces, and Kubelet state files, I realized we weren't just dealing with a "network glitch." We had triggered a cascade failure caused by the fundamental physics of how Kubernetes manages state.
Think of Kubernetes not as software, but as a thermodynamic system:
- It has potential energy (pending events in buffers)
- It has equilibrium states (desired = actual)
- It has phase transitions (reconnection events)
- It has memory (state files and checkpoints)
And like any physical system, if you don't manage the energy properly, it releases all at once-creating entropy (chaos).
The Three Forces That Destroyed Our Cluster
The "Thundering Herd" Effect: Accumulated State Debt
What is the Reconciliation Loop?
Kubernetes operates on continuous reconciliation loops-every component constantly compares desired state with actual state and attempts to converge them:
┌─────────────────────────────────────────┐
│ Desired State (What should exist) │
│ (YAML, API) │
└──────────────┬──────────────────────────┘
│
│ Reconciliation Loop
▼
┌─────────────────────────────────────────┐
│ Actual State (What exists now) │
│ (Kubelet, Runtime, Nodes) │
└─────────────────────────────────────────┘
What Happened During the 24-Hour Disconnect?
These loops didn't stop-they accumulated. Here's what was building up:
| Component | What It Queued |
|---|---|
| Kubelet → API Server | Pod status updates, node heartbeats, metric snapshots, volume mount events |
| Container Runtime → Kubelet | Image pull failures, container state changes, resource usage updates |
| Device Plugins → Kubelet | GPU availability reports, topology updates, allocation failures |
| CNI Plugins → Control Plane | Network policy reconciliation attempts, IP allocation requests |
| Controllers → API Server | ReplicaSet scaling events, Service endpoint updates |
Each component has internal buffers and event queues that kept filling during isolation. When we reconnected, they all tried to flush simultaneously.
The Physics Analogy:
Think of it like a dam:
- Water (events) doesn't disappear when blocked
- It accumulates pressure behind the dam (in buffers)
- Opening the gates all at once creates a flood (thundering herd)
We triggered a synchronized flood that exceeded gRPC's default 16MB message limit.
Normal Operation: During Disconnect: Reconnection:
API ←→ Kubelet API ✗✗✗ Kubelet API ← ← ← ← Kubelet
↕ ↓ ↓ ↑↑↑↑↑↑↑↑↑
Small, continuous Queue Queue MASSIVE FLOOD
updates fills fills (exceeds limits)
The YAML Trap: Silent Configuration Corruption
In the chaos, we tried to adjust settings in /etc/rancher/rke2/config.yaml. We wanted to increase max-pods and fix the CPU manager policy.
Here was our config:
# Block 1 - Our original settings
kubelet-arg:
- "max-pods=200"
- "cpu-manager-policy=static"
- "reserved-cpus=0-7"
# ... 50 lines later ...
# Block 2 - Added during troubleshooting
kubelet-arg:
- "event-qps=10"
- "event-burst=50"
The Problem:
YAML doesn't merge duplicate keys-the second block completely overwrites the first.
Our effective config became:
kubelet-arg:
- "event-qps=10"
- "event-burst=50"
# Everything else GONE: max-pods, cpu-manager-policy, reserved-cpus
RKE2 silently reverted to defaults:
max-pods: 110 (default) instead of 200cpu-manager-policy: "none" (default) instead of "static"reserved-cpus: not set
This created the next domino...
The CPU Manager "Memory": State File Conflict
The Kubelet has a state file at /var/lib/kubelet/cpu_manager_state that acts as its "memory":
{
"policyName": "static",
"defaultCpuSet": "8-95",
"entries": {
"pod-uid-1": {
"container-1": "16-23"
}
},
"checksum": 123456789
}
The Kubelet's Boot Logic:
// Pseudo-code of what Kubelet does on startup
func (cm *cpuManager) Start() error {
configPolicy := cm.getConfiguredPolicy() // "none" (from corrupted YAML)
statePolicy := cm.readStateFile() // "static" (from last run)
if configPolicy != statePolicy {
return CrashWithError("Policy mismatch! Cannot guarantee safe CPU allocation")
}
// Continue only if they match
}
Why Does It Crash Instead of Adapting?
This is intentional design. The Kubelet follows a fail-fast philosophy:
Better to crash than silently corrupt workload CPU pinning.
If it switched from static to none without intervention, pods that were guaranteed exclusive CPUs would suddenly share them-potentially violating SLAs or causing performance degradation in latency-sensitive workloads.
The RKE2 Amplification:
In RKE2, the control plane runs as static pods managed by the Kubelet:
Kubelet crashes → Static pods can't start → Control plane offline → Cluster dead
The Volatile GPU Configuration
While troubleshooting, we discovered our NVIDIA MIG instances had vanished.
What are MIG instances?
Multi-Instance GPU (MIG) lets you partition a single A100 into up to 7 isolated GPU instances, each with dedicated memory and compute.
The Problem:
MIG configurations are volatile-they exist in GPU firmware memory and are lost on reboot unless explicitly persisted.
Our MIG creation was done manually:
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
This creates the instances in runtime memory only. After the chaos forced a reboot, they were gone.
The Recovery: Step-by-Step Resurrection
Here's how I brought the cluster back from the dead. If you find yourself in this situation, this is the path out.
Step 1: Fix the YAML Collision
Consolidate into a single kubelet-arg block:
kubelet-arg:
- "max-pods=200"
- "cpu-manager-policy=static"
- "reserved-cpus=0-7"
- "cpu-manager-reconcile-period=10s"
- "event-qps=10"
- "event-burst=50"
Pro-tip: Validate your YAML before applying:
# Check for duplicate keys
yq eval 'keys | group_by(.) | map(select(length > 1))' config.yaml
# Or use a linter
yamllint config.yaml
# Or use kubectl/RKE2 dry-run
rke2 server --dry-run
Step 2: Clear the CPU Manager State
The Problem:
State file says "policyName": "static", but config now says none (due to YAML corruption).
The Fix:
# Stop RKE2
systemctl stop rke2-server
# Remove the conflicting state
rm -f /var/lib/kubelet/cpu_manager_state
# Restart with clean state
systemctl start rke2-server
# Verify Kubelet is healthy
journalctl -u rke2-server -f | grep "cpu_manager"
Expected output:
Successfully initialized CPU manager with policy "static"
Step 3: Increase gRPC Message Limits
The Problem:
The accumulated 24-hour backlog exceeded Containerd's default max_recv_message_size of 16MB.
The Fix:
Edit /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl:
version = 2
[grpc]
max_recv_message_size = 67108864 # 64MB (4x default)
max_send_message_size = 67108864 # 64MB
[plugins."io.containerd.grpc.v1.cri"]
# ... rest of config ...
Why .tmpl and not .toml?
RKE2 regenerates config.toml from the template on restart. If you edit config.toml directly, your changes will be overwritten.
Apply the change:
systemctl restart rke2-server
Verify:
grep max_recv_message_size /var/lib/rancher/rke2/agent/etc/containerd/config.toml
Step 4: Restore and Persist MIG Configuration
Immediate restoration:
# Disable MIG mode
sudo nvidia-smi -i 0 -mig 0
# Re-enable with persistence
sudo nvidia-smi -i 0 -mig 1
# Create 7 balanced instances (1g.10gb profiles)
sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
# Verify
nvidia-smi mig -lgi
Expected output:
+----+--------+------+--------+
| ID | Name | Size | Memory |
+----+--------+------+--------+
| 0 | 1g.10gb| 1/7 | 10GB |
| 1 | 1g.10gb| 1/7 | 10GB |
...
Making It Permanent:
Create /etc/systemd/system/nvidia-mig-setup.service:
[Unit]
Description=NVIDIA MIG Configuration
After=nvidia-persistenced.service
[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -mig 1
ExecStart=/usr/bin/nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Enable it:
sudo systemctl daemon-reload
sudo systemctl enable nvidia-mig-setup.service
sudo systemctl start nvidia-mig-setup.service
Or use NVIDIA GPU Operator (preferred for production):
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
mig:
strategy: mixed
config:
name: all-balanced
persistent: true # ← Critical for survival across reboots
The Timeline: How It All Unfolded
T+0h : Firewall blocks outbound traffic (DGX -> WAN)
T+24h : Firewall restored. Here comes the pain.
T+24h:05 : gRPC limits exceeded, messages dropped
T+24h:10 : API server overwhelmed, starts rejecting requests
T+24h:15 : Attempted config fix. YAML collision introduced.
T+24h:20 : Kubelet detects cpu_manager_state mismatch. Crash loop begins.
T+24h:25 : Static pods (control plane) can't start
T+24h:30 : Emergency reboot to "fix it"
T+24h:35 : MIG instances lost in reboot
T+24h:40 : GPUs show 0 capacity in cluster
T+25h : Coffee break to prevent making things worse
T+26h : Begin systematic recovery (steps above)
T+28h : Cluster fully restored
Key Takeaways
Air-Gap by Design vs Air-Gap by Force
A purpose-built air-gapped cluster has:
- Local container registries (Harbor, Nexus)
- Pre-cached images and Helm charts
- Internal NTP servers
- Properly sized buffers for offline operation
- No expectation of external connectivity
Forcing isolation on a live cluster creates state debt that accumulates and must eventually be repaid-often violently.
Kubernetes Has Memory (And It's Unforgiving)
State files don't reset when you disconnect the network:
/var/lib/kubelet/cpu_manager_state/var/lib/kubelet/memory_manager_state/var/lib/kubelet/device-plugins/- Event queues in etcd
- Metrics buffers in Prometheus/monitoring agents
They accumulate, and they demand consistency with your configuration.
YAML Is Silently Dangerous
Duplicate keys overwrite without warning. Always validate:
# Check for duplicates
yq eval 'keys | group_by(.) | map(select(length > 1))' config.yaml
# Lint before applying
yamllint config.yaml
# Or use kubectl/RKE2 dry-run
rke2 server --dry-run
gRPC Limits Matter at Scale
Default 16MB message sizes are fine for steady-state operations but catastrophic during:
- Reconnection storms
- Mass pod evictions
- Large ConfigMap/Secret updates
- Metric backlog flushes
Set conservative limits:
max_recv_message_size = 67108864 # 64MB minimum for production
GPU Configurations Are Volatile Unless Persisted
MIG instances, GPU clock settings, power limits-none of these survive reboots unless you:
- Use systemd services
- Use NVIDIA GPU Operator with
persistent: true - Script them into your node provisioning
What Would I Do Differently?
If I had to run this experiment again:
Before Disconnecting:
1. Baseline the cluster:
kubectl get events --all-namespaces > events-before.txt
kubectl top nodes > resources-before.txt
2. Increase buffer sizes proactively:
# In config.toml.tmpl BEFORE going dark
max_recv_message_size = 67108864
3. Snapshot critical state:
cp -r /var/lib/kubelet /backup/kubelet-state-$(date +%s)
4. Document GPU config:
nvidia-smi mig -lgi > mig-config-backup.txt
During Disconnect:
1. Monitor buffer growth:
# Watch journald size
journalctl --disk-usage
# Watch etcd size
du -sh /var/lib/rancher/rke2/server/db/etcd
2. Set alerts for buffer thresholds:
# Prometheus alert
- alert: EventBufferBuildingUp
expr: sum(rate(apiserver_audit_event_total[5m])) > 100
During Reconnection:
1. Gradual restoration with rate limiting:
# Instead of instant "all at once"
iptables -A OUTPUT -m limit --limit 100/sec --limit-burst 200 -j ACCEPT
2. Monitor in real-time:
# Watch gRPC metrics
watch -n 1 'crictl stats'
# Watch API server load
kubectl get --raw /metrics | grep apiserver_request_duration
For Production:
Don't do this experiment on production. Build a dedicated air-gap cluster from day one.
Final Thoughts
Kubernetes isn't just software-it's a distributed state machine with physical properties. It has inertia. It has momentum. It seeks equilibrium. And when you force it into isolation, you're not pressing pause-you're compressing a spring.
The longer it's compressed, the more violently it releases.
I learned this the hard way. The cluster didn't just "wake up" after 24 hours-it exploded with accumulated state debt. And every decision I made during the panic-the YAML edits, the emergency reboots-made things worse because I didn't respect the underlying physics of the system.
The lesson: Respect the physics. Prepare your buffers. Validate your state. And remember that Kubernetes is a living, stateful organism with memory, momentum, and the capacity for catastrophic failure if mishandled.
What's Next?
If you're planning to test air-gap scenarios:
- Build a dedicated air-gapped cluster from scratch
- Use local registries and pre-cache all images
- Script your GPU/MIG configurations into systemd
- Increase gRPC limits before testing
- Validate your YAML rigorously
- Test your buffers before going dark
Have you experienced similar state debt catastrophes? I'd be curious to hear how you handled them.