The Physics of Kubernetes: What Happens When You Force a Live Cluster Into Isolation
Last updated on

The Physics of Kubernetes: What Happens When You Force a Live Cluster Into Isolation


The Setup: A Simple Request

It's a classic Tuesday. My Dev Manager walks in with what sounds like a straightforward request:

"We need to simulate an air-gapped environment for 24 hours to see how the app handles zero-internet connectivity. Can you just block outbound traffic on the firewall for our RKE2 node?"

We have a dedicated air-gapped cluster that was built for this exact purpose-Air-gap by Design. But this time, they wanted to test our main development node: a single-node RKE2 cluster that normally lives, breathes, and updates via the internet.

If you've ever considered:

  • Testing disaster recovery by disconnecting networks
  • Simulating air-gapped deployments on existing infrastructure
  • Investigating "what if the WAN goes down for a day"

This post-mortem will save you from catastrophic failure.


The Experiment: 24 Hours in the Dark

I configured the firewall to block outbound traffic from the DGX to the WAN. The cluster could still communicate internally, but it couldn't reach the internet. For 24 hours, everything looked perfectly normal. The cluster kept running, pods stayed healthy, the application continued serving requests.

Then I restored the connection.

The cluster didn't just "wake up"-it went into a violent death spiral.


The Symptoms: When Everything Breaks at Once

Within seconds of reconnection:

gRPC Floods:

rpc error: code = ResourceExhausted desc = grpc: received message
larger than max (16782685 vs. 16777216)

Control Plane Failure:

Failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods":
context deadline exceeded

Kubelet Crash Loop:

Failed to initialize CPU manager: policy mismatch: expected static,
got none

GPUs Vanished:

Our NVIDIA A100 GPUs with MIG instances completely disappeared from the cluster capacity. nvidia.com/gpu: 0

We were looking at total system failure.


The Root Cause: Understanding Kubernetes as a Physical System

After diving deep into journalctl logs, containerd traces, and Kubelet state files, I realized we weren't just dealing with a "network glitch." We had triggered a cascade failure caused by the fundamental physics of how Kubernetes manages state.

Think of Kubernetes not as software, but as a thermodynamic system:

  • It has potential energy (pending events in buffers)
  • It has equilibrium states (desired = actual)
  • It has phase transitions (reconnection events)
  • It has memory (state files and checkpoints)

And like any physical system, if you don't manage the energy properly, it releases all at once-creating entropy (chaos).


The Three Forces That Destroyed Our Cluster

The "Thundering Herd" Effect: Accumulated State Debt

What is the Reconciliation Loop?

Kubernetes operates on continuous reconciliation loops-every component constantly compares desired state with actual state and attempts to converge them:

┌─────────────────────────────────────────┐
│   Desired State (What should exist)    │
│              (YAML, API)                │
└──────────────┬──────────────────────────┘
               │
               │ Reconciliation Loop
               ▼
┌─────────────────────────────────────────┐
│   Actual State (What exists now)       │
│        (Kubelet, Runtime, Nodes)        │
└─────────────────────────────────────────┘

What Happened During the 24-Hour Disconnect?

These loops didn't stop-they accumulated. Here's what was building up:

Component What It Queued
Kubelet → API Server Pod status updates, node heartbeats, metric snapshots, volume mount events
Container Runtime → Kubelet Image pull failures, container state changes, resource usage updates
Device Plugins → Kubelet GPU availability reports, topology updates, allocation failures
CNI Plugins → Control Plane Network policy reconciliation attempts, IP allocation requests
Controllers → API Server ReplicaSet scaling events, Service endpoint updates

Each component has internal buffers and event queues that kept filling during isolation. When we reconnected, they all tried to flush simultaneously.

The Physics Analogy:

Think of it like a dam:

  • Water (events) doesn't disappear when blocked
  • It accumulates pressure behind the dam (in buffers)
  • Opening the gates all at once creates a flood (thundering herd)

We triggered a synchronized flood that exceeded gRPC's default 16MB message limit.

Normal Operation:          During Disconnect:        Reconnection:

API ←→ Kubelet            API ✗✗✗ Kubelet          API ← ← ← ← Kubelet
  ↕                         ↓       ↓                 ↑↑↑↑↑↑↑↑↑
Small, continuous         Queue    Queue            MASSIVE FLOOD
updates                   fills    fills            (exceeds limits)

The YAML Trap: Silent Configuration Corruption

In the chaos, we tried to adjust settings in /etc/rancher/rke2/config.yaml. We wanted to increase max-pods and fix the CPU manager policy.

Here was our config:

# Block 1 - Our original settings
kubelet-arg:
  - "max-pods=200"
  - "cpu-manager-policy=static"
  - "reserved-cpus=0-7"

# ... 50 lines later ...

# Block 2 - Added during troubleshooting
kubelet-arg:
  - "event-qps=10"
  - "event-burst=50"

The Problem:

YAML doesn't merge duplicate keys-the second block completely overwrites the first.

Our effective config became:

kubelet-arg:
  - "event-qps=10"
  - "event-burst=50"
# Everything else GONE: max-pods, cpu-manager-policy, reserved-cpus

RKE2 silently reverted to defaults:

  • max-pods: 110 (default) instead of 200
  • cpu-manager-policy: "none" (default) instead of "static"
  • reserved-cpus: not set

This created the next domino...


The CPU Manager "Memory": State File Conflict

The Kubelet has a state file at /var/lib/kubelet/cpu_manager_state that acts as its "memory":

{
  "policyName": "static",
  "defaultCpuSet": "8-95",
  "entries": {
    "pod-uid-1": {
      "container-1": "16-23"
    }
  },
  "checksum": 123456789
}

The Kubelet's Boot Logic:

// Pseudo-code of what Kubelet does on startup
func (cm *cpuManager) Start() error {
    configPolicy := cm.getConfiguredPolicy()  // "none" (from corrupted YAML)
    statePolicy := cm.readStateFile()         // "static" (from last run)

    if configPolicy != statePolicy {
        return CrashWithError("Policy mismatch! Cannot guarantee safe CPU allocation")
    }

    // Continue only if they match
}

Why Does It Crash Instead of Adapting?

This is intentional design. The Kubelet follows a fail-fast philosophy:

Better to crash than silently corrupt workload CPU pinning.

If it switched from static to none without intervention, pods that were guaranteed exclusive CPUs would suddenly share them-potentially violating SLAs or causing performance degradation in latency-sensitive workloads.

The RKE2 Amplification:

In RKE2, the control plane runs as static pods managed by the Kubelet:

Kubelet crashes → Static pods can't start → Control plane offline → Cluster dead

The Volatile GPU Configuration

While troubleshooting, we discovered our NVIDIA MIG instances had vanished.

What are MIG instances?
Multi-Instance GPU (MIG) lets you partition a single A100 into up to 7 isolated GPU instances, each with dedicated memory and compute.

The Problem:
MIG configurations are volatile-they exist in GPU firmware memory and are lost on reboot unless explicitly persisted.

Our MIG creation was done manually:

nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

This creates the instances in runtime memory only. After the chaos forced a reboot, they were gone.


The Recovery: Step-by-Step Resurrection

Here's how I brought the cluster back from the dead. If you find yourself in this situation, this is the path out.

Step 1: Fix the YAML Collision

Consolidate into a single kubelet-arg block:

kubelet-arg:
  - "max-pods=200"
  - "cpu-manager-policy=static"
  - "reserved-cpus=0-7"
  - "cpu-manager-reconcile-period=10s"
  - "event-qps=10"
  - "event-burst=50"

Pro-tip: Validate your YAML before applying:

# Check for duplicate keys
yq eval 'keys | group_by(.) | map(select(length > 1))' config.yaml

# Or use a linter
yamllint config.yaml

# Or use kubectl/RKE2 dry-run
rke2 server --dry-run

Step 2: Clear the CPU Manager State

The Problem:
State file says "policyName": "static", but config now says none (due to YAML corruption).

The Fix:

# Stop RKE2
systemctl stop rke2-server

# Remove the conflicting state
rm -f /var/lib/kubelet/cpu_manager_state

# Restart with clean state
systemctl start rke2-server

# Verify Kubelet is healthy
journalctl -u rke2-server -f | grep "cpu_manager"

Expected output:

Successfully initialized CPU manager with policy "static"

Step 3: Increase gRPC Message Limits

The Problem:
The accumulated 24-hour backlog exceeded Containerd's default max_recv_message_size of 16MB.

The Fix:

Edit /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl:

version = 2

[grpc]
  max_recv_message_size = 67108864  # 64MB (4x default)
  max_send_message_size = 67108864  # 64MB

[plugins."io.containerd.grpc.v1.cri"]
  # ... rest of config ...

Why .tmpl and not .toml?
RKE2 regenerates config.toml from the template on restart. If you edit config.toml directly, your changes will be overwritten.

Apply the change:

systemctl restart rke2-server

Verify:

grep max_recv_message_size /var/lib/rancher/rke2/agent/etc/containerd/config.toml

Step 4: Restore and Persist MIG Configuration

Immediate restoration:

# Disable MIG mode
sudo nvidia-smi -i 0 -mig 0

# Re-enable with persistence
sudo nvidia-smi -i 0 -mig 1

# Create 7 balanced instances (1g.10gb profiles)
sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

# Verify
nvidia-smi mig -lgi

Expected output:

+----+--------+------+--------+
| ID |  Name  | Size | Memory |
+----+--------+------+--------+
|  0 | 1g.10gb|  1/7 |  10GB  |
|  1 | 1g.10gb|  1/7 |  10GB  |
...

Making It Permanent:

Create /etc/systemd/system/nvidia-mig-setup.service:

[Unit]
Description=NVIDIA MIG Configuration
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -mig 1
ExecStart=/usr/bin/nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable nvidia-mig-setup.service
sudo systemctl start nvidia-mig-setup.service

Or use NVIDIA GPU Operator (preferred for production):

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  mig:
    strategy: mixed
    config:
      name: all-balanced
      persistent: true  # ← Critical for survival across reboots

The Timeline: How It All Unfolded

T+0h     : Firewall blocks outbound traffic (DGX -> WAN)
T+24h    : Firewall restored. Here comes the pain.
T+24h:05 : gRPC limits exceeded, messages dropped
T+24h:10 : API server overwhelmed, starts rejecting requests
T+24h:15 : Attempted config fix. YAML collision introduced.
T+24h:20 : Kubelet detects cpu_manager_state mismatch. Crash loop begins.
T+24h:25 : Static pods (control plane) can't start
T+24h:30 : Emergency reboot to "fix it"
T+24h:35 : MIG instances lost in reboot
T+24h:40 : GPUs show 0 capacity in cluster
T+25h    : Coffee break to prevent making things worse
T+26h    : Begin systematic recovery (steps above)
T+28h    : Cluster fully restored

Key Takeaways

Air-Gap by Design vs Air-Gap by Force

A purpose-built air-gapped cluster has:

  • Local container registries (Harbor, Nexus)
  • Pre-cached images and Helm charts
  • Internal NTP servers
  • Properly sized buffers for offline operation
  • No expectation of external connectivity

Forcing isolation on a live cluster creates state debt that accumulates and must eventually be repaid-often violently.

Kubernetes Has Memory (And It's Unforgiving)

State files don't reset when you disconnect the network:

  • /var/lib/kubelet/cpu_manager_state
  • /var/lib/kubelet/memory_manager_state
  • /var/lib/kubelet/device-plugins/
  • Event queues in etcd
  • Metrics buffers in Prometheus/monitoring agents

They accumulate, and they demand consistency with your configuration.

YAML Is Silently Dangerous

Duplicate keys overwrite without warning. Always validate:

# Check for duplicates
yq eval 'keys | group_by(.) | map(select(length > 1))' config.yaml

# Lint before applying
yamllint config.yaml

# Or use kubectl/RKE2 dry-run
rke2 server --dry-run

gRPC Limits Matter at Scale

Default 16MB message sizes are fine for steady-state operations but catastrophic during:

  • Reconnection storms
  • Mass pod evictions
  • Large ConfigMap/Secret updates
  • Metric backlog flushes

Set conservative limits:

max_recv_message_size = 67108864  # 64MB minimum for production

GPU Configurations Are Volatile Unless Persisted

MIG instances, GPU clock settings, power limits-none of these survive reboots unless you:

  • Use systemd services
  • Use NVIDIA GPU Operator with persistent: true
  • Script them into your node provisioning

What Would I Do Differently?

If I had to run this experiment again:

Before Disconnecting:

1. Baseline the cluster:

kubectl get events --all-namespaces > events-before.txt
kubectl top nodes > resources-before.txt

2. Increase buffer sizes proactively:

# In config.toml.tmpl BEFORE going dark
max_recv_message_size = 67108864

3. Snapshot critical state:

cp -r /var/lib/kubelet /backup/kubelet-state-$(date +%s)

4. Document GPU config:

nvidia-smi mig -lgi > mig-config-backup.txt

During Disconnect:

1. Monitor buffer growth:

# Watch journald size
journalctl --disk-usage

# Watch etcd size
du -sh /var/lib/rancher/rke2/server/db/etcd

2. Set alerts for buffer thresholds:

# Prometheus alert
- alert: EventBufferBuildingUp
  expr: sum(rate(apiserver_audit_event_total[5m])) > 100

During Reconnection:

1. Gradual restoration with rate limiting:

# Instead of instant "all at once"
iptables -A OUTPUT -m limit --limit 100/sec --limit-burst 200 -j ACCEPT

2. Monitor in real-time:

# Watch gRPC metrics
watch -n 1 'crictl stats'

# Watch API server load
kubectl get --raw /metrics | grep apiserver_request_duration

For Production:

Don't do this experiment on production. Build a dedicated air-gap cluster from day one.


Final Thoughts

Kubernetes isn't just software-it's a distributed state machine with physical properties. It has inertia. It has momentum. It seeks equilibrium. And when you force it into isolation, you're not pressing pause-you're compressing a spring.

The longer it's compressed, the more violently it releases.

I learned this the hard way. The cluster didn't just "wake up" after 24 hours-it exploded with accumulated state debt. And every decision I made during the panic-the YAML edits, the emergency reboots-made things worse because I didn't respect the underlying physics of the system.

The lesson: Respect the physics. Prepare your buffers. Validate your state. And remember that Kubernetes is a living, stateful organism with memory, momentum, and the capacity for catastrophic failure if mishandled.


What's Next?

If you're planning to test air-gap scenarios:

  1. Build a dedicated air-gapped cluster from scratch
  2. Use local registries and pre-cache all images
  3. Script your GPU/MIG configurations into systemd
  4. Increase gRPC limits before testing
  5. Validate your YAML rigorously
  6. Test your buffers before going dark

Have you experienced similar state debt catastrophes? I'd be curious to hear how you handled them.