Dec 8, 2025

Last updated on Dec 8, 2025

How a Single-Node Kubernetes Cluster Accidentally Grew to 550 Pods

A cautionary tale about scope creep, resource management, and the slippery slope of infrastructure growth that nobody planned for. Here's how a modest single-node cluster became an operational nightmare—and why "it's working fine" isn't good enough.

This is the story every DevOps engineer recognizes but hopes never to experience. A past client from my DevOps-as-a-Service days had a single-node Kubernetes cluster that started with reasonable intentions. It was supposed to be lightweight—a simple setup for a modest workload.

Then came the incremental additions. "Can we add this monitoring tool?" Sure. "The team needs this developer environment." No problem. "Let's deploy this analytics service too." Why not.

Before anyone realized what was happening, that single node was running 550 pods.

Five. Hundred. Fifty. Pods.

On. One. Node.

How It Started

The initial setup was sensible. The client needed a development environment that could handle a few microservices. A single-node Kubernetes cluster seemed appropriate:

Moderate resource requirements
Limited budget for infrastructure
Small team with simple deployment needs
Expected to host maybe 20-30 pods at most

The server was provisioned accordingly. Not overpowered, not underpowered. Just right for what was planned.

The deployment worked. Services came up. Developers were happy. Everything looked green.

The Slippery Slope

The problem with infrastructure is that it never stays static. Teams always need "just one more thing."

Month 1-3: The additions seemed reasonable.

Initial pods: ~25
+ Logging stack (Elasticsearch, Fluentd, Kibana): +15 pods
+ Monitoring (Prometheus, Grafana): +8 pods
+ Developer tools: +12 pods
Current total: ~60 pods

Still manageable. The node had capacity. Everything worked.

Month 4-6: The pace accelerated.

+ Multiple staging environments: +80 pods
+ CI/CD tooling: +25 pods
+ Message queue infrastructure: +30 pods
+ Database replicas: +20 pods
Current total: ~215 pods

At this point, we started seeing occasional DNS timeouts. Nothing consistent. Easy to blame on "network issues" or "transient problems." Restarts fixed it, so nobody dug deeper.

Month 7-12: The flood gates opened.

+ Per-developer namespaces: +150 pods
+ A/B testing environments: +80 pods
+ Partner integration sandboxes: +60 pods
+ Various "temporary" testing workloads: +45 pods
Final count: 550 pods

Each addition made sense in isolation. Nobody was trying to create a disaster. But nobody was looking at the aggregate picture either.

When The Cracks Appeared

The problems started subtle, then cascaded.

DNS Resolution Failures

Services randomly couldn't resolve each other. Pods would work fine for hours, then suddenly start failing DNS lookups. CoreDNS was overwhelmed, but monitoring didn't catch it because the failures were intermittent.

kubectl logs coredns-xyz
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout

We scaled CoreDNS replicas. It helped for a week. Then the problem returned.

Internal Networking Breakdown

This was the really nasty part. The internal cluster networking started failing in ways that were barely documented online.

IPtables rules exceeded reasonable limits. The kernel's connection tracking table was constantly full. Every new pod deployment caused a wave of connection resets across existing pods.

$ sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 262144
net.netfilter.nf_conntrack_max = 262144

Connection tracking table: maxed out. All the time.

Tuning nf_conntrack_max higher helped temporarily. But with 550 pods constantly communicating, we were playing whack-a-mole with kernel limits.

Etcd Under Siege

The etcd database was never designed for this scale on a single node.

Watch streams multiplied. Every pod, every controller, every service—all maintaining watch connections to etcd. The database struggled under constant write pressure from status updates across hundreds of pods.

etcdctl endpoint status --write-out=table
+------------------+---------+---------+--------+----------+
|    ENDPOINT      | DB SIZE | MEMBERS | ERRORS | LATENCY  |
+------------------+---------+---------+--------+----------+
| 127.0.0.1:2379   | 6.8 GB  |    1    |   47   | 3.2s     |
+------------------+---------+---------+--------+----------+

Etcd latency in the seconds range. Database size approaching corruption danger zone. Backup and restore operations taking 20+ minutes.

We were one power outage away from complete cluster loss.

Resource Starvation

CPU and memory were predictably exhausted, but the symptoms were weird.

Kubernetes system components fought for resources with application pods. The scheduler would slow to a crawl. Pod startups took minutes instead of seconds. Health checks timed out not because pods were unhealthy, but because the kubelet was too busy to respond.

┌─────────────────────────────────────┐
│  Single Node - 550 Pods             │
│                                     │
│  System Pods:      Fighting         │
│  App Pods:         Fighting         │
│  etcd:             Drowning         │
│  kubelet:          Overwhelmed      │
│  CoreDNS:          Timeout          │
│  kube-proxy:       Broken           │
│                                     │
│  Status: "Everything's on fire"     │
└─────────────────────────────────────┘

Why It Was Hard To Stop

By the time we recognized the problem, we were trapped.

The Boiling Frog Effect

No single addition pushed the cluster over the edge. It was gradual degradation. Each new problem seemed fixable with tuning. "Just increase this limit." "Just scale that component." "Just restart those pods."

Nobody wanted to admit we'd crossed the point of no return because that meant a major architectural overhaul.

Migration Costs

Moving to a multi-node cluster meant:

Redesigning networking
Implementing proper resource allocation
Migrating stateful workloads
Coordinating downtime across multiple teams
Potentially weeks of work

The business pressure was constant: "It's working, why spend time fixing it?"

Except it wasn't working. It was barely surviving.

The Documentation Gap

Here's something that surprised me: there's very little online documentation about the failure modes of single-node clusters at extreme scale.

Most Kubernetes docs assume you're running a reasonable multi-node setup. Troubleshooting guides don't cover "what happens when you push a single node to 550 pods" because nobody expects anyone to do that.

We were in uncharted territory with almost no community knowledge to reference.

The Warning Signs We Ignored

Looking back, the red flags were everywhere:

1. DNS timeouts that "resolved themselves"

Transient DNS failures are never "just transient." They're symptoms of capacity problems.

2. Increasing restart frequency

When pods need more frequent restarts to "fix issues," that's not fixing—that's masking.

3. Deployment slowdowns

When a deployment that used to take 30 seconds starts taking 5 minutes, the cluster is telling you it's overloaded.

4. Node resource utilization above 80%

Sustained high utilization isn't running efficiently—it's running out of headroom.

5. etcd database size growth

A multi-gigabyte etcd database on a single-node cluster is a disaster waiting to happen.

6. The "just one more" pattern

When every request is "just add one more small thing," and nobody's tracking the aggregate, you're on the slope.

What We Learned

Set Hard Limits Early

Single-node clusters should have enforced pod limits. Not soft suggestions. Hard limits.

If we'd set a 100-pod limit from day one, the conversation would have been different. "We need to add more services" would have forced "then we need to redesign the infrastructure," not "sure, there's still technically room."

"It's Working" Isn't Enough

The cluster was technically functional at 550 pods. Services were running. Most requests succeeded.

But operational quality was terrible:

Deployments were nerve-wracking
Debugging was nearly impossible
Every change risked cascading failures
On-call was exhausting

Infrastructure shouldn't just work. It should work reliably, predictably, and without constant intervention.

Track The Trajectory

We should have been graphing pod count over time with projected growth. A simple trend line would have shown the problem months before it became critical.

Week 1:   25 pods
Week 12:  60 pods
Week 24: 215 pods ← Should have triggered action here
Week 36: 420 pods ← Definitely should have acted here
Week 48: 550 pods ← Too late

Architecture Triggers

Define architectural inflection points upfront:

Over 50 pods: Add monitoring and capacity planning
Over 100 pods: Evaluate multi-node migration
Over 150 pods: Multi-node is mandatory

These should be organization policies, not suggestions.

Technical Debt Compounds

Every week we delayed the migration, the problem got worse:

More services to migrate
More teams dependent on the current setup
More tribal knowledge about workarounds
More resistance to change

Infrastructure technical debt is like financial debt. The interest compounds, and eventually it's all you can afford to pay.

How To Avoid This

For New Clusters

Single-node Kubernetes is fine for:

Learning and development
Truly small workloads (sub-30 pods)
Time-limited POCs

Single-node Kubernetes is dangerous for:

Production workloads
Anything expected to grow
Multi-team environments
Long-term infrastructure

If you're starting with a single node because "we'll add nodes later when we need them," ask yourself: who will make that call and when?

For Existing Clusters

If you're reading this because you're already on the slope, here are the breakpoints:

Under 100 pods: You have time. Start planning migration now before it becomes urgent.

100-200 pods: You're in the danger zone. Migrations take time. Start immediately.

Over 200 pods: You're in crisis territory even if everything seems fine. This is your last chance to migrate on your terms instead of during an outage.

Over 300 pods: Stop adding services. Freeze the environment and execute an emergency migration.

Visibility Tools

Set up monitoring for these metrics specifically:

# Pod count over time
kubectl get pods --all-namespaces --no-headers | wc -l

# etcd health and size
etcdctl endpoint health
etcdctl endpoint status

# Connection tracking
sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max

# Kubelet performance
kubectl top nodes

Alert when:

Pod count increases 20% in a month
etcd database exceeds 2GB
Connection tracking exceeds 80% capacity
Node CPU/memory sustained above 70%

The Eventual Outcome

We eventually migrated. It took three weeks of planning and a weekend cutover. We moved to a proper multi-node cluster with resource quotas, proper monitoring, and architectural review processes for new services.

The migration was painful. Services broke in unexpected ways. Teams discovered hidden dependencies. We burned a lot of coffee and midnight oil.

But the cluster that emerged was sustainable. Deployments were fast again. DNS worked reliably. Etcd was healthy. The on-call burden dropped dramatically.

Most importantly: when someone asks "can we add one more service," we now have a process that evaluates impact before automatically saying yes.

The TL;DR

Single-node Kubernetes clusters have hard scaling limits that are easy to hit through incremental growth
No single addition causes the problem—it's the accumulated pressure that breaks things
Technical problems at extreme single-node scale are poorly documented and difficult to troubleshoot
Warning signs like DNS timeouts, slow deployments, and resource pressure should trigger architectural review
Set hard pod limits on single-node clusters and define migration triggers upfront
Infrastructure should be evaluated on operational quality, not just "does it work"
The cost of migration only increases with time—act early

The Real Lesson

This isn't really a story about Kubernetes. It's a story about incremental technical debt and organizational decision-making.

The infrastructure didn't fail because of one bad decision. It failed because of hundreds of small decisions that individually seemed fine but collectively created an unsustainable situation.

Infrastructure doesn't fail gradually—it fails catastrophically after gradually degrading.

Your job as a DevOps engineer isn't just to make things work today. It's to recognize when "making it work" is accumulating debt that will break things tomorrow.

Sometimes the right answer to "can we add one more thing?" is "no, not until we fix the foundation."

That's a hard conversation to have. But it's infinitely easier than the conversation you'll have when the cluster finally breaks under its own weight.

Have you experienced the slippery slope of infrastructure growth? Found yourself saying "just one more pod" once too often? I'd love to hear how you handled it—or how you're currently stuck in it. The patterns are remarkably similar across different teams and technologies.