How a Single-Node Kubernetes Cluster Accidentally Grew to 550 Pods
A cautionary tale about scope creep, resource management, and the slippery slope of infrastructure growth that nobody planned for. Here's how a modest single-node cluster became an operational nightmare—and why "it's working fine" isn't good enough.
This is the story every DevOps engineer recognizes but hopes never to experience. A past client from my DevOps-as-a-Service days had a single-node Kubernetes cluster that started with reasonable intentions. It was supposed to be lightweight—a simple setup for a modest workload.
Then came the incremental additions. "Can we add this monitoring tool?" Sure. "The team needs this developer environment." No problem. "Let's deploy this analytics service too." Why not.
Before anyone realized what was happening, that single node was running 550 pods.
Five. Hundred. Fifty. Pods.
On. One. Node.
How It Started
The initial setup was sensible. The client needed a development environment that could handle a few microservices. A single-node Kubernetes cluster seemed appropriate:
- Moderate resource requirements
- Limited budget for infrastructure
- Small team with simple deployment needs
- Expected to host maybe 20-30 pods at most
The server was provisioned accordingly. Not overpowered, not underpowered. Just right for what was planned.
The deployment worked. Services came up. Developers were happy. Everything looked green.
The Slippery Slope
The problem with infrastructure is that it never stays static. Teams always need "just one more thing."
Month 1-3: The additions seemed reasonable.
Initial pods: ~25
+ Logging stack (Elasticsearch, Fluentd, Kibana): +15 pods
+ Monitoring (Prometheus, Grafana): +8 pods
+ Developer tools: +12 pods
Current total: ~60 pods
Still manageable. The node had capacity. Everything worked.
Month 4-6: The pace accelerated.
+ Multiple staging environments: +80 pods
+ CI/CD tooling: +25 pods
+ Message queue infrastructure: +30 pods
+ Database replicas: +20 pods
Current total: ~215 pods
At this point, we started seeing occasional DNS timeouts. Nothing consistent. Easy to blame on "network issues" or "transient problems." Restarts fixed it, so nobody dug deeper.
Month 7-12: The flood gates opened.
+ Per-developer namespaces: +150 pods
+ A/B testing environments: +80 pods
+ Partner integration sandboxes: +60 pods
+ Various "temporary" testing workloads: +45 pods
Final count: 550 pods
Each addition made sense in isolation. Nobody was trying to create a disaster. But nobody was looking at the aggregate picture either.
When The Cracks Appeared
The problems started subtle, then cascaded.
DNS Resolution Failures
Services randomly couldn't resolve each other. Pods would work fine for hours, then suddenly start failing DNS lookups. CoreDNS was overwhelmed, but monitoring didn't catch it because the failures were intermittent.
kubectl logs coredns-xyz
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout
[ERROR] plugin/errors: 2 example.svc.cluster.local. A: read udp timeout
We scaled CoreDNS replicas. It helped for a week. Then the problem returned.
Internal Networking Breakdown
This was the really nasty part. The internal cluster networking started failing in ways that were barely documented online.
IPtables rules exceeded reasonable limits. The kernel's connection tracking table was constantly full. Every new pod deployment caused a wave of connection resets across existing pods.
$ sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 262144
net.netfilter.nf_conntrack_max = 262144
Connection tracking table: maxed out. All the time.
Tuning nf_conntrack_max higher helped temporarily. But with 550 pods constantly communicating, we were playing whack-a-mole with kernel limits.
Etcd Under Siege
The etcd database was never designed for this scale on a single node.
Watch streams multiplied. Every pod, every controller, every service—all maintaining watch connections to etcd. The database struggled under constant write pressure from status updates across hundreds of pods.
etcdctl endpoint status --write-out=table
+------------------+---------+---------+--------+----------+
| ENDPOINT | DB SIZE | MEMBERS | ERRORS | LATENCY |
+------------------+---------+---------+--------+----------+
| 127.0.0.1:2379 | 6.8 GB | 1 | 47 | 3.2s |
+------------------+---------+---------+--------+----------+
Etcd latency in the seconds range. Database size approaching corruption danger zone. Backup and restore operations taking 20+ minutes.
We were one power outage away from complete cluster loss.
Resource Starvation
CPU and memory were predictably exhausted, but the symptoms were weird.
Kubernetes system components fought for resources with application pods. The scheduler would slow to a crawl. Pod startups took minutes instead of seconds. Health checks timed out not because pods were unhealthy, but because the kubelet was too busy to respond.
┌─────────────────────────────────────┐
│ Single Node - 550 Pods │
│ │
│ System Pods: Fighting │
│ App Pods: Fighting │
│ etcd: Drowning │
│ kubelet: Overwhelmed │
│ CoreDNS: Timeout │
│ kube-proxy: Broken │
│ │
│ Status: "Everything's on fire" │
└─────────────────────────────────────┘
Why It Was Hard To Stop
By the time we recognized the problem, we were trapped.
The Boiling Frog Effect
No single addition pushed the cluster over the edge. It was gradual degradation. Each new problem seemed fixable with tuning. "Just increase this limit." "Just scale that component." "Just restart those pods."
Nobody wanted to admit we'd crossed the point of no return because that meant a major architectural overhaul.
Migration Costs
Moving to a multi-node cluster meant:
- Redesigning networking
- Implementing proper resource allocation
- Migrating stateful workloads
- Coordinating downtime across multiple teams
- Potentially weeks of work
The business pressure was constant: "It's working, why spend time fixing it?"
Except it wasn't working. It was barely surviving.
The Documentation Gap
Here's something that surprised me: there's very little online documentation about the failure modes of single-node clusters at extreme scale.
Most Kubernetes docs assume you're running a reasonable multi-node setup. Troubleshooting guides don't cover "what happens when you push a single node to 550 pods" because nobody expects anyone to do that.
We were in uncharted territory with almost no community knowledge to reference.
The Warning Signs We Ignored
Looking back, the red flags were everywhere:
1. DNS timeouts that "resolved themselves"
Transient DNS failures are never "just transient." They're symptoms of capacity problems.
2. Increasing restart frequency
When pods need more frequent restarts to "fix issues," that's not fixing—that's masking.
3. Deployment slowdowns
When a deployment that used to take 30 seconds starts taking 5 minutes, the cluster is telling you it's overloaded.
4. Node resource utilization above 80%
Sustained high utilization isn't running efficiently—it's running out of headroom.
5. etcd database size growth
A multi-gigabyte etcd database on a single-node cluster is a disaster waiting to happen.
6. The "just one more" pattern
When every request is "just add one more small thing," and nobody's tracking the aggregate, you're on the slope.
What We Learned
Set Hard Limits Early
Single-node clusters should have enforced pod limits. Not soft suggestions. Hard limits.
If we'd set a 100-pod limit from day one, the conversation would have been different. "We need to add more services" would have forced "then we need to redesign the infrastructure," not "sure, there's still technically room."
"It's Working" Isn't Enough
The cluster was technically functional at 550 pods. Services were running. Most requests succeeded.
But operational quality was terrible:
- Deployments were nerve-wracking
- Debugging was nearly impossible
- Every change risked cascading failures
- On-call was exhausting
Infrastructure shouldn't just work. It should work reliably, predictably, and without constant intervention.
Track The Trajectory
We should have been graphing pod count over time with projected growth. A simple trend line would have shown the problem months before it became critical.
Week 1: 25 pods
Week 12: 60 pods
Week 24: 215 pods ← Should have triggered action here
Week 36: 420 pods ← Definitely should have acted here
Week 48: 550 pods ← Too late
Architecture Triggers
Define architectural inflection points upfront:
- Over 50 pods: Add monitoring and capacity planning
- Over 100 pods: Evaluate multi-node migration
- Over 150 pods: Multi-node is mandatory
These should be organization policies, not suggestions.
Technical Debt Compounds
Every week we delayed the migration, the problem got worse:
- More services to migrate
- More teams dependent on the current setup
- More tribal knowledge about workarounds
- More resistance to change
Infrastructure technical debt is like financial debt. The interest compounds, and eventually it's all you can afford to pay.
How To Avoid This
For New Clusters
Single-node Kubernetes is fine for:
- Learning and development
- Truly small workloads (sub-30 pods)
- Time-limited POCs
Single-node Kubernetes is dangerous for:
- Production workloads
- Anything expected to grow
- Multi-team environments
- Long-term infrastructure
If you're starting with a single node because "we'll add nodes later when we need them," ask yourself: who will make that call and when?
For Existing Clusters
If you're reading this because you're already on the slope, here are the breakpoints:
Under 100 pods: You have time. Start planning migration now before it becomes urgent.
100-200 pods: You're in the danger zone. Migrations take time. Start immediately.
Over 200 pods: You're in crisis territory even if everything seems fine. This is your last chance to migrate on your terms instead of during an outage.
Over 300 pods: Stop adding services. Freeze the environment and execute an emergency migration.
Visibility Tools
Set up monitoring for these metrics specifically:
# Pod count over time
kubectl get pods --all-namespaces --no-headers | wc -l
# etcd health and size
etcdctl endpoint health
etcdctl endpoint status
# Connection tracking
sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
# Kubelet performance
kubectl top nodes
Alert when:
- Pod count increases 20% in a month
- etcd database exceeds 2GB
- Connection tracking exceeds 80% capacity
- Node CPU/memory sustained above 70%
The Eventual Outcome
We eventually migrated. It took three weeks of planning and a weekend cutover. We moved to a proper multi-node cluster with resource quotas, proper monitoring, and architectural review processes for new services.
The migration was painful. Services broke in unexpected ways. Teams discovered hidden dependencies. We burned a lot of coffee and midnight oil.
But the cluster that emerged was sustainable. Deployments were fast again. DNS worked reliably. Etcd was healthy. The on-call burden dropped dramatically.
Most importantly: when someone asks "can we add one more service," we now have a process that evaluates impact before automatically saying yes.
The TL;DR
- Single-node Kubernetes clusters have hard scaling limits that are easy to hit through incremental growth
- No single addition causes the problem—it's the accumulated pressure that breaks things
- Technical problems at extreme single-node scale are poorly documented and difficult to troubleshoot
- Warning signs like DNS timeouts, slow deployments, and resource pressure should trigger architectural review
- Set hard pod limits on single-node clusters and define migration triggers upfront
- Infrastructure should be evaluated on operational quality, not just "does it work"
- The cost of migration only increases with time—act early
The Real Lesson
This isn't really a story about Kubernetes. It's a story about incremental technical debt and organizational decision-making.
The infrastructure didn't fail because of one bad decision. It failed because of hundreds of small decisions that individually seemed fine but collectively created an unsustainable situation.
Infrastructure doesn't fail gradually—it fails catastrophically after gradually degrading.
Your job as a DevOps engineer isn't just to make things work today. It's to recognize when "making it work" is accumulating debt that will break things tomorrow.
Sometimes the right answer to "can we add one more thing?" is "no, not until we fix the foundation."
That's a hard conversation to have. But it's infinitely easier than the conversation you'll have when the cluster finally breaks under its own weight.
Have you experienced the slippery slope of infrastructure growth? Found yourself saying "just one more pod" once too often? I'd love to hear how you handled it—or how you're currently stuck in it. The patterns are remarkably similar across different teams and technologies.