How DNS Resolution Became Our CI/CD Bottleneck at 600+ Daily Builds
When our development team scaled from 10 to 40 engineers, our Jenkins pipeline went from 50 builds per day to over 600. Everything seemed fine—until it wasn't. Builds started failing randomly with network timeouts. The culprit? DNS resolution. Here's how understanding Linux networking fundamentals saved our CI/CD pipeline.
This is the story of how rapid growth exposed a bottleneck we never saw coming, and why knowing your Linux networking stack matters more than you think.
The Growth Trajectory
Six months ago, our development team was lean. 10 engineers, a handful of microservices, maybe 50 builds on a busy day. Our Jenkins setup was straightforward:
- 3 executor nodes with 4 executors each (12 total concurrent builds)
- Builds pulling from GitHub, DockerHub, and internal artifact repositories
- Average build time: 8-12 minutes
- Everything worked smoothly
Then we raised funding. The team doubled. Then doubled again.
Month 1: 10 engineers → ~50 builds/day
Month 3: 20 engineers → ~150 builds/day
Month 6: 40 engineers → 600+ builds/day
Jenkins scaling:
Month 1: 3 nodes, 12 executors
Month 6: 8 nodes, 32 executors
We added more Jenkins executors. We upgraded our infrastructure. We optimized our pipelines. On paper, we had the capacity.
But the builds started failing.
The Symptoms
At first, it was sporadic. A build would fail with a network timeout. Retry it, and it worked. "Transient network issue," we thought. Move on.
Then it got worse.
The Error Messages
Error cloning repository: Could not resolve host: github.com
Timeout while connecting to registry.hub.docker.com
Failed to fetch package: Temporary failure in name resolution
curl: (6) Could not resolve host: api.internal.company.com
All DNS-related. All intermittent. All concentrated during peak build times (9am-11am, 2pm-4pm).
The Pattern
We started tracking failures:
Time Range Builds Failures Failure Rate
08:00 - 09:00 45 2 4.4%
09:00 - 10:00 120 18 15.0%
10:00 - 11:00 145 27 18.6%
11:00 - 12:00 95 8 8.4%
12:00 - 13:00 50 1 2.0%
13:00 - 14:00 65 3 4.6%
14:00 - 15:00 135 22 16.3%
15:00 - 16:00 140 25 17.9%
Clear correlation: more concurrent builds = more failures. All DNS-related.
The Investigation
Standard debugging didn't reveal much. Network connectivity was fine. DNS servers were healthy. Queries from individual nodes worked perfectly.
But then we started digging into the actual query volume.
DNS Query Explosion
Each build was making dozens of DNS queries:
- Cloning from GitHub:
github.com - Pulling Docker images:
registry.hub.docker.com,auth.docker.io,production.cloudflare.docker.com - Downloading dependencies:
npmjs.org,pypi.org,maven.org, etc. - Internal services:
api.internal.company.com,artifacts.internal.company.com - External APIs: various third-party endpoints
Conservatively: 20-30 unique DNS queries per build.
Do the math:
Peak concurrent builds: 32 (all executors running)
DNS queries per build: ~25
Query frequency: every few seconds during active build phases
Peak load: 32 builds × 25 domains = 800+ DNS queries
Duration: sustained for 8-12 minutes per build wave
We were hammering our DNS infrastructure with thousands of queries per minute.
Checking the DNS Setup
On our Jenkins executors, the DNS configuration was default Ubuntu:
$ cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
options timeout:2 attempts:3
Every DNS query was going out to Google's public DNS. No caching. No local resolver.
With 2-second timeouts and high query volume, we were hitting timeout limits under load. Queries were queuing, timing out, and causing build failures.
Understanding DNS Resolution in Linux
This is where Linux networking fundamentals became critical. To solve the problem, we needed to understand how DNS resolution actually works.
The DNS Resolution Chain
When an application makes a DNS query on Linux, here's what happens:
Application (curl, git, docker)
↓
glibc resolver (getaddrinfo)
↓
/etc/nsswitch.conf (determines resolution order)
↓
/etc/hosts (checked first if configured)
↓
DNS resolver (from /etc/resolv.conf)
↓
systemd-resolved OR traditional resolv.conf
↓
Upstream DNS servers (8.8.8.8, etc.)
By default, there's no caching at the OS level. Every query goes all the way to the upstream DNS server.
The /etc/resolv.conf File
This file controls DNS behavior:
nameserver 8.8.8.8 # Primary DNS server
nameserver 8.8.4.4 # Fallback DNS server
options timeout:2 # Timeout per query (seconds)
options attempts:3 # Number of retry attempts
options rotate # Rotate through nameservers
options ndots:1 # Dots required for absolute lookup
Key insight: timeout:2 with attempts:3 means a failed query takes 6 seconds minimum before giving up. Under load, these timeouts stack up.
systemd-resolved
Modern Ubuntu uses systemd-resolved, which provides some caching:
$ systemctl status systemd-resolved
● systemd-resolved.service - Network Name Resolution
Active: active (running)
But the default cache is limited:
$ resolvectl statistics
Current Cache Size: 0
Cache Hits: 127
Cache Misses: 1843
Cache hit rate: about 6%. Not nearly enough for our use case.
Why No Caching?
The problem: most CI/CD workloads are short-lived. Jenkins executors start build jobs that run for minutes, then stop. The OS-level DNS cache (even with systemd-resolved) doesn't help much because:
- Each build is a fresh process with no cache
- systemd-resolved's cache is limited (short TTLs)
- High concurrency means the cache keeps getting wiped
We needed a dedicated DNS caching layer.
The Solution: DNS Caching
We implemented a local DNS cache on each Jenkins executor using dnsmasq.
Why dnsmasq?
- Lightweight (minimal resource overhead)
- Simple configuration
- Aggressive caching (respects TTLs but caches effectively)
- Battle-tested (used in countless production environments)
Implementation
Step 1: Install dnsmasq
sudo apt-get update
sudo apt-get install -y dnsmasq
Step 2: Configure dnsmasq
Edit /etc/dnsmasq.conf:
# Don't read /etc/resolv.conf for upstream servers
no-resolv
# Define upstream DNS servers explicitly
server=8.8.8.8
server=8.8.4.4
server=1.1.1.1
# Cache size (default is 150, we increased it)
cache-size=10000
# Don't forward queries for plain names (security)
domain-needed
# Don't forward reverse lookups for private IP ranges
bogus-priv
# Listen on localhost only
listen-address=127.0.0.1
# Log queries (for debugging, disable in production)
# log-queries
Step 3: Update /etc/resolv.conf
Point the system resolver to localhost:
# Make resolv.conf immutable to prevent systemd-resolved from changing it
sudo chattr -i /etc/resolv.conf
# Update resolv.conf
sudo tee /etc/resolv.conf > /dev/null <
Step 4: Disable systemd-resolved (it conflicts)
sudo systemctl disable systemd-resolved
sudo systemctl stop systemd-resolved
Step 5: Start dnsmasq
sudo systemctl enable dnsmasq
sudo systemctl start dnsmasq
sudo systemctl status dnsmasq
Verification
Test DNS resolution:
$ dig github.com @127.0.0.1
;; Query time: 45 msec (first query - cache miss)
$ dig github.com @127.0.0.1
;; Query time: 0 msec (cached!)
Check dnsmasq cache stats:
sudo kill -USR1 $(pidof dnsmasq)
sudo tail -20 /var/log/syslog
# Output shows:
# cache size 10000, 0/847 cache insertions re-used unexpired cache entries
# queries forwarded 1243, queries answered locally 8621
Cache hit rate after deployment: 87%.
The Results
The impact was immediate and dramatic.
Before vs After
Metric Before After Improvement
-----------------------------------------------------------------
DNS query latency (avg) 45ms 0.5ms 90x faster
DNS timeouts per day 180 3 98% reduction
Build failure rate (peak) 18% 0.8% 95% reduction
Average build time 11.2min 9.8min 12% faster
Jenkins executor CPU (DNS) 8% 0.2% 40x less
The Unexpected Benefits
Beyond fixing the failures, we got surprising improvements:
- Faster builds: Eliminating DNS latency shaved 1-2 minutes off average build times
- Less network load: 87% cache hit rate meant 87% fewer outbound DNS queries
- More predictable performance: No more variance from DNS lookup times
- Better debugging: dnsmasq logs made DNS issues immediately visible
Monitoring DNS in Production
After deployment, we added DNS monitoring to our observability stack.
Key Metrics
We track:
# Cache hit rate
(queries_answered_locally / total_queries) * 100
# Query latency percentiles
p50, p95, p99 query times
# Timeout rate
dns_timeouts / total_queries
# Cache size utilization
current_cache_entries / max_cache_size
Alerting
We alert on:
- Cache hit rate drops below 70%
- DNS timeout rate exceeds 1%
- Average query latency exceeds 10ms
- dnsmasq service is down
Dashboard
We built a Grafana dashboard showing:
┌─────────────────────────────────────────────┐
│ DNS Performance - Jenkins Executors │
├─────────────────────────────────────────────┤
│ │
│ Cache Hit Rate: 87.3% ✓ │
│ Avg Query Time: 0.6ms ✓ │
│ Timeout Rate: 0.04% ✓ │
│ Queries/min: 1,240 │
│ │
│ [Graph: Query latency over time] │
│ [Graph: Cache hit rate over time] │
│ [Graph: Timeout rate correlation] │
│ │
└─────────────────────────────────────────────┘
What We Learned
Linux Networking Fundamentals Matter
You can't debug what you don't understand. Knowing how DNS resolution works in Linux—the resolver chain, /etc/resolv.conf, systemd-resolved, timeout behavior—was essential to solving this problem.
This isn't esoteric knowledge. This is fundamental DevOps infrastructure understanding.
Monitoring Has Blind Spots
Our monitoring showed:
- Network: healthy ✓
- DNS servers: healthy ✓
- Individual queries: working ✓
But it missed the big picture: thousands of DNS queries per minute overwhelming our DNS servers.
We added DNS-specific metrics after this. Volume matters, not just success rate.
"It Works On My Machine" Doesn't Scale
A single build making 25 DNS queries? No problem.
32 concurrent builds making 800 combined queries sustained over minutes? Problem.
Infrastructure that works at low scale can break in completely different ways at high scale. DNS resolution is a perfect example.
Caching Is Infrastructure
We think of caching for application data—Redis, Memcached, CDNs. But caching applies to infrastructure too:
- DNS caching (dnsmasq, nscd)
- Package caching (apt-cacher-ng, Docker registry cache)
- Artifact caching (local Nexus/Artifactory)
Every external dependency can become a bottleneck. Caching turns repeated network calls into local lookups.
Growth Exposes Assumptions
At 50 builds/day, we never thought about DNS. It just worked.
At 600 builds/day, our implicit assumption—"DNS resolution is fast and reliable"—was wrong.
Infrastructure that works at one scale often breaks at another. Planning ahead means questioning your assumptions before growth forces you to.
Broader Implications
Communication and Networking Are Core Skills
Understanding DNS isn't optional for DevOps engineers. Neither is understanding:
- TCP/IP fundamentals (connection limits, TIME_WAIT, etc.)
- HTTP/HTTPS behavior (keep-alive, connection pooling)
- Load balancing (L4 vs L7, connection distribution)
- Network timeouts (connect, read, total)
These aren't "networking team" problems. These are DevOps problems. Your applications run on networks. Your CI/CD runs on networks. Your infrastructure runs on networks.
Know your stack, all the way down.
Planning Ahead vs Reacting
We only reacted to problems. We added executors when builds queued. We added resources when systems slowed.
But we didn't plan ahead and ask: "What happens at 10x our current scale?"
Planning ahead means asking:
- What are our current bottlenecks?
- What will become bottlenecks at 5x scale? 10x?
- What assumptions break under load?
- Where is our infrastructure making repeated external calls?
DNS was obvious in hindsight. It should have been obvious in foresight.
Documentation and Knowledge Sharing
After fixing this, we documented:
- How DNS resolution works in our infrastructure
- Why we use dnsmasq
- How to debug DNS issues
- What metrics to monitor
- When to scale or reconfigure
Infrastructure knowledge can't live in one person's head. The next engineer debugging a DNS issue shouldn't have to rediscover all of this.
Implementation Checklist
If you're running CI/CD at scale, here's what to check:
1. Measure your DNS query volume
# On your CI/CD nodes, monitor DNS traffic
sudo tcpdump -i any port 53 -c 100
# Count queries over time
watch -n 5 'sudo ss -u | grep :53 | wc -l'
2. Check your DNS configuration
cat /etc/resolv.conf
systemctl status systemd-resolved
dig +trace example.com # Verify resolution path
3. Test DNS under load
# Simple load test
for i in {1..1000}; do dig github.com > /dev/null & done
wait
# Monitor for failures or slowdowns
4. Implement DNS caching if needed
- Install dnsmasq (or alternatives like nscd, unbound)
- Configure appropriate cache size for your workload
- Point /etc/resolv.conf to localhost
- Monitor cache hit rates
5. Add DNS monitoring
- Query latency (p50, p95, p99)
- Cache hit rate
- Timeout rate
- Query volume over time
6. Document your setup
- How DNS resolution works in your environment
- Why you configured it this way
- How to debug DNS issues
- When to scale or reconfigure
Alternatives to dnsmasq
We chose dnsmasq, but other options exist:
nscd (Name Service Cache Daemon)
- Pros: Built into most Linux distros, minimal setup
- Cons: Less configurable, can have bugs, less visibility
systemd-resolved (with tuning)
- Pros: Already present on modern Ubuntu, integrated
- Cons: Conservative caching, less control, complex configuration
unbound
- Pros: Very powerful, DNSSEC support, recursive resolver
- Cons: More complex, heavier weight, overkill for simple caching
CoreDNS
- Pros: Cloud-native, plugin ecosystem, Kubernetes-friendly
- Cons: Requires Go, more resources, configuration complexity
For CI/CD caching, dnsmasq hits the sweet spot: simple, effective, lightweight.
The TL;DR
- Rapid development growth (10 → 40 engineers) scaled our Jenkins builds from 50 to 600+ per day
- High concurrent build volume exposed a DNS resolution bottleneck
- Thousands of DNS queries per minute overwhelmed our upstream DNS resolution
- Default Linux DNS configuration has no good caching for short-lived jobs
- Implementing dnsmasq for local DNS caching reduced query latency by 90x and build failures by 98%
- Understanding Linux networking fundamentals (resolv.conf, systemd-resolved, DNS resolution chain) was critical to solving the problem
- DNS monitoring and planning ahead prevent these issues before they cause failures
- Infrastructure assumptions that work at low scale often break at high scale
The Deeper Lesson
This story isn't really about DNS. It's about understanding the systems you build on.
Modern DevOps has many layers: Jenkins, Docker, Kubernetes, cloud platforms. It's easy to treat everything below your application as "infrastructure that just works."
But when things break—and at scale, they will—you need to understand how those layers actually work. Not just the tools, but the underlying systems.
DNS resolution. TCP connections. File systems. Process scheduling. Memory management.
These aren't "sysadmin" topics that DevOps has moved beyond. These are the fundamentals that everything else is built on.
When your Jenkins builds are failing and nobody can figure out why, knowing how /etc/resolv.conf works isn't optional knowledge. It's the difference between guessing and understanding.
The DevOps engineers who do well aren't the ones who know the most tools. They're the ones who understand their systems all the way down—and know when to dig deeper.