How DNS Resolution Became Our CI/CD Bottleneck at 600+ Daily Builds
Last updated on

How DNS Resolution Became Our CI/CD Bottleneck at 600+ Daily Builds


When our development team scaled from 10 to 40 engineers, our Jenkins pipeline went from 50 builds per day to over 600. Everything seemed fine—until it wasn't. Builds started failing randomly with network timeouts. The culprit? DNS resolution. Here's how understanding Linux networking fundamentals saved our CI/CD pipeline.

This is the story of how rapid growth exposed a bottleneck we never saw coming, and why knowing your Linux networking stack matters more than you think.

The Growth Trajectory

Six months ago, our development team was lean. 10 engineers, a handful of microservices, maybe 50 builds on a busy day. Our Jenkins setup was straightforward:

  • 3 executor nodes with 4 executors each (12 total concurrent builds)
  • Builds pulling from GitHub, DockerHub, and internal artifact repositories
  • Average build time: 8-12 minutes
  • Everything worked smoothly

Then we raised funding. The team doubled. Then doubled again.

Month 1:  10 engineers → ~50 builds/day
Month 3:  20 engineers → ~150 builds/day
Month 6:  40 engineers → 600+ builds/day

Jenkins scaling:
Month 1:  3 nodes, 12 executors
Month 6:  8 nodes, 32 executors

We added more Jenkins executors. We upgraded our infrastructure. We optimized our pipelines. On paper, we had the capacity.

But the builds started failing.

The Symptoms

At first, it was sporadic. A build would fail with a network timeout. Retry it, and it worked. "Transient network issue," we thought. Move on.

Then it got worse.

The Error Messages

Error cloning repository: Could not resolve host: github.com
Timeout while connecting to registry.hub.docker.com
Failed to fetch package: Temporary failure in name resolution
curl: (6) Could not resolve host: api.internal.company.com

All DNS-related. All intermittent. All concentrated during peak build times (9am-11am, 2pm-4pm).

The Pattern

We started tracking failures:

Time Range       Builds    Failures    Failure Rate
08:00 - 09:00      45         2           4.4%
09:00 - 10:00     120        18          15.0%
10:00 - 11:00     145        27          18.6%
11:00 - 12:00      95         8           8.4%
12:00 - 13:00      50         1           2.0%
13:00 - 14:00      65         3           4.6%
14:00 - 15:00     135        22          16.3%
15:00 - 16:00     140        25          17.9%

Clear correlation: more concurrent builds = more failures. All DNS-related.

The Investigation

Standard debugging didn't reveal much. Network connectivity was fine. DNS servers were healthy. Queries from individual nodes worked perfectly.

But then we started digging into the actual query volume.

DNS Query Explosion

Each build was making dozens of DNS queries:

  • Cloning from GitHub: github.com
  • Pulling Docker images: registry.hub.docker.com, auth.docker.io, production.cloudflare.docker.com
  • Downloading dependencies: npmjs.org, pypi.org, maven.org, etc.
  • Internal services: api.internal.company.com, artifacts.internal.company.com
  • External APIs: various third-party endpoints

Conservatively: 20-30 unique DNS queries per build.

Do the math:

Peak concurrent builds: 32 (all executors running)
DNS queries per build: ~25
Query frequency: every few seconds during active build phases

Peak load: 32 builds × 25 domains = 800+ DNS queries
Duration: sustained for 8-12 minutes per build wave

We were hammering our DNS infrastructure with thousands of queries per minute.

Checking the DNS Setup

On our Jenkins executors, the DNS configuration was default Ubuntu:

$ cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
options timeout:2 attempts:3

Every DNS query was going out to Google's public DNS. No caching. No local resolver.

With 2-second timeouts and high query volume, we were hitting timeout limits under load. Queries were queuing, timing out, and causing build failures.

Understanding DNS Resolution in Linux

This is where Linux networking fundamentals became critical. To solve the problem, we needed to understand how DNS resolution actually works.

The DNS Resolution Chain

When an application makes a DNS query on Linux, here's what happens:

Application (curl, git, docker)
    ↓
glibc resolver (getaddrinfo)
    ↓
/etc/nsswitch.conf (determines resolution order)
    ↓
/etc/hosts (checked first if configured)
    ↓
DNS resolver (from /etc/resolv.conf)
    ↓
systemd-resolved OR traditional resolv.conf
    ↓
Upstream DNS servers (8.8.8.8, etc.)

By default, there's no caching at the OS level. Every query goes all the way to the upstream DNS server.

The /etc/resolv.conf File

This file controls DNS behavior:

nameserver 8.8.8.8      # Primary DNS server
nameserver 8.8.4.4      # Fallback DNS server
options timeout:2       # Timeout per query (seconds)
options attempts:3      # Number of retry attempts
options rotate          # Rotate through nameservers
options ndots:1         # Dots required for absolute lookup

Key insight: timeout:2 with attempts:3 means a failed query takes 6 seconds minimum before giving up. Under load, these timeouts stack up.

systemd-resolved

Modern Ubuntu uses systemd-resolved, which provides some caching:

$ systemctl status systemd-resolved
● systemd-resolved.service - Network Name Resolution
   Active: active (running)

But the default cache is limited:

$ resolvectl statistics
Current Cache Size: 0
Cache Hits: 127
Cache Misses: 1843

Cache hit rate: about 6%. Not nearly enough for our use case.

Why No Caching?

The problem: most CI/CD workloads are short-lived. Jenkins executors start build jobs that run for minutes, then stop. The OS-level DNS cache (even with systemd-resolved) doesn't help much because:

  1. Each build is a fresh process with no cache
  2. systemd-resolved's cache is limited (short TTLs)
  3. High concurrency means the cache keeps getting wiped

We needed a dedicated DNS caching layer.

The Solution: DNS Caching

We implemented a local DNS cache on each Jenkins executor using dnsmasq.

Why dnsmasq?

  • Lightweight (minimal resource overhead)
  • Simple configuration
  • Aggressive caching (respects TTLs but caches effectively)
  • Battle-tested (used in countless production environments)

Implementation

Step 1: Install dnsmasq

sudo apt-get update
sudo apt-get install -y dnsmasq

Step 2: Configure dnsmasq

Edit /etc/dnsmasq.conf:

# Don't read /etc/resolv.conf for upstream servers
no-resolv

# Define upstream DNS servers explicitly
server=8.8.8.8
server=8.8.4.4
server=1.1.1.1

# Cache size (default is 150, we increased it)
cache-size=10000

# Don't forward queries for plain names (security)
domain-needed

# Don't forward reverse lookups for private IP ranges
bogus-priv

# Listen on localhost only
listen-address=127.0.0.1

# Log queries (for debugging, disable in production)
# log-queries

Step 3: Update /etc/resolv.conf

Point the system resolver to localhost:

# Make resolv.conf immutable to prevent systemd-resolved from changing it
sudo chattr -i /etc/resolv.conf

# Update resolv.conf
sudo tee /etc/resolv.conf > /dev/null <

Step 4: Disable systemd-resolved (it conflicts)

sudo systemctl disable systemd-resolved
sudo systemctl stop systemd-resolved

Step 5: Start dnsmasq

sudo systemctl enable dnsmasq
sudo systemctl start dnsmasq
sudo systemctl status dnsmasq

Verification

Test DNS resolution:

$ dig github.com @127.0.0.1

;; Query time: 45 msec  (first query - cache miss)

$ dig github.com @127.0.0.1

;; Query time: 0 msec   (cached!)

Check dnsmasq cache stats:

sudo kill -USR1 $(pidof dnsmasq)
sudo tail -20 /var/log/syslog

# Output shows:
# cache size 10000, 0/847 cache insertions re-used unexpired cache entries
# queries forwarded 1243, queries answered locally 8621

Cache hit rate after deployment: 87%.

The Results

The impact was immediate and dramatic.

Before vs After

Metric                          Before      After       Improvement
-----------------------------------------------------------------
DNS query latency (avg)         45ms        0.5ms       90x faster
DNS timeouts per day            180         3           98% reduction
Build failure rate (peak)       18%         0.8%        95% reduction
Average build time              11.2min     9.8min      12% faster
Jenkins executor CPU (DNS)      8%          0.2%        40x less

The Unexpected Benefits

Beyond fixing the failures, we got surprising improvements:

  1. Faster builds: Eliminating DNS latency shaved 1-2 minutes off average build times
  2. Less network load: 87% cache hit rate meant 87% fewer outbound DNS queries
  3. More predictable performance: No more variance from DNS lookup times
  4. Better debugging: dnsmasq logs made DNS issues immediately visible

Monitoring DNS in Production

After deployment, we added DNS monitoring to our observability stack.

Key Metrics

We track:

# Cache hit rate
(queries_answered_locally / total_queries) * 100

# Query latency percentiles
p50, p95, p99 query times

# Timeout rate
dns_timeouts / total_queries

# Cache size utilization
current_cache_entries / max_cache_size

Alerting

We alert on:

  • Cache hit rate drops below 70%
  • DNS timeout rate exceeds 1%
  • Average query latency exceeds 10ms
  • dnsmasq service is down

Dashboard

We built a Grafana dashboard showing:

┌─────────────────────────────────────────────┐
│ DNS Performance - Jenkins Executors         │
├─────────────────────────────────────────────┤
│                                             │
│  Cache Hit Rate:        87.3%  ✓           │
│  Avg Query Time:        0.6ms  ✓           │
│  Timeout Rate:          0.04%  ✓           │
│  Queries/min:           1,240              │
│                                             │
│  [Graph: Query latency over time]          │
│  [Graph: Cache hit rate over time]         │
│  [Graph: Timeout rate correlation]         │
│                                             │
└─────────────────────────────────────────────┘

What We Learned

Linux Networking Fundamentals Matter

You can't debug what you don't understand. Knowing how DNS resolution works in Linux—the resolver chain, /etc/resolv.conf, systemd-resolved, timeout behavior—was essential to solving this problem.

This isn't esoteric knowledge. This is fundamental DevOps infrastructure understanding.

Monitoring Has Blind Spots

Our monitoring showed:

  • Network: healthy ✓
  • DNS servers: healthy ✓
  • Individual queries: working ✓

But it missed the big picture: thousands of DNS queries per minute overwhelming our DNS servers.

We added DNS-specific metrics after this. Volume matters, not just success rate.

"It Works On My Machine" Doesn't Scale

A single build making 25 DNS queries? No problem.

32 concurrent builds making 800 combined queries sustained over minutes? Problem.

Infrastructure that works at low scale can break in completely different ways at high scale. DNS resolution is a perfect example.

Caching Is Infrastructure

We think of caching for application data—Redis, Memcached, CDNs. But caching applies to infrastructure too:

  • DNS caching (dnsmasq, nscd)
  • Package caching (apt-cacher-ng, Docker registry cache)
  • Artifact caching (local Nexus/Artifactory)

Every external dependency can become a bottleneck. Caching turns repeated network calls into local lookups.

Growth Exposes Assumptions

At 50 builds/day, we never thought about DNS. It just worked.

At 600 builds/day, our implicit assumption—"DNS resolution is fast and reliable"—was wrong.

Infrastructure that works at one scale often breaks at another. Planning ahead means questioning your assumptions before growth forces you to.

Broader Implications

Communication and Networking Are Core Skills

Understanding DNS isn't optional for DevOps engineers. Neither is understanding:

  • TCP/IP fundamentals (connection limits, TIME_WAIT, etc.)
  • HTTP/HTTPS behavior (keep-alive, connection pooling)
  • Load balancing (L4 vs L7, connection distribution)
  • Network timeouts (connect, read, total)

These aren't "networking team" problems. These are DevOps problems. Your applications run on networks. Your CI/CD runs on networks. Your infrastructure runs on networks.

Know your stack, all the way down.

Planning Ahead vs Reacting

We only reacted to problems. We added executors when builds queued. We added resources when systems slowed.

But we didn't plan ahead and ask: "What happens at 10x our current scale?"

Planning ahead means asking:

  • What are our current bottlenecks?
  • What will become bottlenecks at 5x scale? 10x?
  • What assumptions break under load?
  • Where is our infrastructure making repeated external calls?

DNS was obvious in hindsight. It should have been obvious in foresight.

Documentation and Knowledge Sharing

After fixing this, we documented:

  1. How DNS resolution works in our infrastructure
  2. Why we use dnsmasq
  3. How to debug DNS issues
  4. What metrics to monitor
  5. When to scale or reconfigure

Infrastructure knowledge can't live in one person's head. The next engineer debugging a DNS issue shouldn't have to rediscover all of this.

Implementation Checklist

If you're running CI/CD at scale, here's what to check:

1. Measure your DNS query volume

# On your CI/CD nodes, monitor DNS traffic
sudo tcpdump -i any port 53 -c 100

# Count queries over time
watch -n 5 'sudo ss -u | grep :53 | wc -l'

2. Check your DNS configuration

cat /etc/resolv.conf
systemctl status systemd-resolved
dig +trace example.com  # Verify resolution path

3. Test DNS under load

# Simple load test
for i in {1..1000}; do dig github.com > /dev/null & done
wait

# Monitor for failures or slowdowns

4. Implement DNS caching if needed

  • Install dnsmasq (or alternatives like nscd, unbound)
  • Configure appropriate cache size for your workload
  • Point /etc/resolv.conf to localhost
  • Monitor cache hit rates

5. Add DNS monitoring

  • Query latency (p50, p95, p99)
  • Cache hit rate
  • Timeout rate
  • Query volume over time

6. Document your setup

  • How DNS resolution works in your environment
  • Why you configured it this way
  • How to debug DNS issues
  • When to scale or reconfigure

Alternatives to dnsmasq

We chose dnsmasq, but other options exist:

nscd (Name Service Cache Daemon)

  • Pros: Built into most Linux distros, minimal setup
  • Cons: Less configurable, can have bugs, less visibility

systemd-resolved (with tuning)

  • Pros: Already present on modern Ubuntu, integrated
  • Cons: Conservative caching, less control, complex configuration

unbound

  • Pros: Very powerful, DNSSEC support, recursive resolver
  • Cons: More complex, heavier weight, overkill for simple caching

CoreDNS

  • Pros: Cloud-native, plugin ecosystem, Kubernetes-friendly
  • Cons: Requires Go, more resources, configuration complexity

For CI/CD caching, dnsmasq hits the sweet spot: simple, effective, lightweight.

The TL;DR

  • Rapid development growth (10 → 40 engineers) scaled our Jenkins builds from 50 to 600+ per day
  • High concurrent build volume exposed a DNS resolution bottleneck
  • Thousands of DNS queries per minute overwhelmed our upstream DNS resolution
  • Default Linux DNS configuration has no good caching for short-lived jobs
  • Implementing dnsmasq for local DNS caching reduced query latency by 90x and build failures by 98%
  • Understanding Linux networking fundamentals (resolv.conf, systemd-resolved, DNS resolution chain) was critical to solving the problem
  • DNS monitoring and planning ahead prevent these issues before they cause failures
  • Infrastructure assumptions that work at low scale often break at high scale

The Deeper Lesson

This story isn't really about DNS. It's about understanding the systems you build on.

Modern DevOps has many layers: Jenkins, Docker, Kubernetes, cloud platforms. It's easy to treat everything below your application as "infrastructure that just works."

But when things break—and at scale, they will—you need to understand how those layers actually work. Not just the tools, but the underlying systems.

DNS resolution. TCP connections. File systems. Process scheduling. Memory management.

These aren't "sysadmin" topics that DevOps has moved beyond. These are the fundamentals that everything else is built on.

When your Jenkins builds are failing and nobody can figure out why, knowing how /etc/resolv.conf works isn't optional knowledge. It's the difference between guessing and understanding.

The DevOps engineers who do well aren't the ones who know the most tools. They're the ones who understand their systems all the way down—and know when to dig deeper.


It's not DNS. There's no way it's DNS. It was DNS.