Syed Jahanzaib – سید جہانزیب – Personal Blog to Share Knowledge !

February 16, 2026

DNS Capacity Planning for ISPs: Recursive Load, QPS and Hit Ratio Explained (50K–100K Deployment Guide)


DNS Capacity Planning for ISPs: Recursive Load, QPS and Hit Ratio Explained (50K–100K Deployment Guide)

Measuring, Benchmarking, Modeling & Sizing Recursive Infrastructure

Author: Syed Jahanzaib
Audience: ISP Network & Systems Engineers
Scope: Production-grade DNS capacity planning for 10K–100K+ subscribers


⚠️ Disclaimer & Note on Writing Style

Every network environment is unique. A solution that works effectively in one infrastructure may require modification in another. Readers are strongly encouraged to understand the underlying concepts and adapt the guidance according to their own architecture, operational policies, and risk tolerance.

Blind copy-paste implementation without proper validation, testing, and change management is never recommended — especially in production environments. Always ensure proper backups and risk assessment before applying any configuration.

The content shared here is based on hands-on experience from real-world deployments, ISP environments, lab testing, and continuous learning. While I strive for technical accuracy, no technical implementation is entirely free from the possibility of error. Constructive discussion and alternative approaches are always welcome.

Due to professional commitments, it is not always feasible to publish highly detailed or multi-part write-ups. The technical logic and implementation details are written based on my own practical experience. AI tools such as ChatGPT are used only to refine grammar, structure, and presentation — not to generate the core technical concepts.

This blog is not intended for client acquisition or follower growth. It exists solely to share practical knowledge and real-world experience with the community.

Thank you for your understanding and continued support.


Executive Summary

DNS infrastructure in ISP environments is often sized using:

  • Subscriber count
  • Vendor marketing numbers
  • Approximate hardware specs

This approach frequently results in:

  • CPU saturation during peak hours
  • Increased latency
  • UDP packet drops
  • Recursive overload
  • Cache inefficiency

This post explains how to model DNS backend load using real measurements (QPS), cache behavior (Hit Ratio), and benchmarking, culminating in sizing recommendations for 50K and 100K subscriber ISPs. DNS capacity planning is not determined by subscriber count. It is determined by:

Recursive Load = Total QPS × (1 − Hit Ratio)

Only cache-miss traffic consumes real recursive CPU. In real ISP environments:

  • Frontend QPS can be very high
  • Cache hit ratio reduces backend load
  • Recursive servers are CPU-bound
  • RAM improves hit ratio and indirectly reduces CPU requirement

This guide walks through measurement, benchmarking, modeling, and real-world Pakistani ISP deployment examples (50K and 100K subscribers).


This whitepaper provides a measurement-driven engineering framework to:

  1. Typical ISP DNS Design
  2. Measuring Production QPS Baseline
  3. Benchmarking Recursive Servers (Cache-Hit & Cache-Miss)
  4. Benchmarking DNSDIST Frontend Capacity
  5. ISP Capacity Modeling (100K Subscriber Example)
  6. Real Traffic Pattern Simulation (Zipf Distribution)
  7. Recommended Hardware for 100K ISP
  8. Real-World Case Study – 50K ISP Deployment (Pakistan)
  9. Real-World Case Study – 100K Karachi Metro ISP
  10. Final Comparative Snapshot
  11. Engineering Takeaway for Pakistani ISPs
  12. Conclusion
  13. Layered DNS Design with Pakistani ISP Context
  14. Threat Model & Risk Assessment
  15. Monitoring & Alerting Blueprint (What to monitor and thresholds)

The goal is deterministic DNS capacity planning — not guesswork.

Typical ISP Recursive DNS Architecture for ISP Deployment

Typical ISP Recursive DNS Architecture for ISP Deployment

Reference Architecture

Typical ISP DNS Design

 Components

DNSDIST Layer

  • Load balancing
  • Packet cache
  • Rate limiting
  • Frontend UDP/TCP handling

Recursive Layer (BIND / Unbound / PowerDNS Recursor)

  • Full recursion
  • Cache storage
  • DNSSEC validation
  • Upstream resolution

Authoritative Layer (Optional)

  • Local zones
  • Internal domains

Measure Real Production QPS (Baseline First)

Before benchmarking anything, measure real traffic.

DNS Capacity Planning Flow Model (QPS × (1 − Hit Ratio))

DNS Capacity Planning Flow Model (QPS × (1 − Hit Ratio))

Why This Matters

Capacity modeling without baseline QPS is meaningless. DNS CPU demand is defined by:

Method 1 — BIND Statistics Channel (Recommended)

Enable statistics channel:

statistics-channels {
inet 127.0.0.1 port 8053 allow { 127.0.0.1; };
};

Restart BIND.

Retrieve counters:

curl http://127.0.0.1:8053/

Measure at time T1 and T2.

This gives actual production QPS.

Method 2 — rndc stats

rndc stats

Parse:

/var/cache/bind/named.stats

Automate sampling every 5 seconds for accurate peak measurement.

Benchmark Recursive Servers Independently

  • Recursive servers are the primary CPU bottleneck.
  • Always isolate them from DNSDIST during testing.

A recursive resolver will query authoritative servers when the answer is not in cache, increasing CPU/latency load.

impact of DNS TTL values on effective cache hit ratio:

  • Shorter TTL → more recursion
  • Longer TTL → better cache effectiveness

This is technically important because TTL distribution significantly affects hit ratio behavior — especially in real ISP traffic patterns.

Two Performance Modes

A) Cache-Hit Performance

Measures:

  • Memory speed
  • Thread scaling
  • Max theoretical QPS

B) Cache-Miss Performance (Real Recursion)

Measures:

  • CPU saturation
  • External lookups
  • True capacity

Cache-hit QPS can be 10x higher than recursion QPS.

Design for recursion load — not cache-hit numbers.

Using dnsperf

Install on test machine:

apt install dnsperf

Cache-Hit Test

Small repeated dataset:

dnsperf -s 10.10.2.164 -d queries_cache.txt -Q 2000 -l 30

Gradually increase load.

Cache-Miss Test

Large unique dataset (10K+ domains):

dnsperf -s 10.10.2.164 -d queries_miss.txt -Q 500 -l 60

Monitor:

  • CPU per core
  • SoftIRQ
  • UDP drops (netstat -su)
  • Latency growth

Engineering Rule

  • Recursive DNS is CPU-bound.
  • DNSDIST is lightweight.
  • Recursive must be benchmarked first.

Benchmark DNSDIST Separately

Goal: Measure frontend packet handling capacity.

Isolate Backend Variable

Create fast local zone on backend:

zone "bench.local" {
type master;
file "/etc/bind/db.bench";
};

Enable DNSDIST packet cache:

pc = newPacketCache(1000000, {maxTTL=60})
getPool("rec"):setCache(pc)

Run:

dnsperf -s 10.10.2.160 -d bench_queries.txt -Q 10000 -l 30

What This Measures

  • Packet processing rate
  • Rule engine overhead
  • Cache lookup speed
  • Socket performance

Typical 8-core VM:

Component Typical QPS
DNSDIST 40K–120K QPS
Recursive (cache hit) 20K–50K QPS
Recursive (miss heavy) 2K–5K QPS


ISP Capacity Modeling (100K Subscriber Example)

Step 1 — Active Users

  • 100,000 subscribers
  • Assume 30% peak concurrency
Active=100,000×0.3=30,000

Step 2 — Average QPS Per Active User

Engineering safe value:

Step 3 — Apply Cache Hit Ratio

Assume:

  1. Core Requirement Calculation

Recursive Core Formula

Example deployment:

Server Count Cores per Server
3 10 cores
4 8 cores

DNSDIST Core Formula

Recommended per node: 8 cores (HA pair)

  1. Cache Hit Ratio Modeling

Typical ISP values:

ISP Size Hit Ratio
5K users 50–60%
30K users 60–75%
100K users 70–85%

Why larger ISPs have higher hit ratio:

  • Higher domain overlap probability
  • CDN concentration
  • Popular content clustering

IMPORTANT Note for FORMULA :

The commonly used estimate of ~1000 recursive QPS per CPU core is a conservative planning value.
Actual performance depends on:

  • CPU generation and clock speed
  • DNS software (BIND vs Unbound vs PowerDNS)
  • Threading configuration
  • DNSSEC usage
  • Cache size

Real Traffic Pattern Simulation (Zipf Distribution)

ISP DNS Traffic Distribution Model (Zipf Behavior)

DNS traffic follows Zipf distribution:

  • 60–80% popular domains
  • 10–20% medium popularity
  • 5–10% long-tail

Testing only google.com is invalid.

Simulate burst:

dnsperf -Q 5000 -l 30
dnsperf -Q 10000 -l 30
dnsperf -Q 20000 -l 30

Observe latency before packet drops.

Latency growth = early saturation warning.

  1. RAM Sizing for Recursive Cache

Rule of Thumb

1 million entries ≈ 150–250 MB

Safe estimate:

200 bytes per entry

If:

1,500,000 entries
RAM=1,500,000×200=300MB

Multiply by 4–5 for safety.

Recommended RAM

ISP Size Recommended RAM
10K 8–16 GB
30K 16–24 GB
100K 32 GB

Insufficient RAM causes:

  • Cache eviction
  • Hit ratio drop
  • CPU spike
  • Latency explosion
  1. DNS Performance Triangle

Core relationship:

  1. QPS
  2. Cache Hit Ratio
  3. CPU Cores

RAM influences hit ratio.
Hit ratio influences CPU.
CPU influences latency.

Subscriber count alone means nothing.

Recommended Hardware (100K ISP)

Layer Cores RAM Notes
DNSDIST (×2 HA) 8 16GB Packet cache enabled
Recursive (×3–4) 8–12 32GB Large cache
Authoritative 4 8–16GB Light load

Below is a publication-ready Case Study section.

It includes:

  • Realistic 50K ISP deployment model
  • Pakistan-specific traffic behavior
  • PTA / local bandwidth realities
  • WhatsApp / YouTube heavy usage pattern
  • Ramadan peak pattern
  • Load measurements
  • Final hardware design

Glossary of Key Terms

QPS (Queries Per Second)
Number of DNS queries received per second.

Hit Ratio (H)
Percentage of queries answered from cache.

Cache Miss
Query requiring full recursive resolution.

Recursive QPS
Cache-miss queries that consume CPU.

DNSDIST
DNS load balancer and frontend packet handler.

SoftIRQ
Linux kernel mechanism handling network interrupts.

Zipf Distribution
Statistical model where few domains dominate most queries.


Real-World Case Study

50,000 ~(+/-) Subscriber ISP Deployment (Pakistan)

Location: Mid-size city ISP in Karachi
Access Type: GPON + PPPoE
Upstream: PTCL + Transworld
Peak Hour: 8:30 PM – 11:30 PM
User Profile: Residential + small offices

Why This 50K Profile Matters?

This profile represents a mid-sized Pakistani ISP typically operating in secondary cities.
Traffic is mobile-heavy, CDN-dominant, and shows strong evening peaks influenced by:

  • WhatsApp
  • YouTube
  • Android updates
  • Ramadan late-night spikes

This example demonstrates practical DNS scaling behavior in real Pakistani environments.

12.1 Network Overview

Architecture

  • Core Router (MikroTik CCR / Juniper MX)
  • BRAS / PPPoE Concentrator
  • DNSDIST HA pair (2 VMs)
  • 3 Recursive Servers (BIND)
  • Local NTP + Monitoring

12.2 Measured Production Data

Initial baseline measurement (using BIND statistics):

Total Subscribers:

  • 50,000

Peak Concurrent Users (measured via PPPoE sessions):

  • 14,800 – 16,500
  • ≈ 30–33%

Measured Peak QPS:

  • 38,000 – 44,000 QPS

Observed behavior:

  • Strong WhatsApp and YouTube dominance
  • TikTok traffic rising
  • Android update storms monthly
  • Windows update bursts on Patch Tuesday
  • Ramadan night peaks significantly higher

12.3 Pakistani Traffic Pattern Characteristics

1️⃣ YouTube & Google CDN Dominance

  • youtube.com
  • googlevideo.com
  • gvt1.com
  • whatsapp.net
  • fbcdn.net

High CDN reuse = High cache hit ratio

2️⃣ Ramadan Effect

During Ramadan:

  • Post-Iftar spike (~8 PM)
  • Late-night spike (1–2 AM)
  • Hit ratio increases (same content watched)

Peak QPS increased ~18% compared to normal month.

3️⃣ Mobile-Heavy Usage

70% users on Android devices.

This causes:

  • Background DNS queries
  • App telemetry lookups
  • Frequent short bursts

Average active user QPS observed:

2.7 ~ 3.5 QPS

Engineering value used: 3 QPS

12.4 Cache Hit Ratio Measurement

Measured over 24-hour window:

Time Hit Ratio
Normal hours 72%
Peak hours 76%
Ramadan late night 81%
During update storm 61%

Engineering worst-case design value used:

H=0.65

12.5 Capacity Modeling

12.6 Recursive Core Requirement

Assume:

Deployment chosen:

Server CPU RAM
REC1 8 cores 32GB
REC2 8 cores 32GB
REC3 8 cores 32GB

Total = 24 cores (headroom included)

12.7 DNSDIST Frontend Requirement

  • Total frontend QPS ≈ 48,000

Deployment:

Node CPU RAM
DNSDIST-1 6 cores 16GB
DNSDIST-2 6 cores 16GB

Active-Active via VRRP

12.8 RAM Sizing Decision

Estimated unique domains per hour:

~600,000

With recursion state and buffers → 32GB chosen.

Result:

  • No swap
  • Stable cache
  • Hit ratio maintained

12.9 Benchmark Results (After Deployment)

Cache-Hit Benchmark:

  • 28,000 QPS per server stable

Cache-Miss Benchmark:

  • 4,200 QPS per server stable

Real Production Peak:

Metric Value
Total QPS 44K
Recursive QPS 14–17K
CPU usage 55–68%
UDP drops 0
Avg latency 3–7 ms
99th percentile < 18 ms

System stable even during:

  • PSL streaming nights
  • Ramadan peak
  • Android update storm

12.10 Lessons Learned (Local Engineering Insight)

1️⃣ Subscriber Count Is Misleading

  • 50K subscribers did NOT mean 50K load.
  • Peak concurrency was only 32%.

2️⃣ Cache Hit Ratio Is Gold

  • Higher cache hit ratio reduced recursive CPU by ~70%.
  • RAM investment reduced CPU investment.

3️⃣ Pakistani Traffic Is CDN Heavy

  • This increases hit ratio compared to some international ISPs.
  • Good for DNS performance.

4️⃣ Update Storms Are Real Risk

Worst-case hit ratio drop observed:

  • 61%
  • Recursive QPS jumped by 30%.
  • Headroom saved the network.

5️⃣ SoftIRQ Monitoring Is Critical

Early packet drops observed before tuning: Solved by increasing:

  • net.core.netdev_max_backlog

12.11 Final Hardware Summary (50K ISP)

Layer Qty CPU RAM
DNSDIST 2 6 cores 16GB
Recursive 3 8 cores 32GB
Authoritative 1 4 cores 8GB

This setup safely supports:

  • 50K subscribers
  • ~50K peak QPS
  • 30% growth buffer

12.12 Growth Projection

Projected growth to 70K subscribers:

Estimated QPS:

70,000×0.3×3=63,000

Existing infrastructure can handle with:

  • 1 additional recursive node
    OR
  • CPU upgrade to 12 cores per node

No DNSDIST change required.

Engineering Takeaway for Pakistani ISPs

In Pakistan:

  • High mobile usage
  • High CDN overlap
  • Ramadan spikes
  • Update storms
  • PSL / Cricket live streaming bursts

Design must consider:

Worst Case Hit Ratio

Not average.

  • Overdesign recursive layer slightly.
  • DNS failure at peak hour damages brand reputation immediately.

Closing Thought

DNS is invisible — until it fails.

In competitive Pakistani ISP market:

  • Latency matters
  • Stability matters
  • Evening performance defines customer satisfaction

Engineering-driven DNS sizing ensures:

  • No random slowdowns
  • No unexplained packet loss
  • No midnight emergency calls

Below is an additional urban-scale case study section tailored for a Karachi metro ISP with ~100K subscribers. It is structured in the same engineering style as the previous case study and ready to append to your whitepaper.


13. Real-World Case Study

100,000 Subscriber Metro ISP Deployment (Karachi Urban Profile)

Karachi Metro ISP – 100K Subscriber DNS Deployment Model

Karachi Metro ISP – 100K Subscriber DNS Deployment Model

Location: Karachi (Metro Urban ISP)
Access Type: GPON + Metro Ethernet + High-rise FTTH
Upstream Providers: PTCL, Transworld, StormFiber peering, local IX (KIXP)
Customer Type: Dense residential, apartments, SMEs, co-working spaces
Peak Hours:

  • Weekdays: 8:00 PM – 12:00 AM
  • Weekends: 4:00 PM onward
  • Special Events: Cricket matches, PSL, political events, software release days

Why Karachi Metro Traffic Is Different

Karachi urban ISP environments show:

  • Higher concurrency (35–40%)
  • Higher QPS per user (gaming + streaming)
  • Event-driven traffic bursts (PSL, ICC matches)
  • More SaaS and SME usage

This significantly affects recursive CPU sizing and worst-case hit ratio modeling.

13.1 Metro Architecture Overview

Logical Layout

  • Core Routers (Juniper MX / MikroTik CCR2216 class)
  • PPPoE BRAS cluster
  • Anycast-ready DNSDIST HA pair
  • 4 Recursive Servers (BIND cluster)
  • Monitoring (Zabbix / Prometheus)
  • Netflow traffic analytics

13.2 Traffic Characteristics — Karachi Urban Behavior

Karachi differs from smaller cities in key ways:

1️⃣ Higher Concurrency Ratio

Measured peak concurrent users:

35–40%

Due to:

  • Dense apartments
  • Work-from-home population
  • Gaming users
  • Always-online devices

For modeling, we use:

100,000×0.38=38,000 active users

2️⃣ Higher Per-User QPS

Observed behavior:

  • Heavy gaming (PUBG, Valorant, Call of Duty)
  • Smart TVs
  • 3–5 mobile devices per household
  • CCTV cloud uploads
  • Background SaaS usage

Measured average:

3.2–4.1 QPS per active user

Engineering value used:

3.5 QPS

3️⃣ Event-Driven Traffic Spikes

Examples:

  • PSL match final
  • ICC cricket match
  • Major Windows release
  • Android security update rollout

QPS spike observed:

+22–28% above normal peak.

13.3 Measured Production Data

13.4 Cache Hit Ratio (Urban Environment)

Measured over 30-day period:

Condition Hit Ratio
Normal day 74%
Peak evening 78%
Cricket match 83%
Update storm 58%

Urban CDN dominance increases hit ratio normally.

Worst-case engineering value chosen:

H=0.60

13.5 Recursive Load Calculation

This is the real CPU load requirement.

13.6 Core Requirement Calculation

Assume safe recursion capacity:

Deployment selected:

Server CPU RAM
REC1 16 cores 64GB
REC2 16 cores 64GB
REC3 16 cores 64GB
REC4 16 cores 64GB

Total = 64 cores (headroom included)

Headroom margin ≈ 20%

13.7 DNSDIST Frontend Requirement

Frontend QPS:

Deployment:

Node CPU RAM
DNSDIST-1 12 cores 32GB
DNSDIST-2 12 cores 32GB

Configured in Active-Active mode with VRRP + ECMP.

13.8 RAM Sizing for Urban DNS

Unique domains per hour observed:

~1.5–2 million

Memory calculation:

Safety multiplier × 5:

2GB

With recursion states + buffers:

64GB selected for stability and growth.

13.9 Benchmark Results (After Deployment)

Cache-Hit Mode:

~45,000 QPS per recursive server stable

Cache-Miss Mode:

~5,500 QPS per server stable

Production Peak Snapshot:

Metric Value
Total QPS 128K–135K
Recursive QPS 48K–55K
CPU Usage 60–72%
UDP Drops 0
Avg Latency 4–9 ms
99th Percentile < 22 ms

Stable even during:

  • PSL final
  • Windows Update day
  • Ramadan night spikes

13.10 Karachi-Specific Engineering Observations

1️⃣ Gaming Traffic Increases DNS Load

Online games frequently resolve:

  • Matchmaking servers
  • Regional endpoints
  • CDN endpoints

Small TTL values increase recursion pressure.

2️⃣ High-Rise Apartments = High Overlap

  • Multiple households querying same domains simultaneously.
  • Boosts cache hit ratio significantly.

3️⃣ Corporate & SME Mix

SMEs introduce:

  • Microsoft 365
  • Google Workspace
  • SaaS endpoints

Increases DNS diversity.

4️⃣ IX Peering Improves Stability

  • Local IX (KIXP) reduces recursion latency.
  • Improved average resolution time by ~3ms.

13.11 Growth Projection (Urban Scaling)

Projected 130K subscribers:

Infrastructure supports up to:

~160K QPS safely

Upgrade path:

  • Add 5th recursive node
    OR
  • Upgrade CPUs to 24-core models

DNSDIST layer already sufficient.

13.12 Final Deployment Summary (Karachi Metro ISP)

Layer Qty CPU RAM
DNSDIST 2 12 cores 32GB
Recursive 4 16 cores 64GB
Authoritative 2 6 cores 16GB

Supports:

  • 100K subscribers
  • ~135K QPS peak
  • 25% growth buffer

Karachi Metro Engineering Insight

Urban ISPs must design for:

  • Higher concurrency
  • Higher QPS per user
  • Gaming + streaming overlap
  • Event-driven bursts
  • Rapid growth

In Karachi market:

  • Evening performance defines reputation.
  • DNS instability during cricket match = instant social media complaints.
  • Overdesign recursive layer slightly.
  • Frontend DNSDIST is rarely your bottleneck.

Final Comparative Snapshot

Comparative DNS Infrastructure – 50K vs 100K ISP

Appendix A — Kernel Tuning (Linux)

Increase UDP Buffers

net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.netdev_max_backlog = 50000

Apply:

sysctl -p

Monitor UDP Drops

netstat -su

Look for:

  • packet receive errors
  • receive buffer errors

Monitor SoftIRQ

  • cat /proc/softirqs

High softirq = network bottleneck.

Appendix B — Benchmark Checklist

Before declaring capacity:

  • No UDP drops
  • CPU < 80%
  • Stable latency
  • No kernel buffer errors
  • No swap usage

Final Engineering Principles

  • Measure first
  • Benchmark components independently
  • Model mathematically
  • Design for peak hour
  • Add headroom (30–40%)

Monitoring & Alerting Recommendations

Capacity planning is incomplete without monitoring.

Key Metrics to Track:

Metric

Why It Matters

Total QPS

Detect traffic spikes
Cache Hit Ratio Detect recursion surge
Recursive QPS True CPU load
CPU per core Saturation detection
UDP Drops Kernel bottleneck
SoftIRQ usage Network stack overload
Latency (avg + 99th percentile)

Early saturation warning

Recommended Thresholds:

  • CPU > 80% sustained → investigate
  • Hit ratio drop > 10% during peak → review cache size
  • UDP receive errors > 0 → kernel tuning required
  • 99th percentile latency rising → near saturation

Suggested Monitoring Stack:

  • Prometheus + Grafana
  • Zabbix
  • Netdata (lightweight)
  • sysstat (sar)
  • Custom script polling BIND stats

Conclusion

DNS capacity planning is governed by:

Not subscriber count.

The expression:

QPS×(1HitRatio)

means:

Only the cache-miss portion of your total DNS traffic consumes real recursive CPU.

🔎 Step-by-Step Meaning

1️⃣ QPS

Queries Per Second hitting your DNS infrastructure (frontend load).

Example:

Total QPS = 90,000

This is what DNSDIST receives.

2️⃣ HitRatio

Percentage of queries answered from cache.

If:

HitRatio = 0.70  (70%)

That means:

  • 70% answered instantly from memory
  • 30% require full recursion

3️⃣ (1 − HitRatio)

This gives the cache-miss ratio.

So:

30% of total QPS hits recursive engine.

4️⃣ Final Formula

Example:

That means:

 

  • Although frontend is 90K QPS,
  • Only 27K QPS consumes recursive CPU.

 

💡 Why This Governs DNS Capacity Planning

Because:

  • DNSDIST load ≠ recursive CPU load
  • Subscriber count ≠ CPU requirement
  • Total QPS ≠ backend QPS

Recursive servers are CPU-bound.

And recursive CPU is determined by:

  • QPS×(1HitRatio)

🎯 Engineering Interpretation

If you improve hit ratio:

Hit Ratio Recursive QPS (from 90K total)
50% 45K
70% 27K
80% 18K
90% 9K

Higher cache hit ratio = drastically lower CPU requirement.

🔥 Why RAM Matters

  • More RAM → Larger cache → Higher hit ratio
  • Higher hit ratio → Lower recursive CPU
  • Lower CPU → Stable latency

That’s the recursive performance triangle. So in simpler words, DNS capacity planning is governed by:

How many queries miss cache — not how many users you have.

Because only cache misses consume expensive recursive CPU cycles.


Correct engineering ensures:

  • Stable latency
  • No packet drops
  • Predictable scaling
  • Upgrade planning based on math

This is how ISP-grade DNS infrastructure should be designed.


Layered DNS Design with Pakistani ISP Context

Architecture Overview

In many Pakistani ISP environments — especially cable-net operators in Karachi, Lahore, Faisalabad, Multan, Peshawar and emerging FTTH providers — DNS infrastructure typically evolves reactively:

  • Start with one BIND server
  • Add second server as “secondary”
  • Increase RAM when complaints start
  • Restart named during peak
  • Hope it survives update storms

This works until subscriber density crosses ~25K active users. Beyond that point, DNS must move from “server-based” design to infrastructure-based architecture. The model described here is layered, scalable, and designed specifically for ISPs operating in Pakistani broadband realities.

High-Level Logical Architecture

  • Subscriber → Floating VIP → dnsdist (HA Pair) → Backend Pool → Internet
  • The system is divided into five functional layers.
  • Each layer has a defined responsibility and failure boundary.

Layer 1 – Subscriber Ingress Layer

This is where real-world Pakistani ISP complexity begins.

Subscribers may be:

  • PPPoE users behind MikroTik BRAS
  • CGNAT users
  • FTTH ONT users
  • Shared cable-net NAT pools
  • Apartment building fiber aggregation

Important observation:

Even if 25K–30K subscribers are “behind NAT”, DNS load is not reduced. Each device generates independent queries.

In urban Karachi networks, for example:

  • One household may have 4–8 active devices
  • Streaming + mobile apps continuously generate DNS lookups
  • Smart TVs and Android boxes produce background DNS traffic

Subscribers are configured to use:

  • Floating VIP (e.g., 10.10.2.160)
  • They never directly query recursive backend.
  • This abstraction is critical.

Layer 2 – Frontend Control Plane (dnsdist HA Pair)

Nodes:

  • LAB-DD1
  • LAB-DD2

Floating IP managed via VRRP.

Role:

  • Accept subscriber DNS traffic
  • Enforce ACLs
  • Apply rate limiting
  • Drop abusive patterns
  • Route queries to correct backend
  • Cache responses
  • Monitor backend health

This is not a resolver. It is a DNS traffic controller.

Why This Matters in Pakistani ISP Context

During peak time (8PM–1AM):

  • Cricket streaming traffic increases
  • Mobile app usage spikes
  • Social media heavy usage
  • Windows and Android updates trigger bursts

Without frontend control:

  • Primary recursive server gets overloaded.
  • Secondary remains underused.
  • dnsdist prevents uneven load.

Layer 3 – Traffic Classification Engine

Inside dnsdist, traffic is classified:

If domain belongs to local zone → Authoritative pool
Else → Recursive pool

In Pakistani ISP use cases, local domains may include:

  • ispname.local
  • billing portal
  • speedtest.isp
  • internal monitoring domains

If ISP does not host local zones, authoritative layer can be removed.

But separation remains best practice.

Layer 4 – Recursive Backend Pool

Recursive servers perform:

  • Internet resolution
  • Cache management
  • DNSSEC validation
  • External queries to root and TLD

In Pakistani ISP scenarios, recursive load characteristics:

Morning:
Low to moderate load

Afternoon:
Moderate browsing load

Evening:
High streaming + gaming + mobile app traffic

During major events (e.g., PSL match night):
Short burst QPS spikes

Without packet cache and horizontal scaling, recursive becomes bottleneck.

Layer 5 – External Resolution Layer

Recursive servers interact with:

  • Root servers
  • TLD servers
  • CDN authoritative servers
  • Google, Facebook, Akamai, Cloudflare zones

In Pakistan, upstream latency may vary depending on:

  • PTCL transit
  • TW1/TWA links
  • StormFiber transit
  • IX Pakistan exchange paths

Cache hit ratio reduces dependency on external latency.

End-to-End Query Flow Example (Pakistani Scenario)

Scenario 1 – Subscriber Opening YouTube

  1. User in Lahore opens YouTube.
  2. Device sends DNS query to VIP.
  3. dnsdist receives query.
  4. Cache checked.
  5. If cached → instant reply.
  6. If miss → forwarded to recursive.
  7. Recursive resolves via upstream.
  8. Response cached.
  9. Reply sent to subscriber.

Most repeated YouTube queries become cache hits within seconds.

Scenario 2 – Android Update Burst in Karachi

  1. 5,000 devices start update simultaneously.
  2. Unique subdomains requested.
  3. Cache hit ratio temporarily drops.
  4. Backend QPS spikes.
  5. dnsdist distributes evenly across recursive pool.
  6. Kernel buffers absorb short burst.
  7. No outage.

Without frontend layer, one recursive server may hit 100% CPU.

Scenario 3 – Infected Device Flood

  1. Compromised CPE sends 3,000 QPS random subdomain queries.
  2. dnsdist rate limiting drops excess.
  3. Recursive protected.
  4. Only abusive IP affected.

This is common in unmanaged cable-net deployments.

Failure Domain Isolation

Let’s analyze with Pakistani operational mindset.

If:

Recursive 1 crashes → Recursive 2 continues.

If:

dnsdist MASTER fails → BACKUP takes VIP.

If:

Authoritative crashes → Only local zone fails.

If:

Single backend CPU overloaded → Load redistributed.

Blast radius is contained.

VLAN Placement Strategy (Practical Pakistani ISP Setup)

Inside VMware or physical switch:

  • VLAN 10 – Subscriber DNS ingress (dnsdist nodes + VIP)
  • VLAN 20 – Backend DNS (recursive + auth)
  • VLAN 30 – Management

Do NOT:

  • Create separate VLAN per recursive unnecessarily.
  • Keep design simple but logically separated.

Horizontal Scaling Model

As subscriber base grows:

  • From 25K → 50K → 80K active

You scale by:

  • Adding recursive servers to pool.
  • dnsdist automatically distributes.
  • No DHCP change required.
  • No client configuration change required.

This is true infrastructure scalability.

Why This Architecture Fits Pakistani ISP Growth Pattern

Many ISPs in Pakistan:

  • Start with 5K–10K users
  • Rapidly grow to 30K–40K
  • Suddenly hit stability issues
  • Increase RAM only
  • No architectural redesign

This layered design prevents crisis scaling. You can grow from:

  • 25K active → 100K active

By adding recursive nodes, not redesigning network.

Engineering Summary

This architecture provides:

✔ Deterministic failover
✔ Even load distribution
✔ Burst absorption
✔ Internal abuse containment
✔ Horizontal scalability
✔ Clear failure boundaries

In Pakistani ISP environments where growth is rapid and peak traffic patterns are unpredictable, DNS must be treated as core infrastructure — not as a background Linux service.


Threat Model & Risk Assessment

ISP DNS Infrastructure – Pakistani Operational Context

Designing DNS infrastructure without defining a threat model is like deploying a core router without thinking about routing loops.

In Pakistani ISP environments — especially cable-net and regional fiber operators — DNS sits in a very exposed position:

  • It faces tens of thousands of NATed subscribers
  • It faces infected home devices
  • It faces public internet traffic (if authoritative is exposed)
  • It handles high PPS UDP traffic
  • It becomes the first visible failure when something goes wrong

DNS is not just a resolver. It is an attack surface. This section defines the realistic threat model for a 25K–100K subscriber Pakistani ISP.

  1. Threat Surface Definition

The DNS system contains multiple exposure layers:

  1. Subscriber ingress (PPPoE / CGNAT users)
  2. Frontend dnsdist layer (VIP)
  3. Recursive backend servers
  4. Authoritative backend (if used)
  5. Internet-facing queries (if auth exposed)
  6. Management interfaces

Each layer has different risk characteristics.

  1. Internal Threats (Most Common in Pakistan)

In Pakistani ISP environments, the most frequent DNS stress does NOT come from external DDoS. It comes from internal subscriber networks.

2.1 Infected Subscriber Devices

Very common reality:

  • Windows PCs without updates
  • Pirated OS installations
  • Compromised Android devices
  • IoT cameras exposed to internet
  • IPTV boxes running modified firmware

These devices can generate:

  • High QPS bursts
  • Random subdomain queries
  • DNS tunneling attempts
  • Internal amplification behavior

Effect:

  • Recursive servers get overloaded from inside the network.
  • This is extremely common in cable-net deployments in dense urban areas.

Mitigation in This Design

  • Per-IP rate limiting in dnsdist
  • MaxQPSIPRule protection
  • ACL enforcement
  • Recursive servers not publicly exposed

Internal abuse is statistically more likely than external DDoS.

2.2 Update Storm Events

Real-world Pakistani scenarios:

  • Windows Patch Tuesday
  • Android system update rollout
  • Major app update (WhatsApp, TikTok, YouTube)
  • During Ramadan evenings (peak usage window)
  • PSL or Cricket World Cup streaming events

Sudden QPS spike occurs.

Symptoms:

  • Recursive CPU jumps to 90%
  • UDP drops increase
  • Latency increases
  • Customers complain “Internet slow”

Without cache and frontend load balancing, DNS collapses under burst.

Mitigation:

  • Packet cache in dnsdist
  • Large recursive cache
  • Horizontal recursive scaling
  • Kernel buffer tuning
  1. External Threats

3.1 DNS Amplification / Reflection

If recursive is exposed publicly (misconfiguration):

  • Your ISP becomes reflection source.

Impact:

  • Upstream may null-route IP
  • Reputation damage
  • Regulatory complaints

Unfortunately, some smaller Pakistani ISPs accidentally expose recursive publicly.

Mitigation:

  • Recursive binds to private IP only
  • allow-recursion restricted
  • Firewall blocks external access
  • dnsdist ACL enforced

3.2 UDP Volumetric Flood

Attackers can send high PPS traffic to port 53.

Impact:

  • Kernel buffer overflow
  • SoftIRQ CPU spikes
  • Packet drops
  • VIP failover triggered

Mitigation:

  • Aggressive sysctl tuning
  • netdev backlog tuning
  • VRRP HA
  • Upstream filtering (if available)

Note:

  • dnsdist is not a full DDoS appliance.
  • Edge router protection still required.

3.3 Authoritative Targeting

If ISP hosts:

  • Internal captive portal domain
  • Billing portal
  • Speedtest domain
  • Public customer domain

That authoritative zone may be targeted. Without separation, recursive performance also suffers.

Mitigation:

  • Separate authoritative pool
  • Health check-based routing
  • Ability to isolate authoritative backend
  1. Infrastructure Threats

4.1 Single Point of Failure

Common in small ISPs:

  • One DNS VM
  • No VRRP
  • No monitoring

Failure of one VM = total browsing failure.

This design removes single point of failure at:

  • Frontend layer
  • Backend layer

4.2 Silent Recursive Failure

Example:

  • named process running
  • But resolution broken
  • High latency responses
  • Partial packet drops

Without health checks, frontend continues sending traffic.

Mitigation:

  • dnsdist active health checks
  • checkType A-record validation
  • Automatic backend removal

4.3 Resource Exhaustion

Common during peak:

  • File descriptor exhaustion
  • UDP buffer exhaustion
  • Swap usage under memory pressure

Result:

Random resolution delays.

Mitigation:

  • Increase fs.file-max
  • Disable swap
  • Large cache memory
  • Kernel buffer tuning
  1. Control Plane Exposure

dnsdist control socket must not be exposed.

Risk:

  • Configuration manipulation
  • Traffic rerouting
  • Statistics scraping

Mitigation:

  • Bind to 127.0.0.1
  • Firewall management VLAN
  • Separate management network
  1. VLAN Design Risk Considerations

Over-segmentation can introduce complexity. Under-segmentation increases risk. Minimum practical separation:

  • Subscriber VLAN (dnsdist frontend)
  • Backend VLAN (recursive + auth)
  • Management VLAN

Do NOT place recursive directly on subscriber VLAN.

Do NOT expose backend IPs to customers.

  1. Risk Matrix – Pakistani ISP Context

Most common operational stress in Pakistan:

Internal subscriber behavior — not nation-state attack.

  1. Acceptable Risk Boundaries

This architecture protects against:

✔ Single frontend crash
✔ Single recursive crash
✔ Internal abuse spikes
✔ Update bursts
✔ Accidental overload
✔ Packet flood at moderate scale

It does NOT protect against:

✘ Full data center power outage
✘ Upstream fiber cut
✘ Large-scale multi-gigabit DDoS
✘ BGP hijacking

Those require multi-site + Anycast.

  1. Operational Assumptions

This threat model assumes:

  • Firewall correctly configured
  • Recursive not publicly exposed
  • Monitoring enabled
  • Failover tested quarterly
  • Cache properly sized
  • Swap disabled

Without monitoring, architecture alone is insufficient.

  1. Engineering Conclusion

In Pakistani ISP environments, DNS instability most often comes from:

  • Growth without redesign
  • Lack of QPS visibility
  • No cache modeling
  • No frontend control plane

By introducing:

  • dnsdist frontend
  • VRRP failover
  • Recursive separation
  • Rate limiting
  • Cache modeling
  • Aggressive OS tuning

We reduce:

  • Operational panic during peak
  • Subscriber complaint spikes
  • Random browsing failures
  • Overload-induced outages

DNS must be treated like:

  • BNG
  • Core Router
  • RADIUS

Not like a “side VM”.

Engineering begins with understanding threats.
Then designing boundaries.


Monitoring & Alerting Blueprint (What to monitor and thresholds)

Monitoring & Alerting Blueprint

Now we move into what separates a stable ISP from a reactive one: Most DNS failures in Pakistani ISP environments are not caused by bad architecture , they are caused by lack of visibility. Below is a full Monitoring & Alerting Blueprint designed specifically for:

  • 25K–100K subscriber ISPs
  • dnsdist + Recursive + VRRP architecture
  • VMware-based deployments
  • Pakistani cable-net operational realities

What to Monitor, Why It Matters, and Thresholds

What to Monitor, Why It Matters, and Thresholds for 25K–100K ISPs.

A DNS system without monitoring is silent failure waiting to happen. In Pakistani ISP environments, monitoring must detect:

  • QPS surge before collapse
  • Cache hit drop before CPU spike
  • Packet drops before customers complain
  • Recursive latency before timeout
  • Failover event before NOC panic

Monitoring must be:

  • Continuous
  • Threshold-driven
  • Alert-based
  • Logged historically

1️⃣ Monitoring Layers

We monitor 4 logical layers:

  1. Frontend (dnsdist)
  2. Recursive servers
  3. System / Kernel
  4. Infrastructure (VRRP & VMware)

Each has separate metrics and thresholds.

2️⃣ dnsdist Monitoring Blueprint

dnsdist is your control plane. If this layer fails, everything fails.

2.1 Metrics to Monitor

From dnsdist console or Prometheus exporter:

  • Total QPS
  • QPS per backend
  • Cache hit count
  • Cache miss count
  • Backend latency
  • Backend up/down status
  • Dropped packets (rate limiting)
  • UDP vs TCP ratio

2.2 Key Thresholds

🔴 Total QPS

For 25K–30K active ISP:

  • Normal peak: 40K–80K QPS

Alert if:

  • Sustained > 90% of tested maximum capacity

Example:

If dnsdist tested stable at 80K QPS
Alert at 70K sustained for 5 minutes

🔴 Cache Hit Ratio

Healthy ISP:

  • 65%–85%

Alert if:

  • Drops below 55% during peak

Why?

  • Lower hit ratio = recursive overload coming.

🔴 Backend Latency

Normal recursive latency:

  • 2–10 ms internal
  • 20–50 ms internet resolution

Alert if:

  • Average backend latency > 100 ms sustained

This indicates:

  • CPU saturation
  • Packet drops
  • Upstream latency issue

🔴 Backend DOWN Status

  • Immediate critical alert if:
  • Any recursive backend marked DOWN.
  • Even if redundancy exists, this must alert.

🔴 Dropped Queries (Rate Limiting)

Monitor how many queries are dropped by:

  • MaxQPSIPRule

Alert if:

  • Sudden spike in dropped queries

This may indicate:

  • Infected subscriber
  • Local DNS abuse
  • Misconfigured device flood

3️⃣ Recursive Server Monitoring Blueprint

Recursive is CPU-heavy layer.

3.1 Core Metrics

On each recursive:

  • CPU utilization per core
  • System load average
  • Memory usage
  • Swap usage (should be 0)
  • UDP receive errors
  • Packet drops
  • File descriptor usage
  • Cache size
  • Recursive QPS

3.2 Critical Thresholds

🔴 CPU

Alert if:

  • Any recursive server > 80% CPU sustained for 5 minutes

If >90% → immediate alert.

🔴 Memory

Alert if:

  • RAM usage > 85%
  • Swap must remain 0.

If swap > 0 → critical misconfiguration.

🔴 UDP Errors

Check:

  • netstat -su

Alert if:

  • Packet receive errors increasing continuously This indicates kernel buffer exhaustion.

🔴 Recursive QPS Per Node

If expected load per node:

12K QPS

Alert if:

Sustained > 15K QPS

That means you are approaching CPU limit.

4️⃣ System / Kernel Monitoring

This is ignored by many ISPs. But UDP packet drops often happen here.

4.1 Monitor

  • net.core.netdev_max_backlog utilization
  • SoftIRQ CPU usage
  • Interrupt distribution
  • NIC packet drops
  • Interface errors
  • Ring buffer overflows

Alert if:

  • RX dropped packets increasing
  • SoftIRQ > 40% of CPU

5️⃣ VRRP Monitoring

Keepalived must be monitored.

Alert if:

  • VIP moves unexpectedly
  • MASTER changes state
  • Both nodes claim MASTER (split-brain)

In Pakistani ISP environments with shared switches, multicast issues may cause VRRP instability. Monitor VRRP logs continuously.

6️⃣ VMware-Level Monitoring

Since all VMs are on shared host:

Monitor:

  • Host CPU contention
  • Ready time (vCPU wait)
  • Datastore latency
  • Network contention

Alert if:

CPU ready time > 5% . DNS under high QPS is sensitive to CPU scheduling delay.

7️⃣ Alert Severity Model

Use 3 levels:

🟢 Warning
🟠 High
🔴 Critical

Example:

🟢 CPU 75%
🟠 CPU 85%
🔴 CPU 95%

Alerts must escalate if sustained > 3–5 minutes. Avoid alert fatigue.

8️⃣ Recommended Monitoring Stack

Practical for Pakistani ISPs:

  • Prometheus
  • Grafana
  • Node exporter
  • dnsdist Prometheus exporter
  • Alertmanager

Or simpler:

  • Zabbix
  • LibreNMS
  • Even basic Nagios

Do not rely only on “htop”.

9️⃣ What Not to Ignore

In Pakistani ISP environments, many outages occur because:

  • No QPS baseline known
  • No cache hit tracking
  • No packet drop monitoring
  • No failover testing
  • No alert thresholds defined

Monitoring must answer:

  • What is normal peak?
  • What is dangerous peak?
  • When to add recursive?
  • When to upgrade CPU?
  • When to add RAM?

10️⃣ Practical Example (25K–30K Active ISP)

Healthy Evening Metrics:

Total QPS: 60K
Hit ratio: 72%
Recursive per node: 9K QPS
CPU per recursive: 55–65%
UDP drops: 0

Danger Metrics:

Total QPS: 85K
Hit ratio: 52%
Recursive per node: 18K
CPU: 90%
UDP errors increasing

At this stage, scaling must be planned.

11️⃣ When to Add 3rd Recursive?

Add new recursive when:

  • CPU > 75% during peak for multiple days
  • Cache hit ratio stable but CPU rising
  • QPS trending upward month over month
  • Subscriber base increasing rapidly

Do NOT wait for outage. Scale before saturation.

12️⃣ Monitoring Philosophy

In Pakistani ISP context:

Most DNS outages happen not because architecture is bad,
but because growth outpaces monitoring.

DNS should have:

  • Real-time QPS dashboard
  • Cache hit graph
  • Backend latency graph
  • Per-node CPU graph
  • UDP drop graph

If you cannot see it, you cannot scale it.

Engineering Conclusion

Monitoring is not optional.

For 25K–100K subscriber ISPs, DNS monitoring must:

✔ Predict overload
✔ Detect abuse
✔ Track failover
✔ Measure cache efficiency
✔ Guide capacity planning

  • Architecture prevents collapse.
  • Monitoring prevents surprise.
  • Together, they create stability.

 

February 12, 2026

Designing NAS & BNG Architecture – MikroTik vs Carrier-Grade BNG (Juniper, Cisco, Nokia, Huawei)



Designing NAS & BNG Architecture for 50,000+ FTTH Subscribers

(MikroTik vs Carrier-Grade BNG (Juniper, Cisco, Nokia, Huawei)

Real-World ISP Engineering Perspective (MikroTik vs Carrier-Grade BNG)

Author: Syed Jahanzaib | A Humble Human being! nothing else 😊
Platform: ISP
Audience: ISP / Telco Network Engineers, Architects, CTOs


⚠️ Disclaimer & Note on Writing Style

Every network environment is unique. A solution that works effectively in one infrastructure may require modification in another. Readers are strongly encouraged to understand the underlying concepts and adapt the guidance according to their own architecture, operational policies, and risk tolerance.

Blind copy-paste implementation without proper validation, testing, and change management is never recommended — especially in production environments. Always ensure proper backups and risk assessment before applying any configuration.

The content shared here is based on hands-on experience from real-world deployments, ISP environments, lab testing, and continuous learning. While I strive for technical accuracy, no technical implementation is entirely free from the possibility of error. Constructive discussion and alternative approaches are always welcome.

Due to professional commitments, it is not always feasible to publish highly detailed or multi-part write-ups. The technical logic and implementation details are written based on my own practical experience. AI tools such as ChatGPT are used only to refine grammar, structure, and presentation — not to generate the core technical concepts.

This blog is not intended for client acquisition or follower growth. It exists solely to share practical knowledge and real-world experience with the community.

Thank you for your understanding and continued support.


Design Objective & Scope

This article evaluates NAS/BNG architecture design specifically for ISPs targeting 50,000+ FTTH subscribers. The purpose is not vendor comparison from a marketing perspective, but architectural decision-making based on:

  • Subscriber concurrency
  • Aggregate throughput modeling
  • CGNAT scaling
  • High availability design
  • Operational stability
  • Long-term growth projection

This guide assumes familiarity with PPPoE, RADIUS, CGNAT, and core routing fundamentals. The objective is to determine when MikroTik is sufficient — and when a carrier-grade BNG becomes operationally necessary.


Introduction

When an ISP crosses 50,000 active subscribers, traditional “router-as-NAS(es)” thinking no longer applies.

At 80,000+ FTTH users, your NAS is no longer just a PPPoE termination device , it becomes the subscriber state engine of the entire network.

This article is written from real operational experience, not vendor marketing. It covers:

  • Realistic bandwidth & session modeling
  • Why MikroTik struggles at scale (even x86)
  • Correct distributed NAS design
  • CGNAT engineering at 50k+ scale
  • Carrier-grade BNG comparison (Juniper, Cisco, Nokia, Huawei)
  • MikroTik NAS performance tuning checklist
  • Monitoring KPIs that actually matter
  • CAPEX vs OPEX trade-offs
  • Office gateway comparison (MikroTik vs FortiGate vs Sangfor IAG)

But first let’s discuss our PAKISTANI market ….

Common Misconceptions in Pakistani Cable & ISP Market

In the Pakistani ISP and cable broadband market, several architectural mistakes are repeated due to cost pressure, legacy mindset, or partial understanding of scaling behavior.

Let’s clarify some common misconceptions.

Common Red Flags in Pakistani Cable.Network’s Audits

  • Single NAS for 20k+ users
  • CGNAT + PPPoE on same box
  • Simple queues for 10k users
  • No PPS monitoring
  • No NAT logging
  • No or Minimum VLAN segmentation
  • No redundancy
  • No documented growth plan <<< This hits hard when something goes wrong…

❌ Misconception 1: “More CPU cores = More PPPoE users”

Many operators believe:

If we buy a 32-64-core x86 server, it will easily handle 20k–30k PPPoE users.

Reality:

  • PPPoE session handling is not perfectly multi-thread scalable.
  • IRQ imbalance causes one core to saturate.
  • Queue engine remains CPU-driven.
  • PPS (packet per second) becomes bottleneck before bandwidth.

Result:

  • 5k–8k users stable
  • Beyond that → latency spikes and random PPP drops

More cores do not automatically equal linear scaling.

❌ Misconception 2: “10G port means 10G performance”

Having 10G SFP+ does not guarantee 10G stable forwarding at scale.

Throughput depends on:

  • Packet size mix
  • PPS rate
  • CPU scheduler
  • Firewall complexity
  • Queue configuration

Many ISPs see:

  • 10G interface installed
  • But CPU hits 100% at 6–7 Gbps mixed traffic

Interface speed ≠ forwarding capacity.

❌ Misconception 3: “All users can be on one VLAN”

Some cable ISPs still run:

  • All ONUs in one broadcast domain
  • One PPPoE server
  • One NAS

At 20k–50k subscribers, this causes:

  • Broadcast storms
  • ARP pressure
  • Massive failure domain
  • Maintenance outage for entire network

Correct design:

  • VLAN per OLT
  • VLAN per PON
  • Distributed NAS load < KEY 🙂 SJZ

❌ Misconception 4: “CGNAT + PPPoE on same router saves cost”

This is very common in local deployments.

Operators try:

  • PPPoE termination
  • Queue shaping
  • Firewall
  • CGNAT
  • BGP
    All on one box.

Even if it works at 3k–5k users, at 20k+ >>

  • Latency increases
  • NAT session exhaustion
  • CPU spikes at evening peak

Cost saving today → outage tomorrow.

❌ Misconception 5: “If traffic is working, architecture is correct”

Many networks appear fine during daytime. Evening peak exposes design weakness.

True engineering validation requires:

  • 95th percentile monitoring
  • PPS monitoring
  • Per-core CPU tracking
  • Session growth tracking

If your design only works at 40% load, it is not stable.

❌ Misconception 6: “CDN means NAS load is reduced”

Local CDN (Facebook, YouTube, Netflix) reduces:

  • International bandwidth cost
  • Faster response for cached contents near to your location

But it does NOT reduce:

  • Packet processing load
  • Subscriber state handling
  • PPPoE session load
  • Queue overhead

NAS still forwards total traffic internally.

❌ Misconception 7: “MikroTik is bad for large ISPs”

In Pakistani forums, you often hear:

“MikroTik cannot handle more than 2000~3000 users.”

That is not accurate. BUT FIRST Read this.

Common MikroTik Deployment Models in FTTH

In production ISP environments, MikroTik is typically deployed in one of the following architectures:

  1. Centralized PPPoE Concentrator

All subscriber sessions terminate on a single core router.

  1. Distributed NAS Model

Multiple MikroTik routers placed at aggregation layer to distribute session load.

  1. Hybrid Model

MikroTik handles PPPoE termination while core router handles CGNAT and routing.

Each deployment model affects:

  • Failure impact radius
  • CGNAT performance
  • RADIUS transaction load
  • Broadcast domain size
  • Scalability ceiling

Architecture choice directly impacts long-term stability at 50k+ subscriber scale.

MikroTik can handle large scale IF:

  • Distributed architecture is used (you need to distribute load by adding more NAS after specific number of users/BW/cpu load)
  • No simple queues
  • No heavy firewall
  • CGNAT separated
  • Proper VLAN segmentation
  • CPU margin maintained
    try to avoid NAT

The real problem is usually poor architecture — not brand limitation.

❌ Misconception 8: “Carrier BNG is only for Tier-1 ISPs”

Control Plane vs Data Plane Separation: Carrier-grade BNG platforms separate:

  • Control Plane (subscriber authentication, routing logic)
  • Data Plane (packet forwarding, QoS enforcement)

This provides:

  • Predictable performance under load
  • Hardware forwarding acceleration (ASIC-based)
  • Reduced CPU spikes during mass reconnect events
  • Better CGNAT scalability

Software-based routers rely heavily on CPU for both control and forwarding, which introduces scaling ceilings.

Some operators believe:

Juniper / Cisco / Nokia / Huawei BNG is only for HIGH END -level operators.

Reality:

If you have:

  • 50k+ active users
  • 200+ Gbps traffic
  • CGNAT > 50k users
  • Enterprise customers
  • Government compliance needs

You are already in carrier category — even if you started as cable operator.

❌ Misconception 9: “Scaling vertically is easier than horizontally”

Many ISPs prefer:

  • Buy one bigger router
  • Instead of multiple moderate routers

Vertical scaling increases:

  • Single point of failure
  • Maintenance impact
  • Risk exposure

Horizontal scaling increases:

  • Stability
  • Flexibility
  • Upgrade safety

At 50k+ users, horizontal scaling is the safer design.

❌ Misconception 10: “We will upgrade architecture later”

Common mindset:

Let’s grow to 100k users first, then redesign.

But migrating NAS architecture at 50k+ subscribers is operationally risky:

  • PPPoE session migration complexity
  • IP pool changes
  • RADIUS re-architecture
  • CGNAT port remapping
  • Subscriber outage risk

Architecture should scale with growth — not after crisis.

Operational Pitfalls at 50k+ Scale

At large FTTH scale, the following issues commonly appear:

  • CPU spikes during mass reconnect events
  • RADIUS overload during outage recovery
  • CGNAT table exhaustion
  • BGP route churn affecting stability
  • Single router failure impacting entire subscriber base

Design must assume failure events — not only steady-state operation. Architecture that survives failure is carrier-grade. Architecture that survives only normal load is not.

Reality of Pakistani ISP Environment

Challenges specific to local market:

  • IPv4 shortage → heavy CGNAT dependence (more Logging)
  • Budget constraints
  • Rapid subscriber growth
  • Low ARPU pressure
  • Hybrid fiber + other modes of deployments
  • Limited centralized monitoring culture

Because of these constraints, design discipline becomes even more important.

Engineering Mindset Shift Needed

Instead of asking:

“Which router is powerful?”

We should ask:

  • What is peak PPS?
  • What is per-core load?
  • What is session growth trend?
  • What is NAT port utilization?
  • What is failure blast radius?

This is the difference between:

Cable operator thinking
and
Carrier engineering thinking.


1️⃣ Defining the Real Scale: 50k+FTTH Users

Option-1:
Capacity Planning Baseline Formula

Assumptions (realistic for FTTH):

  • Active users: 80,000
  • Average package: 10 Mbps
  • Peak concurrency: 25%
  • CDN present (Facebook, YouTube, Google)

Peak Bandwidth Calculation

  • 80,000 × 10 Mbps × 0.25 = 200 Gbps

Important:

  • CDN reduces upstream transit cost, not NAS forwarding load.
  • Your NAS still processes ~200 Gbps internally.
  • Design target should be ≥250 Gbps to allow growth and safety margin.

Option-2:
Proper NAS/BNG sizing must be based on measurable parameters.

  • Peak Traffic Estimation:
  • Peak Traffic =
  • Total Subscribers × Concurrency Ratio × Average Peak Bandwidth

Example:

  • 50,000 subscribers × 0.6 concurrency × 8 Mbps
  • = 240 Gbps theoretical peak demand

Concurrent Sessions:

  • 50,000 × 0.6 = 30,000 active sessions

Hardware must sustain:

  • 30k+ PPPoE sessions
  • 200–300 Gbps aggregate throughput
  • CGNAT state growth
  • Mass reconnect events during outages

Design decisions must be validated against these numbers — not vendor claims.


2️⃣ The Hidden Problem: Sessions & PPS (Not Bandwidth)

Modern households generate massive session counts:

  • Smart TVs, phones, tablets, IoT
  • Streaming, social media, updates

Conservative assumption:

  • 200 connections per subscriber
  • 80,000 × 200 = 16 million concurrent sessions

Why This Matters

  • PPPoE = per-subscriber state
  • Firewall/NAT = per-connection state
  • Queues = per-subscriber scheduling

Bandwidth is easy.
Session state + packets per second (PPS) is hard.


3️⃣ Industry-Standard BNG Architecture (Carrier Model)

Proper Carrier Flow

OLT / Access
→ Aggregation (10G / 100G)
Distributed BNG Layer
→ Core Router
Dedicated CGNAT Cluster
→ Transit / IX / CDN

Key Principles

  • No single NAS
  • Horizontal scaling
  • Hardware forwarding preferred
  • CGNAT always separate
  • AAA centralized (RADIUS cluster)

4️⃣ Why MikroTik (Including x86) Hits a Wall

Many ISPs report MikroTik instability beyond 5k–8k PPPoE users, even on powerful x86 servers. This is not a myth.

Root Causes

🔴 1. PPPoE is Not Fully Multi-Thread Scalable

  • One CPU core saturates
  • Others remain underutilized
  • Traffic chokes despite “low total CPU”

🔴 2. Software-Based Queuing

  • Simple queues / PCQ / queue tree = CPU
  • 5k–10k queues = scheduler overhead

🔴 3. High PPS Rate

  • Smaller packets (video, ACKs)
  • PPPoE overhead
  • CPU processes PPS, not ASIC

🔴 4. x86 IRQ & NUMA Issues

  • NIC interrupts bound to limited cores
  • Cross-NUMA memory latency
  • PCIe bottlenecks

Carrier BNGs avoid this by separating:

  • Control plane (CPU)
  • Forwarding plane (ASIC/NPU)

5️⃣ Practical MikroTik Capacity (Real World)

Platform Stable PPPoE Users Typical Throughput
CCR1036 1k–2k 2–3 Gbps
CCR2216 4k–5k 10–15 Gbps
x86 (high-end) 6k–10k 20–30 Gbps

These numbers assume clean configs and no CGNAT.

6️⃣ Correct MikroTik Design for 50k+ Users (If Budget-Constrained)

Distributed NAS Model

  • 16 × CCR2216
  • 5k users per node
  • VLAN segmentation per OLT / area
  • RADIUS dynamic rate-limit
  • No simple queues
  • Minimal firewall
  • FastPath enabled
  • CGNAT moved out

Each NAS should stay below:

  • CPU < 65%
  • Conntrack < 60%
  • Zero packet drops

7️⃣ CGNAT Engineering at 50k+ users Scale

Assume 80% Natted users:

  • 80,000 × 0.8 = 64,000 CGNAT subscribers

Connections:

  • 64,000 × 200 = 12.8 million NAT sessions

Best Practices

  • Dedicated CGNAT cluster
  • Port block allocation
  • ≥200 public IPs
  • NAT logging to syslog / ELK
  • No PPPoE on CGNAT devices

8️⃣ Carrier-Grade BNG Platforms (Industry Standard)

Commonly deployed vendors:

  • Juniper Networks – MX Series
  • Cisco Systems – ASR Series
  • Nokia – 7750 SR
  • Huawei Technologies – NE / ME Series

Why They Scale Better

  • ASIC / NPU forwarding
  • Hardware QoS
  • Hardware subscriber tables
  • Millions of sessions
  • ISSU (hitless upgrades)
  • Lawful intercept support

Typical deployment:

  • 2–4 BNG nodes
  • 40k users per node
  • 100G interfaces

9️⃣ MikroTik NAS Performance Tuning Checklist

System & CPU

  • Enable FastPath
  • Disable unused services
  • Avoid dual-socket x86
  • Ensure IRQ distribution

PPPoE

  • One PPPoE server per VLAN
  • MTU/MRU = 1492
  • One-session-per-host

Queues

  • ❌ No simple queues
  • ✔ RADIUS rate-limit
  • Minimal queue tree (if required)

Firewall

  • Accept established/related
  • Drop invalid
  • No Layer-7
  • Minimal logging

Design Rule

If CPU or conntrack crosses threshold → add another NAS, not “optimize harder”.


🔟 Monitoring KPIs That Actually Matter

Minimum Mandatory KPIs for 50k Subscriber Network

A production FTTH network must continuously monitor:

  • Active PPPoE sessions
  • Session creation rate per minute
  • CGNAT active translations
  • CPU utilization per core
  • Interrupt load
  • Packet drops
  • Queue latency
  • RADIUS response time
  • BGP session stability

Without long-term KPI trending, scaling decisions become reactive instead of planned.

CGNAT KPIs

  • Active NAT sessions
  • Port utilization
  • Public IP pool usage
  • NAT failures
  • Log server reachability

Monitoring tools:

  • Zabbix
  • LibreNMS
  • ELK
  • NetFlow / sFlow

1️⃣1️⃣ CAPEX vs OPEX Reality

OPEX Consideration (Very Important)

MikroTik Model

Pros:

  • Low initial cost
  • Flexible expansion
  • No heavy licensing

Hidden OPEX:

  • More devices to manage
  • More manual config sync
  • Higher troubleshooting time
  • Longer MTTR during outages
  • Skill dependency on engineer

Operational staff requirement often higher.

Carrier Model

Pros:

  • Fewer nodes
  • Centralized management
  • Hardware QoS
  • Faster troubleshooting
  • Better SLA stability
  • Vendor TAC support

OPEX:

  • Annual support renewal
  • Licensing subscription

But operational stress is lower.

Risk-Based Cost Perspective

Cost is not only CAPEX.

Cost also includes:

  • Outage duration impact
  • Customer churn
  • Reputation damage
  • SLA penalties
  • Engineering burnout

If a 3-hour nationwide outage causes:

  • 5% customer churn
  • Social media backlash

That hidden cost may exceed hardware savings.

Realistic Strategy for Pakistani ISP

If ARPU low & growth moderate:

  • Start with distributed MikroTik
  • Plan migration path within 3–4 years

If ARPU stable & enterprise customers present:

  • Consider phased carrier BNG investment

Final Financial Thought

The question is not:

“Which is cheaper?”

The real question is:

  1. “At what subscriber size does operational risk cost more than hardware savings?”

For many Pakistani ISPs, that tipping point is between:

  • 40 ~ 50K active subscribers

 

Factor MikroTik Carrier BNG
Initial Cost Low High
Stability Margin Tight Wide
Growth Headroom Medium High
Compliance Limited Full

1️⃣2️⃣ Office Gateway Comparison (<1000 Users)

This is a different problem space.

MikroTik

  • Best for routing, VPN, VLANs
  • Weak security inspection

FortiGate (NGFW)

  • IPS, AV, SSL inspection
  • Enterprise security posture

Sangfor IAG

  • Identity-based access
  • End user user access control for office environment

Rule of thumb:

  • Routing only → MikroTik
  • Security first → FortiGate
  • Identity-centric → Sangfor IAG

Final Thought

  • A network is not stable because it is working today.
  • It is stable because it can survive peak load, hardware failure, growth, and compliance pressure.
  • Most outages in Pakistani ISPs are not hardware failures — they are architecture failures.

Final Engineering Verdict

At 50k+ active FTTH subscribers:

  • MikroTik can work, but only in strictly distributed architecture (still try to avoid it for peace)
  • Single or few “big” NAS boxes will fail
  • Carrier BNG platforms are architecturally superior
  • The decision is not about brand, it’s about risk tolerance

Throughput is easy.
Subscriber state and PPS are hard.
Design accordingly.


Network Design & Compliance Health Assessment for Pakistani ISPs

Strategic Overview for Management & Decision Makers
By Syed Jahanzaib !

1️ Why This Assessment Matters

At 10,000+ subscribers, an ISP is no longer running a small cable network.
At 50,000+ subscribers, the ISP is operating at carrier scale.

At this stage, poor architecture decisions can result in:

  • Nationwide service outages
  • Regulatory penalties
  • Subscriber churn
  • Revenue loss
  • Reputation damage

This executive summary explains what management must verify to ensure the network is:

  • Scalable
  • Stable
  • Compliant
  • Financially sustainable

2️ Key Business Risks Identified in Pakistani ISPs

Risk 1: Single Point of Failure

Many ISPs run:

  • One large NAS
  • CGNAT + PPPoE on same device
  • No redundancy

Impact:

  • 1 device failure = full outage
  • Repair time = hours
  • Social media backlash
  • Subscriber complaints spike

Risk 2: Hidden Capacity Crisis

Network may appear “working” but:

  • CPU runs near saturation at peak
  • No headroom for growth
  • No performance margin

Impact:

  • Evening slow speeds
  • Gradual customer dissatisfaction
  • Churn increase

Risk 3: CGNAT Legal Exposure

If NAT logs are:

  • Incomplete
  • Time unsynchronized
  • Not searchable

Impact:

  • Legal liability
  • PTA/FIA pressure
  • Reputation risk

Risk 4: Growth Without Architecture Upgrade

Common pattern in Pakistan:

  • Subscriber growth rapid
  • Infrastructure unchanged
  • Upgrade delayed until crisis

Impact:

  • Emergency upgrades
  • Higher cost
  • Network instability

3️ What Management Should Demand from Technical Team

Capacity Visibility

  • Monthly 95th percentile bandwidth report
  • Peak concurrency data
  • 3-year growth projection

Architecture Review

  • Distributed NAS model
  • No single device handling excessive load
  • Clear redundancy design

Compliance Readiness

  • NAT logs properly stored
  • Law enforcement request SOP defined
  • Subscriber data secured

Monitoring Dashboard

Management-level dashboard should show:

  • Active subscribers
  • Peak bandwidth
  • CPU health
  • CGNAT utilization
  • Uptime percentage

If these are not visible to management, risk is invisible.

4️ Financial Perspective: CAPEX vs Risk

Example (50k subscribers, Pakistan market):

  • Distributed MikroTik model ≈ XX Million PKR
  • Carrier-grade BNG ≈ XXX–XXX Million PKR

Management must evaluate:

Is lower upfront cost worth higher operational risk?

Key question:

  • What is cost of 3-hour nationwide outage?
  • What is churn impact of persistent evening slowdown?
  • What is reputational damage cost?

Sometimes the cheaper hardware is more expensive long-term.

5️ Decision Framework for Management

If:

  • ARPU is low
  • Growth moderate
  • No enterprise SLA

→ Distributed MikroTik model acceptable (with strict design discipline)

If:

  • 50k+ subscribers
  • Enterprise clients present
  • Compliance pressure high
  • Growth >20% yearly

→ Begin migration planning toward carrier-grade BNG

6️ Governance Recommendations

Management should implement:

  1. Quarterly architecture review
  2. Annual compliance audit
  3. Capacity forecast planning
  4. Incident post-mortem reporting
  5. Defined network upgrade roadmap

Network design must be proactive — not reactive.

7️ Executive Risk Scorecard

Management can classify network maturity:

Category Status
Capacity Headroom Safe / Warning / Critical
Redundancy Full / Partial / None
Compliance Readiness Strong / Moderate / Weak
Monitoring Visibility Complete / Limited / None
Growth Preparedness Planned / Reactive / Unknown

If 2 or more categories are “Critical” → Immediate redesign review required.

8️ Strategic Recommendation

For Pakistani ISPs scaling beyond 50k subscribers:

  • Architecture discipline becomes more important than hardware brand.
  • Horizontal scaling reduces outage risk.
  • Compliance readiness protects license.
  • Monitoring visibility reduces crisis events.
  • Growth planning reduces emergency CAPEX.

The goal is not just “network running.”

The goal is:

  • Predictable performance
  • Regulatory safety
  • Sustainable growth
  • Controlled operational stress

Final Executive Message

  • A network is a revenue engine.
  • At 80,000 subscribers:
  • Every hour of outage directly impacts millions of rupees in revenue and long-term brand trust.
  • The difference between a cable operator and a carrier-grade ISP is not size.
  • It is governance, planning, and architecture maturity.

Network Design & Compliance Health Assessment for Pakistani ISPs

Executive Summary / Strategic Overview for Management & Decision Makers

1️⃣ Why This Assessment Matters

  • At 10,000+ subscribers, an ISP is no longer running a small cable network.
  • At 50,000+ subscribers, the ISP is operating at carrier scale.

At this stage, poor architecture decisions can result in:

  • Nationwide service outages
  • Regulatory penalties
  • Subscriber churn
  • Revenue loss
  • Reputation damage

This executive summary explains what management must verify to ensure the network is:

  • Scalable
  • Stable
  • Compliant
  • Financially sustainable

2️ Key Business Risks Identified in Pakistani ISPs

  Risk 1: Single Point of Failure

Many ISPs run:

  • One large NAS
  • CGNAT + PPPoE on same device
  • No redundancy

Impact:

  • 1 device failure = full outage
  • Repair time = hours
  • Social media backlash
  • Subscriber complaints spike

Risk 2: Hidden Capacity Crisis

Network may appear “working” but:

  • CPU runs near saturation at peak
  • No headroom for growth
  • No performance margin

Impact:

  • Evening slow speeds
  • Gradual customer dissatisfaction
  • Churn increase

Risk 3: CGNAT Legal Exposure

If NAT logs are:

  • Incomplete
  • Time unsynchronized
  • Not searchable

Impact:

  • Legal liability
  • PTA/FIA pressure
  • Reputation risk

Risk 4: Growth Without Architecture Upgrade

Common pattern in Pakistan:

  • Subscriber growth rapid
  • Infrastructure unchanged
  • Upgrade delayed until crisis

Impact:

  • Emergency upgrades
  • Higher cost
  • Network instability

3️ What Management Should Demand from Technical Team

✔ Capacity Visibility

  • Monthly 95th percentile bandwidth report
  • Peak concurrency data
  • 3-year growth projection

✔ Architecture Review

  • Distributed NAS model
  • No single device handling excessive load
  • Clear redundancy design

✔ Compliance Readiness

  • NAT logs properly stored
  • Law enforcement request SOP defined
  • Subscriber data secured

✔ Monitoring Dashboard

Management-level dashboard should show:

  • Active subscribers
  • Peak bandwidth
  • CPU health
  • CGNAT utilization
  • Uptime percentage

If these are not visible to management, risk is invisible.

4️ Financial Perspective: CAPEX vs Risk

Example (80k subscribers, Pakistan market):

  • Distributed MikroTik model ≈ XXMillion PKR
  • Carrier-grade BNG ≈ XXX–XXX Million PKR

Management must evaluate:

  • Is lower upfront cost worth higher operational risk?

Key question:

  • What is cost of 3-hour nationwide outage?
  • What is churn impact of persistent evening slowdown?
  • What is reputational damage cost?

Sometimes the cheaper hardware is more expensive long-term.

5️ Decision Framework for Management

If:

  • ARPU is low
  • Growth moderate
  • No enterprise SLA

→ Distributed MikroTik model acceptable (with strict design discipline)

If:

  • 70k+ subscribers
  • Enterprise clients present
  • Compliance pressure high
  • Growth >20% yearly

→ Begin migration planning toward carrier-grade BNG

6️ Governance Recommendations

Management should implement:

  1. Quarterly architecture review
  2. Annual compliance audit
  3. Capacity forecast planning
  4. Incident post-mortem reporting / RCA analysis
  5. Defined network upgrade roadmap

Network design must be proactive — not reactive.

7️ Executive Risk Scorecard

Management can classify network maturity:

Category Status
Capacity Headroom Safe / Warning / Critical
Redundancy Full / Partial / None
Compliance Readiness Strong / Moderate / Weak
Monitoring Visibility Complete / Limited / None
Growth Preparedness Planned / Reactive / Unknown

If 2 or more categories are “Critical” → Immediate redesign review required.

8️ Strategic Recommendation

For Pakistani ISPs scaling beyond 50k subscribers:

  • Architecture discipline becomes more important than hardware brand.
  • Horizontal scaling reduces outage risk.
  • Compliance readiness protects license.
  • Monitoring visibility reduces crisis events.
  • Growth planning reduces emergency CAPEX.

The goal is not just “network running.”

The goal is:

  • Predictable performance
  • Regulatory safety
  • Sustainable growth
  • Controlled operational stress

Final Executive Message

 A network is a revenue engine.

At 50,000+ subscribers:

  • Every hour of outage directly impacts millions of rupees in revenue and long-term brand trust.
  • The difference between a cable operator and a carrier-grade ISP is not size.
  • It is governance, planning, and architecture maturity.

About the Author

Syed Jahanzaib
A Humble Human being! nothing else 😊