Post-Mortem — Incident 522 / WAN Failover (April 8, 2026)

2026-04-08 846 words 4 minutes

/images/post-mortem-522-wan-failover-featured.jpg

Contents

⚡ In short

Date: April 7-8, 2026 — Duration: ~3h (21:28 UTC → 22:31 UTC) — Severity: P1

arleo.eu was unreachable for 3 hours. The root cause was not the server, not nginx, not CrowdSec — it was an HTTPS port forwarding rule attached to generic WAN instead of explicit WAN1 on the Netgear PR60X. The daily DHCP lease renewal of the 4G modem (WAN2) triggered a NAT rebalance that broke routing to port 443.

🧠 Why

The PR60X is configured with dual WAN (FTTH WAN1 + 4G WAN2 backup). On each WAN state transition, the router rebalances its NAT rules. A rule attached to generic WAN does not consistently reattach to WAN1 after the rebalance — result: incoming packets on WAN1 are no longer routed to the server.

This configuration trap had been latent since the initial PR60X setup. It never triggered before because the daily WAN2 reboot timing (23:17 local time, ISP-imposed) was not documented. The incident also revealed that the physical network architecture (Freebox bridge → PR60X dual WAN → NUC) was undocumented.

🔧 What was done

Network Architecture (Revealed by the Incident)

Internet
   ↓
[Freebox FTTH] (Bridge mode — no routing)
   ↓
[Netgear PR60X] (actual router/firewall, dual WAN)
   ├── WAN1 = FTTH Free → 82.xxx.xxx.xxx (primary)
   └── WAN2 = 4G/LTE backup → 100.96.220.53 (backup, daily reboot ~23:17)
   ↓
LAN 192.168.0.0/22
   ↓
[NUC8i3BEH] 192.168.1.26 → nginx + CrowdSec + Grav

Timeline (UTC)

Time	Event
21:01–21:12	CrowdSec bans 97 IPs (95× http:scan, 2× http:exploit) — scan wave from Microsoft Azure subnets
21:21:28	Last BetterStack `/ping` probe received by nginx (status 200)
21:22–21:23	nginx surge: active connections 2 → 34, traffic 355 KB/s — CrowdSec memory spike to 1.4 GB
21:27:10	WAN2 disconnected on PR60X (ISP-forced 4G lease renewal)
21:27:14	WAN2 reconnected — PR60X rebalances NAT rules
21:28:15	BetterStack incident triggered — HTTP 522
21:30–22:00	Initial investigation: false-positive AppSec CrowdSec (timeouts in nginx error.log)
22:20–22:25	Key discovery: strict `tcpdump` on `dst port 443` shows zero incoming packets from outside
22:28	Timing identified: 522 crash correlates exactly with PR60X WAN2 events at 21:27
22:30	Edit HTTPS rule on PR60X → explicitly fixed to WAN1
22:31	Service restored

Exact Failure Mechanism

WAN2 (4G modem) receives its ISP-forced daily DHCP lease renewal (~23:17 local time)
PR60X logs WAN2 disconnected then WAN2 reconnected a few seconds later
On each WAN state transition, the PR60X rebalances its NAT/port forwarding rules
Since the HTTPS rule was attached to generic WAN (not explicit WAN1), it does not consistently reattach to WAN1 after the rebalance
Result: incoming packets on WAN1 are no longer routed to 192.168.1.26:443
Cloudflare sends requests to the origin IP → timeout → HTTP 522

The False Positive: CrowdSec AppSec

During the first hour, nginx logs showed very convincing AppSec CrowdSec errors. The Lua bouncer had a fail-closed policy (deny on timeout), and under scan wave load it was timing out. But these denials only affected requests that reached nginx — and no requests were arriving (broken port forwarding).

The decisive diagnostic:

        
sudo timeout 20 tcpdump -i eno1 -n "tcp dst port 443"

0 incoming packets during 20 seconds of testing from a phone on 4G. The server wasn’t the problem — packets weren’t even arriving.

Applied Fix

Primary Fix — PR60X:

Field	Before	After
WAN IP Address	`WAN` (generic)	`WAN1` (explicit)
External port	443	443
Internal IP	192.168.1.26	192.168.1.26

Bonus Fix — CrowdSec Hardening in /etc/crowdsec/bouncers/crowdsec-nginx-bouncer.conf:

APPSEC_FAILURE_ACTION=allow (fail-open instead of fail-closed on timeout)
APPSEC_CONNECT_TIMEOUT=500 / APPSEC_SEND_TIMEOUT=500 / APPSEC_PROCESS_TIMEOUT=2000
ALWAYS_SEND_TO_APPSEC=false (targeted mode — only inspect suspicious requests)

Lessons Learned

Convincing server logs can be a false positive. Always validate from outside (tcpdump on dst port) that packets are actually arriving before diving into application debugging.
Port forwarding rules on dual-WAN routers must always be attached to an explicit WAN — NAT rebalancing is a silent trap.
A planned daily event that correlates with an intermittent outage is almost always the cause. The 23:17 / 23:28 timing should have been an earlier signal.
Documenting the physical network architecture is as important as documenting the application stack.

Corrective Actions

Short term:

Fix HTTPS rule to explicit WAN1 (done during incident)
CrowdSec AppSec hardening (done during incident)
Check all other PR60X port forwarding rules (HTTP, DNS-TLS, Plex…) and fix to explicit WAN1
Switch PR60X WAN mode to Failover instead of Load Balancing

Medium term:

Add a BetterStack check probing 82.xxx.xxx.xxx:443 directly (bypassing Cloudflare)
Update /documentation with a “Physical Network Architecture” section
Add PR60X WAN event monitoring (SNMP or Insight Cloud) to detect NAT rebalances in real time

🏁 Conclusion

This 3-hour incident had a trivial root cause — a checkbox in the PR60X interface — but it was completely masked by misleading application logs. BetterStack detected the incident within 2 minutes, the tcpdump capture gave a definitive verdict between server problem and network problem. No data loss, no impact on content.

To go further:

💡 Set up an hourly script checking PR60X port forwarding rule consistency to detect any slip from WAN1 to generic WAN
💡 Implement SNMP monitoring of PR60X WAN events to alert on NAT rebalances before they cause an outage