Post-Mortem — Incident 522 / WAN Failover (April 8, 2026)

⚡ In short
Date: April 7-8, 2026 — Duration: ~3h (21:28 UTC → 22:31 UTC) — Severity: P1
arleo.eu was unreachable for 3 hours. The root cause was not the server, not nginx, not CrowdSec — it was an HTTPS port forwarding rule attached to generic WAN instead of explicit WAN1 on the Netgear PR60X. The daily DHCP lease renewal of the 4G modem (WAN2) triggered a NAT rebalance that broke routing to port 443.
🧠 Why
The PR60X is configured with dual WAN (FTTH WAN1 + 4G WAN2 backup). On each WAN state transition, the router rebalances its NAT rules. A rule attached to generic WAN does not consistently reattach to WAN1 after the rebalance — result: incoming packets on WAN1 are no longer routed to the server.
This configuration trap had been latent since the initial PR60X setup. It never triggered before because the daily WAN2 reboot timing (23:17 local time, ISP-imposed) was not documented. The incident also revealed that the physical network architecture (Freebox bridge → PR60X dual WAN → NUC) was undocumented.
🔧 What was done
Network Architecture (Revealed by the Incident)
Internet
↓
[Freebox FTTH] (Bridge mode — no routing)
↓
[Netgear PR60X] (actual router/firewall, dual WAN)
├── WAN1 = FTTH Free → 82.xxx.xxx.xxx (primary)
└── WAN2 = 4G/LTE backup → 100.96.220.53 (backup, daily reboot ~23:17)
↓
LAN 192.168.0.0/22
↓
[NUC8i3BEH] 192.168.1.26 → nginx + CrowdSec + GravTimeline (UTC)
| Time | Event |
|---|---|
| 21:01–21:12 | CrowdSec bans 97 IPs (95× http:scan, 2× http:exploit) — scan wave from Microsoft Azure subnets |
| 21:21:28 | Last BetterStack /ping probe received by nginx (status 200) |
| 21:22–21:23 | nginx surge: active connections 2 → 34, traffic 355 KB/s — CrowdSec memory spike to 1.4 GB |
| 21:27:10 | WAN2 disconnected on PR60X (ISP-forced 4G lease renewal) |
| 21:27:14 | WAN2 reconnected — PR60X rebalances NAT rules |
| 21:28:15 | BetterStack incident triggered — HTTP 522 |
| 21:30–22:00 | Initial investigation: false-positive AppSec CrowdSec (timeouts in nginx error.log) |
| 22:20–22:25 | Key discovery: strict tcpdump on dst port 443 shows zero incoming packets from outside |
| 22:28 | Timing identified: 522 crash correlates exactly with PR60X WAN2 events at 21:27 |
| 22:30 | Edit HTTPS rule on PR60X → explicitly fixed to WAN1 |
| 22:31 | Service restored |
Exact Failure Mechanism
- WAN2 (4G modem) receives its ISP-forced daily DHCP lease renewal (~23:17 local time)
- PR60X logs
WAN2 disconnectedthenWAN2 reconnecteda few seconds later - On each WAN state transition, the PR60X rebalances its NAT/port forwarding rules
- Since the HTTPS rule was attached to generic
WAN(not explicitWAN1), it does not consistently reattach to WAN1 after the rebalance - Result: incoming packets on WAN1 are no longer routed to 192.168.1.26:443
- Cloudflare sends requests to the origin IP → timeout → HTTP 522
The False Positive: CrowdSec AppSec
During the first hour, nginx logs showed very convincing AppSec CrowdSec errors. The Lua bouncer had a fail-closed policy (deny on timeout), and under scan wave load it was timing out. But these denials only affected requests that reached nginx — and no requests were arriving (broken port forwarding).
The decisive diagnostic:
sudo timeout 20 tcpdump -i eno1 -n "tcp dst port 443"0 incoming packets during 20 seconds of testing from a phone on 4G. The server wasn’t the problem — packets weren’t even arriving.
Applied Fix
Primary Fix — PR60X:
| Field | Before | After |
|---|---|---|
| WAN IP Address | WAN (generic) | WAN1 (explicit) |
| External port | 443 | 443 |
| Internal IP | 192.168.1.26 | 192.168.1.26 |
Bonus Fix — CrowdSec Hardening in /etc/crowdsec/bouncers/crowdsec-nginx-bouncer.conf:
APPSEC_FAILURE_ACTION=allow(fail-open instead of fail-closed on timeout)APPSEC_CONNECT_TIMEOUT=500/APPSEC_SEND_TIMEOUT=500/APPSEC_PROCESS_TIMEOUT=2000ALWAYS_SEND_TO_APPSEC=false(targeted mode — only inspect suspicious requests)
Lessons Learned
- Convincing server logs can be a false positive. Always validate from outside (
tcpdumpondst port) that packets are actually arriving before diving into application debugging. - Port forwarding rules on dual-WAN routers must always be attached to an explicit WAN — NAT rebalancing is a silent trap.
- A planned daily event that correlates with an intermittent outage is almost always the cause. The 23:17 / 23:28 timing should have been an earlier signal.
- Documenting the physical network architecture is as important as documenting the application stack.
Corrective Actions
Short term:
- Fix HTTPS rule to explicit WAN1 (done during incident)
- CrowdSec AppSec hardening (done during incident)
- Check all other PR60X port forwarding rules (HTTP, DNS-TLS, Plex…) and fix to explicit WAN1
- Switch PR60X WAN mode to Failover instead of Load Balancing
Medium term:
- Add a BetterStack check probing
82.xxx.xxx.xxx:443directly (bypassing Cloudflare) - Update
/documentationwith a “Physical Network Architecture” section - Add PR60X WAN event monitoring (SNMP or Insight Cloud) to detect NAT rebalances in real time
🏁 Conclusion
This 3-hour incident had a trivial root cause — a checkbox in the PR60X interface — but it was completely masked by misleading application logs. BetterStack detected the incident within 2 minutes, the tcpdump capture gave a definitive verdict between server problem and network problem. No data loss, no impact on content.
To go further:
- 💡 Set up an hourly script checking PR60X port forwarding rule consistency to detect any slip from WAN1 to generic WAN
- 💡 Implement SNMP monitoring of PR60X WAN events to alert on NAT rebalances before they cause an outage