Contents

Post-Mortem — Incident 522 / WAN Failover (April 8, 2026)

⚡ In short

Date: April 7-8, 2026 — Duration: ~3h (21:28 UTC → 22:31 UTC) — Severity: P1

arleo.eu was unreachable for 3 hours. The root cause was not the server, not nginx, not CrowdSec — it was an HTTPS port forwarding rule attached to generic WAN instead of explicit WAN1 on the Netgear PR60X. The daily DHCP lease renewal of the 4G modem (WAN2) triggered a NAT rebalance that broke routing to port 443.

🧠 Why

The PR60X is configured with dual WAN (FTTH WAN1 + 4G WAN2 backup). On each WAN state transition, the router rebalances its NAT rules. A rule attached to generic WAN does not consistently reattach to WAN1 after the rebalance — result: incoming packets on WAN1 are no longer routed to the server.

This configuration trap had been latent since the initial PR60X setup. It never triggered before because the daily WAN2 reboot timing (23:17 local time, ISP-imposed) was not documented. The incident also revealed that the physical network architecture (Freebox bridge → PR60X dual WAN → NUC) was undocumented.

🔧 What was done

Network Architecture (Revealed by the Incident)

Internet
   ↓
[Freebox FTTH] (Bridge mode — no routing)
   ↓
[Netgear PR60X] (actual router/firewall, dual WAN)
   ├── WAN1 = FTTH Free → 82.xxx.xxx.xxx (primary)
   └── WAN2 = 4G/LTE backup → 100.96.220.53 (backup, daily reboot ~23:17)
   ↓
LAN 192.168.0.0/22
   ↓
[NUC8i3BEH] 192.168.1.26 → nginx + CrowdSec + Grav

Timeline (UTC)

TimeEvent
21:01–21:12CrowdSec bans 97 IPs (95× http:scan, 2× http:exploit) — scan wave from Microsoft Azure subnets
21:21:28Last BetterStack /ping probe received by nginx (status 200)
21:22–21:23nginx surge: active connections 2 → 34, traffic 355 KB/s — CrowdSec memory spike to 1.4 GB
21:27:10WAN2 disconnected on PR60X (ISP-forced 4G lease renewal)
21:27:14WAN2 reconnected — PR60X rebalances NAT rules
21:28:15BetterStack incident triggered — HTTP 522
21:30–22:00Initial investigation: false-positive AppSec CrowdSec (timeouts in nginx error.log)
22:20–22:25Key discovery: strict tcpdump on dst port 443 shows zero incoming packets from outside
22:28Timing identified: 522 crash correlates exactly with PR60X WAN2 events at 21:27
22:30Edit HTTPS rule on PR60X → explicitly fixed to WAN1
22:31Service restored

Exact Failure Mechanism

  1. WAN2 (4G modem) receives its ISP-forced daily DHCP lease renewal (~23:17 local time)
  2. PR60X logs WAN2 disconnected then WAN2 reconnected a few seconds later
  3. On each WAN state transition, the PR60X rebalances its NAT/port forwarding rules
  4. Since the HTTPS rule was attached to generic WAN (not explicit WAN1), it does not consistently reattach to WAN1 after the rebalance
  5. Result: incoming packets on WAN1 are no longer routed to 192.168.1.26:443
  6. Cloudflare sends requests to the origin IP → timeout → HTTP 522

The False Positive: CrowdSec AppSec

During the first hour, nginx logs showed very convincing AppSec CrowdSec errors. The Lua bouncer had a fail-closed policy (deny on timeout), and under scan wave load it was timing out. But these denials only affected requests that reached nginx — and no requests were arriving (broken port forwarding).

The decisive diagnostic:

sudo timeout 20 tcpdump -i eno1 -n "tcp dst port 443"

0 incoming packets during 20 seconds of testing from a phone on 4G. The server wasn’t the problem — packets weren’t even arriving.

Applied Fix

Primary Fix — PR60X:

FieldBeforeAfter
WAN IP AddressWAN (generic)WAN1 (explicit)
External port443443
Internal IP192.168.1.26192.168.1.26

Bonus Fix — CrowdSec Hardening in /etc/crowdsec/bouncers/crowdsec-nginx-bouncer.conf:

  • APPSEC_FAILURE_ACTION=allow (fail-open instead of fail-closed on timeout)
  • APPSEC_CONNECT_TIMEOUT=500 / APPSEC_SEND_TIMEOUT=500 / APPSEC_PROCESS_TIMEOUT=2000
  • ALWAYS_SEND_TO_APPSEC=false (targeted mode — only inspect suspicious requests)

Lessons Learned

  1. Convincing server logs can be a false positive. Always validate from outside (tcpdump on dst port) that packets are actually arriving before diving into application debugging.
  2. Port forwarding rules on dual-WAN routers must always be attached to an explicit WAN — NAT rebalancing is a silent trap.
  3. A planned daily event that correlates with an intermittent outage is almost always the cause. The 23:17 / 23:28 timing should have been an earlier signal.
  4. Documenting the physical network architecture is as important as documenting the application stack.

Corrective Actions

Short term:

  • Fix HTTPS rule to explicit WAN1 (done during incident)
  • CrowdSec AppSec hardening (done during incident)
  • Check all other PR60X port forwarding rules (HTTP, DNS-TLS, Plex…) and fix to explicit WAN1
  • Switch PR60X WAN mode to Failover instead of Load Balancing

Medium term:

  • Add a BetterStack check probing 82.xxx.xxx.xxx:443 directly (bypassing Cloudflare)
  • Update /documentation with a “Physical Network Architecture” section
  • Add PR60X WAN event monitoring (SNMP or Insight Cloud) to detect NAT rebalances in real time

🏁 Conclusion

This 3-hour incident had a trivial root cause — a checkbox in the PR60X interface — but it was completely masked by misleading application logs. BetterStack detected the incident within 2 minutes, the tcpdump capture gave a definitive verdict between server problem and network problem. No data loss, no impact on content.

To go further:

  • 💡 Set up an hourly script checking PR60X port forwarding rule consistency to detect any slip from WAN1 to generic WAN
  • 💡 Implement SNMP monitoring of PR60X WAN events to alert on NAT rebalances before they cause an outage