Post-mortem: 3 MCP timeouts — IPAddressDeny + Cloudflare + NFS

Context
I deployed a Hugo MCP Server (FastAPI, 7 tools) that lets me edit arleo.eu from Claude.ai. Architecture: claude.ai → mcp-oauth-proxy NUC → hugo-mcp-proxy NUC → MCP server VM.
For 2 weeks, create_page and update_page calls regularly timed out from the Claude.ai side. Not systematically — about 1 in 3 calls. Hard to reproduce, harder to correlate.
Here’s the post-mortem in 3 acts.
Act 1 — The fake culprit: OAuth expiry
Symptom: create_page returns 401 Unauthorized after ~10 min idle.
Initial hypothesis: OAuth token expires and isn’t refreshed.
Investigation: mcp-oauth-proxy logs on the NUC:
2026-05-03 14:23:11 INFO Token validated for client xxxxx
2026-05-03 14:33:42 INFO Token validated for client xxxxx
2026-05-03 14:44:18 WARN Token expired, returning 401Expiry is real. Fix: extend token lifetime to 1h, implement refresh in mcp-oauth-proxy.
Test: 30 min idle, then create_page → 200ms response.
Except the original timeout wasn’t a 401, it was a timeout. I hadn’t looked carefully enough.
Act 2 — The real culprit: IPAddressDeny=any
Persistent symptom: after the OAuth fix, create_page is still slow. Not a 401, not a 500. Just 30 seconds of silence, then timeout from Claude.ai.
Hypothesis: the Hugo MCP VM service makes an external call (Cloudflare cache purge) that’s blocked.
Investigation: I had hardened the hugo-mcp.service the day before with IPAddressDeny=any + IPAddressAllow=127.0.0.0/8. I just wanted to allow loopback.
The problem: the create_page tool calls requests.post('https://api.cloudflare.com/client/v4/zones/.../purge_cache') at the end. Cloudflare is NOT in the whitelist. The connect() syscall doesn’t fail instantly under IPAddressDeny — the kernel silently drops the packet, and requests waits its default 30-second timeout.
Fix:
[Service]
IPAddressDeny=any
IPAddressAllow=127.0.0.0/8
IPAddressAllow=192.168.122.1/32
IPAddressAllow=104.16.0.0/12
IPAddressAllow=2606:4700::/32Test: create_page → 800ms.
Except sometimes it’s still slow. Not systematically.
Act 3 — The final culprit: timing instrumentation as the revealer
Persistent symptom: create_page takes either 800ms or 12s, with no visible pattern.
I added timing instrumentation to the MCP: each tool call logs durations of each sub-operation.
import time
def create_page(...):
t0 = time.time()
write_md_file(...)
t1 = time.time()
subprocess.run(["hugo", "--minify"])
t2 = time.time()
purge_cloudflare(...)
t3 = time.time()
log.info(f"create_page write={t1-t0:.0f}ms build={t2-t1:.0f}ms purge={t3-t2:.0f}ms")Logs after instrumentation:
create_page write=12ms build=2400ms purge=380ms
create_page write=14ms build=2350ms purge=380ms
create_page write=11ms build=11800ms purge=380ms ← SLOW
create_page write=13ms build=2410ms purge=380msThe culprit isn’t purge_cloudflare. It’s hugo --minify taking 12 seconds instead of 2.4s sometimes.
Hugo investigation: Hugo reads all files in content/, themes/, static/. If one is on QNAP NFS and QNAP is doing its daily backup, NFS latencies explode.
I checked: the Hugo VM mounts static/images/ (page bundles) on QNAP NFS. When QNAP does its daily backup (6am at home), stat() on files takes ~500ms each instead of 1ms. × ~30 files = 12 seconds.
Permanent fix:
- Migrate
static/images/to local VM disk (no more NFS for Hugo) - Add a BetterStack alert if QNAP NFS exceeds 30 minutes of backup (which would indicate a real issue, not just daily updates)
Result: create_page consistently at 800ms ± 100ms.
Lessons learned
1. Timing instrumentation is the #1 diagnostic tool
Without log.info(timing=...) in each sub-operation, I would have looked for hours in Cloudflare and OAuth logs. Once durations were separated, the culprit appeared in 30 seconds of log reading.
Rule: any MCP tool exceeding 500ms in p99 must have time.time() around each sub-operation.
2. IPAddressDeny=any is a double-edged sword
Excellent for security (reduces network attack surface), but tricky: any non-whitelisted external call becomes a silent timeout. Always audit ALL external calls the service makes before enabling this directive.
Audit list for a Python service:
- External APIs (Cloudflare, GitHub, AbuseIPDB…)
- DNS lookups (
socket.gethostbyname) - Monitoring webhooks (BetterStack heartbeat)
3. NFS is not a local filesystem for CLI tools
Hugo, Git, and most CLI tools do repeated stat() calls. NFS introduces ~1ms minimum latency per syscall. Invisible normally, catastrophic during a backup or saturated share.
Rule: anything read or written during a critical operation (build, deploy, hot path) must be on local disk. NFS is for archives, rarely consulted media, backups.
4. Hypothesis cascading wastes time
I spent ~1h on OAuth (which was a real bug, but not THE bug). Then ~2h on IPAddressDeny (same). Then ~30min on NFS.
If I’d started by instrumenting durations BEFORE forming hypotheses, NFS would have jumped out on the first log read.
Rule: before searching for the cause, measure where time is spent. Always.
Conclusion
3 distinct bugs, 3 different fixes. None would have been sufficient alone. Final validation test: 24h of operation with create_page consistent at 800ms, regardless of time of day or QNAP load.
The timing instrumentation stayed in prod. It now powers a BetterStack dashboard showing p50/p95/p99 per tool. Current p99: 1.2s, including Hugo rebuild. Acceptable for interactive editing.