Post-mortem: 3 MCP timeouts — IPAddressDeny + Cloudflare + NFS

2026-05-09 763 words 4 minutes

/images/postmortem-mcp-timeouts-cloudflare-featured.jpg

Contents

Context

I deployed a Hugo MCP Server (FastAPI, 7 tools) that lets me edit arleo.eu from Claude.ai. Architecture: claude.ai → mcp-oauth-proxy NUC → hugo-mcp-proxy NUC → MCP server VM.

For 2 weeks, create_page and update_page calls regularly timed out from the Claude.ai side. Not systematically — about 1 in 3 calls. Hard to reproduce, harder to correlate.

Here’s the post-mortem in 3 acts.

Act 1 — The fake culprit: OAuth expiry

Symptom: create_page returns 401 Unauthorized after ~10 min idle.

Initial hypothesis: OAuth token expires and isn’t refreshed.

Investigation: mcp-oauth-proxy logs on the NUC:

2026-05-03 14:23:11 INFO  Token validated for client xxxxx
2026-05-03 14:33:42 INFO  Token validated for client xxxxx
2026-05-03 14:44:18 WARN  Token expired, returning 401

Expiry is real. Fix: extend token lifetime to 1h, implement refresh in mcp-oauth-proxy.

Test: 30 min idle, then create_page → 200ms response.

Except the original timeout wasn’t a 401, it was a timeout. I hadn’t looked carefully enough.

Act 2 — The real culprit: IPAddressDeny=any

Persistent symptom: after the OAuth fix, create_page is still slow. Not a 401, not a 500. Just 30 seconds of silence, then timeout from Claude.ai.

Hypothesis: the Hugo MCP VM service makes an external call (Cloudflare cache purge) that’s blocked.

Investigation: I had hardened the hugo-mcp.service the day before with IPAddressDeny=any + IPAddressAllow=127.0.0.0/8. I just wanted to allow loopback.

The problem: the create_page tool calls requests.post('https://api.cloudflare.com/client/v4/zones/.../purge_cache') at the end. Cloudflare is NOT in the whitelist. The connect() syscall doesn’t fail instantly under IPAddressDeny — the kernel silently drops the packet, and requests waits its default 30-second timeout.

Fix:

        
        
        
    
[Service]
IPAddressDeny=any
IPAddressAllow=127.0.0.0/8
IPAddressAllow=192.168.122.1/32
IPAddressAllow=104.16.0.0/12
IPAddressAllow=2606:4700::/32

Test: create_page → 800ms.

Except sometimes it’s still slow. Not systematically.

Act 3 — The final culprit: timing instrumentation as the revealer

Persistent symptom: create_page takes either 800ms or 12s, with no visible pattern.

I added timing instrumentation to the MCP: each tool call logs durations of each sub-operation.

        
import time

def create_page(...):
    t0 = time.time()
    write_md_file(...)
    t1 = time.time()
    
    subprocess.run(["hugo", "--minify"])
    t2 = time.time()
    
    purge_cloudflare(...)
    t3 = time.time()
    
    log.info(f"create_page write={t1-t0:.0f}ms build={t2-t1:.0f}ms purge={t3-t2:.0f}ms")

Logs after instrumentation:

create_page write=12ms  build=2400ms   purge=380ms
create_page write=14ms  build=2350ms   purge=380ms
create_page write=11ms  build=11800ms  purge=380ms   ← SLOW
create_page write=13ms  build=2410ms   purge=380ms

The culprit isn’t purge_cloudflare. It’s hugo --minify taking 12 seconds instead of 2.4s sometimes.

Hugo investigation: Hugo reads all files in content/, themes/, static/. If one is on QNAP NFS and QNAP is doing its daily backup, NFS latencies explode.

I checked: the Hugo VM mounts static/images/ (page bundles) on QNAP NFS. When QNAP does its daily backup (6am at home), stat() on files takes ~500ms each instead of 1ms. × ~30 files = 12 seconds.

Permanent fix:

Migrate static/images/ to local VM disk (no more NFS for Hugo)
Add a BetterStack alert if QNAP NFS exceeds 30 minutes of backup (which would indicate a real issue, not just daily updates)

Result: create_page consistently at 800ms ± 100ms.

Lessons learned

1. Timing instrumentation is the #1 diagnostic tool

Without log.info(timing=...) in each sub-operation, I would have looked for hours in Cloudflare and OAuth logs. Once durations were separated, the culprit appeared in 30 seconds of log reading.

Rule: any MCP tool exceeding 500ms in p99 must have time.time() around each sub-operation.

2. `IPAddressDeny=any` is a double-edged sword

Excellent for security (reduces network attack surface), but tricky: any non-whitelisted external call becomes a silent timeout. Always audit ALL external calls the service makes before enabling this directive.

Audit list for a Python service:

External APIs (Cloudflare, GitHub, AbuseIPDB…)
DNS lookups (socket.gethostbyname)
Monitoring webhooks (BetterStack heartbeat)

3. NFS is not a local filesystem for CLI tools

Hugo, Git, and most CLI tools do repeated stat() calls. NFS introduces ~1ms minimum latency per syscall. Invisible normally, catastrophic during a backup or saturated share.

Rule: anything read or written during a critical operation (build, deploy, hot path) must be on local disk. NFS is for archives, rarely consulted media, backups.

4. Hypothesis cascading wastes time

I spent ~1h on OAuth (which was a real bug, but not THE bug). Then ~2h on IPAddressDeny (same). Then ~30min on NFS.

If I’d started by instrumenting durations BEFORE forming hypotheses, NFS would have jumped out on the first log read.

Rule: before searching for the cause, measure where time is spent. Always.

Conclusion

3 distinct bugs, 3 different fixes. None would have been sufficient alone. Final validation test: 24h of operation with create_page consistent at 800ms, regardless of time of day or QNAP load.

The timing instrumentation stayed in prod. It now powers a BetterStack dashboard showing p50/p95/p99 per tool. Current p99: 1.2s, including Hugo rebuild. Acceptable for interactive editing.