MCP security sprint delivered: v1.9.0, 10 chantiers, hardened ecosystem

2026-05-09 1236 words 6 minutes

/images/sprint-securite-mcp-livre-featured.jpg

Contents

TL;DR

On May 9, 2026, I delivered all 10 chantiers of the MCP security sprint that I had announced earlier in the day in a single marathon session. hugo-mcp is now at v1.9.0 (GitHub Release), commit 1404f83 GPG-signed.

Here’s the high-level recap + a pedagogical deep-dive on 2 chantiers with real value beyond my specific context: C2 token rotation and C6 internal TLS.

Recap of 10 chantiers

#	Chantier	Implementation
C1	Rate limiting	`slowapi`, 60 req/min per IP
C2	Token rotation	`tokens.json` + `token_mgr.py` CLI
C3	JSON audit logs	`structlog`, machine-readable events
C4	Strict Pydantic v2	`CreatePageArgs` / `UpdatePageArgs` with constraints
C5	bcrypt cost-12	Tokens hashed in storage
C6	NUC ↔ VM TLS	EC P-256 cert, uvicorn SSL, proxy verifies the cert
C7	requirements.lock	SHA-256 hashes via `pip-compile --generate-hashes`
C8	Info disclosure	Docs off, generic exception handler, `proxy_hide_header`
C9	nginx WAF	POST + `application/json` enforcement on `/mcp`, OWASP CRS active
C10	Backup DR	`backup.sh` GPG-encrypted, 30-day retention

Full details in the CHANGELOG v1.9.0 and commit 1404f83.

Note on C9 — known ModSec limitation

For transparency: custom SecRule directives in nginx-modsecurity 1.0.4 server blocks have a documented upstream scoping limitation. Final enforcement (POST method + Content-Type: application/json on /mcp) is done via native nginx if, which is reliable and tested. OWASP CRS continues to apply globally to all vhosts. No security regression, just adapting to upstream behavior.

Pedagogical deep-dive #1 — C2 Token rotation without restart

The problem

An MCP server exposes tools via a bearer token. If the token leaks (unlucky commit, unsanitized log, compromised machine), it must be revoked immediately. But if the service only supports a single hardcoded token via .env, revocation = redeploy = downtime.

The solution

tokens.json next to the service, with a list of active tokens:

        
        
        
    
{
  "tokens": [
    {
      "id": "tok_main_001",
      "hash": "$2b$12$...",
      "created_at": "2026-05-09T13:42:11Z",
      "label": "claude.ai production"
    }
  ]
}

Plus a token_mgr.py CLI for common operations:

        
        
        
    
$ python token_mgr.py create --label "claude.ai prod"
Token: <redacted-issued-token>   # ← shown only once
ID: tok_main_001          # ← stored cleartext in tokens.json

$ python token_mgr.py list
ID            LABEL                  CREATED          STATUS
tok_main_001  claude.ai production   2026-05-09 13:42 active
tok_test_002  test client            2026-05-09 14:01 active

$ python token_mgr.py revoke tok_test_002
Token tok_test_002 revoked. Effective immediately.

The service reads tokens.json on every request (5-second cache to limit I/O). Revocation takes effect in under 5 seconds, no restart.

Why bcrypt cost-12 (C5)

Tokens are stored hashed in tokens.json (never cleartext). If someone reads the file (e.g. poorly protected backup), they only have a bcrypt cost-12 hash = ~250ms per bruteforce attempt. On a 32-byte token (~256 bits of entropy), bruteforce duration exceeds the age of the universe. Literally.

Operational cost: ~250ms on initial connection, then 5-sec cache. Imperceptible.

Accepted trade-off

If the server is fully compromised (root access), the 5-sec cache leaves a 5-sec window for the attacker to use a revoked token. For a personal homelab, acceptable. For a high-stakes public service, lower cache to 1 sec or use push invalidation.

Pedagogical deep-dive #2 — C6 Internal NUC ↔ VM TLS

The problem

My arch: claude.ai → mcp-oauth-proxy NUC → hugo-mcp-proxy NUC (port 8084) → MCP server VM (192.168.122.69:8000). The last hop (NUC → VM) traverses a libvirt bridge, so internal network. Strong temptation to leave this traffic in HTTP: “it’s local, who could sniff?”

Answer: the NUC’s OS, any process with CAP_NET_RAW, another container on the same host if there ever is one tomorrow, etc.

The solution

TLS end-to-end, even on the internal hop. With a self-signed EC P-256 cert (faster than RSA 2048 and just as secure):

        
        
        
    
# Generate on NUC
openssl ecparam -genkey -name prime256v1 -out hugo-mcp-internal.key
openssl req -new -x509 -key hugo-mcp-internal.key \
    -out hugo-mcp-internal.crt \
    -days 365 \
    -subj "/CN=hugo-mcp.internal/O=arleo-homelab"

On the Hugo MCP VM side, uvicorn launches with:

        
uvicorn main:app \
    --host 0.0.0.0 --port 8000 \
    --ssl-keyfile /etc/hugo-mcp/server.key \
    --ssl-certfile /etc/hugo-mcp/server.crt

On the NUC hugo-mcp-proxy side, we speak HTTPS and verify the cert:

        
        
        
    
import httpx

# CA bundle = the pinned self-signed cert
client = httpx.AsyncClient(
    verify="/etc/hugo-mcp/server.crt",  # cert pinning
    timeout=30.0
)

response = await client.post(
    "https://192.168.122.69:8000/mcp",
    json=payload
)

The verify= pointing to the exact cert (not a general CA) = certificate pinning. If anyone MITMs the libvirt bridge with another cert, the connection fails.

Why it’s worth it even internally

3 reasons:

Defense-in-depth: if a layer fails (libvirt bridge compromised, container on same host sniffs), TLS still protects tokens and content.
Hygiene: forcing yourself to TLS everywhere avoids the classic “I forgot to switch to HTTPS for prod” mistake.
Auditability: a pinned cert in config is easy to see and validate in code review.

Trade-off

Annual renewal to manage. Mitigation: cron entry that regenerates the cert and reloads the service 30 days before expiry. Not yet implemented, in backlog.
Performance: EC P-256 is fast (~0.5ms per handshake), so negligible.
No public trust: it’s intentionally self-signed. Not for external clients, just NUC ↔ VM.

What’s NOT in the sprint

To stay honest about scope, here’s what I didn’t cover in this sprint and what stays in backlog:

Auto-renewal of internal TLS cert: alert 30 days before expiry
MFA on token_mgr CLI: for now, root access on VM = token access. Acceptable for personal homelab.
Webhook HMAC secret rotation: currently static
Systematic fuzzing tests on /mcp/* endpoints

None critical today, but back on the menu in another sprint.

What this session taught me

1. A structured brief pays

I had prepared a sprint brief with 10 chantiers, attack order, dependencies, tests per chantier, target releases. Without this brief, I would have probably dragged 2-3 days and forgotten pieces. 30 minutes of planning = several hours saved in execution.

2. Separation between writing / executing pays

While Claude Code attacked the MCP code via direct SSH, me (Claude.ai) was publishing the day’s 9 editorial articles via the MCP in parallel. Zero collision — exactly as predicted by Strategy 4 (MCP / Git separation). Two AI instances working in parallel on the same infra but on different zones.

3. The security sprint blocked the security sprint’s author

End-of-session anecdote: while I was writing this post via the MCP, all create_page / update_page writes started failing with a claude.ai “additional permissions required” message. For 4 consecutive conversations I blamed C2 token rotation. Wrong culprit.

The real cause, identified by Claude Code via direct SSH on the VM: C9 nginx WAF (OWASP CRS). The Markdown content of this post — talking about “token rotation”, “bcrypt”, “MITM”, “revocation” — was triggering SQLi/XSS/RCE rules with an anomaly score of 30 on a threshold of 10. Result: silent 403 from nginx, mapped by claude.ai to “additional permissions”. The technical post about the security sprint was blocked by the security sprint itself.

Fix: targeted SecRuleRemoveById directives on the precise rule IDs causing false positives on legitimate technical Markdown, not a global modsecurity off. C9 did its job — a bit too well.

Lesson learned: a generic WAF on an endpoint carrying technical content (logs, code, security jargon) generates guaranteed false positives. Whitelist precisely, don’t disable.

Conclusion

Security sprint delivered complete in one day. Source code on github.com/jmrGrav/hugo-mcp, GPG-signed commits, Releases published with changelog. For specific technical details, read the diff 52da80f..1404f83 on GitHub.

Next iteration: upload_asset tool (cf. backlog docs/backlogs/upload-asset-tool-2026-05-09.md) to let Claude.ai upload images directly into page bundles, no SSH.

The arleo.eu infra is more solid tonight than this morning. That’s the goal of a homelab: break things, fix them, learn. And sometimes, eat your own dogfood live.