Self-Healing Egress Platform
From vendor negotiation to a 28-backend egress platform that recovers on its own.
Problem
Collection at scale needed reliable, geographically diverse egress IPs. External proxy spend was rising, and the existing system had no operational history — there was no way to answer who was on which IP at a given moment.
System
Three-stage evolution, sole owner: (1) market research, vendor selection, and contract negotiation for external proxies; (2) cost-driven shift to rotating an existing VPN pool, prototyped in docker-compose and moved to Kubernetes as a multi-deployment egress platform; (3) introduced a Postgres event ledger plus a drain-rotate cron and a synthetic-CONNECT health checker that automatically quarantine and restore backends.
Impact
Now 28 backends (26 global + 2 Korea) behind HAProxy with consistent source-hash balancing. Nightly and weekend single-backend failures auto-recover without human intervention. A daily Slack health report makes overnight events visible at the start of each morning.
Architecture notes
The same architecture pattern shows up on the public lab at https://sungjukim.com/lab: a scheduled agent runs upstream, writes a static JSON artifact, and the consuming surface only ever reads that artifact. No LLM call is on the user's request path. Failures in the upstream collector or LLM stage cannot break the consuming surface — they only delay the next refresh.
The drain-rotate algorithm uses HAProxy’s admin port to gracefully drain active connections from a backend, polls the stats CSV until the connection count reaches zero, then patches the SERVER_HOSTNAMES config and triggers a kubectl rollout restart with strategy: Recreate. Once the new pod is up, it probes the public IP through the backend and writes the result to a Postgres event ledger.
A separate synthetic-CONNECT health checker runs as a cron pod. It performs direct CONNECT probes through each backend to a small set of round-robin endpoints — five consecutive failures quarantine the backend, one success after that restores it. This catches a class of failures that HAProxy’s built-in L7 health check misses.
A daily Slack report joins the event ledger against the previous 24 hours, so overnight rotations, quarantines, and recoveries are visible at the start of the next morning rather than discovered by a downstream failure.