SK ← Sungju Kim · Data & AI Systems Engineer
Career / Selected work / System 03

Concurrency-Safe Resource Backend for a Collection Framework

The standard service layer every collector talks to for accounts, seeds, sessions, and captures.

Python 3.13FastAPIMongoDB (Beanie)PostgreSQL (Tortoise + asyncpg)PlaywrightSlack BoltAirflow trigger

Problem

Collectors operating across hundreds of login-gated sources kept getting blocked when concurrent runs reused the same account, and seed scheduling, session merging, and capture handoff were duplicated in each collector.

System

Primary Maintainer. A FastAPI backend with 8 routers and 32 v2 endpoints in front of a MongoDB + PostgreSQL dual-stack. Atomic account acquire/release prevents shared-account abuse, seed scheduling assigns boards dynamically per request, Playwright StorageState sessions are merged automatically, and a Site Domain Management module rewrites URLs as source sites change domains.

Collector Resource API Mongo pool PG state Session merge Airflow

Impact

Every collector now goes through one backend for resources, so blocking-avoidance and concurrency policy live in one place. Major migrations executed: MongoDB → PostgreSQL on the relational paths, and a v3.0.0 architecture refresh that retired the legacy API surface.

Architecture notes

The same architecture pattern shows up on the public lab at https://sungjukim.com/lab: a scheduled agent runs upstream, writes a static JSON artifact, and the consuming surface only ever reads that artifact. No LLM call is on the user's request path. Failures in the upstream collector or LLM stage cannot break the consuming surface — they only delay the next refresh.

The backend’s central abstraction is “resources” — accounts, seeds (board entries), and Playwright sessions. A collector requests resources, runs, and releases them. Behind the API, atomic MongoDB updates implement the acquire/release with a state machine (ACTIVE → USING → INACTIVE / BLOCKED). If active resources are exhausted, the API can reclaim “zombie USING” rows whose owning job died, preventing starvation.

Session merge is a small but load-bearing trick: when a new Playwright StorageState is uploaded, the backend merges its cookies and origins with the existing cached session — duplicates collapse to the newer value, non-overlapping entries are preserved. This avoids losing cookies a single login flow only sets once.

The MongoDB → PostgreSQL migration moved everything with strong relational shape (site metadata, domain rewrites, capture audit) to PostgreSQL via Tortoise ORM; account and session pools stayed on MongoDB where atomic single-document updates fit naturally.

Stack

Python 3.13FastAPIMongoDB (Beanie)PostgreSQL (Tortoise + asyncpg)PlaywrightSlack BoltAirflow trigger