How we made parallel pytest safe for multi-tenant agent swarms

Parallel test runs feel solved right up until the day they stop being solved.

For a while our backend test environment looked healthy. We had pytest-xdist, modular fixtures, Factory Boy, a structured conftest.py, separate platform and tenant databases, and a ~/run_tests entrypoint that auto-loaded the test environment. One engineer running a focused slice of the suite was fast and stable. A few workers in parallel was routine.

Then how we worked changed.

We started running more tests concurrently across more contexts: multiple tmux panes, background validation passes, longer fixture-heavy suites, and eventually multiple AI agents firing test invocations seconds apart against the same database host. That is when a structural problem we had been getting away with surfaced: even with xdist in place, two independent pytest invocations could collide on the same PostgreSQL schemas and produce LockNotAvailable, statement timeouts, or worse, silent fall-through writes into shared public.

The fix was not “raise the timeout again.” It was treating the test environment as shared infrastructure: per-invocation namespacing, fail-closed cleanup, deterministic connection labeling, and a few subtle bug fixes that mattered more than the headline change.

This post is about what broke, why xdist alone does not solve it, and the specific mechanisms we ended up needing.

Why xdist isn’t enough

pytest-xdist gives you worker parallelism inside one invocation. That is not the same as making multiple independent invocations safe on the same database host.

Our original isolation was schema-per-worker:

test_gw0
test_gw1
test_gw2

That works fine until two entirely separate ~/run_tests calls both spin up gw0. Both runs then try to drop, recreate, migrate, and seed the same physical schema. The result is structural lock contention that looks like flakiness until you see the pattern.

A multi-tenant SaaS test setup typically touches at least two databases:

a platform database for users, orgs, roles, and global control-plane state
one or more tenant databases for the operational data each customer actually works with

Once tests add multi-tenant isolation cases or extra tenant DBs, “isolated by worker name” becomes “isolated only when other invocations stay out of the way.” For a small human team that is an occasional flake. For a swarm of AI agents kicking off concurrent runs and inheriting shell state from each other, it is the normal operating envelope.

The hard lesson was simple: xdist gives you worker parallelism inside one invocation. It does not automatically make multiple independent invocations safe on the same database host.

What we changed

Every pytest invocation now gets a namespace token, and every worker composes its schema from namespace + worker_id:

test_{namespace}_{worker_id}

A namespace token from ~/run_tests looks like this:

p18234t1745178234r3af93d71

That is p<PID>t<EPOCH>r<HEX> with 32 bits of /dev/urandom entropy in the suffix. Two concurrent invocations therefore look like:

test_p18234t1745178234r3af93d71_gw0
test_p18234t1745178234r3af93d71_gw1
test_p20491t1745178241r9c12d4fe_gw0
test_p20491t1745178241r9c12d4fe_gw1

The cross-invocation collision is now impossible by construction. The interesting work was in the parts of the system that had to learn this rule, and in two subtle bugs we hit along the way.

The runner generates and validates the namespace

~/run_tests generates a fresh token per invocation by default, even if the calling shell already has HONEY_TEST_SCHEMA_NAMESPACE exported from a previous run. That stale-export defense matters: a debug session that left the variable set could otherwise cause the next “normal” run to silently re-collide with itself.

A scoped debug override (HONEY_TEST_SCHEMA_NAMESPACE_FORCE=mydebug) is allowed but tightly restricted: lowercase alphanumeric, no underscores, no hyphens, max 32 characters. The runner validates the override in shell before the value reaches any report file path or environment export.

conftest.py covers direct pytest

bin/run_tests is not the only way pytest gets invoked. To keep the design coherent under direct pytest calls, tests/conftest.py:

preserves any namespace already exported by the runner
generates one if none exists
propagates the controller’s namespace into every xdist worker subprocess via the pytest_configure_node hook (node.workerinput["schema_namespace"])

Without that propagation, each xdist worker would generate its own namespace and you would be back to per-worker collisions inside a single invocation.

Subtle bug 1: silent search_path fallthrough

This is the bug that mattered most and that no test would have caught.

PostgreSQL’s SET LOCAL search_path TO foo, public does not error if foo does not exist. It silently falls through to public. So if you migrate engine setup to namespaced schemas but a session-scoped seeder still composes the legacy name:

# Engines now use test_<namespace>_gw0 (correct)

# Seeder, still using the old pattern:
worker_id = os.getenv("PYTEST_XDIST_WORKER", "gw0")
schema = f"test_{worker_id}"               # "test_gw0" — does not exist any more
seed_sess.execute(text(f"SET LOCAL search_path TO {schema}, public"))
seed_sess.add(OperationType(...))           # ends up in `public`
seed_sess.commit()                          # cross-invocation contamination, no error

The seeded rows land in shared public. Concurrent invocations now see each other’s data. The original LockNotAvailable symptom is gone, but a much subtler form of the same problem is silently active.

Every fixture, helper, and seeder that touches search_path now composes through one shared worker_schema() helper, so the engine’s schema and the seeder’s schema cannot diverge. We codified that as a structural invariant in the design doc.

Subtle bug 2: a regex that quietly halved DDL parallelism

Schema setup uses a 2-slot DDL semaphore to bound how many concurrent CREATE TABLE runs hit Postgres at once. The slot was keyed off the worker number:

# Original
slot_match = re.search(r"(\d+)", schema)
slot = int(slot_match.group(1)) % 2 if slot_match else 0

That worked when the schema was test_gw0 or test_gw1. After namespacing, the schema is test_p12345t1745178234r3af93d71_gw0, and re.search(r"(\d+)", ...) matches the first digit run: the PID. Every worker of one invocation gets 12345 % 2 = 1. The 2-lane semaphore degenerates to a single lane, with no error and no logged warning.

Fix: anchor on the worker suffix.

def ddl_slot(schema: str, slot_count: int = 2) -> int:
    m = re.search(r"_gw(\d+)(?:_|$)", schema)
    return int(m.group(1)) % slot_count if m else 0

The 2-slot count itself is empirical. Each metadata.create_all() for our 220+ tenant tables acquires roughly 4,000 catalog locks. PostgreSQL’s shared lock table holds about 19,200 slots at the default max_locks_per_transaction = 64. Two concurrent DDL transactions peak near 8,000 locks, with comfortable margin. Four pushed peak past 16,000 and produced out of shared memory failures under -n 22.

Active-namespace detection unions across every test DB

Stale-schema cleanup is best-effort and runs before pytest. It needs to drop crashed-run schemas without ever dropping a schema a live invocation is still using.

The first instinct (check pg_stat_activity in the DB you are about to sweep) is wrong. A live invocation may currently hold connections in only one of the four configured test DBs. If you check per-DB, you can drop a live namespace’s schemas from the other three, recreating the original race in a narrower window.

The sweep collects the live set across every reachable DB before sweeping any DB:

active_namespaces, errors = collect_active_namespaces(all_urls)
active_detection_complete = not errors

for url in all_urls:
    sweep_one(
        url,
        active_namespaces=active_namespaces,
        active_detection_complete=active_detection_complete,
    )

If any DB is unreachable during collection (auth failure, network timeout, host blackhole), active_detection_complete flips to False and the sweep drops nothing. Stale candidates are recorded with reason: active_detection_incomplete so the fail-closed decision is visible in the JSON output. We bound connection establishment with a 5-second connect_timeout so a blackholing host cannot stall the pre-pytest sweep before this safety logic runs.

One drop per transaction

The first version of the sweep dropped every stale schema inside one transaction. That worked on small backlogs. On a real backlog of 56 stale schemas (each containing 220+ tenant tables), the cumulative DROP SCHEMA ... CASCADE catalog locks blew past PostgreSQL’s shared lock table:

psycopg2.errors.OutOfMemory: out of shared memory

Worse, the failure aborted the transaction, leaving the rest of the backlog stranded for the next sweep to hit again at higher cost.

The fix: discovery in one read-only engine.connect() block, then each drop in its own short engine.begin() transaction with SET LOCAL lock_timeout = '5s' and SET LOCAL statement_timeout = '30s'. Failed drops land in a per-DB failed list rather than aborting the rest of the database. The cleanup pass cleared the 56-schema backlog cleanly: 56 dropped, 0 failed, no OOM.

Connection labeling makes lock incidents traceable

Every test connection is labeled:

application_name = pytest_{worker_id}_{namespace}[_{suffix}]

Examples:

pytest_gw0_p18234t1745178234r3af93d71
pytest_gw0_p18234t1745178234r3af93d71_concurrency

The underscore ban for ad-hoc namespaces is structural here. pytest_gw0_my_debug_concurrency could parse as either (my_debug, concurrency) or (my_debug_concurrency, no suffix), and the sweep’s active-namespace detector cannot pick. Banning underscores in ad-hoc tokens makes the suffix boundary unambiguous.

bin/diagnose_test_locks.py queries pg_blocking_pids() across every configured test DB and prints both sides:

SELECT
  blocked.pid              AS blocked_pid,
  blocked.application_name AS blocked_app,
  blocked.wait_event,
  blocked.query            AS blocked_query,
  blocker.pid              AS blocker_pid,
  blocker.application_name AS blocker_app,
  blocker.query            AS blocker_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blockers(pid) ON true
JOIN pg_stat_activity blocker ON blocker.pid = blockers.pid;

Because every connection carries its worker and namespace in application_name, a LockNotAvailable is now traceable to a specific invocation, not inferred from a stack trace.

Test reports are namespaced

A small but load-bearing detail. ~/run_tests writes JSON reports to /tmp/pytest_reports/<timestamp>_<namespace>.json and collected-nodeid scratch files to .collected_nodeids_<namespace>.txt. Two concurrent invocations starting in the same second used to overwrite each other’s evidence. The reader (~/test_report) globs *.json, so namespacing the filenames does not break anything downstream.

Old design vs current design

Concern	Earlier shape	Current shape
Worker schema naming	`test_gw0`	`test_{namespace}_{worker}`
Cross-invocation safety	best effort	structural isolation
Report artifacts	shared timestamp paths	namespace-qualified paths
Active test detection	per-DB view, risk of partial picture	union across all configured DBs
Cleanup posture	could be destructive when uncertain	fail-closed when detection is incomplete
Schema drop strategy	many drops in one transaction	one schema per transaction
DDL slot derivation	first digit run in schema name	anchored on `_gwN` suffix
`search_path` composition	hand-built per call site	single `worker_schema()` helper
Lock triage	infer from failures	`pg_blocking_pids()` plus `application_name` labels

What we deliberately did not do

A few options we considered and rejected, in case they are useful for someone making the same call:

Per-invocation database (CREATE DATABASE … DROP DATABASE): needs CREATEDB, re-runs Alembic per invocation, and adds 4 to 8 seconds × N databases to every startup. Too heavy.
Bigger lock_timeout everywhere: more band-aid. The failure mode becomes 120-second waits instead of 15-second errors. Does not fix the race.
Postgres advisory locks instead of file-system fcntl: does not replace namespace isolation. Reasonable follow-up if multi-host CI runners ever show up; not worth doing inside a local-WSL workflow.
Queueing invocations as a discipline rule: fragile, and was already getting violated in practice when the failures started.

Why this matters more for AI-assisted teams

A human team often hits this class of collision occasionally. An agent swarm hits it routinely:

multiple agents may launch tests within seconds of each other
they often run similar target sets
they may inherit shell state you forgot about
they are more likely to stress stale cleanup and report collection

What looked like a flaky edge case becomes the normal envelope of operation. The test environment has to behave like shared infrastructure, with namespacing, cleanup rules, diagnostics, and safe defaults that hold without operator coordination.

Closing

The same instincts behind this work show up in the product: isolate state explicitly, fail closed when certainty disappears, label work so you can trace it later, prefer structural guarantees over timeout band-aids. That is the posture behind the EquatorOps platform and the engine architecture.

If you are building operational software with real concurrency, the takeaway is not “use namespaces.” It is that worker-level parallelism stops being enough once independent runs can collide on shared state, and once AI agents are part of your engineering loop, those collisions stop being rare.

If you want to talk about the developer surface behind that architecture, /developers.