2026-03-25 Run Registry For Deferred Cleanup¶
- Status:
implemented - Date:
2026-03-25 - Baseline repo commit before this change series:
3dba87b6e126a46fdedd48609c7d0d016cee436f - Baseline commit summary:
chore: remove codecov patch target - Initial implementation commits:
d37d7d7feat: add shared run registry primitives7a50b87feat: track writer runs in the registry6e56637test: cover run registry lifecycle
Summary¶
This change series added a writer-owned run registry that stores one run-scoped YAML record per writer invocation under a dedicated S3 prefix.
The run record is operational metadata for writers and future cleanup workflows. It is not part of the reader contract and is not embedded in the manifest body.
The core idea is:
- Create one run record when the write starts.
- Keep it fresh with a lease heartbeat while the writer is active.
- Mark it
succeededorfailedat the end. - Let a later sweeper derive cleanup targets by combining the run record, the manifest if present, and the existing
shards/run_id=...layout.
This replaced the earlier direction of writing per-attempt cleanup artifacts to S3.
Why This Change Was Needed¶
The existing best-effort cleanup_losers() behavior is useful, but it is not a complete contract for deferred cleanup:
- it runs only inline during the write
- it does not leave a durable writer-owned record that a periodic cleanup job can discover later
- if the writer process dies or cleanup fails transiently, a later sweeper needs some other durable signal to know which run should be inspected
At the same time, the solution had to satisfy a few constraints:
- work across all writers: Python, Dask, Ray, Spark, and single-db modes
- work for both HASH and CEL sharding
- avoid adding reader-only overhead to the manifest
- avoid generating one S3 object per partition or per attempt
- leave a clean migration path to a future SQLite-backed implementation
Problem With The Earlier Direction¶
The earlier plan was to write failed-attempt or attempt-journal YAML objects to S3 and let a later cleanup job process them.
That direction had three problems.
Object Fan-Out¶
One object per attempt or per partition does not scale well.
For large future runs, the cardinality can be very high. A run with hundreds of thousands or millions of partitions would produce an unacceptable number of tiny operational objects.
Distributed Capture Semantics¶
A precise per-attempt journal is easy for a driver-owned writer, but much harder for distributed systems:
- Spark executor attempts can fail before returning structured results
- Dask and Ray have framework-specific event paths, but they are not the same abstraction
- a shared mutable run journal on S3 is awkward because S3 is an object store, not a transactional append log
Reader Contract Pollution¶
Putting cleanup telemetry directly in the manifest would make a reader-facing artifact carry operational details that readers do not need.
That is the wrong ownership boundary. Cleanup metadata belongs to the writer side.
Main Design Decisions¶
1. Separate Operational Metadata From The Manifest¶
The manifest remains the reader-facing snapshot contract.
The run registry is a separate writer-owned artifact. On successful runs, the run record stores manifest_ref. The manifest does not need to point back to the run record.
This keeps readers decoupled from cleanup mechanics.
2. Store One Record Per Run, Not Per Attempt¶
The run registry stores one YAML object per writer invocation under:
s3://.../<run_registry_prefix>/<timestamp>_run_id=<run_id>_<uuid>/run.yaml
That gives cleanup jobs one stable entrypoint per run and avoids S3 object explosions.
3. Derive Loser Cleanup From Existing Shard Layout¶
The solution does not try to persist a complete list of failed attempts.
Instead, a future sweeper derives losers from what already exists:
- the run record says whether the run is still active, succeeded, or failed
- the manifest identifies winner shard URLs when a successful publish exists
- the shard layout already isolates attempts under
shards/run_id=<run_id>/.../attempt=NN
So the sweeper can:
- load the manifest and keep winner
db_urls - list all prefixes under the run's shard tree
- delete everything else
For failed runs without a manifest, the sweeper can delete the whole run shard tree once the run is terminal or its lease expires.
4. Use A Lease-Based Liveness Model¶
The run record is created as running and refreshed with a heartbeat.
Fields include:
statusstarted_atupdated_atlease_expires_ats3_prefixshard_prefixdb_path_templatemanifest_ref- optional failure context
This gives a future sweeper a simple rule:
- active
runningrecord with valid lease: leave it alone failedrecord: eligible for failed-run cleanup- stale
runningrecord with expired lease: treat as abandoned
5. Keep The Interface Storage-Agnostic¶
The implementation introduced a RunRegistry abstraction with:
S3RunRegistrynowInMemoryRunRegistryfor tests
This is intentionally shaped so a future SQLite-backed implementation can replace the storage backend without changing the writer lifecycle.
Alternatives Considered¶
Alternative 1: Failed-Attempt YAML Object Per Run¶
Idea:
- write one YAML report at the end containing failed attempt prefixes and run identity
Rejected because:
- failed Spark executor attempts may not be observable at the driver
- it still requires precise attempt bookkeeping across all backends
- it is better than per-attempt objects, but still pushes the design toward explicit attempt journaling instead of deriving from stable layout
Alternative 2: Immutable Per-Attempt Journal Objects Plus Summary¶
Idea:
- each attempt writes an immutable record
- the driver later writes one summary object
Rejected because:
- S3 object fan-out is too high
- a million-partition future makes this design operationally expensive
- framework-specific attempt signaling becomes part of the core design
Alternative 3: Put Attempt Metadata In The Manifest¶
Idea:
- extend manifest payload with attempt journal or cleanup metadata
Rejected because:
- readers do not need it
- failed runs may have no manifest at all
- it mixes operational cleanup metadata into the snapshot contract
Alternative 4: Driver-Owned SQLite Attempt Journal Right Now¶
Idea:
- record all attempts into one SQLite DB and publish it at the end
This is still a plausible future direction, but it was not chosen for the current implementation because:
- it is a larger design jump than needed for the immediate cleanup problem
- distributed attempt capture remains tricky unless all frameworks can reliably report attempt starts to a shared durable journal
- the simpler run-registry contract already solves discovery and cleanup ownership without committing to per-attempt persistence
Alternative 5: Framework-Specific Event Collection¶
Idea:
- Spark listeners
- Dask scheduler events
- Ray actors or queues
Rejected as the primary design because:
- it is not uniform across frameworks
- it solves detailed attempt telemetry, not the simpler question of "which runs need cleanup inspection?"
- it adds framework-specific infrastructure to a problem that can be solved from existing object layout
Final Solution¶
The implemented solution is a driver-owned run registry.
Writer Lifecycle¶
At run start:
- create one run record
- set
status=running - set
lease_expires_at
While the run is active:
- refresh
updated_at - refresh
lease_expires_at
At successful completion:
- publish the manifest
- update
manifest_ref - mark the run record
succeeded
At handled failure:
- mark the run record
failed - include
error_typeanderror_messagewhen available
If the process dies before terminal update:
- the record stays
running - the lease eventually expires
- a sweeper can treat it as abandoned
Cleanup Model¶
The current implementation does not add the sweeper itself. It establishes the contract the sweeper will use later.
The intended cleanup behavior is:
Successful Run With Manifest¶
- read the run record
- load
manifest_ref - keep winner
db_urls from the manifest - list all attempt prefixes under
shards/run_id=<run_id>/ - delete non-winner attempts
Failed Run Without Manifest¶
- read the run record
- if
status=failed, orstatus=runningwith expired lease, delete the whole run shard tree
This works for CEL as well because CEL still resolves into concrete db_id shards and writes into the same run-scoped attempt layout.
Implementation Areas¶
Primary implementation areas:
shardyfusion/run_registry.py- run record model
- S3 and in-memory registries
- lifecycle helper with heartbeat and terminal state updates
shardyfusion/config.pyoutput.run_registry_prefix- optional injected
run_registry shardyfusion/manifest.pyBuildResult.run_record_ref- writer entrypoints
- Python, Dask, Ray, Spark, and single-db writers now create and complete a run record
Test coverage was added across:
- shared unit tests for run-record lifecycle
- writer unit tests for success and retry-success paths
- local-S3 integration tests that validate run-record contents
- e2e smoke scenarios that validate run-record contents when the Garage environment is available
Pros¶
One Durable Artifact Per Run¶
The operational metadata footprint stays small and predictable.
Uniform Across Writers¶
The same contract works for:
- Python
- Dask
- Ray
- Spark
- single-db modes
- HASH and CEL sharding
Readers Stay Unchanged¶
No reader behavior depends on the run registry.
Cleanup Can Be Derived, Not Logged Explicitly¶
The solution reuses the existing attempt-isolated shard layout instead of inventing a separate attempt journal.
Good Fit For Future SQLite¶
The abstraction is already separated from the storage backend, so a future SQLite-backed run registry can preserve the same writer lifecycle.
Cons¶
No Per-Attempt Provenance¶
The run registry tells the sweeper which run to inspect, not the exact list of attempts that failed.
That is a deliberate tradeoff.
Sweeper Must List S3¶
Deferred cleanup now depends on listing the run shard tree and comparing it to the manifest. That is simple, but it shifts some work to the sweeper.
Abandoned Run Detection Is Lease-Based¶
Hard crashes are detected by expired heartbeat lease, not by an explicit crash marker.
Spark Retry Telemetry Is Still Implicit¶
The design does not try to record every Spark task attempt. It only guarantees that abandoned or completed runs are discoverable for cleanup.
Why This Solution Was Chosen¶
This solution was chosen because it is the smallest design that satisfies the real operational need:
- discover unfinished or completed runs later
- keep cleanup metadata out of reader-facing artifacts
- work uniformly across all writers
- scale better than per-attempt S3 records
It does not try to solve exact per-attempt journaling. That was an intentional boundary.
The run registry gives the system one durable source of truth for run lifecycle, while loser detection continues to come from the existing shard layout and manifest winner selection.
Follow-On Work¶
The run registry is the discovery layer. A future cleanup implementation still needs to:
- list run records under the configured registry prefix
- evaluate terminal or expired runs
- load manifests when present
- delete non-winner or abandoned attempt prefixes
- report cleanup results
If a later SQLite-based writer journal is introduced, it should implement the same conceptual contract:
- one logical run record
- terminal state
- manifest pointer when available
- enough metadata for deferred cleanup discovery