ADR-004: Consistent writer retry model¶
Status: Accepted (2026-03-25)
Source: historical-notes/2026-03-25-consistent-writer-retry.md
Context¶
Writers across frameworks (Spark, Dask, Ray, Python) have their own retry semantics for failed tasks. Without a shared model:
- Behavior diverges across frameworks (e.g. Ray retries a task; Dask doesn't; Python writer aborts).
- Per-shard transient failures cause whole-build aborts.
- Operators can't tune retry budget centrally.
Decision¶
Define a writer-level retry contract independent of the framework:
WriteConfig.shard_retrycontrols per-shard retry budget for transient adapter failures.- The
_writer_coreshared module owns the retry loop; framework writers delegate to it. - Framework-native retries (Spark task retry, Ray task retry) are additive — they handle worker-level failures;
shard_retryhandles adapter-level failures (S3 transient, SlateDB I/O). - File-spool fallback exists in the Python writer when
shard_retryis set, to allow re-running a single shard without re-feeding the input iterator.
Consequences¶
- Consistent shard-level retry semantics across frameworks.
ShardCoverageErrorraised only aftershard_retryis exhausted.- Operators tune one knob (
shard_retry) regardless of framework.