ADR-004: Consistent writer retry model¶

Status: Accepted (2026-03-25) Source: historical-notes/2026-03-25-consistent-writer-retry.md

Context¶

Writers across frameworks (Spark, Dask, Ray, Python) have their own retry semantics for failed tasks. Without a shared model:

Behavior diverges across frameworks (e.g. Ray retries a task; Dask doesn't; Python writer aborts).
Per-shard transient failures cause whole-build aborts.
Operators can't tune retry budget centrally.

Define a writer-level retry contract independent of the framework:

WriteConfig.shard_retry controls per-shard retry budget for transient adapter failures.
The _writer_core shared module owns the retry loop; framework writers delegate to it.
Framework-native retries (Spark task retry, Ray task retry) are additive — they handle worker-level failures; shard_retry handles adapter-level failures (S3 transient, SlateDB I/O).
File-spool fallback exists in the Python writer when shard_retry is set, to allow re-running a single shard without re-feeding the input iterator.