Error Handling¶
Error Hierarchy¶
All shardyfusion errors inherit from ShardyfusionError, which carries a retryable flag:
graph TD
SE[ShardyfusionError<br/>retryable: bool]
SE --> CVE[ConfigValidationError<br/>retryable=False]
SE --> SAE[ShardAssignmentError<br/>retryable=False]
SE --> SCE[ShardCoverageError<br/>retryable=False]
SE --> SWE[ShardWriteError<br/>retryable=True]
SE --> SDAE[SlateDbApiError<br/>retryable=False]
SE --> MBE[ManifestBuildError<br/>retryable=False]
SE --> PME[PublishManifestError<br/>retryable=True]
SE --> PCE[PublishCurrentError<br/>retryable=True]
SE --> MPE[ManifestParseError<br/>retryable=False]
SE --> RSE[ReaderStateError<br/>retryable=False]
SE --> STE[S3TransientError<br/>retryable=True]
SE --> PEE[PoolExhaustedError<br/>retryable=True]
SE --> MSTE[ManifestStoreError<br/>retryable=True]
Classification¶
| Error | Retryable | When raised |
|---|---|---|
ConfigValidationError |
No | Invalid WriteConfig parameters (bad s3_prefix, num_dbs <= 0, unsupported sharding strategy) |
ShardAssignmentError |
No | Routing verification detects mismatch between framework-assigned and Python-computed shard IDs |
ShardCoverageError |
No | After shard writes, results don't cover all expected range(num_dbs) |
ShardWriteError |
Yes | Adapter operations (write_batch, flush, checkpoint) failed with a potentially transient error |
SlateDbApiError |
No | SlateDB package missing, reader close failures, API-level errors |
ManifestBuildError |
No | Manifest artifact creation failed during SqliteManifestBuilder.build() |
PublishManifestError |
Yes | Manifest upload to S3 fails (transient) |
PublishCurrentError |
Yes | CURRENT pointer upload fails after manifest is already published |
ManifestParseError |
No | Malformed manifest JSON, missing required fields, structural violations |
ReaderStateError |
No | Operations on a closed reader, missing CURRENT pointer |
S3TransientError |
Yes | Throttling, HTTP 500/503, timeout during S3 operations |
PoolExhaustedError |
Yes | All readers in the pool are checked out and checkout timed out |
ManifestStoreError |
Yes | Transient manifest store failure (DB connection, query timeout) |
Retryable vs Non-Retryable¶
from shardyfusion.errors import ShardyfusionError
try:
result = write_sharded(...)
except ShardyfusionError as exc:
if exc.retryable:
# Safe to retry — transient infrastructure failure
retry_with_backoff(write_sharded, ...)
else:
# Programmer/data error — fix before retrying
raise
PublishCurrentError Recovery¶
The most nuanced error scenario: the manifest has been successfully published to S3, but the CURRENT pointer update fails. The data is written and the manifest exists — only the pointer is missing.
from shardyfusion.errors import PublishCurrentError
try:
result = write_sharded(...)
except PublishCurrentError as exc:
# The manifest is already published — recover by retrying just the CURRENT update
manifest_ref = exc.manifest_ref
if manifest_ref:
log.warning(f"CURRENT update failed, manifest at: {manifest_ref}")
# Option 1: Retry CURRENT update via the store
# Option 2: Log manifest_ref for manual recovery
# Option 3: Re-run the entire pipeline (idempotent with same run_id)
Note: As of the latest version,
publish_to_store()automatically retriesPublishCurrentErrorup to 3 times with exponential backoff (1s → 2s → 4s) before raising. Manual retry is only needed if the automatic retries are also exhausted.
S3 Retry Behavior¶
The storage layer uses exponential backoff for transient S3 errors:
- Attempts: 3 (initial + 2 retries)
- Backoff: 1s → 2s → 4s
- Retried conditions: HTTP 500, 503, throttling, connection timeouts
- Not retried: HTTP 400, 403, 404, other client errors
If all retries are exhausted, S3TransientError is raised to the caller.
Best Practices¶
-
Catch
ShardyfusionError, notException, for shardyfusion-specific handling. -
Check
retryablebefore implementing retry logic — non-retryable errors indicate bugs that retrying won't fix. -
Log
PublishCurrentError.manifest_refin production — this is your recovery handle when the CURRENT pointer fails to update. -
Use
MetricsCollectorto monitorS3_RETRYandS3_RETRY_EXHAUSTEDevents for infrastructure health. -
Don't suppress errors during cleanup — reader
close()now raisesSlateDbApiErrorif any shard handle fails to close. Catch this at the application level if you need graceful degradation.