Error Handling¶

Error Hierarchy¶

All shardyfusion errors inherit from ShardyfusionError, which carries a retryable flag:

graph TD
    SE[ShardyfusionError<br/>retryable: bool]

    SE --> CVE[ConfigValidationError<br/>retryable=False]
    SE --> SAE[ShardAssignmentError<br/>retryable=False]
    SE --> SCE[ShardCoverageError<br/>retryable=False]
    SE --> SWE[ShardWriteError<br/>retryable=True]
    SE --> SDAE[SlateDbApiError<br/>retryable=False]
    SE --> MBE[ManifestBuildError<br/>retryable=False]
    SE --> PME[PublishManifestError<br/>retryable=True]
    SE --> PCE[PublishCurrentError<br/>retryable=True]
    SE --> MPE[ManifestParseError<br/>retryable=False]
    SE --> RSE[ReaderStateError<br/>retryable=False]
    SE --> STE[S3TransientError<br/>retryable=True]
    SE --> PEE[PoolExhaustedError<br/>retryable=True]
    SE --> MSTE[ManifestStoreError<br/>retryable=True]

Classification¶

Error	Retryable	When raised
`ConfigValidationError`	No	Invalid `WriteConfig` parameters (bad `s3_prefix`, `num_dbs <= 0`, unsupported sharding strategy)
`ShardAssignmentError`	No	Routing verification detects mismatch between framework-assigned and Python-computed shard IDs
`ShardCoverageError`	No	After shard writes, results don't cover all expected `range(num_dbs)`
`ShardWriteError`	Yes	Adapter operations (write_batch, flush, checkpoint) failed with a potentially transient error
`SlateDbApiError`	No	SlateDB package missing, reader close failures, API-level errors
`ManifestBuildError`	No	Manifest artifact creation failed during `SqliteManifestBuilder.build()`
`PublishManifestError`	Yes	Manifest upload to S3 fails (transient)
`PublishCurrentError`	Yes	CURRENT pointer upload fails after manifest is already published
`ManifestParseError`	No	Malformed manifest JSON, missing required fields, structural violations
`ReaderStateError`	No	Operations on a closed reader, missing CURRENT pointer
`S3TransientError`	Yes	Throttling, HTTP 500/503, timeout during S3 operations
`PoolExhaustedError`	Yes	All readers in the pool are checked out and checkout timed out
`ManifestStoreError`	Yes	Transient manifest store failure (DB connection, query timeout)

Retryable vs Non-Retryable¶

from shardyfusion.errors import ShardyfusionError

try:
    result = write_sharded(...)
except ShardyfusionError as exc:
    if exc.retryable:
        # Safe to retry — transient infrastructure failure
        retry_with_backoff(write_sharded, ...)
    else:
        # Programmer/data error — fix before retrying
        raise

PublishCurrentError Recovery¶

The most nuanced error scenario: the manifest has been successfully published to S3, but the CURRENT pointer update fails. The data is written and the manifest exists — only the pointer is missing.

from shardyfusion.errors import PublishCurrentError

try:
    result = write_sharded(...)
except PublishCurrentError as exc:
    # The manifest is already published — recover by retrying just the CURRENT update
    manifest_ref = exc.manifest_ref
    if manifest_ref:
        log.warning(f"CURRENT update failed, manifest at: {manifest_ref}")
        # Option 1: Retry CURRENT update via the store
        # Option 2: Log manifest_ref for manual recovery
        # Option 3: Re-run the entire pipeline (idempotent with same run_id)

Note: As of the latest version, publish_to_store() automatically retries PublishCurrentError up to 3 times with exponential backoff (1s → 2s → 4s) before raising. Manual retry is only needed if the automatic retries are also exhausted.

S3 Retry Behavior¶

The storage layer uses exponential backoff for transient S3 errors:

Attempts: 3 (initial + 2 retries)
Backoff: 1s → 2s → 4s
Retried conditions: HTTP 500, 503, throttling, connection timeouts
Not retried: HTTP 400, 403, 404, other client errors

If all retries are exhausted, S3TransientError is raised to the caller.

Best Practices¶

Catch ShardyfusionError, not Exception, for shardyfusion-specific handling.
Check retryable before implementing retry logic — non-retryable errors indicate bugs that retrying won't fix.
Log PublishCurrentError.manifest_ref in production — this is your recovery handle when the CURRENT pointer fails to update.
Use MetricsCollector to monitor S3_RETRY and S3_RETRY_EXHAUSTED events for infrastructure health.
Don't suppress errors during cleanup — reader close() now raises SlateDbApiError if any shard handle fails to close. Catch this at the application level if you need graceful degradation.