Glossary¶
Attempt
: A single execution of a shard write. In Spark, task retries produce multiple attempts for the same db_id. The winner is selected deterministically (lowest attempt number, then task_attempt_id, then URL).
Checkpoint ID : An opaque string returned by SlateDB after flushing a shard database. Stored in the manifest and used to open readers at the exact write position.
CURRENT pointer
: A JSON file at {s3_prefix}/_CURRENT that references the latest manifest. Updated atomically after a successful manifest publish. Contains manifest_ref, run_id, updated_at, and format_version.
db_id
: Zero-based integer identifying a shard database. Range: [0, num_dbs). Assigned by the sharding strategy (hash, range, or custom).
Key encoding
: How routing keys are serialized to bytes for shard storage. u64be = 8-byte big-endian unsigned int (default). u32be = 4-byte variant. Stored in the manifest so readers use the same encoding.
Manifest
: A JSON document describing a complete snapshot: which shards exist, where they are, how they were sharded, and metadata about the build. Published to {s3_prefix}/manifests/run_id={run_id}/{manifest_name}.
num_dbs
: Number of independent shard databases in a snapshot. Set in WriteConfig. All shards must be present for a valid manifest.
Partition : In distributed frameworks (Spark, Dask, Ray), a unit of parallel execution. shardyfusion repartitions data so each partition maps to exactly one shard.
Routing
: The process of mapping a key to a db_id. Uses the sharding strategy (hash or range) and must produce identical results at write time and read time for correctness.
run_id : A unique identifier for a single write execution. Auto-generated (UUID hex) if not provided. Used in S3 paths and manifest references to isolate runs.
Shard
: A single shard database within a snapshot. Identified by db_id. Depending on the backend it may be a SlateDB database or a SQLite file, and it contains all key-value pairs that route to that db_id.
Snapshot
: The complete set of shards produced by a single write_sharded call, described by a manifest and pointed to by CURRENT. Readers load snapshots atomically.
Two-phase publish : The write protocol where (1) the manifest is published first, then (2) the CURRENT pointer is updated. If step 2 fails, the manifest is orphaned but the previous snapshot remains valid.
Winner
: The selected attempt for each db_id when multiple attempts exist (e.g., Spark task retries). Selected deterministically to ensure consistent results across pipeline reruns.
xxh3_64
: The hash function used for hash-based shard routing. Uses seed=0 and canonical_bytes(key) (int → 8-byte signed LE, str → UTF-8, bytes → passthrough). xxh3_64(canonical_bytes(key), seed=0) % num_dbs determines the db_id. All writers use the same Python implementation.