Configuration reference¶
This page enumerates the public configuration objects. Defaults and constraints come from source — links point at file:line.
HashShardedWriteConfig¶
shardyfusion/config.py:260. Dataclass, slots=True. Primary config for HASH sharded snapshot writes.
Inherits all common fields from BaseShardedWriteConfig (see below) and adds:
| Field | Type | Default | Purpose |
|---|---|---|---|
num_dbs |
int \| None |
None |
Number of shards. Required (>0) unless max_keys_per_shard is set. |
max_keys_per_shard |
int \| None |
None |
Alternative to num_dbs — computes shard count as ceil(total_rows / max_keys_per_shard) at write time. |
CelShardedWriteConfig¶
shardyfusion/config.py:289. Dataclass, slots=True. Primary config for CEL sharded snapshot writes.
Inherits all common fields from BaseShardedWriteConfig (see below) and adds:
| Field | Type | Default | Purpose |
|---|---|---|---|
cel_expr |
str |
"" |
CEL expression that produces a shard ID or categorical token. Required. |
cel_columns |
dict[str, str] |
{} |
Mapping of CEL variable names to their types (e.g. {"key": "int"}). Required. |
routing_values |
list[RoutingValue] \| None |
None |
Optional categorical values for token-based routing. |
infer_routing_values_from_data |
bool |
False |
Discover routing values from input at write time (single-process only). |
BaseShardedWriteConfig (base class)¶
shardyfusion/config.py:162. Dataclass, slots=True. Base class for HashShardedWriteConfig and CelShardedWriteConfig. Do not instantiate directly.
Common fields inherited by both concrete configs:
| Field | Type | Default |
|---|---|---|
storage |
WriterStorageConfig |
WriterStorageConfig() |
output |
WriterOutputConfig |
WriterOutputConfig() |
manifest |
WriterManifestConfig |
WriterManifestConfig() |
kv |
KeyValueWriteConfig |
KeyValueWriteConfig() |
retry |
WriterRetryConfig |
WriterRetryConfig() |
rate_limits |
KvWriteRateLimitConfig |
KvWriteRateLimitConfig() |
observability |
WriterObservabilityConfig |
WriterObservabilityConfig() |
lifecycle |
WriterLifecycleConfig |
WriterLifecycleConfig() |
vector |
VectorSpec \| None |
None |
For migration convenience, concrete config constructors also accept previous flat keyword names such as s3_prefix, key_encoding, batch_size, adapter_factory, metrics_collector, run_registry, shard_retry, credential_provider, s3_connection_options, vector_spec, and rate-limit fields. New code should prefer nested group names when several related fields are set together.
Common nested groups¶
| Config | Key fields |
|---|---|
WriterStorageConfig |
s3_prefix, credential_provider, s3_connection_options |
WriterOutputConfig |
run_id, db_path_template, shard_prefix, run_registry_prefix, local_root |
WriterManifestConfig |
store, custom_manifest_fields, credential_provider, s3_connection_options |
KeyValueWriteConfig |
key_encoding, batch_size, adapter_factory |
WriterRetryConfig |
shard_retry |
KvWriteRateLimitConfig |
max_writes_per_second, max_write_bytes_per_second |
WriterObservabilityConfig |
metrics_collector |
WriterLifecycleConfig |
run_registry |
Manifest layout (path, naming, store) is configured via the nested manifest: WriterManifestConfig field — there is no top-level manifest_store or manifest_name on writer configs.
HashShardingSpec¶
shardyfusion/sharding_types.py. Strategy-specific parameters for HASH routing:
hash_algorithm—ShardHashAlgorithm, currentlyXXH3_64withseed=0.max_keys_per_shard— soft cap (writer-side); incompatible with explicitnum_dbs.
CelShardingSpec¶
shardyfusion/sharding_types.py. Strategy-specific parameters for CEL routing:
cel_expr— CEL expression returning routing token.cel_columns— input columns for CEL.routing_values— closed token set (categorical CEL).infer_routing_values_from_data— deriverouting_valuesfrom input.
ShardingSpec¶
Base class for HashShardingSpec and CelShardingSpec.
KeyEncoding¶
shardyfusion/type_defs.py. Enum: U64BE, U32BE, UTF8, RAW. Default on KeyValueWriteConfig is U64BE.
ShardHashAlgorithm¶
shardyfusion/sharding_types.py. Enum: currently XXH3_64 only. The value is required in manifest sharding metadata so future readers can reject unsupported algorithms rather than silently misrouting.
VectorSpec¶
shardyfusion/config.py:93. Dataclass.
| Field | Type | Default |
|---|---|---|
dim |
int |
required |
vector_col |
str \| None |
None |
metric |
VectorMetric (str) |
"cosine" ("cosine", "l2", "dot_product") |
index_type |
str |
"hnsw" |
index_params |
dict[str, object] |
{} |
quantization |
str \| None |
None ("fp16", "i8", or None) |
sharding |
VectorSpecSharding |
default factory |
VectorSpec does not carry a backend field. Backend selection happens via the adapter factory (SqliteVecFactory ⇒ sqlite-vec; CompositeFactory(..., vector_factory=LanceDbFactory(), ...) ⇒ lancedb). The manifest's vector.backend field is filled in from the chosen adapter and used by UnifiedShardedReader to dispatch.
VectorShardedWriteConfig¶
shardyfusion/vector/config.py:74. Standalone and distributed vector writer config.
Key fields (see source for full list):
index_config: VectorIndexConfig— required,dim > 0.sharding: VectorShardingConfig— carriesnum_dbsand strategy-specific settings.storage: WriterStorageConfig—s3_prefixand object-store connection settings.output: WriterOutputConfig— run and shard path settings.manifest: WriterManifestConfig— manifest-store settings.adapter: VectorAdapterConfig— vector adapter factory, reader factory, and batch size.rate_limits: VectorWriteRateLimitConfig— vector write ops/sec limit.observability: WriterObservabilityConfig— metrics collector.lifecycle: WriterLifecycleConfig— run registry.
Like the KV configs, VectorShardedWriteConfig accepts flat migration keywords such as num_dbs, s3_prefix, adapter_factory, reader_factory, batch_size, and max_writes_per_second, but grouped configs are preferred for new code.
Writer input and options¶
Public writer functions take (data, config, input, options=None) except standalone vector writes, where input is optional:
PythonRecordInput(key_fn, value_fn, columns_fn=None, vector_fn=None)for Python KV writes.ColumnWriteInput(key_col, value_spec, vector=None, vector_fn=None)for Spark/Dask/Ray KV writes.VectorColumnInput(vector_col, id_col=None, payload_cols=None, shard_id_col=None, routing_context_cols=None)for distributed vector writes.shard_id_colhere is the user input column carrying explicit shard IDs (EXPLICIT strategy only). It is distinct fromconfig.shard_id_colonVectorShardedWriteConfig, which names the internal routing column added by the writer.PythonWriteOptions,SparkWriteOptions,DaskWriteOptions,RayWriteOptions,SingleDbWriteOptions, andVectorWriteOptionscarry per-call execution behavior.
VectorShardingConfig / VectorShardingSpec¶
Strategies: CLUSTER (k-means, default), LSH, EXPLICIT (uses VectorRecord.shard_id), CEL (uses routing_context).
Manifest paths¶
- Manifest:
manifests/<timestamp>_run_id=<run_id>/manifest - Run record:
runs/<timestamp>_run_id=<run_id>_<uuidhex>/run.yaml - Pointer:
_CURRENT(configurable ascurrent_pointer_key) - Timestamp format:
%Y-%m-%dT%H:%M:%S.%fZ - Supported manifest format versions:
{4}. CurrentPointer.format_version: int = 1(separate from manifest version).
Adapter factories¶
| Factory | Module | Notes |
|---|---|---|
SlateDbFactory() |
shardyfusion.slatedb_adapter |
Default. In top-level __all__. |
LocalSlateDbFactory(s3_connection_options=None, credential_provider=None) |
shardyfusion.local_slatedb_adapter |
Local-first: writes to file://, uploads to S3 on close. In top-level __all__. |
SqliteFactory(page_size=4096, cache_size_pages=-2000, journal_mode="OFF", synchronous="OFF", temp_store="MEMORY", mmap_size=0) |
shardyfusion.sqlite_adapter |
Not re-exported. |
SqliteVecFactory(vector_spec, ...) |
shardyfusion.sqlite_vec_adapter |
Unified KV+vector single backend. |
LanceDbFactory() |
shardyfusion.vector.adapters.lancedb_adapter |
Vector only. |
CompositeFactory(kv_factory, vector_factory, vector_spec) |
shardyfusion.composite_adapter |
KV + vector composition. |
All adapter factories are kw-only-callable: factory(*, db_url, local_dir).