Skip to content

Build a composite KV+vector snapshot (SlateDB + LanceDB)

Use the composite adapter to write two backends per shard — SlateDB for KV and LanceDB for vector search — published under one manifest.

When to use

  • You need both point-key lookups (KV) and approximate nearest-neighbor search over the same shard layout.
  • You want LanceDB's mature vector backend with HNSW/IVF tuning.
  • You are happy to pay the cost of two adapters per shard.

When NOT to use

Install

uv add 'shardyfusion[unified-vector,writer-python]'

unified-vector = vector-lancedb + cel.

Minimal example

from shardyfusion import HashShardedWriteConfig, PythonRecordInput, VectorSpec
from shardyfusion.writer.python import write_hash_sharded
from shardyfusion.slatedb_adapter import SlateDbFactory
from shardyfusion.vector.adapters.lancedb_adapter import LanceDbFactory
from shardyfusion.composite_adapter import CompositeFactory

vector_spec = VectorSpec(dim=384, metric="cosine")

config = HashShardedWriteConfig(
    num_dbs=16,
    s3_prefix="s3://my-bucket/snapshots/items",
    adapter_factory=CompositeFactory(
        kv_factory=SlateDbFactory(),
        vector_factory=LanceDbFactory(),
        vector_spec=vector_spec,
    ),
    vector_spec=vector_spec,
)

result = write_hash_sharded(
    records,
    config,
    PythonRecordInput(
        key_fn=lambda r: r["id"].encode(),
        value_fn=lambda r: r["payload"],
        vector_fn=lambda r: (r["id"], r["embedding"], None),
    ),
)
from shardyfusion import ColumnWriteInput, HashShardedWriteConfig, VectorColumnInput, VectorSpec
from shardyfusion.writer.spark import write_hash_sharded
from shardyfusion.slatedb_adapter import SlateDbFactory
from shardyfusion.vector.adapters.lancedb_adapter import LanceDbFactory
from shardyfusion.composite_adapter import CompositeFactory
from shardyfusion.serde import ValueSpec

vector_spec = VectorSpec(dim=384, metric="cosine")

config = HashShardedWriteConfig(
    num_dbs=16,
    s3_prefix="s3://my-bucket/snapshots/items",
    adapter_factory=CompositeFactory(
        kv_factory=SlateDbFactory(),
        vector_factory=LanceDbFactory(),
        vector_spec=vector_spec,
    ),
    vector_spec=vector_spec,
)

result = write_hash_sharded(
    df,
    config,
    ColumnWriteInput(
        key_col="id",
        value_spec=ValueSpec.binary_col("payload"),
        vector=VectorColumnInput(vector_col="embedding", id_col="id"),
    ),
)
from shardyfusion import ColumnWriteInput, HashShardedWriteConfig, VectorColumnInput, VectorSpec
from shardyfusion.writer.dask import write_hash_sharded
from shardyfusion.slatedb_adapter import SlateDbFactory
from shardyfusion.vector.adapters.lancedb_adapter import LanceDbFactory
from shardyfusion.composite_adapter import CompositeFactory
from shardyfusion.serde import ValueSpec

vector_spec = VectorSpec(dim=384, metric="cosine")

config = HashShardedWriteConfig(
    num_dbs=16,
    s3_prefix="s3://my-bucket/snapshots/items",
    adapter_factory=CompositeFactory(
        kv_factory=SlateDbFactory(),
        vector_factory=LanceDbFactory(),
        vector_spec=vector_spec,
    ),
    vector_spec=vector_spec,
)

result = write_hash_sharded(
    ddf,
    config,
    ColumnWriteInput(
        key_col="id",
        value_spec=ValueSpec.binary_col("payload"),
        vector=VectorColumnInput(vector_col="embedding", id_col="id"),
    ),
)
from shardyfusion import ColumnWriteInput, HashShardedWriteConfig, VectorColumnInput, VectorSpec
from shardyfusion.writer.ray import write_hash_sharded
from shardyfusion.slatedb_adapter import SlateDbFactory
from shardyfusion.vector.adapters.lancedb_adapter import LanceDbFactory
from shardyfusion.composite_adapter import CompositeFactory
from shardyfusion.serde import ValueSpec

vector_spec = VectorSpec(dim=384, metric="cosine")

config = HashShardedWriteConfig(
    num_dbs=16,
    s3_prefix="s3://my-bucket/snapshots/items",
    adapter_factory=CompositeFactory(
        kv_factory=SlateDbFactory(),
        vector_factory=LanceDbFactory(),
        vector_spec=vector_spec,
    ),
    vector_spec=vector_spec,
)

result = write_hash_sharded(
    ds,
    config,
    ColumnWriteInput(
        key_col="id",
        value_spec=ValueSpec.binary_col("payload"),
        vector=VectorColumnInput(vector_col="embedding", id_col="id"),
    ),
)

Configuration

  • VectorSpec(dim, metric, index_type="hnsw", ...) — set on HashShardedWriteConfig.vector_spec or CelShardedWriteConfig.vector_spec.
  • metric for LanceDB: cosine, l2, dot_product (mapped to "dot" internally at vector/adapters/lancedb_adapter.py:142).
  • The backend ("lancedb") is determined by the adapter factory; the manifest's vector.backend field is filled from there and used by UnifiedShardedReader to dispatch.

Functional properties

  • Each shard contains a SlateDB store and a LanceDB table side by side.
  • Two sets of files uploaded per shard.
  • Atomic publish across both backends (single manifest entry per shard).

Guarantees

  • Successful return => both KV and vector data are addressable via the same _CURRENT.
  • UnifiedShardedReader dispatches to the right backend based on manifest vector.backend.

Weaknesses

  • Roughly 2x shard size and upload time vs KV-only.
  • LanceDB index build cost included in writer wall time.

Failure modes & recovery

Failure Surface Recovery
Vector dim mismatch ConfigValidationError at write start Fix VectorSpec.dim.
LanceDB index build fail VectorIndexError Check disk; rerun.
Either backend fails on a shard ShardCoverageError after retries config.shard_retry; rerun.

See also