Sharded Vector Search¶

This use case is vector-only: no KV data, no point-key lookups. You write vector embeddings into sharded indices and query by approximate nearest-neighbor (ANN) search.

This page covers concepts shared by both backends. Backend and reader specifics live in the child pages.

Query flow: scatter-gather¶

A vector search fans out across all (or routed) shards, collects local top-k results, and merges them into a global top-k:

flowchart TD
    Q[Query vector] --> R{Routing}
    R -->|CLUSTER<br/>nearest centroids| S0[Shard 0]
    R -->|LSH<br/>matching buckets| S1[Shard 1]
    R -->|CEL / EXPLICIT<br/>target shards| S2[Shard 2]
    R -->|no routing<br/>all shards| S3[All shards]

    S0 -->|local top-k| M[Merge + global sort]
    S1 -->|local top-k| M
    S2 -->|local top-k| M
    S3 -->|local top-k| M
    M --> O[Global top-k results]

Routing — some strategies skip shards that cannot contain relevant vectors.
Fan-out — each target shard runs a local ANN search.
Local top-k — each shard returns its top_k nearest neighbors.
Global merge — results are merged and re-sorted by distance to produce the final top-k.

The merge logic lives in shardyfusion/vector/_merge.py and is reused by all vector reader variants.

Vector sharding strategies¶

Four strategies control how vectors are assigned to shards at write time and which shards are queried at read time.

Strategy	Write-time routing	Read-time routing	Best for
CLUSTER (default)	K-means centroids trained on sample; vector assigned to nearest centroid shard	Query routed to `n` nearest centroid shards	Maximizing recall on naturally grouped data
LSH	Random hyperplanes hash vectors into buckets	Query hashed to same bucket; optional multi-probe	Massive ingestion throughput; streaming data
CEL	CEL expression on metadata (e.g. `tenant_id == "acme"`)	Exact-match routing on `routing_context`	Strict tenant isolation
EXPLICIT	Caller provides `shard_id` per vector	Caller provides `shard_ids` at query time	External directory or custom logic

CLUSTER details¶

flowchart LR
    subgraph "Write phase"
        A[Sample vectors] --> B[Train k-means<br/>k = num_dbs]
        B --> C[Assign each vector<br/>to nearest centroid]
    end
    subgraph "Read phase"
        D[Query vector] --> E[Find nearest<br/>centroids]
        E --> F[Search only<br/>those shards]
    end

Requires a sampling pass over the data to train centroids (unless train_centroids=False).
Centroids are stored as .npy files in S3 and referenced in the manifest.
nprobe controls how many centroid shards are queried (default = 1).

LSH details¶

Hyperplanes are generated deterministically from a seed.
No training phase — hashing is virtually free.
Multi-probe search can query nearby buckets to improve recall.

Backend comparison¶

	LanceDB	sqlite-vec
File format	`.lance` directory per shard	`.sqlite` file per shard
Index types	HNSW, IVF, PQ, SQ	Flat (brute-force within shard)
Metrics	`cosine`, `l2`, `dot_product`	`cosine`, `l2`
Scale	Large-scale, high-dim (768d–3072d)	Small-to-medium, fits in memory
Tuning surface	`M`, `ef_construction`, `nprobes`, quantization	None
Dependencies	`lancedb`	`sqlite-vec` extension

Shared snapshot properties¶

Vector-only snapshots use the same manifest + _CURRENT pointer model as KV and KV+vector snapshots. The manifest is still the reader contract: it records vector shard locations, routing metadata, and vector-specific custom fields. See Shared Snapshot Workflow for the project-wide publish/read flow and Manifest & _CURRENT for implementation details.

All the safety properties from the shared workflow apply:

Two-phase publish — manifest first, then _CURRENT.
Immutable shards — indices are never modified after upload.
Atomic visibility — readers see old or new snapshot entirely.
Backward rollback — point _CURRENT at any previous manifest.

The manifest carries vector-specific metadata in custom["vector"] (centroids, hyperplanes, index config) so readers can reconstruct the routing and search parameters.

Child pages¶

Build → LanceDB — HNSW/IVF vector index (Python single-process)
Build → sqlite-vec — single-file SQLite vector index (Python single-process)
Build → Spark — distributed vector writer for PySpark DataFrames
Build → Dask — distributed vector writer for Dask DataFrames
Build → Ray — distributed vector writer for Ray Datasets
Read → Sync — ShardedVectorReader
Read → Async — AsyncShardedVectorReader