Choosing a writer and backend¶
This section covers building sharded KV snapshots. Pick your writer based on your data source and infrastructure:
| Writer | Input type | Java required | Cluster required | Best for |
|---|---|---|---|---|
| Python | Iterable[T] |
No | No | Single-host, streaming, simplicity |
| Spark | PySpark DataFrame |
Yes | Optional (local mode works) | Large-scale ETL, existing Spark pipeline |
| Dask | Dask DataFrame |
No | Optional | Distributed scale-out without JVM |
| Ray | Ray Dataset |
No | Optional | ML preprocessing pipelines, actor scheduling |
All writers share the same core behavior: deterministic routing, attempt-isolated paths, deterministic winner selection, two-phase publish, and run records. See KV Storage Overview for the conceptual model.
Choosing a backend¶
| Backend | Read-side access | When to use |
|---|---|---|
| SlateDB (default) | Point-key get / multi_get |
Lowest friction, LSM characteristics, default for most users |
| SlateDB (local) | Point-key get / multi_get |
Writes to local disk, bulk uploads to S3 — decouples write throughput from S3 latency |
| SQLite | Point-key + SQL queries + range-read VFS | Need SQL, single-file shards, or remote page-level access |
Backend selection is a single config swap (adapter_factory=SqliteFactory() instead of default). Everything else — routing, publishing, reading — works identically.