2026-05-04 slatedb 0.12 uniffi Migration¶

Status: implemented
Date: 2026-05-04

Summary¶

This engineering note documents the migration of shardyfusion's writer and reader paths from the legacy synchronous slatedb top-level API (slatedb.SlateDB, slatedb.SlateDBReader) to the async-only uniffi-generated bindings under slatedb.uniffi shipped in slatedb>=0.12,<0.13. It covers the sync→async bridge design, the removal of read-side checkpoint pinning, the switch to opaque shardyfusion-generated UUID checkpoint_id values, the seal()-vs-checkpoint() Protocol change, the typed SlateDbSettings configuration model, the new iterator_chunk_size knob, and the perf microbenchmark scaffolding introduced to guard against bridge-overhead regressions.

1. What problem is being solved or functionality being added by the changes?¶

slatedb 0.12 deleted the synchronous top-level Python API that shardyfusion was built on. Every public method on Db, DbReader, WriteBatch, and the iterator types is now async def, and the read-side checkpoint_id argument no longer exists. The migration needed to:

Replumb the hot path onto an async-only library while preserving shardyfusion's synchronous DbAdapter and ShardReader Protocols, because Spark/Dask executors and the Python writer's multiprocessing workers are sync code paths. Pushing async upward would have rippled into every framework integration.
Replace the read-side checkpoint pinning model that no longer exists in slatedb. The previous design relied on DbReaderBuilder accepting a checkpoint_id to pin a specific on-disk snapshot.
Keep the public shardyfusion API stable so that downstream Spark/Dask/Ray writers and existing snapshots don't churn.
Avoid catastrophic per-row scan overhead introduced by the sync→async bridge cost (~15–40 µs per round-trip; ~30× slowdown for naive per-row iteration).
Make missing-symbol failures actionable instead of producing AttributeError stack traces deep inside writer/reader code when slatedb's surface drifts again.

2. What design decisions were considered with their pros and cons and trade offs?¶

Decision 1: How to bridge sync shardyfusion Protocols to async uniffi¶

Option A: Generate a sync wrapper class (`SyncDb`) around uniffi¶

Pros: - Looks idiomatic at call sites (db.write(batch) instead of run_coro(db.write(batch))).

Cons: - Doubles the surface to maintain — every uniffi method needs a sync mirror. - Obscures where async actually happens, making it easy to accidentally call from inside an event loop and deadlock. - Adds a wrapper-object lifetime to track on top of the underlying uniffi object.

Option B: Process-global daemon-thread asyncio loop with `run_coro` helper (chosen)¶

Pros: - One file (shardyfusion/_slatedb_runtime.py) owns the bridge. - Honest at call sites: run_coro(reader.get(k)) makes the async hop visible. - Daemon thread gives us process-lifetime semantics without shutdown coordination. - Cannot be hijacked by a request-scoped loop (the loop is private).

Cons: - Per-call cost (~15–40 µs) is non-trivial for hot paths. - Tests that need a fake event loop must monkey-patch run_coro.

Option C: Make the loop user-pluggable¶

Pros: - Lets advanced callers reuse their own loop.

Cons: - Users will accidentally pin it to a request-scoped loop and deadlock on shutdown. - The Protocol contract becomes "sync, but configurably so" — a worse abstraction.

We chose Option B. The bridge is shardyfusion-owned and non-customizable; cost is amortized by the iterator chunking decision below.

Decision 2: Iterator chunking knob — where it lives¶

Option A: Push chunking into `DbAdapter` / writer side¶

Cons: - Writes already batch via WriteBatch; adding a second knob invites confusion about which path it controls.

Option B: `iterator_chunk_size` only on `SlateDbReaderFactory` (chosen)¶

Pros: - Single place to tune, single place to document. - Default 1024 amortizes the bridge cost across rows of typical size; failure mode is "uses more memory per chunk", not correctness.

Cons: - Callers with very large values must override explicitly.

We chose Option B.

Decision 3: How to identify shards now that slatedb has no read-side checkpoint API¶

Option A: Hash the materialized SlateDB / SQLite file (legacy SQLite/SQLiteVec behaviour)¶

Pros: - Content-addressable; identical bytes → identical id.

Cons: - Forces a read-back pass on the writer hot path on every shard close. - Not actually needed for correctness given shardyfusion's invariants (single writer per SlateDB; manifest published only after writers finish; no post-publish updates).

Option B: Opaque `uuid.uuid4().hex` stamped by the writer (chosen)¶

Pros: - Zero I/O on the writer hot path. - Uniqueness guaranteed without reading any bytes. - Centralized in shardyfusion._checkpoint_id.generate_checkpoint_id(). - Cache identity for SQLite/SQLiteVec/LanceDB factories is preserved.

Cons: - Two writes that produce identical bytes get different ids — but this never happens under our invariants.

Option C: Re-hash from the manifest after publish¶

Cons: - Manifest doesn't see the bytes; we'd need a separate pass. - Adds ordering dependency between shard finalize and manifest build.

We chose Option B. SlateDbReaderFactory accepts checkpoint_id for Protocol symmetry but ignores it — cache identity for SlateDB shards collapses to db_url only.

Decision 4: Adapter Protocol — `seal()` vs `checkpoint()`¶

The old Protocol had checkpoint() -> str | None, which told the adapter "flush, finalize, and tell me your checkpoint id". With Decision 3 the writer now stamps the id itself; adapters only need to flush + finalize.

Option A: Keep `checkpoint() -> str | None` and ignore the return¶

Cons: - Custom adapters returning a hash silently get their value dropped — no way to notice the contract changed.

Option B: Rename to `seal() -> None` (chosen)¶

Pros: - Compile-time / Protocol-check failure for any adapter still on the old contract. - Method name accurately describes the new responsibility.

Cons: - Source-incompatible for custom adapters — but the alternative is worse (silent behavior change).

We chose Option B.

Decision 5: Symbol resolution layer¶

Option A: `import slatedb.uniffi` directly at every call site¶

Cons: - import shardyfusion blows up on machines without slatedb. - Missing symbols become AttributeError deep in writer code. - Tests must patch every site individually.

Option B: Single choke point in `_slatedb_symbols.py` (chosen)¶

Pros: - Lazy import isolates the optional-dependency check to one try/except. - Missing symbols become a single DbAdapterError("slatedb.uniffi.X is unavailable"). - Tests monkey-patch sys.modules["slatedb"] and sys.modules["slatedb.uniffi"] once and every shardyfusion call site picks up the fake.

We chose Option B.

Decision 6: Configuration model — `SlateDbSettings`¶

Option A: Keep accepting the JSON-ish dict from the legacy API¶

Cons: - Typos silently lost; new typed Settings class in slatedb 0.12 is not validated.

Option B: Typed `SlateDbSettings` dataclass with¶

raw_overrides: dict[str, Any] escape hatch + legacy-dict adapter that emits DeprecationWarning (chosen) Pros: - Typos caught at construction time. - Escape hatch for fields shardyfusion hasn't modeled. - One-release deprecation cycle keeps the migration non-breaking for downstream callers.

We chose Option B.

Decision 7: `env_file` + `db_url` resolution¶

slatedb 0.12's ObjectStore.resolve(db_url) reads env at resolve time. We need env present at that moment, scoped to the resolution call.

Option A: Pull in `python-dotenv`¶

Cons: - Ships a CLI and global config-file behavior we don't want. - Adds a third-party dependency for ~30 lines of logic.

Option B: In-house `apply_env_file()` context manager (chosen)¶

Pros: - Scopes env mutations to the resolve call; restores on exit. - No new dependency.

We chose Option B.

Decision 8: `KeyRange` over kwargs in `scan_iter`¶

A late perf-test failure caught that uniffi's DbReader.scan takes a single KeyRange positional — not start=/end= kwargs. We added get_key_range_class() and now build a KeyRange(start, start_inclusive=True, end, end_inclusive=False) once per call, mirroring half-open [start, end) Python semantics. We also moved iterator construction outside the per-chunk drain loop. The previous code re-opened the iterator on every chunk, which would have silently re-served the first N rows forever once a shard exceeded iterator_chunk_size.

Decision 9: Performance microbenchmarks — gating¶

The bridge overhead is the single largest correctness-adjacent risk in this migration.

Option A: Run perf benchmarks on every CI run¶

Cons: - Adds wall-clock time and flakiness to every PR. - Hard to tune budgets that survive shared-runner variance.

Option B: Marker-gated, on-demand `just perf` recipe (chosen)¶

Pros: - Deliberate, focused check when investigating bridge regressions. - addopts = "-ra -m 'not perf'" in pyproject.toml keeps default pytest runs clean. - Loose budgets (~5× measured) catch order-of-magnitude regressions without flaking.

We chose Option B.

Decision 10: Async migration scope¶

Option A: Migrate shardyfusion to async end-to-end¶

Cons: - Spark/Dask/Ray worker contracts are sync. - This is a separate, much larger project.

Option B: Keep shardyfusion's Protocols sync; bridge per-call (chosen)¶

We chose Option B — see Decision 1.

3. What is the impact of these changes (covering testability, performance, and complexity)?¶

Testability¶

_slatedb_symbols.py gives tests one place to patch (sys.modules["slatedb"] + sys.modules["slatedb.uniffi"]) instead of N call sites.
New helpers shardyfusion.testing.open_slatedb_db() and open_slatedb_reader() wrap the canonical "build → resolve → run_coro" pattern so test seed code mirrors production exactly.
The integration-test file URL remap pattern (map_s3_db_url_to_file_url(db_url, object_store_root) followed by SlateDbReaderFactory() delegation) is now the standard for any test that materializes data on local disk under an s3:// URL.
Perf suite (tests/integration/perf/, @pytest.mark.perf, just perf) provides a deliberate guardrail for bridge-overhead regressions without polluting default CI.

Performance¶

Per-call bridge cost: ~15–40 µs measured. Acceptable for get and write (already batched via WriteBatch).
Per-row iteration without chunking: ~30× slowdown vs. an in-process iterator. iterator_chunk_size=1024 default recovers most of it; perf budgets catch regressions.
Writer hot path lost the read-back hashing pass that the legacy SHA-256 checkpoint_id required for SQLite/SQLiteVec — a small but real win.

Complexity¶

Net +1 module (_slatedb_runtime.py) and +1 module (_checkpoint_id.py); slight expansion of _slatedb_symbols.py.
Net −1 read-side concept (checkpoint pinning) that was never needed under our invariants.
Bridge call sites are visible (run_coro(...)) — readers can trace where async actually happens.
Configuration surface narrowed: typed SlateDbSettings with one raw_overrides escape hatch instead of an open-ended dict.

4. API delta: slatedb 0.11.x → 0.12.1¶

Concern	0.11.x (legacy)	0.12.1 (uniffi)
Module surface	`slatedb.SlateDB`, `slatedb.SlateDBReader`	`slatedb.uniffi.{Db, DbBuilder, DbReader, DbReaderBuilder, WriteBatch, ObjectStore, Settings, FlushOptions, FlushType, KeyRange, DbIterator, KeyValue}`
Sync/async	Sync methods	All methods `async def`
Open writer	`SlateDB(local_dir, url=..., **opts)`	`await DbBuilder(path, store).build()` where `store = ObjectStore.resolve(url)`
Open reader	`SlateDBReader(local_dir, url=..., checkpoint_id=...)`	`await DbReaderBuilder(path, store).build()` (no `checkpoint_id` arg)
Write batch	`db.write(batch)` (sync)	`await db.write(WriteBatch())`
Flush WAL	`db.flush()` / `db.flush_with_options("wal")`	`await db.flush_with_options(FlushOptions(flush_type=FlushType.WAL))`
Range scan	`reader.scan(start=..., end=...)` returning sync iterator	`await reader.scan(KeyRange(start=..., start_inclusive=True, end=..., end_inclusive=False))` returning async `DbIterator`
Iterate	`for kv in scan_result` (sync)	`await iterator.next()` returning `Optional[KeyValue]` (per-row only; no `next_batch` in 0.12.1)
Checkpoint create	`db.create_checkpoint()` returning hash str	Removed — no read-side checkpoint pinning API exists
Close	implicit GC	`await db.shutdown()` for `Db`; `DbReader` has no explicit close
Settings	dict/kwargs to constructor	`Settings(...)` typed object passed via builder
Object store creds	env vars + url string	`ObjectStore.resolve(url)` reads env at resolve time
Python version floor	3.9+	3.11+ (uniffi-generated bindings)

Symbol availability check¶

from importlib.metadata import version
print(version("slatedb"))                    # '0.12.1'
from slatedb.uniffi import (
    Db, DbBuilder, DbReader, DbReaderBuilder,
    WriteBatch, ObjectStore, Settings,
    FlushOptions, FlushType, KeyRange,
)

slatedb<0.12 lacks the uniffi submodule entirely; slatedb==0.11.1 satisfied a naive slatedb<0.13 constraint and triggered the migration's first false-start. The pin is now slatedb>=0.12,<0.13 in both pyproject.toml extras and main deps.

5. Observed-but-deliberate gotchas¶

DbReader.scan requires KeyRange — not kwargs. Documented in the _SlateDbReaderHandle docstring and AGENTS.md Gotchas.
DbIterator exposes only next in 0.12.1 — no next_batch. scan_iter keeps a next_batch fast path behind try/except AttributeError for forward compatibility with later slatedb releases that may add it.
Per-call bridge cost is real (~15–40 µs); any future hot-path call must batch through run_coro once, not per item.
SlateDbReaderFactory.checkpoint_id is accepted but ignored. Document this in any new factory subclass or wrapper to avoid misleading callers.
Test patching must hit both sys.modules["slatedb"] and sys.modules["slatedb.uniffi"]; patching only the top-level module doesn't intercept _slatedb_symbols._import_uniffi().

6. What was explicitly not done, and why¶

Did not introduce a sync wrapper class around uniffi. See Decision 1, Option A.
Did not move shardyfusion to async end-to-end. See Decision 10.
Did not pin slatedb to an exact version. >=0.12,<0.13 lets patch releases through; the symbol-resolution layer makes any newly-renamed class fail at one obvious site.
Did not preserve content-addressed checkpoint IDs. See Decision 3 — the single-writer + serial-publish invariant makes content addressing unnecessary, and removing it deleted a write-time hashing pass we were paying for on every shard close.

7. Post-merge audit¶

After the migration landed, two concerns were raised in review and investigated quantitatively before closing the work.

7.1 Bridge-loop contention¶

Concern. All synchronous slatedb operations (writer + sync reader) funnel through one process-global asyncio loop running on a single daemon thread. Spark, Dask, Ray, and Python writers can run many worker threads inside one Python process — does the shared loop serialise them?

Topology. Cluster writers (mapPartitionsWithIndex for Spark, analogous for Dask/Ray) put one shard per Python worker process. Inside one process the partition writer is itself sequential, so there is at most one in-flight write_batch per process anyway — the loop is not contended. Multi-shard-per-process only occurs in the Python writer's parallel=True mode, which uses multiprocessing spawn; each subprocess has its own loop.

The interesting case is a single process serving many concurrent sync get/scan calls — e.g. a FastAPI app wrapping ConcurrentShardedReader with a thread pool.

Measurement. A microbenchmark on the local file:// backend (see commit log; not committed as a perf test because the absolute numbers are FS-dependent):

Topology	per-op (write_batch ×100 rows)
1 thread, shared bridge loop	101.10 ms
2 threads, shared bridge loop, separate DBs	101.09 ms
4 threads, shared bridge loop, separate DBs	101.12 ms
8 threads, shared bridge loop, separate DBs	101.18 ms
8 threads, separate loops, separate DBs	101.13 ms

Pure bridge cost (no slatedb work) under the same loop:

Topology	per-call latency	aggregate ops/s
1 thread	18.42 µs	54,300
8 threads (shared loop)	16.05 µs	62,300

Findings.

The bridge contributes ~18 µs per call. A typical write_batch spends ~101 ms in slatedb, so the bridge is <0.02 % of write latency. For point gets it's a larger fraction but still well below the 1 ms budget set in the perf tests.
Adding threads with separate loops (option C) gives no throughput improvement either, which means slatedb's internal Tokio runtime is already serialising per-DB writes — the bridge is not the bottleneck even in principle.
Aggregate bridge throughput actually increases slightly with threads (62k > 54k ops/s) because the loop amortises the wakeup cost across multiple inflight submissions.

Verdict. Single shared bridge loop is correct. No mitigation needed. The cost of switching to per-thread loops would be losing uniffi's set_event_loop registration invariant (uniffi binds Rust async tasks to the loop that created the resource) for no measurable benefit.

7.2 What did the migration give up?¶

A line-by-line audit of pre-0.12 capabilities versus the current adapter surface:

Pre-0.12 capability	Post-0.12 status	Operational impact
`db.create_checkpoint(scope="durable")` returning a slatedb-managed checkpoint id	Removed. uniffi 0.12 has no checkpoint create/pin API. Writer stamps an opaque `uuid4().hex`.	Manifests still record a per-shard id, so cleanup and winner-selection are unaffected. Lost: ability to pin-read a specific historical checkpoint of a SlateDB shard via the engine. Safe under the single-writer + serial-publish + S3-strong-consistency invariant: shards at `db_url` are immutable after publish, so readers never need a checkpoint pin.
Content-addressed `checkpoint_id` (SHA-256 of materialized DB)	Replaced by `uuid4().hex`.	`_winner_sort_key` is `(attempt, task_attempt_id, db_url)` — `checkpoint_id` was never consulted for tiebreaks. No regression. External consumers that persisted SHA-256 fingerprints must switch to comparing `db_bytes()` or computing their own digest from the downloaded shard.
Reader-side `checkpoint_id` pinning (`with_checkpoint_id`)	No equivalent in uniffi 0.12. `SlateDbReaderFactory.checkpoint_id` accepted for Protocol symmetry; ignored at runtime.	Safe under the immutability invariant; if a future workflow needs reader pinning we will need a SlateDB upstream change first.
`flush_with_options("wal")` short string form	Replaced with typed `FlushOptions(flush_type=FlushType.WAL)`.	Cosmetic; same semantics.
Free-form `JsonObject` settings forwarded as JSON to slatedb	Now typed `SlateDbSettings` with `raw_overrides` escape hatch.	Net positive: typed surface, IDE help. Legacy dict shape is rejected at the type level (library not yet released; no migration cycle needed).
`SlateDB(path, url=..., env_file=..., settings=...)` synchronous constructor	`DbBuilder("", ObjectStore.resolve(db_url))` + `Settings.set(...)` per option, opened via `await builder.build()` through the bridge.	More verbose, more explicit. `local_dir` is now unused for the SlateDB backend (data goes straight to object store) but is still threaded through factories for symmetry with SQLite/SQLiteVec/LanceDB adapters.
`Db.close()`	Renamed `Db.shutdown()`; awaitable.	Adapter `close()` calls `run_coro(self._db.shutdown())`. Same lifecycle semantics.
`DbIterator.next_batch(n)`	Not present in 0.12.1; only `next()`.	`scan_iter` retains a `next_batch` try/except fast path for forward compatibility; chunking happens in Python via `iterator_chunk_size` (default 1024 on `SlateDbReaderFactory`).
Sync I/O directly from worker thread	Now goes through bridge loop.	~18 µs per call; mitigated by batching. See §7.1.
Pre-migration `DbAdapter.checkpoint() -> str \\| None`	Renamed to `seal() -> None`; checkpoint id stamped by writer.	Custom adapters must rename. Documented in CHANGELOG, AGENTS.md Gotchas, and adapter-authoring guide.

Net assessment. All losses are documented, and either (a) unused under the existing invariants (checkpoint pinning, content addressing) or (b) cosmetic (constructor shape, flush options). The migration removed a write-time SHA-256 pass and added ~18 µs of bridge overhead per call — net positive on the hot path.

2026-05-04 slatedb 0.12 uniffi Migration¶

Summary¶

1. What problem is being solved or functionality being added by the changes?¶

2. What design decisions were considered with their pros and cons and trade offs?¶

Decision 1: How to bridge sync shardyfusion Protocols to async uniffi¶

Option A: Generate a sync wrapper class (SyncDb) around uniffi¶

Option B: Process-global daemon-thread asyncio loop with run_coro helper (chosen)¶

Option C: Make the loop user-pluggable¶

Decision 2: Iterator chunking knob — where it lives¶

Option A: Push chunking into DbAdapter / writer side¶

Option B: iterator_chunk_size only on SlateDbReaderFactory (chosen)¶

Decision 3: How to identify shards now that slatedb has no read-side checkpoint API¶

Option A: Hash the materialized SlateDB / SQLite file (legacy SQLite/SQLiteVec behaviour)¶

Option B: Opaque uuid.uuid4().hex stamped by the writer (chosen)¶

Option C: Re-hash from the manifest after publish¶

Decision 4: Adapter Protocol — seal() vs checkpoint()¶

Option A: Keep checkpoint() -> str | None and ignore the return¶

Option B: Rename to seal() -> None (chosen)¶

Decision 5: Symbol resolution layer¶

Option A: import slatedb.uniffi directly at every call site¶

Option B: Single choke point in _slatedb_symbols.py (chosen)¶

Decision 6: Configuration model — SlateDbSettings¶

Option A: Keep accepting the JSON-ish dict from the legacy API¶

Option B: Typed SlateDbSettings dataclass with¶

Decision 7: env_file + db_url resolution¶

Option A: Pull in python-dotenv¶

Option B: In-house apply_env_file() context manager (chosen)¶

Decision 8: KeyRange over kwargs in scan_iter¶

Decision 9: Performance microbenchmarks — gating¶

Option A: Run perf benchmarks on every CI run¶

Option B: Marker-gated, on-demand just perf recipe (chosen)¶

Decision 10: Async migration scope¶

Option A: Migrate shardyfusion to async end-to-end¶

Option B: Keep shardyfusion's Protocols sync; bridge per-call (chosen)¶

3. What is the impact of these changes (covering testability, performance, and complexity)?¶

Testability¶

Performance¶

Complexity¶

4. API delta: slatedb 0.11.x → 0.12.1¶

Symbol availability check¶

5. Observed-but-deliberate gotchas¶

6. What was explicitly not done, and why¶

7. Post-merge audit¶

7.1 Bridge-loop contention¶

7.2 What did the migration give up?¶

Option A: Generate a sync wrapper class (`SyncDb`) around uniffi¶

Option B: Process-global daemon-thread asyncio loop with `run_coro` helper (chosen)¶

Option A: Push chunking into `DbAdapter` / writer side¶

Option B: `iterator_chunk_size` only on `SlateDbReaderFactory` (chosen)¶

Option B: Opaque `uuid.uuid4().hex` stamped by the writer (chosen)¶

Decision 4: Adapter Protocol — `seal()` vs `checkpoint()`¶

Option A: Keep `checkpoint() -> str | None` and ignore the return¶

Option B: Rename to `seal() -> None` (chosen)¶

Option A: `import slatedb.uniffi` directly at every call site¶

Option B: Single choke point in `_slatedb_symbols.py` (chosen)¶

Decision 6: Configuration model — `SlateDbSettings`¶

Option B: Typed `SlateDbSettings` dataclass with¶

Decision 7: `env_file` + `db_url` resolution¶

Option A: Pull in `python-dotenv`¶

Option B: In-house `apply_env_file()` context manager (chosen)¶

Decision 8: `KeyRange` over kwargs in `scan_iter`¶

Option B: Marker-gated, on-demand `just perf` recipe (chosen)¶