SQLite B-tree metadata sidecar¶
The writer emits a small auxiliary artifact alongside each finalized SQLite shard. The sidecar bundles the pages that range-mode readers always traverse during navigation — page 1, the schema btree, and every interior B-tree page — so the reader can fetch them once on shard open and pin them for the lifetime of the shard reader.
Why it exists¶
A finalized shard database is read-only once published. Page numbering and the content of every B-tree internal node are frozen for the life of the shard. Without a sidecar:
- Each point lookup walks root → interior → leaf — typically 3-4 page reads per query for a 1 M-row shard.
- The reader's LRU cache must hold those interior pages alongside hot leaves, and on warm-but-pressured workloads it evicts them, regenerating the round trips.
With a sidecar:
- One S3 GET on shard open primes the page cache with every interior page.
- Each point lookup needs at most one S3 GET (the leaf) for an uncached read; warm reads serve from the pinned cache.
- ~3× lower steady-state latency per point lookup, and far less LRU churn at high QPS.
For shardyfusion's typical fan-out (≤ 1 GiB shards, 4 KiB pages, depth-3 B-trees) the sidecar is ~0.5–2% of the main DB.
Artifact¶
Each finalized shard publishes two sibling objects:
s3://bucket/.../shards/run_id=X/db=00000/attempt=00/shard.db
s3://bucket/.../shards/run_id=X/db=00000/attempt=00/shard.btreemeta
The sidecar is a self-describing little-endian binary blob with a small uncompressed header followed by a zstd-compressed body:
offset size field
------ ------- ----------------------------------
0 8 magic = b"SFBTM\x00\x00\x00"
8 4 format_version (u32) = 3
12 ... zstd-compressed body
The body, once decompressed, contains:
offset size field
------ ------- ----------------------------------
0 4 page_size (u32, e.g. 4096)
4 4 page_count N (u32) — number of pages in sidecar
8 8 * N index (u32 pageno, u32 offset) per entry,
sorted ascending by pageno
8+8N PS * N page_data (page_size bytes each)
offset is the body-relative byte position of the corresponding page
slab. Equivalently, when a consumer writes the decompressed body to a
file on disk, offset is the file offset.
This shape lets a reader pick its memory strategy:
- Pin everything in memory. Decompress the body, populate the LRU cache by walking the index. Standard path for the eventual range-mode reader.
- On-disk lookups. Decompress the body to a local file. Hold only
the
pageno → offsetmap in memory (~8 bytes per page) andpreadindividual pages on demand; the OS page cache amortizes hot reads. For million-page sidecars this trades ~MB-scale resident memory for ~KB-scale.
Storing offset explicitly is redundant when pages are uniformly
page_size bytes (it equals 8 + 8*N + i*page_size), but keeping it
in the format costs almost nothing under zstd (offsets are an
arithmetic progression and compress to a few bits each) and:
- Removes the offset-arithmetic burden from disk consumers.
- Future-proofs the format for variable-size pages (e.g. per-page compression for very large sidecars).
The magic and version stay uncompressed so a reader can validate the artifact (and reject mismatched versions) before paying the decompression cost. Btree pages compress ~12× under zstd at level 3 in practice — interior pages are typically ~50% free space, and the per-page headers and cell pointer arrays repeat across the bundle. A representative 100 k-row shard (9.5 MiB DB) emits a ~21 KB sidecar (~0.2% of the DB).
The byte-exact format is specified in
reference/sqlite-btree-sidecar-format.md;
this page covers the surrounding architecture (page selection,
discovery, failure semantics).
Page selection¶
The extractor uses SQLite's own dbstat virtual table (via APSW) to
identify pages, then reads page bytes by direct file I/O at the
deterministic (pageno - 1) * page_size offsets. The selection query is:
SELECT pageno
FROM dbstat
WHERE pagetype = 'internal'
OR name IN ('sqlite_master', 'sqlite_schema')
ORDER BY pageno;
This selects:
- Every interior B-tree page in any btree (table or index, regular or
shadow tables of the
sqlite-vecvirtual table). - Every page belonging to the schema btree, regardless of whether it is
leaf or interior. Schema leaves matter: they hold each table/index's
rootpage, which SQLite reads on every connection open. Caching them eliminates one round trip on shard open.
We deliberately do not parse the SQLite file format ourselves. SQLite's own introspection automatically handles every page type, transparently covers virtual-table shadow btrees, and tracks any future format changes.
Manifest discovery¶
When the writer emits sidecars, it also records a snapshot-level flag in
the manifest's custom field:
{
"sqlite_btreemeta": {
"format_version": 1,
"page_size": 4096,
"filename": "shard.btreemeta"
}
}
A range-mode reader can detect the flag once per snapshot and fetch
{shard.db_url}/shard.btreemeta for every shard it opens, with no
per-shard probe.
Configuration¶
Both SqliteFactory.emit_btree_metadata and
SqliteVecFactory.emit_btree_metadata default to True. Opt out
explicitly when desired:
from shardyfusion.sqlite_adapter import SqliteFactory
SqliteFactory(emit_btree_metadata=False)
Recommended only when the deployment never uses the range-read VFS — for download-mode reads, the sidecar costs an extra PUT at write time without benefit.
Coverage¶
| Adapter / Factory | Sidecar emitted? |
|---|---|
SqliteFactory |
Yes — toggle on the factory. |
SqliteVecFactory |
Yes — same toggle, same code path. dbstat covers sqlite-vec shadow btrees. |
CompositeAdapter with SqliteFactory for the KV side |
KV-side only — the inner SQLite adapter handles its own sidecar; the vector half (e.g. LanceDB) is not SQLite. |
| Other adapters (SlateDB, etc.) | No. |
Failure semantics¶
Sidecar emission is best-effort. The contract:
| Event | Behavior |
|---|---|
| APSW not installed (writer base install only) | Debug-level sqlite_btreemeta_unsupported log with reason=apsw_not_installed; skip sidecar; continue to upload shard.db. |
zstandard not installed |
Same — reason=zstandard_not_installed. |
dbstat virtual table missing in the SQLite build |
Same — reason=dbstat_unavailable. |
| Unsafe journal mode on the finalized DB | Same — reason=unsupported_journal_mode. The extractor reads page bytes by direct file I/O at deterministic offsets; this is safe only when committed state lives entirely in the main .db file (modes off, delete, memory, truncate, persist). WAL-family modes can park committed pages in a -wal sidecar that raw file I/O would miss, so the guard refuses to run on wal/wal2 files rather than emit a silently incomplete sidecar. |
| Any other extraction error | Warning-level sqlite_btreemeta_failed log; skip sidecar; continue to upload shard.db. |
| Sidecar PUT fails | Same — sidecar dropped, manifest still lists the shard, reader (when wired) falls back to per-page lazy fetch. |
Main shard.db PUT fails |
Existing behavior — re-raise; the shard's retry path handles it. |
A missing or corrupt sidecar must never block a write or break the reader's existing path.
Why we don't use sqlite_dbpage¶
SQLite has a virtual table, sqlite_dbpage, that returns raw page bytes
through SQLite's own pager — which would automatically handle WAL,
encryption, custom VFSes, and concurrent writers. We deliberately don't
use it: APSW does not consistently ship with SQLITE_ENABLE_DBPAGE_VTAB
enabled, which would make sidecar emission silently no-op on otherwise
healthy environments.
Instead, the extractor uses dbstat (which APSW does ship with) to
identify which pages we want, then reads their bytes by direct file
I/O at the deterministic (pageno - 1) * page_size offsets. In the
narrow context of a finalized shardyfusion shard — journal_mode=OFF,
no encryption, no concurrent writers, file already closed by
checkpoint() — this is byte-for-byte equivalent to what
sqlite_dbpage would return. The journal-mode guard above turns the
"narrow context" assumption into an explicit invariant: any future
writer change that violates it (e.g. enabling WAL) trips the guard at
write time rather than producing silently-incomplete sidecars.
Dependencies¶
Sidecar emission requires:
apsw— for thedbstatvirtual table that identifies which pages to bundle. APSW bundles a SQLite built withSQLITE_ENABLE_DBSTAT_VTABenabled.zstandard— for body compression.
Both ship under the [sqlite-range] extra, which is also what readers
install for the range-read VFS:
pip install 'shardyfusion[sqlite-range]'
If either dependency is missing the writer logs at debug level and skips
the sidecar; the main shard.db upload proceeds unchanged.