shardyfusion¶
shardyfusion builds and reads sharded snapshots backed by independent shard databases. The default backend uses SlateDB, and the project also supports SQLite shard databases. It lets you partition large datasets during a write pipeline, then route point lookups to the correct shard at read time.
How the Pieces Fit Together¶
Writers ingest data (from Spark, Dask, Ray, or plain Python), assign each row to a shard, write the shards to S3-compatible storage, and publish a manifest describing the result. Readers load the manifest, build a routing table, and serve key lookups by directing each key to its shard. The CLI wraps a reader for interactive and batch use.
flowchart LR
subgraph Writers
Spark
Dask
Ray
Python
end
Writers --> Core["_writer_core\n(routing, winner selection,\nmanifest building)"]
Core --> MS["ManifestStore\n(S3 / DB / in-memory)"]
MS --> Readers
subgraph Readers
Sync["ShardedReader"]
Conc["ConcurrentShardedReader"]
Async["AsyncShardedReader"]
end
Readers --> CLI["CLI (shardy)"]
Obs["Metrics & Logging"] -.-> Writers
Obs -.-> Readers
All four writer backends share the same core logic (_writer_core.py) for routing, winner selection, manifest building, and two-phase publishing. This guarantees that any backend produces a manifest readable by any reader variant, whether the shard storage uses the default SlateDB adapter or SQLite.
Dependency Layers¶
The codebase is organized into layers with strict dependency direction — lower layers never import higher ones:
flowchart TB
L0["Layer 0 — Core Types\nerrors, type_defs, sharding_types,\nordering, logging, metrics protocol"]
L1["Layer 1 — Config & Serialization\nconfig, serde, _rate_limiter"]
L2["Layer 2 — Storage & Routing\nstorage, manifest, routing,\nmanifest_store, async_manifest_store,\ndb_manifest_store"]
L3["Layer 3 — Writer Core\n_writer_core"]
L4["Layer 4 — Entry Points\nwriter/{spark,dask,ray,python},\nreader, async_reader, CLI"]
L5["Layer 5 — Adapters & Testing\nslatedb_adapter, sqlite_adapter,\ntesting"]
L0 --> L1 --> L2 --> L3 --> L4 --> L5
style L0 fill:#e8f5e9
style L1 fill:#e3f2fd
style L2 fill:#fff3e0
style L3 fill:#fce4ec
style L4 fill:#f3e5f5
style L5 fill:#e0f2f1
Writer Backends¶
Four backends produce identical manifest formats. They differ in how they distribute work but share all core logic, and can write either SlateDB or SQLite shards depending on the installed backend extra:
| Feature | Spark | Dask | Ray | Python |
|---|---|---|---|---|
| Input type | PySpark DataFrame |
Dask DataFrame |
Ray Dataset |
Iterable[T] |
| Requires Java | Yes | No | No | No |
| Sharded write | write_sharded |
write_sharded |
write_sharded |
write_sharded |
| Single-shard write | write_single_db |
write_single_db |
write_single_db |
— |
| Hash sharding | Yes | Yes | Yes | Yes |
| Range sharding | Yes | Yes | Yes | Yes |
| Custom expr sharding | Yes | No | No | No |
| Rate limiting (ops/sec) | Yes | Yes | Yes | Yes |
| Rate limiting (bytes/sec) | Yes | Yes | Yes | Yes |
| Parallel mode | Built-in (RDD) | Built-in (Dask) | Built-in (Ray) | parallel=True |
Reader Variants¶
| Variant | Thread-safe | Async | Best for |
|---|---|---|---|
ShardedReader |
No | No | Single-threaded scripts |
ConcurrentShardedReader |
Yes | No | Multi-threaded web servers |
AsyncShardedReader |
N/A | Yes | asyncio services (FastAPI, etc.) |
All readers support health checks, cold-start fallback, refresh, and optional rate limiting.
Quick Links¶
| Page | Description |
|---|---|
| Getting Started | Installation, extras, and dev setup |
| Architecture | Write/read pipeline deep dive and S3 layout |
| Writer Side | All writer backends, single-shard writers, rate limiting |
| Reader Side | Sync, concurrent, and async readers with health and fallback |
| Manifest Stores | S3, database, async, and in-memory manifest storage |
| CLI (shardy) | Interactive lookups, batch scripts, history, and rollback |
| Observability | Prometheus, OpenTelemetry, and structured logging |
| Error Handling | Error hierarchy and retry semantics |
| Operations | CI, rollback procedures, health monitoring, troubleshooting |
| API Reference | Auto-generated API docs from docstrings |