Skip to content

shardyfusion

shardyfusion builds and reads sharded snapshots backed by independent shard databases. The default backend uses SlateDB, and the project also supports SQLite shard databases. It lets you partition large datasets during a write pipeline, then route point lookups to the correct shard at read time.

How the Pieces Fit Together

Writers ingest data (from Spark, Dask, Ray, or plain Python), assign each row to a shard, write the shards to S3-compatible storage, and publish a manifest describing the result. Readers load the manifest, build a routing table, and serve key lookups by directing each key to its shard. The CLI wraps a reader for interactive and batch use.

flowchart LR
    subgraph Writers
        Spark
        Dask
        Ray
        Python
    end

    Writers --> Core["_writer_core\n(routing, winner selection,\nmanifest building)"]
    Core --> MS["ManifestStore\n(S3 / DB / in-memory)"]
    MS --> Readers

    subgraph Readers
        Sync["ShardedReader"]
        Conc["ConcurrentShardedReader"]
        Async["AsyncShardedReader"]
    end

    Readers --> CLI["CLI (shardy)"]

    Obs["Metrics & Logging"] -.-> Writers
    Obs -.-> Readers

All four writer backends share the same core logic (_writer_core.py) for routing, winner selection, manifest building, and two-phase publishing. This guarantees that any backend produces a manifest readable by any reader variant, whether the shard storage uses the default SlateDB adapter or SQLite.

Dependency Layers

The codebase is organized into layers with strict dependency direction — lower layers never import higher ones:

flowchart TB
    L0["Layer 0 — Core Types\nerrors, type_defs, sharding_types,\nordering, logging, metrics protocol"]
    L1["Layer 1 — Config & Serialization\nconfig, serde, _rate_limiter"]
    L2["Layer 2 — Storage & Routing\nstorage, manifest, routing,\nmanifest_store, async_manifest_store,\ndb_manifest_store"]
    L3["Layer 3 — Writer Core\n_writer_core"]
    L4["Layer 4 — Entry Points\nwriter/{spark,dask,ray,python},\nreader, async_reader, CLI"]
    L5["Layer 5 — Adapters & Testing\nslatedb_adapter, sqlite_adapter,\ntesting"]

    L0 --> L1 --> L2 --> L3 --> L4 --> L5

    style L0 fill:#e8f5e9
    style L1 fill:#e3f2fd
    style L2 fill:#fff3e0
    style L3 fill:#fce4ec
    style L4 fill:#f3e5f5
    style L5 fill:#e0f2f1

Writer Backends

Four backends produce identical manifest formats. They differ in how they distribute work but share all core logic, and can write either SlateDB or SQLite shards depending on the installed backend extra:

Feature Spark Dask Ray Python
Input type PySpark DataFrame Dask DataFrame Ray Dataset Iterable[T]
Requires Java Yes No No No
Sharded write write_sharded write_sharded write_sharded write_sharded
Single-shard write write_single_db write_single_db write_single_db
Hash sharding Yes Yes Yes Yes
Range sharding Yes Yes Yes Yes
Custom expr sharding Yes No No No
Rate limiting (ops/sec) Yes Yes Yes Yes
Rate limiting (bytes/sec) Yes Yes Yes Yes
Parallel mode Built-in (RDD) Built-in (Dask) Built-in (Ray) parallel=True

Reader Variants

Variant Thread-safe Async Best for
ShardedReader No No Single-threaded scripts
ConcurrentShardedReader Yes No Multi-threaded web servers
AsyncShardedReader N/A Yes asyncio services (FastAPI, etc.)

All readers support health checks, cold-start fallback, refresh, and optional rate limiting.

Page Description
Getting Started Installation, extras, and dev setup
Architecture Write/read pipeline deep dive and S3 layout
Writer Side All writer backends, single-shard writers, rate limiting
Reader Side Sync, concurrent, and async readers with health and fallback
Manifest Stores S3, database, async, and in-memory manifest storage
CLI (shardy) Interactive lookups, batch scripts, history, and rollback
Observability Prometheus, OpenTelemetry, and structured logging
Error Handling Error hierarchy and retry semantics
Operations CI, rollback procedures, health monitoring, troubleshooting
API Reference Auto-generated API docs from docstrings