Tox and Dependency Matrix¶
shardyfusion is a sharded snapshot writer/reader library that can be used in many different ways. This rich set of use cases is the root cause of the size and complexity in pyproject.toml and tox.ini.
This page explains: 1. the use cases the project supports 2. how use cases drive the package extras 3. how extras drive the tox matrix 4. how to add new tox environments responsibly
Use Cases the Project Supports¶
shardyfusion can be used for different scenarios in production pipelines:
Writer Side¶
| Use case | Input | Requires |
|---|---|---|
| PySpark DataFrame | DataFrame with custom sharding | Java 17, PySpark |
| Dask DataFrame | Dask DataFrame | Dask |
| Ray Dataset | Ray Dataset | Ray |
| Pure Python | Iterable of records | nothing extra |
Reader Side¶
| Use case | API | Requires |
|---|---|---|
| Sync reader | ShardedReader |
none |
| Concurrent reader | ConcurrentShardedReader |
none |
| Async reader | AsyncShardedReader |
aiobotocore |
Storage Backends¶
| Use case | Backend | Requires |
|---|---|---|
| Default shard storage | SlateDB | slatedb, boto3 |
| SQLite shards | SQLite-on-S3 | boto3 |
| Range reads | SQLite with APSW VFS | apsw, boto3 |
Optional Features¶
| Use case | Feature | Requires |
|---|---|---|
| CLI tool | shardy CLI |
click |
| Expression routing | CEL sharding | cel-expr-python |
| Observability | Prometheus metrics | prometheus-client |
| Observability | OpenTelemetry | opentelemetry-api |
| Vector search | LanceDB | lancedb, numpy |
| Vector search | sqlite-vec | sqlite-vec, numpy |
Python Versions¶
Current support: Python 3.11, 3.12, 3.13. Python 3.14 is excluded until all dependencies consistently support it.
How Use Cases Drive Package Extras¶
Every distinct use case above becomes a public extra in pyproject.toml. Users install exactly what they need:
# Reader only (default SlateDB backend)
uv sync --extra read
# Async reader
uv sync --extra read-async
# SQLite shards instead of SlateDB
uv sync --extra read-sqlite
# Adaptive SQLite reader (download + range, auto-picked per snapshot)
uv sync --extra read-sqlite-adaptive
# Spark writer (requires Java)
uv sync --extra writer-spark
# Dask writer
uv sync --extra writer-dask
# Ray writer
uv sync --extra writer-ray
# CLI tool — kitchen-sink (every read backend bundled)
uv sync --extra cli
# CLI tool — slim (bring your own backend)
uv sync --extra cli-minimal --extra read-sqlite-adaptive
# Full install (all use cases)
uv sync --all-extras
The public extras are intentionally user-shaped. They answer: "How do I use this library for X?"
Why So Many Extras?¶
Because every use case has different dependencies. Users should not install:
- PySpark if they use Dask
- aiobotocore if they only need sync reads
- lancedb if they only need key-value lookups
The matrix is large because the option set is large.
How Extras Drive the Tox Matrix¶
The tox matrix tests every supported combination of use cases, Python versions, backends, and verification stages.
flowchart LR
UC[Use Cases] --> Extras[pyproject.toml extras]
Extras --> Groups[dependency-groups]
Extras --> Tox[tox.ini environments]
Groups --> Tox
Tox --> Labels[tox labels]
Labels --> CI[ci-matrix.json]
| From | To | What it adds |
|---|---|---|
| use case | public extra | user install target |
| use case | dependency group | reusable slice for tox |
| extra + Python + stage | tox env | concrete verification job |
| tox env family | label | group for workflows |
| label | CI job matrix | GitHub Actions parallelization |
The Layers¶
| Layer | Defined in | Purpose |
|---|---|---|
| Public optional extras | pyproject.toml |
User-facing install shapes |
| Dependency groups | pyproject.toml |
Internal reusable slices |
| Tox environments | tox.ini |
Concrete verification targets |
| Tox labels | tox.ini |
Workflow entry points |
| CI matrix | .github/ci-matrix.json |
Generated CI jobs |
Tox Factorization¶
The tox config is compact because one base [testenv] describes many environments through factors.
| Piece | What it does |
|---|---|
package = editable |
Test working tree directly |
extras = test |
Add pytest and fixtures |
dependency_groups = ... |
Add only the slice needed |
deps = ... |
Runtime version pins (e.g., Spark 3.5 vs 4) |
commands = ... |
Test path per env family |
labels = ... |
Group by stage for workflows |
Example: py311-sparkwriter-spark4-slatedb-unit¶
| Piece | Means |
|---|---|
py311 |
Python 3.11 |
sparkwriter |
adds cap-writer-spark + mod-cel |
spark4 |
adds pyspark>=4,<5 |
slatedb |
adds backend-slatedb |
unit |
runs Spark unit tests |
This checks: Spark writer unit tests on Python 3.11 with Spark 4 against SlateDB backend.
Why the Structure Exists¶
1. Honest support boundaries¶
Each tox env encodes exactly what combinations are supported. There is no ambiguity about what works on what Python version with what backend.
2. Fast, focused installs¶
Most envs install only what they need. py312-read-slatedb-unit never pulls PySpark.
3. Clear failure isolation¶
When py313-vector-lancedb-unit fails, the failure tells you the issue is in Python 3.13 + vector + LanceDB. Not "some dependency problem".
4. CI parallelization¶
Each tox env maps to one CI job. The repo can run many combinations in parallel.
Quality Envs¶
The quality label handles non-test checks separately:
lint/format— code styletype-*— per-path type checking (not one env installing everything)package— build validation separately from editable installsdocs-check— site build
Type checking is split because each type path needs different dependencies and pyright configs.
Labels¶
| Label | What | Why separate |
|---|---|---|
quality |
lint, format, type, package, docs | not test-path-based |
unit |
fast slices | quick feedback |
integration |
cross-component | moto S3, framework stacks |
smoke |
broad "all" on scheduled matrix | catch interactions without PR cost |
e2e |
Garage container | needs setup |
How CI Uses Tox¶
For unit, integration, smoke: tox is the source of truth.
flowchart LR
Tox[tox.ini labels] --> Script[generate_ci_matrix.py]
Script --> Json[ci-matrix.json]
Json --> Workflow[GitHub Actions]
Workflow --> Job[run-tests action]
Job --> Run[tox -e env]
- tox labels define env groups
just ci-matrixregenerates.github/ci-matrix.json- CI loads and runs each tox env as its own job
- quality job checks matrix is up to date
When To Add A New Tox Environment¶
Add a new tox env when the change creates a new use case or new supported combination.
Usually needs a new env¶
| Change | Reason |
|---|---|
| New writer implementation | new use case |
| New backend | new storage option |
| New Python version | expands support matrix |
| New Spark major version | different runtime |
| New vector backend | different search option |
Usually does NOT need a new env¶
| Change | Reason |
|---|---|
| More tests in existing directory | same use case |
| More CLI tests | already covered |
| Refactor without behavior change | no new boundary |
Decision flow¶
flowchart TD
Change[New change] --> Boundary{"New use case<br/>or supported combo?"}
Boundary -->|No| Reuse[Reuse existing env]
Boundary -->|Yes| Existing{"Same use case<br/>already covered?"}
Existing -->|Yes| Extend[Expand existing test path]
Existing -->|No| New[Add new tox env]
New --> Label{"Under unit,<br/>integration, smoke?"}
Label -->|Yes| MatrixRegen[Run just ci-matrix]
Label -->|No| SkipMatrix[No regeneration needed]
How To Add A New Tox Environment¶
1. Add the extra or dependency group¶
If user-facing: add to [project.optional-dependencies]
If internal only: add to [dependency-groups]
2. Add to tox env_list¶
Follow the naming pattern: py<version>-<capability>-<backend>-<stage>
3. Add to the right label¶
qualityfor lint/format/type/package/docsunitfor fast slicesintegrationfor cross-componentsmokefor broad scheduled coveragee2efor container-based tests
4. Wire dependencies¶
Use the factor pattern in the base [testenv]:
[testenv]
dependency_groups =
foo: cap-foo
slatedb: backend-slatedb
commands =
foo-unit: pytest -q tests/unit/foo {posargs}
5. Regenerate CI matrix (if needed)¶
just ci-matrix
git add .github/ci-matrix.json
git commit -m "chore: regenerate ci-matrix"
6. Verify¶
Run the specific env:
uv run tox -e py312-foo-slatedb-unit
Then run the label:
uv run tox -m unit
Maintenance Rule¶
Keep the source of truth in order:
- use cases define what the project does
- extras let users install use cases
- tox verifies use cases work
- ci-matrix.json is generated from tox
When use cases are added, the chain flows naturally. Do not add tox complexity without a corresponding use case.
Project-Specific Settings¶
| Setting | Why |
|---|---|
skip_missing_interpreters = false |
Missing Python versions fail loudly |
package = editable |
Fast edit-test cycle |
separate package env |
Verify built wheel, not editable |
SPARK_LOCAL_IP=127.0.0.1 |
Avoid Spark hostname issues |
RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 |
Ray detects uv run and creates fresh envs |