Tox and Dependency Matrix¶

shardyfusion is a sharded snapshot writer/reader library that can be used in many different ways. This rich set of use cases is the root cause of the size and complexity in pyproject.toml and tox.ini.

This page explains: 1. the use cases the project supports 2. how use cases drive the package extras 3. how extras drive the tox matrix 4. how to add new tox environments responsibly

Use Cases the Project Supports¶

shardyfusion can be used for different scenarios in production pipelines:

Writer Side¶

Use case	Input	Requires
PySpark DataFrame	DataFrame with custom sharding	Java 17, PySpark
Dask DataFrame	Dask DataFrame	Dask
Ray Dataset	Ray Dataset	Ray
Pure Python	Iterable of records	nothing extra

Reader Side¶

Use case	API	Requires
Sync reader	`ShardedReader`	none
Concurrent reader	`ConcurrentShardedReader`	none
Async reader	`AsyncShardedReader`	aiobotocore

Storage Backends¶

Use case	Backend	Requires
Default shard storage	SlateDB	slatedb, boto3
SQLite shards	SQLite-on-S3	boto3
Range reads	SQLite with APSW VFS	apsw, boto3

Optional Features¶

Use case	Feature	Requires
CLI tool	`shardy` CLI	click
Expression routing	CEL sharding	cel-expr-python
Observability	Prometheus metrics	prometheus-client
Observability	OpenTelemetry	opentelemetry-api
Vector search	LanceDB	lancedb, numpy
Vector search	sqlite-vec	sqlite-vec, numpy

Python Versions¶

Current support: Python 3.11, 3.12, 3.13. Python 3.14 is excluded until all dependencies consistently support it.

How Use Cases Drive Package Extras¶

Every distinct use case above becomes a public extra in pyproject.toml. Users install exactly what they need:

# Reader only (default SlateDB backend)
uv sync --extra read

# Async reader
uv sync --extra read-async

# SQLite shards instead of SlateDB
uv sync --extra read-sqlite

# Adaptive SQLite reader (download + range, auto-picked per snapshot)
uv sync --extra read-sqlite-adaptive

# Spark writer (requires Java)
uv sync --extra writer-spark

# Dask writer
uv sync --extra writer-dask

# Ray writer
uv sync --extra writer-ray

# CLI tool — kitchen-sink (every read backend bundled)
uv sync --extra cli

# CLI tool — slim (bring your own backend)
uv sync --extra cli-minimal --extra read-sqlite-adaptive

# Full install (all use cases)
uv sync --all-extras

The public extras are intentionally user-shaped. They answer: "How do I use this library for X?"

Why So Many Extras?¶

Because every use case has different dependencies. Users should not install:

PySpark if they use Dask
aiobotocore if they only need sync reads
lancedb if they only need key-value lookups

The matrix is large because the option set is large.

How Extras Drive the Tox Matrix¶

The tox matrix tests every supported combination of use cases, Python versions, backends, and verification stages.

flowchart LR
    UC[Use Cases] --> Extras[pyproject.toml extras]
    Extras --> Groups[dependency-groups]
    Extras --> Tox[tox.ini environments]
    Groups --> Tox
    Tox --> Labels[tox labels]
    Labels --> CI[ci-matrix.json]

From	To	What it adds
use case	public extra	user install target
use case	dependency group	reusable slice for tox
extra + Python + stage	tox env	concrete verification job
tox env family	label	group for workflows
label	CI job matrix	GitHub Actions parallelization

The Layers¶

Layer	Defined in	Purpose
Public optional extras	`pyproject.toml`	User-facing install shapes
Dependency groups	`pyproject.toml`	Internal reusable slices
Tox environments	`tox.ini`	Concrete verification targets
Tox labels	`tox.ini`	Workflow entry points
CI matrix	`.github/ci-matrix.json`	Generated CI jobs

Tox Factorization¶

The tox config is compact because one base [testenv] describes many environments through factors.

Piece	What it does
`package = editable`	Test working tree directly
`extras = test`	Add pytest and fixtures
`dependency_groups = ...`	Add only the slice needed
`deps = ...`	Runtime version pins (e.g., Spark 3.5 vs 4)
`commands = ...`	Test path per env family
`labels = ...`	Group by stage for workflows

Example: `py311-sparkwriter-spark4-slatedb-unit`¶

Piece	Means
`py311`	Python 3.11
`sparkwriter`	adds `cap-writer-spark` + `mod-cel`
`spark4`	adds `pyspark>=4,<5`
`slatedb`	adds `backend-slatedb`
`unit`	runs Spark unit tests

This checks: Spark writer unit tests on Python 3.11 with Spark 4 against SlateDB backend.

Why the Structure Exists¶

1. Honest support boundaries¶

Each tox env encodes exactly what combinations are supported. There is no ambiguity about what works on what Python version with what backend.

2. Fast, focused installs¶

Most envs install only what they need. py312-read-slatedb-unit never pulls PySpark.

3. Clear failure isolation¶

When py313-vector-lancedb-unit fails, the failure tells you the issue is in Python 3.13 + vector + LanceDB. Not "some dependency problem".

4. CI parallelization¶

Each tox env maps to one CI job. The repo can run many combinations in parallel.

Quality Envs¶

The quality label handles non-test checks separately:

lint / format — code style
type-* — per-path type checking (not one env installing everything)
package — build validation separately from editable installs
docs-check — site build

Type checking is split because each type path needs different dependencies and pyright configs.

Labels¶

Label	What	Why separate
`quality`	lint, format, type, package, docs	not test-path-based
`unit`	fast slices	quick feedback
`integration`	cross-component	moto S3, framework stacks
`smoke`	broad "all" on scheduled matrix	catch interactions without PR cost
`e2e`	Garage container	needs setup

How CI Uses Tox¶

For unit, integration, smoke: tox is the source of truth.

flowchart LR
    Tox[tox.ini labels] --> Script[generate_ci_matrix.py]
    Script --> Json[ci-matrix.json]
    Json --> Workflow[GitHub Actions]
    Workflow --> Job[run-tests action]
    Job --> Run[tox -e env]

tox labels define env groups
just ci-matrix regenerates .github/ci-matrix.json
CI loads and runs each tox env as its own job
quality job checks matrix is up to date

When To Add A New Tox Environment¶

Add a new tox env when the change creates a new use case or new supported combination.

Usually needs a new env¶

Change	Reason
New writer implementation	new use case
New backend	new storage option
New Python version	expands support matrix
New Spark major version	different runtime
New vector backend	different search option

Usually does NOT need a new env¶

Change	Reason
More tests in existing directory	same use case
More CLI tests	already covered
Refactor without behavior change	no new boundary

Decision flow¶

flowchart TD
    Change[New change] --> Boundary{"New use case<br/>or supported combo?"}
    Boundary -->|No| Reuse[Reuse existing env]
    Boundary -->|Yes| Existing{"Same use case<br/>already covered?"}
    Existing -->|Yes| Extend[Expand existing test path]
    Existing -->|No| New[Add new tox env]
    New --> Label{"Under unit,<br/>integration, smoke?"}
    Label -->|Yes| MatrixRegen[Run just ci-matrix]
    Label -->|No| SkipMatrix[No regeneration needed]

How To Add A New Tox Environment¶

1. Add the extra or dependency group¶

If user-facing: add to [project.optional-dependencies] If internal only: add to [dependency-groups]

2. Add to tox env_list¶

Follow the naming pattern: py<version>-<capability>-<backend>-<stage>

3. Add to the right label¶

quality for lint/format/type/package/docs
unit for fast slices
integration for cross-component
smoke for broad scheduled coverage
e2e for container-based tests

4. Wire dependencies¶

Use the factor pattern in the base [testenv]:

[testenv]
dependency_groups =
    foo: cap-foo
    slatedb: backend-slatedb
commands =
    foo-unit: pytest -q tests/unit/foo {posargs}

5. Regenerate CI matrix (if needed)¶

just ci-matrix
git add .github/ci-matrix.json
git commit -m "chore: regenerate ci-matrix"

6. Verify¶

Run the specific env:

uv run tox -e py312-foo-slatedb-unit

Then run the label:

uv run tox -m unit

Maintenance Rule¶

Keep the source of truth in order:

use cases define what the project does
extras let users install use cases
tox verifies use cases work
ci-matrix.json is generated from tox

When use cases are added, the chain flows naturally. Do not add tox complexity without a corresponding use case.

Project-Specific Settings¶

Setting	Why
`skip_missing_interpreters = false`	Missing Python versions fail loudly
`package = editable`	Fast edit-test cycle
separate `package` env	Verify built wheel, not editable
`SPARK_LOCAL_IP=127.0.0.1`	Avoid Spark hostname issues
`RAY_ENABLE_UV_RUN_RUNTIME_ENV=0`	Ray detects `uv run` and creates fresh envs