Cloud Testing Checklist
This document provides a structured checklist for validating shardyfusion against real cloud infrastructure. All tests below are manual — they are not part of the automated CI pipeline.
AWS S3 + Spark (EMR)
Prerequisites
- AWS account with EMR and S3 permissions
- S3 bucket provisioned (e.g.
s3://my-org-shardyfusion-test/)
- EMR cluster with Spark 3.5 or 4.x, Python 3.11-3.13, Java 17
Write Test
# On EMR master node
pip install shardyfusion[writer-spark]
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.spark import write_sharded
config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/spark-test")
result = write_sharded(df, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))
# Verify
assert result.stats.rows_written > 0
assert len(result.winners) == 8
print(f"Manifest: {result.manifest_ref}")
- [ ]
num_dbs=8 completes successfully
- [ ]
num_dbs=64 completes successfully
- [ ]
num_dbs=256 completes successfully
- [ ] Manifest JSON is valid and readable
Read Test
from shardyfusion import ShardedReader
reader = ShardedReader(
s3_prefix="s3://my-org-shardyfusion-test/spark-test",
local_root="/tmp/reader-test",
)
value = reader.get(42)
batch = reader.multi_get([1, 2, 3, 42, 100])
info = reader.snapshot_info()
print(info)
reader.close()
- [ ]
get() returns correct values
- [ ]
multi_get() returns all expected keys
- [ ]
snapshot_info() shows correct metadata
- [ ] Reader works from a separate machine (not the EMR cluster)
Cross-Writer Test
- [ ] Write with Spark, read with Python reader on a different host
- [ ] Routing produces identical shard assignments
AWS S3 + Dask
Prerequisites
- Python 3.11-3.13 environment with
pip install "shardyfusion[writer-dask]"
- Use
pip install "shardyfusion[writer-dask-sqlite]" to exercise the SQLite backend instead
- S3 credentials configured via env vars or
~/.aws/credentials
Test
import dask.dataframe as dd
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.dask import write_sharded
ddf = dd.from_pandas(pdf, npartitions=4)
config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/dask-test")
result = write_sharded(ddf, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))
- [ ] Hash sharding completes
- [ ] CEL sharding completes (with
ShardingSpec(strategy=ShardingStrategy.CEL, ...))
- [ ] Rate limiting verified (
max_writes_per_second=1000)
- [ ] Read test with
ShardedReader passes
AWS S3 + Ray
Prerequisites
- Python 3.11-3.13 environment with
pip install shardyfusion[writer-ray]
- Ray cluster or local mode
Test
import ray
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.ray import write_sharded
ds = ray.data.from_items([{"id": i, "payload": b"data"} for i in range(10000)])
config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/ray-test")
result = write_sharded(ds, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))
- [ ] Hash sharding completes
- [ ] Range sharding completes
- [ ]
shuffle_strategy restored after write (check DataContext.shuffle_strategy)
- [ ] Read test with
ShardedReader passes
AWS S3 + Python Writer
Test
from shardyfusion import WriteConfig
from shardyfusion.writer.python import write_sharded
records = [{"id": i, "payload": f"value-{i}".encode()} for i in range(10000)]
config = WriteConfig(num_dbs=4, s3_prefix="s3://my-org-shardyfusion-test/python-test")
# Single-process
result = write_sharded(records, config, key_fn=lambda r: r["id"], value_fn=lambda r: r["payload"])
# Multi-process
result = write_sharded(records, config, key_fn=lambda r: r["id"], value_fn=lambda r: r["payload"], parallel=True)
- [ ] Single-process mode completes
- [ ] Parallel mode completes
- [ ] Rate limiting verified
- [ ] Read test passes
GCS with S3-Compatible API
Prerequisites
- GCS bucket with interoperability access keys
- HMAC keys generated in GCS console
Configuration
from shardyfusion.credentials import StaticCredentialProvider
from shardyfusion.type_defs import S3ConnectionOptions
credential_provider = StaticCredentialProvider(
access_key_id="GOOG...",
secret_access_key="...",
)
connection_options: S3ConnectionOptions = {
"endpoint_url": "https://storage.googleapis.com",
"addressing_style": "path",
}
config = WriteConfig(
num_dbs=4,
s3_prefix="s3://my-gcs-bucket/shardyfusion-test",
credential_provider=credential_provider,
s3_connection_options=connection_options,
)
- [ ] Write completes via GCS interoperability layer
- [ ] Manifest readable from GCS
- [ ] Reader works with GCS S3-compatible endpoint
Write Throughput (rows/sec)
| Backend |
num_dbs=8 |
num_dbs=64 |
num_dbs=256 |
| Spark (EMR m5.xlarge x4) |
_____ |
_____ |
_____ |
| Dask (local, 4 workers) |
_____ |
_____ |
_____ |
| Ray (local, 4 workers) |
_____ |
_____ |
_____ |
| Python (single-process) |
_____ |
_____ |
_____ |
| Python (parallel) |
_____ |
_____ |
_____ |
Read Latency (ms)
| Operation |
p50 |
p95 |
p99 |
get() (warm cache) |
_____ |
_____ |
_____ |
get() (cold) |
_____ |
_____ |
_____ |
multi_get(100 keys) |
_____ |
_____ |
_____ |
refresh() |
_____ |
_____ |
_____ |
Refresh Under Load
- [ ] 10 threads calling
get() during refresh() — no errors
- [ ] Old readers closed only after all in-flight reads complete
Cost Estimates
| Scenario |
S3 PUT |
S3 GET |
Data Transfer |
Approximate Cost |
| Write 1M rows / 8 shards |
~16 PUTs |
— |
~100MB |
< $0.01 |
| Write 10M rows / 64 shards |
~640 PUTs |
— |
~1GB |
< $0.10 |
| Read 10K gets |
— |
~10K GETs |
~10MB |
< $0.01 |
| Full test suite (all backends) |
~2K PUTs |
~50K GETs |
~5GB |
< $1.00 |
Cleanup
After testing, remove test data:
aws s3 rm s3://my-org-shardyfusion-test/ --recursive