Skip to content

Cloud Testing Checklist

This document provides a structured checklist for validating shardyfusion against real cloud infrastructure. All tests below are manual — they are not part of the automated CI pipeline.

AWS S3 + Spark (EMR)

Prerequisites

  • AWS account with EMR and S3 permissions
  • S3 bucket provisioned (e.g. s3://my-org-shardyfusion-test/)
  • EMR cluster with Spark 3.5 or 4.x, Python 3.11-3.13, Java 17

Write Test

# On EMR master node
pip install shardyfusion[writer-spark]
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.spark import write_sharded

config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/spark-test")
result = write_sharded(df, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))

# Verify
assert result.stats.rows_written > 0
assert len(result.winners) == 8
print(f"Manifest: {result.manifest_ref}")
  • [ ] num_dbs=8 completes successfully
  • [ ] num_dbs=64 completes successfully
  • [ ] num_dbs=256 completes successfully
  • [ ] Manifest JSON is valid and readable

Read Test

from shardyfusion import ShardedReader

reader = ShardedReader(
    s3_prefix="s3://my-org-shardyfusion-test/spark-test",
    local_root="/tmp/reader-test",
)

value = reader.get(42)
batch = reader.multi_get([1, 2, 3, 42, 100])
info = reader.snapshot_info()
print(info)
reader.close()
  • [ ] get() returns correct values
  • [ ] multi_get() returns all expected keys
  • [ ] snapshot_info() shows correct metadata
  • [ ] Reader works from a separate machine (not the EMR cluster)

Cross-Writer Test

  • [ ] Write with Spark, read with Python reader on a different host
  • [ ] Routing produces identical shard assignments

AWS S3 + Dask

Prerequisites

  • Python 3.11-3.13 environment with pip install "shardyfusion[writer-dask]"
  • Use pip install "shardyfusion[writer-dask-sqlite]" to exercise the SQLite backend instead
  • S3 credentials configured via env vars or ~/.aws/credentials

Test

import dask.dataframe as dd
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.dask import write_sharded

ddf = dd.from_pandas(pdf, npartitions=4)
config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/dask-test")
result = write_sharded(ddf, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))
  • [ ] Hash sharding completes
  • [ ] CEL sharding completes (with ShardingSpec(strategy=ShardingStrategy.CEL, ...))
  • [ ] Rate limiting verified (max_writes_per_second=1000)
  • [ ] Read test with ShardedReader passes

AWS S3 + Ray

Prerequisites

  • Python 3.11-3.13 environment with pip install shardyfusion[writer-ray]
  • Ray cluster or local mode

Test

import ray
from shardyfusion import WriteConfig, ValueSpec
from shardyfusion.writer.ray import write_sharded

ds = ray.data.from_items([{"id": i, "payload": b"data"} for i in range(10000)])
config = WriteConfig(num_dbs=8, s3_prefix="s3://my-org-shardyfusion-test/ray-test")
result = write_sharded(ds, config, key_col="id", value_spec=ValueSpec.binary_col("payload"))
  • [ ] Hash sharding completes
  • [ ] Range sharding completes
  • [ ] shuffle_strategy restored after write (check DataContext.shuffle_strategy)
  • [ ] Read test with ShardedReader passes

AWS S3 + Python Writer

Test

from shardyfusion import WriteConfig
from shardyfusion.writer.python import write_sharded

records = [{"id": i, "payload": f"value-{i}".encode()} for i in range(10000)]
config = WriteConfig(num_dbs=4, s3_prefix="s3://my-org-shardyfusion-test/python-test")

# Single-process
result = write_sharded(records, config, key_fn=lambda r: r["id"], value_fn=lambda r: r["payload"])

# Multi-process
result = write_sharded(records, config, key_fn=lambda r: r["id"], value_fn=lambda r: r["payload"], parallel=True)
  • [ ] Single-process mode completes
  • [ ] Parallel mode completes
  • [ ] Rate limiting verified
  • [ ] Read test passes

GCS with S3-Compatible API

Prerequisites

  • GCS bucket with interoperability access keys
  • HMAC keys generated in GCS console

Configuration

from shardyfusion.credentials import StaticCredentialProvider
from shardyfusion.type_defs import S3ConnectionOptions

credential_provider = StaticCredentialProvider(
    access_key_id="GOOG...",
    secret_access_key="...",
)
connection_options: S3ConnectionOptions = {
    "endpoint_url": "https://storage.googleapis.com",
    "addressing_style": "path",
}

config = WriteConfig(
    num_dbs=4,
    s3_prefix="s3://my-gcs-bucket/shardyfusion-test",
    credential_provider=credential_provider,
    s3_connection_options=connection_options,
)
  • [ ] Write completes via GCS interoperability layer
  • [ ] Manifest readable from GCS
  • [ ] Reader works with GCS S3-compatible endpoint

Performance Benchmarks

Write Throughput (rows/sec)

Backend num_dbs=8 num_dbs=64 num_dbs=256
Spark (EMR m5.xlarge x4) _____ _____ _____
Dask (local, 4 workers) _____ _____ _____
Ray (local, 4 workers) _____ _____ _____
Python (single-process) _____ _____ _____
Python (parallel) _____ _____ _____

Read Latency (ms)

Operation p50 p95 p99
get() (warm cache) _____ _____ _____
get() (cold) _____ _____ _____
multi_get(100 keys) _____ _____ _____
refresh() _____ _____ _____

Refresh Under Load

  • [ ] 10 threads calling get() during refresh() — no errors
  • [ ] Old readers closed only after all in-flight reads complete

Cost Estimates

Scenario S3 PUT S3 GET Data Transfer Approximate Cost
Write 1M rows / 8 shards ~16 PUTs ~100MB < $0.01
Write 10M rows / 64 shards ~640 PUTs ~1GB < $0.10
Read 10K gets ~10K GETs ~10MB < $0.01
Full test suite (all backends) ~2K PUTs ~50K GETs ~5GB < $1.00

Cleanup

After testing, remove test data:

aws s3 rm s3://my-org-shardyfusion-test/ --recursive