FrameX is an Arrow-backed Python library for parallel dataframe and array processing on a single machine.
It combines:
- Pandas-like tabular APIs (
DataFrame,Series,GroupBy) - NumPy-compatible chunked arrays (
NDArraywith NumPy protocol support) - Arrow-native storage/interop (
to_arrow, Parquet/IPC I/O) - Eager execution with optional lazy pipelines (
.lazy().collect()) - Runtime backends for local threads/processes plus optional Ray/Dask executors
FrameX is aimed at local analytics workflows that are bigger than comfortable single-threaded scripts but do not yet require distributed infrastructure.
Typical fit:
- ETL and analytics pipelines on medium-to-large local datasets
- feature engineering workflows that mix table and array operations
- migration paths from Pandas scripts where API familiarity matters
From PyPI:
pip install pyframe-xpyFrom source:
git clone https://github.com/aeiwz/FrameX.git
cd FrameX
pip install -e .Requirements:
- Python
>=3.10 - Core dependencies:
pyarrow,numpy - Optional compatibility:
pandas(pip install pyframe-xpy[pandas_compat])
import framex as fx
df = fx.DataFrame(
{
"group": ["a", "a", "b"],
"value": [10, 20, 30],
"is_refund": [False, True, False],
}
)
result = (
df.filter(~df["is_refund"])
.groupby("group")
.agg({"value": ["sum", "mean", "count"]})
.sort("value_sum", ascending=False)
)
print(result.to_pandas())Top-level imports:
import framex as fxMain objects and helpers:
fx.DataFrame,fx.Series,fx.Index,fx.LazyFramefx.NDArray,fx.array(...)fx.read_parquet,fx.write_parquet,fx.read_ipc,fx.write_ipc,fx.read_csv,fx.write_csvfx.read_json,fx.write_json,fx.read_ndjson,fx.write_ndjsonfx.read_file,fx.write_filefor format auto-detection
Compression:
- transparent extension-based compression for
read_file/write_file - supported wrappers:
.gz,.bz2,.xz,.zip, and.zst/.zstd(whenzstandardis installed) fx.from_pandas,fx.from_dask,fx.from_ray,fx.from_dataframefx.get_config,fx.set_backend,fx.set_workers,fx.set_serializer,fx.set_kernel_backendfx.set_array_backendfor auto/NumExpr/Numba/JAX/PyTorch/CuPy acceleration modesfx.recommend_best_performance_config()to inspect hardware-tuned settingsfx.auto_configure_hardware()to apply best-performance config automaticallyfx.StreamProcessorfor micro-batch streaming pipelines
Acceleration extras:
pip install pyframe-xpy[accel] # numexpr + numba
pip install pyframe-xpy[gpu] # cupy (CUDA)
pip install pyframe-xpy[ml_accel] # jax + pytorch
pip install pyframe-xpy[pandas_fast] # modin backend
pip install pyframe-xpy[distributed] # Dask + Ray distributed/HPC backends
pip install zstandard # .zst/.zstd file compressionBackend notes:
fx.set_backend("threads" | "processes" | "ray" | "dask" | "hpc")- Ray and Dask execution backends require their respective runtimes to be installed/available.
- HPC mode (
"hpc") uses cluster-oriented execution via Dask or Ray:FRAMEX_HPC_ENGINE=dask|rayFRAMEX_DASK_SCHEDULER_ADDRESS=<tcp://...>to connect existing Dask clustersFRAMEX_RAY_ADDRESS=<ray://...>to connect existing Ray clusters- optional SLURM bootstrap:
FRAMEX_DASK_SLURM=1(requiresdask-jobqueue)
Test support notes:
- Some tests are optional-backend gated and intentionally
skippedwhen deps are not installed. - Typical skip reasons: missing
dask.distributed,dask.dataframe,ray, orray.data. - Run full optional matrix locally:
pip install pyframe-xpy[distributed]
pytest -qCanonical docs are in docs/documents:
- Overview
- Features
- Getting Started
- Installation
- Tutorial: ETL Pipeline
- Tutorial: NumPy NDArray Interop
- Use Cases
- Configuration Guide
- Performance Test
- Architecture
- API Reference
- Roadmap
- FAQ
The docs website lives in website (Next.js App Router).
Main docs routes:
http://localhost:3000/docs/featureshttp://localhost:3000/docs/tutorial_etl_pipelinehttp://localhost:3000/docs/use_caseshttp://localhost:3000/docs/configuration_guidehttp://localhost:3000/docs/performance_test
Run locally:
cd website
npm install
npm run devProduction build:
npm run build
npm run startInstall dev dependencies:
pip install -e .[dev]Run tests:
pytestBenchmark code and generated reports are in benchmarks.
Run the full benchmark suite (includes in-terminal progress bar and report generation):
python3 -m benchmarks.benchmark_suiteRun workload capability matrix checks:
python3 -m benchmarks.check_framex_workloadsBenchmark outputs are written to benchmarks/results:
benchmark_results.jsonbenchmark_results.csvbenchmark_report.mdframex_workload_check.jsonperformance_speedup.pngparallel_processing_scaling.pngmultiprocessing_scaling.pngmemory_peak_rss.png
FrameX is pre-1.0 (0.1.2) and in active development.
- APIs are usable and documented
- compatibility/performance behavior will continue to evolve
- pin versions for production-critical workloads