Skip to content

aeiwz/FrameX

Repository files navigation

FrameX

FrameX is an Arrow-backed Python library for parallel dataframe and array processing on a single machine.

It combines:

  • Pandas-like tabular APIs (DataFrame, Series, GroupBy)
  • NumPy-compatible chunked arrays (NDArray with NumPy protocol support)
  • Arrow-native storage/interop (to_arrow, Parquet/IPC I/O)
  • Eager execution with optional lazy pipelines (.lazy().collect())
  • Runtime backends for local threads/processes plus optional Ray/Dask executors

Why FrameX

FrameX is aimed at local analytics workflows that are bigger than comfortable single-threaded scripts but do not yet require distributed infrastructure.

Typical fit:

  • ETL and analytics pipelines on medium-to-large local datasets
  • feature engineering workflows that mix table and array operations
  • migration paths from Pandas scripts where API familiarity matters

Installation

From PyPI:

pip install pyframe-xpy

From source:

git clone https://github.com/aeiwz/FrameX.git
cd FrameX
pip install -e .

Requirements:

  • Python >=3.10
  • Core dependencies: pyarrow, numpy
  • Optional compatibility: pandas (pip install pyframe-xpy[pandas_compat])

Quick Start

import framex as fx

df = fx.DataFrame(
    {
        "group": ["a", "a", "b"],
        "value": [10, 20, 30],
        "is_refund": [False, True, False],
    }
)

result = (
    df.filter(~df["is_refund"])
      .groupby("group")
      .agg({"value": ["sum", "mean", "count"]})
      .sort("value_sum", ascending=False)
)

print(result.to_pandas())

Core API

Top-level imports:

import framex as fx

Main objects and helpers:

  • fx.DataFrame, fx.Series, fx.Index, fx.LazyFrame
  • fx.NDArray, fx.array(...)
  • fx.read_parquet, fx.write_parquet, fx.read_ipc, fx.write_ipc, fx.read_csv, fx.write_csv
  • fx.read_json, fx.write_json, fx.read_ndjson, fx.write_ndjson
  • fx.read_file, fx.write_file for format auto-detection

Compression:

  • transparent extension-based compression for read_file / write_file
  • supported wrappers: .gz, .bz2, .xz, .zip, and .zst/.zstd (when zstandard is installed)
  • fx.from_pandas, fx.from_dask, fx.from_ray, fx.from_dataframe
  • fx.get_config, fx.set_backend, fx.set_workers, fx.set_serializer, fx.set_kernel_backend
  • fx.set_array_backend for auto/NumExpr/Numba/JAX/PyTorch/CuPy acceleration modes
  • fx.recommend_best_performance_config() to inspect hardware-tuned settings
  • fx.auto_configure_hardware() to apply best-performance config automatically
  • fx.StreamProcessor for micro-batch streaming pipelines

Acceleration extras:

pip install pyframe-xpy[accel]      # numexpr + numba
pip install pyframe-xpy[gpu]        # cupy (CUDA)
pip install pyframe-xpy[ml_accel]   # jax + pytorch
pip install pyframe-xpy[pandas_fast]  # modin backend
pip install pyframe-xpy[distributed]  # Dask + Ray distributed/HPC backends
pip install zstandard  # .zst/.zstd file compression

Backend notes:

  • fx.set_backend("threads" | "processes" | "ray" | "dask" | "hpc")
  • Ray and Dask execution backends require their respective runtimes to be installed/available.
  • HPC mode ("hpc") uses cluster-oriented execution via Dask or Ray:
    • FRAMEX_HPC_ENGINE=dask|ray
    • FRAMEX_DASK_SCHEDULER_ADDRESS=<tcp://...> to connect existing Dask clusters
    • FRAMEX_RAY_ADDRESS=<ray://...> to connect existing Ray clusters
    • optional SLURM bootstrap: FRAMEX_DASK_SLURM=1 (requires dask-jobqueue)

Test support notes:

  • Some tests are optional-backend gated and intentionally skipped when deps are not installed.
  • Typical skip reasons: missing dask.distributed, dask.dataframe, ray, or ray.data.
  • Run full optional matrix locally:
pip install pyframe-xpy[distributed]
pytest -q

Documentation

Canonical docs are in docs/documents:

Website (Docs UI)

The docs website lives in website (Next.js App Router).

Main docs routes:

  • http://localhost:3000/docs/features
  • http://localhost:3000/docs/tutorial_etl_pipeline
  • http://localhost:3000/docs/use_cases
  • http://localhost:3000/docs/configuration_guide
  • http://localhost:3000/docs/performance_test

Run locally:

cd website
npm install
npm run dev

Production build:

npm run build
npm run start

Development

Install dev dependencies:

pip install -e .[dev]

Run tests:

pytest

Benchmarks

Benchmark code and generated reports are in benchmarks.

Run the full benchmark suite (includes in-terminal progress bar and report generation):

python3 -m benchmarks.benchmark_suite

Run workload capability matrix checks:

python3 -m benchmarks.check_framex_workloads

Benchmark outputs are written to benchmarks/results:

  • benchmark_results.json
  • benchmark_results.csv
  • benchmark_report.md
  • framex_workload_check.json
  • performance_speedup.png
  • parallel_processing_scaling.png
  • multiprocessing_scaling.png
  • memory_peak_rss.png

Project Status

FrameX is pre-1.0 (0.1.2) and in active development.

  • APIs are usable and documented
  • compatibility/performance behavior will continue to evolve
  • pin versions for production-critical workloads

License

MIT