Skip to content

KevinMoonLab/PF-GAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PF-GAP

Java Python License

PF-GAP is a flexible, extensible framework for proximity-based learning on time series and structured data. It builds on the original Proximity Forest (PF) model and introduces:

  • GAP proximities
    • Supervised imputation (test and train sets)
    • Intra-class outlier scores
    • Returnable for visualization, SVM kernel, etc.
  • Custom distance functions (in Python, Maple, or Java)
  • Support for multivariate and variable-length time series
  • Parallel training and proximity computation
  • Flexible data formatting and imputation options
  • Regression (Extrinsic)
  • Customizable node purity measures and aggregation schemes

📚 Table of Contents

  1. Installation
  2. Repository Structure
  3. Quickstart
  4. Usage
  5. Demo Notebooks
  6. Data Format
  7. Output Files
  8. Citation

🛠 Installation

Requirements

  • Java 17+
  • Recommended: Python 3.8+ (tested with Python 3.13)
  • Python packages (for running the demo files):
  pip install numpy pandas matplotlib scikit-learn aeon
  • Optional: Maple 2016+ (for Maple-based distance functions)

📂 Repository Structure

PF-GAP/
├── PFGAP/                  # Java source code
├── docs/                   # Project feature and useage documentation
│   ├── custom_distances/   # Documentation and examples for custom Java distances
│   ├── demo/               # Demo scripts (converted from notebooks), toy data, example Maple/Python
│   └── *.md                # Markdown files for feature documentation
├── Application/
│   ├── PFGAP.jar           # Compiled Java executable
│   └── PF_wrapper.py       # Python interface to PFGAP.jar
└── README.md

⚡ Quickstart

Simply download the PFGAP.jar file. For convenience, download the PR_wrapper.py file to call using python.

🚀 Usage

For more detailed descriptions, please refer to the documentation.

Training

Use PF_wrapper.train() to train a proximity forest:

import PF_wrapper as PF

PF.train(
    train_file="Data/GunPoint_TRAIN.tsv",
    model_name="Spartacus",
    return_proximities=True,
    output_directory="training_output",
    entry_separator="\t"
)

Prediction

Use PF_wrapper.predict() to evaluate a saved model on a test set:

PF.predict(
    model_name="training_output/Spartacus",
    testfile="Data/GunPoint_TEST.tsv",
    entry_separator="\t"
)

Imputation

PF-GAP supports iterative imputation for both training and test sets:

PF.train(
    train_file="Data/differentlengths.txt",
    test_file="Data/differentlengths_test.txt",
    train_labels="Data/differentlabels.txt",
    test_labels="Data/differentlabels_test.txt",
    impute_training_data=True,
    return_imputed_training=True,
    impute_testing_data=True,
    return_imputed_testing=True,
    impute_iterations=5,
    data_dimension=2,
    entry_separator=",",
    array_separator=":"
)

Custom Distances You can define your own distance function in:

  • Java: compile a .class or .jar file, or multiple. See docs/custom_distances for more information.
  • Python: PythonDistance.py with a function Distance(list1, list2)
  • Maple: MapleDistance.mpl with a function Distance(list1, list2)

Specify the custom distance source using:

distances=["javadistance:customdistance.class"]

or

distances=["javadistance:userdistances.jar:customdistance"]

or

distances=["python"]  # or ["maple"]

📊 Demo Notebooks

Demo Description
demo_gunpoint.py Classic PF classification on UCR GunPoint dataset
demo_multi_impute.py Imputation on multivariate time series with missing values
demo_load_japanese.py Large-scale multivariate classification with variable-length sequences
demo_regression.py Time Series Extrinsic Regression on the FloodModeling1 dataset

Example MDS Visualization

Demo MDS GunPoint Train


📄 Data Format

PF-GAP supports flexible input formats:

  • UCR-style .tsv files (label + data in one file)
  • Custom delimited files with:
    • entry_separator (e.g., ",", " ")
    • array_separator (e.g., ":" for 2D arrays)

For multivariate or 3D data, use data_dimension=2.


📂 Output Files

Depending on options, PF-GAP may generate:

File Description
Predictions.txt Predictions on test set
Predictions_saved.txt Predictions from a saved model
TrainingProximities.txt Proximity matrix for training set
TestTrainProximities.txt Proximities between test and train
outlier_scores.txt Intra-class outlier scores
imputed_train.txt Imputed training data (if requested)
imputed_test.txt Imputed test data (if requested)

Use PF_wrapper.getArray(filename) to load proximity or outlier arrays.

🔹 Outlier Scores

  • Set return_training_outlier_scores=True to compute intra-class outlier scores for the training set.
  • These are saved to outlier_scores.txt in the output directory.
  • Use PF_wrapper.getArray(output_directory + "outlier_scores.txt") to load them as a NumPy array.
  • Note that outlier scores are not supported for regression.

🔹 Imputed Data

  • If impute_training_data=True and return_imputed_training=True, the imputed training set is saved to:
[output_directory]/[train_file].txt
  • Similarly, return_imputed_testing=True saves:
[output_directory]/[test_file].txt
  • These files preserve the original format and delimiters.

🔹 Proximity Matrices

  • If return_proximities=True, proximity matrices are saved to:

    • TrainingProximities.txt (train vs. train)
    • TestTrainProximities.txt (test vs. train)
  • These are used internally for imputation and outlier detection, but can also be used for:

    • MDS or PHATE visualization
    • Clustering
    • Custom analysis

Load them with:

p = PF_wrapper.getArray(str(output_directory) + "TrainingProximities.txt")

or:

pt = PF_wrapper.getArray(str(output_directory) + "TestTrainProximities.txt")

📖 Citation

If you use PF-GAP in your work, please cite the appropriate paper(s) from the following list:

Ben Shaw, Jake S. Rhodes, Soukaina Filali Boubrahimi, and Kevin R. Moon. Forest Proximities for Time Series, IntelliSys 2025
arXiv preprint

Ben Shaw, Adam Rustad, Sofia Pelagalli Maia, Jake S. Rhodes, and Kevin R. Moon. The Generalized Proximity Forest, ACDSA 2026
arXiv preprint

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published