PF-GAP is a flexible, extensible framework for proximity-based learning on time series and structured data. It builds on the original Proximity Forest (PF) model and introduces:
- GAP proximities
- Supervised imputation (test and train sets)
- Intra-class outlier scores
- Returnable for visualization, SVM kernel, etc.
- Custom distance functions (in Python, Maple, or Java)
- Support for multivariate and variable-length time series
- Parallel training and proximity computation
- Flexible data formatting and imputation options
- Regression (Extrinsic)
- Customizable node purity measures and aggregation schemes
- Java 17+
- Recommended: Python 3.8+ (tested with Python 3.13)
- Python packages (for running the demo files):
pip install numpy pandas matplotlib scikit-learn aeon- Optional: Maple 2016+ (for Maple-based distance functions)
PF-GAP/
├── PFGAP/ # Java source code
├── docs/ # Project feature and useage documentation
│ ├── custom_distances/ # Documentation and examples for custom Java distances
│ ├── demo/ # Demo scripts (converted from notebooks), toy data, example Maple/Python
│ └── *.md # Markdown files for feature documentation
├── Application/
│ ├── PFGAP.jar # Compiled Java executable
│ └── PF_wrapper.py # Python interface to PFGAP.jar
└── README.mdSimply download the PFGAP.jar file. For convenience, download the PR_wrapper.py file to call using python.
For more detailed descriptions, please refer to the documentation.
Use PF_wrapper.train() to train a proximity forest:
import PF_wrapper as PF
PF.train(
train_file="Data/GunPoint_TRAIN.tsv",
model_name="Spartacus",
return_proximities=True,
output_directory="training_output",
entry_separator="\t"
)Use PF_wrapper.predict() to evaluate a saved model on a test set:
PF.predict(
model_name="training_output/Spartacus",
testfile="Data/GunPoint_TEST.tsv",
entry_separator="\t"
)PF-GAP supports iterative imputation for both training and test sets:
PF.train(
train_file="Data/differentlengths.txt",
test_file="Data/differentlengths_test.txt",
train_labels="Data/differentlabels.txt",
test_labels="Data/differentlabels_test.txt",
impute_training_data=True,
return_imputed_training=True,
impute_testing_data=True,
return_imputed_testing=True,
impute_iterations=5,
data_dimension=2,
entry_separator=",",
array_separator=":"
)Custom Distances You can define your own distance function in:
- Java: compile a .class or .jar file, or multiple. See docs/custom_distances for more information.
- Python: PythonDistance.py with a function Distance(list1, list2)
- Maple: MapleDistance.mpl with a function Distance(list1, list2)
Specify the custom distance source using:
distances=["javadistance:customdistance.class"]or
distances=["javadistance:userdistances.jar:customdistance"]or
distances=["python"] # or ["maple"]| Demo | Description |
|---|---|
demo_gunpoint.py |
Classic PF classification on UCR GunPoint dataset |
demo_multi_impute.py |
Imputation on multivariate time series with missing values |
demo_load_japanese.py |
Large-scale multivariate classification with variable-length sequences |
demo_regression.py |
Time Series Extrinsic Regression on the FloodModeling1 dataset |
PF-GAP supports flexible input formats:
- UCR-style
.tsvfiles (label + data in one file) - Custom delimited files with:
entry_separator(e.g.,","," ")array_separator(e.g.,":"for 2D arrays)
For multivariate or 3D data, use data_dimension=2.
Depending on options, PF-GAP may generate:
| File | Description |
|---|---|
Predictions.txt |
Predictions on test set |
Predictions_saved.txt |
Predictions from a saved model |
TrainingProximities.txt |
Proximity matrix for training set |
TestTrainProximities.txt |
Proximities between test and train |
outlier_scores.txt |
Intra-class outlier scores |
imputed_train.txt |
Imputed training data (if requested) |
imputed_test.txt |
Imputed test data (if requested) |
Use PF_wrapper.getArray(filename) to load proximity or outlier arrays.
🔹 Outlier Scores
- Set return_training_outlier_scores=True to compute intra-class outlier scores for the training set.
- These are saved to outlier_scores.txt in the output directory.
- Use PF_wrapper.getArray(output_directory + "outlier_scores.txt") to load them as a NumPy array.
- Note that outlier scores are not supported for regression.
🔹 Imputed Data
- If impute_training_data=True and return_imputed_training=True, the imputed training set is saved to:
[output_directory]/[train_file].txt- Similarly, return_imputed_testing=True saves:
[output_directory]/[test_file].txt- These files preserve the original format and delimiters.
🔹 Proximity Matrices
-
If return_proximities=True, proximity matrices are saved to:
- TrainingProximities.txt (train vs. train)
- TestTrainProximities.txt (test vs. train)
-
These are used internally for imputation and outlier detection, but can also be used for:
- MDS or PHATE visualization
- Clustering
- Custom analysis
Load them with:
p = PF_wrapper.getArray(str(output_directory) + "TrainingProximities.txt")or:
pt = PF_wrapper.getArray(str(output_directory) + "TestTrainProximities.txt")If you use PF-GAP in your work, please cite the appropriate paper(s) from the following list:
Ben Shaw, Jake S. Rhodes, Soukaina Filali Boubrahimi, and Kevin R. Moon. Forest Proximities for Time Series, IntelliSys 2025
arXiv preprint
Ben Shaw, Adam Rustad, Sofia Pelagalli Maia, Jake S. Rhodes, and Kevin R. Moon. The Generalized Proximity Forest, ACDSA 2026
arXiv preprint