Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ htmlcov
.env
.env.test
metrics.json
predictions.csv
predictions.csv
backend/app/evaluation/datasets/*.csv
49 changes: 31 additions & 18 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,48 +97,61 @@ If you use GitHub Actions the tests will run automatically.

We can benchmark validators like PII Remover and Lexical Slur Detection on curated datasets.

Download the dataset from [here](https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89). This contains multiple folders, one for each validator. Each folder contains a testing dataset in csv format for the validator. Download these csv files and store it in `backend/app/evaluation/datasets/` folder. Once the datasets have been stored, we can run the evaluation script for each validator.
Download the dataset from [Google Drive](https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89).This contains multiple folders, one for each validator. Each folder contains a testing dataset in csv format for the validator. Download these csv files and store them in `backend/app/evaluation/datasets/`.

For lexical slur match, ban list and gender assumption bias, testing doesn't make much sense cause these are deterministic. However, we curated a dataset for lexical slur match for use in toxicity detection validator later on.
Important: each `run.py` expects a specific filename, so dataset files must be named exactly as below:
- `app/evaluation/lexical_slur/run.py` expects `lexical_slur_testing_dataset.csv`
- `app/evaluation/pii/run.py` expects `pii_detection_testing_dataset.csv`
- `app/evaluation/gender_assumption_bias/run.py` expects `gender_bias_assumption_dataset.csv`

Once these files are in place with the exact names above, run the evaluation scripts.

Unit tests for lexical slur match, ban list, and gender assumption bias validators have limited value because their logic is deterministic. However, curated datasets exist for lexical slur match and gender assumption bias to benchmark accuracy and latency. The lexical slur dataset will also be used in future toxicity detection workflows.

Each validator produces:
- predictions.csv – row-level outputs for debugging and analysis
- metrics.json – aggregated accuracy + performance metrics

Standardized output structure:
```
```text
app/evaluation/outputs/
lexical_slur/
predictions.csv
metrics.json
gender_assumption_bias/
predictions.csv
metrics.json
pii_remover/
predictions.csv
metrics.json
```

- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py`

Expected outputs:
```
app/evaluation/outputs/lexical_slur/
├── predictions.csv
└── metrics.json
- To run all evaluation scripts together, use:
```bash
bash scripts/run_all_evaluations.sh
```
This script runs the evaluators in sequence:
- `app/evaluation/lexical_slur/run.py`
- `app/evaluation/pii/run.py`
- `app/evaluation/gender_assumption_bias/run.py`
Comment on lines +129 to +136
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fenced code block inside a list item must be indented to maintain list context.

On GitHub (CommonMark), an unindented ``` at column 0 terminates the enclosing list item. The "This script runs the evaluators in sequence:" paragraph and the sub-list (lines 133–136) are then rendered as a separate top-level block rather than as a continuation of the bullet.

📝 Proposed fix — indent the code block and continuation text
-
-- To run all evaluation scripts together, use:
-```bash
-bash scripts/run_all_evaluations.sh
-```
-This script runs the evaluators in sequence:
-- `app/evaluation/lexical_slur/run.py`
-- `app/evaluation/pii/run.py`
-- `app/evaluation/gender_assumption_bias/run.py`
+
+To run all evaluation scripts together:
+
+```bash
+bash scripts/run_all_evaluations.sh
+```
+
+This script runs the evaluators in sequence:
+- `app/evaluation/lexical_slur/run.py`
+- `app/evaluation/pii/run.py`
+- `app/evaluation/gender_assumption_bias/run.py`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 129 - 136, The fenced code block in
backend/README.md breaks the enclosing list item; fix by making the code block
and the following paragraph part of the same list item—either indent the fenced
block and continuation lines by four spaces (so the ```bash block and the "This
script runs the evaluators in sequence:" paragraph remain inside the bullet) or
replace the list entry using the proposed reflow: move the "To run all
evaluation scripts together:" line before the fenced block, wrap the block with
```bash ... ```, and then list the three evaluator paths
(`app/evaluation/lexical_slur/run.py`, `app/evaluation/pii/run.py`,
`app/evaluation/gender_assumption_bias/run.py`) as sub-list items; ensure the
script name scripts/run_all_evaluations.sh is included exactly in the fenced
block.


predictions.csv contains row-level inputs, predictions, and labels.

metrics.json contains binary classification metrics and performance stats (latency + peak memory).

- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py`

- To evaluate PII Validator, run the PII evaluation script: `python app/evaluation/pii/run.py`

Expected outputs:
```
app/evaluation/outputs/pii_remover/
├── predictions.csv
└── metrics.json
```
predictions.csv contains original text, anonymized output, ground-truth masked text
`predictions.csv` contains original text, anonymized output, ground-truth masked text

`metrics.json` contains entity-level precision, recall, and F1 per PII type.

- To evaluate Gender Assumption Bias Validator, run: `python app/evaluation/gender_assumption_bias/run.py`

`predictions.csv` contains biased and neutral samples with predicted outcomes for each.

metrics.json contains entity-level precision, recall, and F1 per PII type.
`metrics.json` contains binary classification metrics and performance stats (latency + peak memory).

## Validator configuration guide

Expand Down
6 changes: 3 additions & 3 deletions backend/app/evaluation/common/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ def compute_binary_metrics(y_true, y_pred):
"tn": tn,
"fp": fp,
"fn": fn,
"precision": precision,
"recall": recall,
"f1": f1,
"precision": round(precision, 2),
"recall": round(recall, 2),
"f1": round(f1, 2),
}


Expand Down
69 changes: 69 additions & 0 deletions backend/app/evaluation/gender_assumption_bias/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
from pathlib import Path
import pandas as pd
from guardrails.validators import FailResult

from app.core.validators.gender_assumption_bias import GenderAssumptionBias
from app.evaluation.common.helper import (
compute_binary_metrics,
Profiler,
write_csv,
write_json,
)

BASE_DIR = Path(__file__).resolve().parent.parent
OUT_DIR = BASE_DIR / "outputs" / "gender_assumption_bias"

df = pd.read_csv(BASE_DIR / "datasets" / "gender_bias_assumption_dataset.csv")

validator = GenderAssumptionBias()

with Profiler() as p:
df["biased_result"] = (
df["biased input"]
.astype(str)
.apply(lambda x: p.record(lambda t: validator.validate(t, metadata=None), x))
)

df["neutral_result"] = (
df["neutral output"]
.astype(str)
.apply(lambda x: p.record(lambda t: validator.validate(t, metadata=None), x))
)

# For biased input → should FAIL (1)
df["biased_pred"] = df["biased_result"].apply(lambda r: int(isinstance(r, FailResult)))

# For neutral output → should PASS (0)
df["neutral_pred"] = df["neutral_result"].apply(
lambda r: int(isinstance(r, FailResult))
)

df["biased_true"] = 1
df["neutral_true"] = 0

y_true = list(df["biased_true"]) + list(df["neutral_true"])
y_pred = list(df["biased_pred"]) + list(df["neutral_pred"])

metrics = compute_binary_metrics(y_true, y_pred)

write_csv(
df.drop(columns=["biased_result", "neutral_result"]),
OUT_DIR / "predictions.csv",
)

write_json(
{
"guardrail": "gender_assumption_bias",
"num_samples": len(df) * 2, # because evaluating both sides
"metrics": metrics,
"performance": {
"latency_ms": {
"mean": round(sum(p.latencies) / len(p.latencies), 2),
"p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2),
"max": round(max(p.latencies), 2),
},
"memory_mb": round(p.peak_memory_mb, 2),
},
},
OUT_DIR / "metrics.json",
)
8 changes: 4 additions & 4 deletions backend/app/evaluation/lexical_slur/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@
"metrics": metrics,
"performance": {
"latency_ms": {
"mean": sum(p.latencies) / len(p.latencies),
"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
"max": max(p.latencies),
"mean": round(sum(p.latencies) / len(p.latencies), 2),
"p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2),
"max": round(max(p.latencies), 2),
},
"memory_mb": p.peak_memory_mb,
"memory_mb": round(p.peak_memory_mb, 2),
},
},
OUT_DIR / "metrics.json",
Expand Down
6 changes: 3 additions & 3 deletions backend/app/evaluation/pii/entity_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,9 @@ def finalize_entity_metrics(stats: Dict[str, dict]) -> Dict[str, dict]:
"tp": tp,
"fp": fp,
"fn": fn,
"precision": precision,
"recall": recall,
"f1": f1,
"precision": round(precision, 2),
"recall": round(recall, 2),
"f1": round(f1, 2),
}

return report
15 changes: 13 additions & 2 deletions backend/app/evaluation/pii/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from app.core.validators.pii_remover import PIIRemover
from app.evaluation.pii.entity_metrics import compute_entity_metrics
from app.evaluation.common.helper import write_csv, write_json
from app.evaluation.common.helper import Profiler, write_csv, write_json

BASE_DIR = Path(__file__).resolve().parent.parent
OUT_DIR = BASE_DIR / "outputs" / "pii_remover"
Expand All @@ -21,7 +21,10 @@ def run_pii(text: str) -> str:
return text


df["anonymized"] = df["source_text"].astype(str).apply(run_pii)
with Profiler() as p:
df["anonymized"] = (
df["source_text"].astype(str).apply(lambda x: p.record(run_pii, x))
)

entity_report = compute_entity_metrics(
df["target_text"],
Expand All @@ -36,6 +39,14 @@ def run_pii(text: str) -> str:
"guardrail": "pii_remover",
"num_samples": len(df),
"entity_metrics": entity_report,
"performance": {
"latency_ms": {
"mean": round(sum(p.latencies) / len(p.latencies), 2),
"p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2),
"max": round(max(p.latencies), 2),
},
"memory_mb": round(p.peak_memory_mb, 2),
},
Comment on lines +42 to +49
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against empty p.latencies to avoid ZeroDivisionError / IndexError.

This performance block is newly added. If the input CSV is empty (or all rows are somehow filtered), p.latencies will be an empty list, and Line 44 (sum(...) / len(...)) will raise ZeroDivisionError, Line 45 will raise IndexError, and Line 46 will raise ValueError. The same issue exists in lexical_slur/run.py.

The cleanest fix is the Profiler.summary() helper suggested in my comment on lexical_slur/run.py, which centralizes the guard and eliminates duplication. If you'd rather keep things inline, a simple if p.latencies: guard suffices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/pii/run.py` around lines 42 - 49, The performance
block currently assumes p.latencies is non-empty which can raise
ZeroDivisionError/IndexError/ValueError; either implement and call a centralized
Profiler.summary() helper that returns safe values (e.g., mean, p95, max as
numbers or None) for an empty latencies list and use that result here, or wrap
the existing computation in an if p.latencies: guard and otherwise set "mean",
"p95", "max" to None (or 0 per project convention); apply the same change to
lexical_slur/run.py so both places reference the safe summary implementation or
the same guard logic.

},
OUT_DIR / "metrics.json",
)
25 changes: 25 additions & 0 deletions backend/scripts/run_all_evaluations.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env bash

set -euo pipefail

BACKEND_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
EVAL_DIR="$BACKEND_DIR/app/evaluation"

RUNNERS=(
"$EVAL_DIR/lexical_slur/run.py"
"$EVAL_DIR/pii/run.py"
"$EVAL_DIR/gender_assumption_bias/run.py"
)

echo "Running validator evaluations..."
echo "Backend dir: $BACKEND_DIR"

for runner in "${RUNNERS[@]}"; do
name="$(basename "$(dirname "$runner")")"
echo ""
echo "==> [$name] $runner"
uv run python "$runner"
done

echo ""
echo "All validator evaluations completed."