Skip to content

Comments

Added gender assumption bias evaluation#59

Open
rkritika1508 wants to merge 6 commits intomainfrom
feat/evaluation-update
Open

Added gender assumption bias evaluation#59
rkritika1508 wants to merge 6 commits intomainfrom
feat/evaluation-update

Conversation

@rkritika1508
Copy link
Collaborator

@rkritika1508 rkritika1508 commented Feb 19, 2026

Summary

Target issue is #58.
Explain the motivation for making this change. What existing problem does the pull request solve?

  • Added a script run_all_evaluations.sh to run evaluations for all validators.
  • Added a dataset for gender assumption bias validator. It contains 300 samples from Hindi, English and Hinglish. The dataset can be found here.
  • Added code to evaluate the performance of gender assumption bias validator on the curated dataset.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Added a gender-assumption bias evaluator that produces predictions, metrics, and performance stats.
    • Added a single-command runner to execute all evaluators in sequence.
  • Improvements

    • PII evaluation and bias runs now record latency and memory; performance stats included in outputs.
    • Numeric outputs (precision, recall, F1, latency, memory) are rounded for clearer reports.
  • Documentation

    • README updated with exact dataset filenames and standardized outputs layout.
  • Chores

    • Updated ignore patterns to exclude dataset CSV files.

@rkritika1508 rkritika1508 marked this pull request as ready for review February 19, 2026 12:24
@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Adds a gender assumption bias evaluation runner and a shell orchestrator, introduces a Profiler and rounding for metrics in helpers and evaluators, updates PII and lexical slur runners to emit performance stats, updates docs and .gitignore to ignore dataset CSVs, and writes standardized predictions/metrics outputs.

Changes

Cohort / File(s) Summary
Configuration
/.gitignore
Added ignore pattern backend/app/evaluation/datasets/*.csv; preserved existing predictions.csv ignore entry.
Gender Assumption Bias Evaluation
backend/app/evaluation/gender_assumption_bias/run.py
New evaluation script: loads dataset, runs GenderAssumptionBias on biased + neutral inputs, profiles latency/memory, computes binary metrics, writes predictions.csv and metrics.json (includes latency & memory).
Evaluation Runner
backend/scripts/run_all_evaluations.sh
New bash script to run all evaluation runners sequentially with strict error handling and per-runner status output.
Common helpers / profiling
backend/app/evaluation/common/helper.py
Adds Profiler; compute_binary_metrics now rounds precision, recall, and f1 to 2 decimals; helper exports updated to include Profiler.
PII evaluation
backend/app/evaluation/pii/run.py, backend/app/evaluation/pii/entity_metrics.py
PII runner now uses Profiler to record latencies and peak memory; entity metrics rounded to 2 decimals; metrics JSON augmented with performance stats.
Lexical slur evaluation
backend/app/evaluation/lexical_slur/run.py
Latency and memory performance numbers in output JSON are rounded to 2 decimal places.
Documentation
backend/README.md
Docs updated with explicit dataset filenames/placement, instructions for running the new gender_assumption_bias evaluator, standardized evaluation outputs layout, and reference to run_all_evaluations.sh.

Sequence Diagram

sequenceDiagram
    participant Script as "run.py"
    participant Dataset as "Dataset Loader"
    participant Validator as "GenderAssumptionBias\nValidator"
    participant Profiler as "Profiler"
    participant Metrics as "Metrics Compute"
    participant Output as "CSV/JSON\nWriter"

    Script->>Dataset: Load dataset
    Script->>Validator: Initialize validator

    loop for each sample (biased & neutral)
        Script->>Profiler: start recording
        Script->>Validator: Evaluate input
        Validator-->>Script: Prediction
        Script->>Profiler: stop recording
    end

    Script->>Metrics: Compute binary metrics (precision, recall, f1)
    Metrics-->>Script: Metrics
    Script->>Output: Write predictions.csv
    Script->>Output: Write metrics.json (latency & memory stats)
    Output-->>Script: Complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • nishika26

Poem

🐇 I hop through rows and time each thoughtful line,
I tally bias checks and round each metric fine,
I sniff the CSVs, profile pings, and memory too,
I write the preds and metrics — neat, concise, and true,
Bravo — the evaluators hum; I nibble a carrot or two.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: a new gender assumption bias evaluation has been added with a dedicated run script, dataset support, and evaluation code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/evaluation-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
backend/app/evaluation/gender_assumption_bias/run.py (2)

61-62: p95 index is off-by-one under the standard nearest-rank convention.

int(len(p.latencies) * 0.95) truncates, yielding the value above the 95th percentile marker (e.g., index 570 out of 600, which is the 95.17th percentile). The standard nearest-rank formula is ceil(p * N) - 1. For an evaluation script the difference is negligible, but using numpy or the corrected index avoids the discrepancy.

♻️ Proposed fix using math.ceil
+import math
 ...
-"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
+"p95": sorted(p.latencies)[math.ceil(len(p.latencies) * 0.95) - 1],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 61 - 62,
The p95 calculation in run.py uses int(len(p.latencies) * 0.95) which truncates
and yields an off-by-one index under the nearest-rank convention; update the p95
computation for p.latencies in the block that builds the dict with "mean" and
"p95" to use the nearest-rank formula (ceil(0.95 * N) - 1) or use a proper
percentile routine (e.g., numpy.percentile) to select the correct index,
ensuring you reference p.latencies and replace the current int(...) index
expression with the corrected ceil-based index or a call to numpy.percentile.

16-68: Wrap top-level execution in an if __name__ == "__main__" guard.

All 50+ lines of evaluation logic execute at import time. This prevents importing any symbol from this module in tests or tooling without triggering file I/O, network, and CSV writes.

♻️ Proposed refactor
+def main():
     df = pd.read_csv(BASE_DIR / "datasets" / "gender_bias_assumption_dataset.csv")
     validator = GenderAssumptionBias()
     ...
     write_json(...)

+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 16 - 68,
The module runs heavy evaluation logic at import time (creating Profiler,
reading CSV into df, running validator.validate, computing metrics, and calling
write_csv/write_json); wrap that top-level execution in a main guard by moving
the existing script body into a function (e.g., def main(): ...) and then add if
__name__ == "__main__": main(), ensuring symbols like Profiler,
GenderAssumptionBias/validator, compute_binary_metrics, write_csv and write_json
remain importable without triggering file I/O or network calls; preserve
behavior and variable names (df, p, y_true/y_pred, metrics) inside main and keep
only definitions/imports at module scope.
backend/scripts/run_all_evaluations.sh (1)

17-22: Add cd "$BACKEND_DIR" before the loop to guarantee uv run picks up the correct project context.

uv run discovers pyproject.toml by ascending from the CWD. If this script is invoked from the repo root (e.g., bash backend/scripts/run_all_evaluations.sh), uv may resolve a different project or fail to find any environment entirely.

♻️ Proposed fix
+cd "$BACKEND_DIR"
+
 for runner in "${RUNNERS[@]}"; do
   name="$(basename "$(dirname "$runner")")"
   echo ""
   echo "==> [$name] $runner"
   uv run python "$runner"
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/run_all_evaluations.sh` around lines 17 - 22, The script
loops over RUNNERS and calls `uv run python "$runner"` but doesn't set the
working directory, so `uv` may ascend from the caller CWD and pick the wrong
pyproject; change the script to `cd` into the backend project before the loop
(e.g., `cd "$BACKEND_DIR"` or use `pushd "$BACKEND_DIR"`/`popd` around the loop)
so that `uv run python` runs from the intended project context and still restore
the original CWD afterward if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Line 17: Remove the debug print statement by deleting the standalone
print(BASE_DIR, OUT_DIR) call in run.py; if runtime visibility of
BASE_DIR/OUT_DIR is required, replace it with a proper logger call (e.g.,
logger.debug) referencing BASE_DIR and OUT_DIR instead of using print.
- Line 16: Wrap the pd.read_csv call that loads
"gender_bias_assumption_dataset.csv" (the df = pd.read_csv(BASE_DIR / "datasets"
/ "gender_bias_assumption_dataset.csv") statement) in a try/except catching
FileNotFoundError, and in the except log or raise a clear, actionable error that
names the missing file, where it was expected (BASE_DIR / "datasets"), and
instructs the operator to download the dataset from the Google Drive location;
ensure the handler either raises a RuntimeError with that message or calls the
module's logger/exit path (e.g., logger.error(...); sys.exit(1)) so the failure
is user-friendly.

---

Nitpick comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Around line 61-62: The p95 calculation in run.py uses int(len(p.latencies) *
0.95) which truncates and yields an off-by-one index under the nearest-rank
convention; update the p95 computation for p.latencies in the block that builds
the dict with "mean" and "p95" to use the nearest-rank formula (ceil(0.95 * N) -
1) or use a proper percentile routine (e.g., numpy.percentile) to select the
correct index, ensuring you reference p.latencies and replace the current
int(...) index expression with the corrected ceil-based index or a call to
numpy.percentile.
- Around line 16-68: The module runs heavy evaluation logic at import time
(creating Profiler, reading CSV into df, running validator.validate, computing
metrics, and calling write_csv/write_json); wrap that top-level execution in a
main guard by moving the existing script body into a function (e.g., def main():
...) and then add if __name__ == "__main__": main(), ensuring symbols like
Profiler, GenderAssumptionBias/validator, compute_binary_metrics, write_csv and
write_json remain importable without triggering file I/O or network calls;
preserve behavior and variable names (df, p, y_true/y_pred, metrics) inside main
and keep only definitions/imports at module scope.

In `@backend/scripts/run_all_evaluations.sh`:
- Around line 17-22: The script loops over RUNNERS and calls `uv run python
"$runner"` but doesn't set the working directory, so `uv` may ascend from the
caller CWD and pick the wrong pyproject; change the script to `cd` into the
backend project before the loop (e.g., `cd "$BACKEND_DIR"` or use `pushd
"$BACKEND_DIR"`/`popd` around the loop) so that `uv run python` runs from the
intended project context and still restore the original CWD afterward if
necessary.

@rkritika1508 rkritika1508 self-assigned this Feb 19, 2026
@nishika26 nishika26 linked an issue Feb 20, 2026 that may be closed by this pull request
@nishika26
Copy link
Contributor

nishika26 commented Feb 20, 2026

@rkritika1508 , ran the script, and here are my inputs -
1)all the output the values should be rounded off to two decimal values
2) i don't see record of latency in PII remover
3)should we not have a script for ban list validator as well, atleast for the demo so we have accuracy and latency numbers

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
backend/app/evaluation/common/helper.py (1)

19-23: Four independent passes over the same data to compute TP/TN/FP/FN.

Each confusion-matrix counter zips and iterates the full sequence separately. A single pass would be clearer and ~4× fewer iterations. Fine for small evaluation datasets, but easy to tighten.

♻️ Single-pass alternative
 def compute_binary_metrics(y_true, y_pred):
-    tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
-    tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
-    fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
-    fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
+    tp = tn = fp = fn = 0
+    for yt, yp in zip(y_true, y_pred, strict=True):
+        if yt == 1 and yp == 1:
+            tp += 1
+        elif yt == 0 and yp == 0:
+            tn += 1
+        elif yt == 0 and yp == 1:
+            fp += 1
+        else:
+            fn += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/common/helper.py` around lines 19 - 23,
compute_binary_metrics currently makes four separate passes over y_true/y_pred
to compute tp, tn, fp, fn which is inefficient; replace those four sum(...)
expressions with a single loop that iterates once over zip(y_true, y_pred,
strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1 and
yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to initialize
the four counters before the loop and compute them in that single pass, leaving
the function's return/signature unchanged.
backend/app/evaluation/lexical_slur/run.py (1)

42-46: Rounding looks good; consider extracting the latency summary into a shared helper.

The round(..., 2) additions address the reviewer feedback. However, the same latency-stats block (mean / p95 / max + memory) is duplicated verbatim in pii/run.py (and likely gender_assumption_bias/run.py). A small helper on Profiler (e.g. summary() -> dict) would eliminate this repetition and keep future format changes in one place.

♻️ Example: add a summary method to Profiler

Add to backend/app/evaluation/common/helper.py:

class Profiler:
    # ... existing methods ...

    def summary(self) -> dict:
        n = len(self.latencies)
        if n == 0:
            return {"latency_ms": {"mean": 0, "p95": 0, "max": 0}, "memory_mb": 0}
        return {
            "latency_ms": {
                "mean": round(sum(self.latencies) / n, 2),
                "p95": round(sorted(self.latencies)[int(n * 0.95)], 2),
                "max": round(max(self.latencies), 2),
            },
            "memory_mb": round(self.peak_memory_mb, 2),
        }

Then each run.py simply calls "performance": p.summary().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/lexical_slur/run.py` around lines 42 - 46, The latency
+ memory summary block is duplicated across runs; add a summary() method to the
Profiler class that returns the same dict shape (handle zero-length latencies by
returning zeros), computing mean, p95, and max with round(..., 2) and rounding
peak_memory_mb to 2 decimals, and then replace the inline block in
lexical_slur/run.py (and other run.py files like pii/run.py and
gender_assumption_bias/run.py) with a call to p.summary() where p is the
Profiler instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/pii/run.py`:
- Around line 42-49: The performance block currently assumes p.latencies is
non-empty which can raise ZeroDivisionError/IndexError/ValueError; either
implement and call a centralized Profiler.summary() helper that returns safe
values (e.g., mean, p95, max as numbers or None) for an empty latencies list and
use that result here, or wrap the existing computation in an if p.latencies:
guard and otherwise set "mean", "p95", "max" to None (or 0 per project
convention); apply the same change to lexical_slur/run.py so both places
reference the safe summary implementation or the same guard logic.

In `@backend/README.md`:
- Around line 153-165: Add a language specifier to the fenced output block in
README.md (change the open fence to use ```text or ```console for the
app/evaluation/outputs/gender_assumption_bias/ tree) so it satisfies MD040 and
renders correctly, and ensure the closing ``` remains. Also add a short
documented subsection for the new backend/scripts/run_all_evaluations.sh script:
state its purpose (orchestrates all per-validator runs), provide the basic usage
(how to invoke the script), and list expected outputs (where aggregated
validator outputs and combined metrics are written, e.g.,
app/evaluation/outputs/ and any combined metrics file), so readers know to run
that script in addition to individual validator run.py commands.
- Line 100: Replace the non-descriptive link text "[here](...)" in the README
with a meaningful label (e.g., "validator datasets on Google Drive" or
"validation CSV dataset folder") so the link conveys its target; update the
sentence that instructs users to download the CSVs into
backend/app/evaluation/datasets/ to use that descriptive label instead of
"here".
- Line 109: The README sentence contradicts itself by calling "lexical slur
match, ban list, gender assumption bias" deterministic (so testing "doesn't make
much sense") while immediately stating curated datasets exist and that a
dedicated run.py (gender assumption bias benchmark) is a main deliverable;
update the paragraph to remove the contradiction and fix wording: clarify which
validators are deterministic (if any), state that curated datasets were created
specifically for benchmarking lexical slur match and gender assumption bias and
that run.py executes the gender assumption bias evaluation, and replace "cause"
with "because" for clarity; reference the terms "lexical slur match", "ban
list", "gender assumption bias", and "run.py" so editors can locate and reword
the README accordingly.

---

Nitpick comments:
In `@backend/app/evaluation/common/helper.py`:
- Around line 19-23: compute_binary_metrics currently makes four separate passes
over y_true/y_pred to compute tp, tn, fp, fn which is inefficient; replace those
four sum(...) expressions with a single loop that iterates once over zip(y_true,
y_pred, strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1
and yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to
initialize the four counters before the loop and compute them in that single
pass, leaving the function's return/signature unchanged.

In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 42-46: The latency + memory summary block is duplicated across
runs; add a summary() method to the Profiler class that returns the same dict
shape (handle zero-length latencies by returning zeros), computing mean, p95,
and max with round(..., 2) and rounding peak_memory_mb to 2 decimals, and then
replace the inline block in lexical_slur/run.py (and other run.py files like
pii/run.py and gender_assumption_bias/run.py) with a call to p.summary() where p
is the Profiler instance.

Comment on lines +42 to +49
"performance": {
"latency_ms": {
"mean": round(sum(p.latencies) / len(p.latencies), 2),
"p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2),
"max": round(max(p.latencies), 2),
},
"memory_mb": round(p.peak_memory_mb, 2),
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against empty p.latencies to avoid ZeroDivisionError / IndexError.

This performance block is newly added. If the input CSV is empty (or all rows are somehow filtered), p.latencies will be an empty list, and Line 44 (sum(...) / len(...)) will raise ZeroDivisionError, Line 45 will raise IndexError, and Line 46 will raise ValueError. The same issue exists in lexical_slur/run.py.

The cleanest fix is the Profiler.summary() helper suggested in my comment on lexical_slur/run.py, which centralizes the guard and eliminates duplication. If you'd rather keep things inline, a simple if p.latencies: guard suffices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/pii/run.py` around lines 42 - 49, The performance
block currently assumes p.latencies is non-empty which can raise
ZeroDivisionError/IndexError/ValueError; either implement and call a centralized
Profiler.summary() helper that returns safe values (e.g., mean, p95, max as
numbers or None) for an empty latencies list and use that result here, or wrap
the existing computation in an if p.latencies: guard and otherwise set "mean",
"p95", "max" to None (or 0 per project convention); apply the same change to
lexical_slur/run.py so both places reference the safe summary implementation or
the same guard logic.

@rkritika1508 rkritika1508 force-pushed the feat/evaluation-update branch from 0e1a079 to 660e2de Compare February 20, 2026 08:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/README.md (1)

115-156: Triplicate output-structure documentation is a maintenance liability.

The same output paths appear in three different formats within ~40 lines:

  1. "Standardized output structure" (lines 115–127, indented tree)
  2. "Expected aggregate outputs" (lines 140–149, flat paths, inside the run_all_evaluations.sh block)
  3. Per-validator "Expected outputs" blocks (lines 151–156, 163–168, 175–180, box-drawing trees)

Every future output-path change must be applied in all three places. Consider keeping just the consolidated "Standardized output structure" at lines 115–127 and, in each per-validator section, replacing the redundant block with a single sentence referencing it (e.g., "Outputs are written to app/evaluation/outputs/gender_assumption_bias/ — see Standardized output structure.").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 115 - 156, The README currently repeats the
same output paths three times; remove the duplicate "Expected aggregate outputs"
and each per-validator "Expected outputs" block and keep only the consolidated
"Standardized output structure" section; in the per-validator sections for
lexical_slur, pii_remover, and gender_assumption_bias replace the removed trees
with a single sentence such as "Outputs are written to
app/evaluation/outputs/<validator_name>/ — see the 'Standardized output
structure' section" and optionally add an internal link to that section (e.g.,
"#running-evaluation-tests") so future path changes are maintained in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/README.md`:
- Line 100: Fix the missing space after the period following the Google Drive
link in the README sentence; update the README.md text fragment "[Google
Drive](https://drive.google.com/...)." so that it becomes "[Google
Drive](https://drive.google.com/...). This contains..." (i.e., add a single
space after the period) to separate the two sentences properly.

---

Nitpick comments:
In `@backend/README.md`:
- Around line 115-156: The README currently repeats the same output paths three
times; remove the duplicate "Expected aggregate outputs" and each per-validator
"Expected outputs" block and keep only the consolidated "Standardized output
structure" section; in the per-validator sections for lexical_slur, pii_remover,
and gender_assumption_bias replace the removed trees with a single sentence such
as "Outputs are written to app/evaluation/outputs/<validator_name>/ — see the
'Standardized output structure' section" and optionally add an internal link to
that section (e.g., "#running-evaluation-tests") so future path changes are
maintained in one place.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/README.md (1)

142-154: Inconsistent per-validator output descriptions and a trailing hard-break space.

Line 142 has a trailing space after the backtick command ( before the newline), which Markdown renders as a <br> — likely unintentional.

Additionally, lines 146–148 (PII) and 152–154 (gender bias) each carry specific predictions.csv / metrics.json descriptions, but lexical slur has none, creating an inconsistency. Either add an analogous block for lexical slur or drop the per-validator inline descriptions and rely solely on the generic summary at lines 111–113.

📝 Remove trailing space on line 142
-- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py` 
+- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 142 - 154, Remove the trailing space after
the backticked command `python app/evaluation/lexical_slur/run.py` and add a
short per-validator output description for the lexical slur evaluator to match
the others: state what `predictions.csv` and `metrics.json` contain for lexical
slur (e.g., samples with labels/predictions in `predictions.csv` and evaluation
metrics in `metrics.json`), or alternatively remove the per-validator
descriptions for PII and Gender Assumption Bias so all validators rely on the
generic summary; reference the command `python
app/evaluation/lexical_slur/run.py`, and the filenames `predictions.csv` and
`metrics.json` when making the edit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/README.md`:
- Around line 129-136: The fenced code block in backend/README.md breaks the
enclosing list item; fix by making the code block and the following paragraph
part of the same list item—either indent the fenced block and continuation lines
by four spaces (so the ```bash block and the "This script runs the evaluators in
sequence:" paragraph remain inside the bullet) or replace the list entry using
the proposed reflow: move the "To run all evaluation scripts together:" line
before the fenced block, wrap the block with ```bash ... ```, and then list the
three evaluator paths (`app/evaluation/lexical_slur/run.py`,
`app/evaluation/pii/run.py`, `app/evaluation/gender_assumption_bias/run.py`) as
sub-list items; ensure the script name scripts/run_all_evaluations.sh is
included exactly in the fenced block.

---

Duplicate comments:
In `@backend/README.md`:
- Line 100: Fix the missing space in backend/README.md by updating the sentence
that currently reads "[Google
Drive](https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89).This
contains" to include a space after the period so it becomes "...89). This
contains"; locate and edit the sentence in the README content describing where
to download CSV datasets and ensure proper spacing between the two sentences.
- Line 109: Update the README and evaluation runner so the ban list validator is
either included in the automated evaluation or explicitly documented as
deferred: add the ban-list evaluation invocation to run_all_evaluations.sh (the
script that currently calls the three evaluators) and update the README section
mentioning curated datasets to list the ban list evaluation script and expected
outputs (accuracy/latency), or alternatively add a clear note in README
explaining why the ban list evaluation is omitted and create a tracking issue;
reference the ban list validator name used in the repo and the
run_all_evaluations.sh entry to locate where to add the call and documentation.

---

Nitpick comments:
In `@backend/README.md`:
- Around line 142-154: Remove the trailing space after the backticked command
`python app/evaluation/lexical_slur/run.py` and add a short per-validator output
description for the lexical slur evaluator to match the others: state what
`predictions.csv` and `metrics.json` contain for lexical slur (e.g., samples
with labels/predictions in `predictions.csv` and evaluation metrics in
`metrics.json`), or alternatively remove the per-validator descriptions for PII
and Gender Assumption Bias so all validators rely on the generic summary;
reference the command `python app/evaluation/lexical_slur/run.py`, and the
filenames `predictions.csv` and `metrics.json` when making the edit.

Comment on lines +129 to +136
- To run all evaluation scripts together, use:
```bash
bash scripts/run_all_evaluations.sh
```
This script runs the evaluators in sequence:
- `app/evaluation/lexical_slur/run.py`
- `app/evaluation/pii/run.py`
- `app/evaluation/gender_assumption_bias/run.py`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fenced code block inside a list item must be indented to maintain list context.

On GitHub (CommonMark), an unindented ``` at column 0 terminates the enclosing list item. The "This script runs the evaluators in sequence:" paragraph and the sub-list (lines 133–136) are then rendered as a separate top-level block rather than as a continuation of the bullet.

📝 Proposed fix — indent the code block and continuation text
-
-- To run all evaluation scripts together, use:
-```bash
-bash scripts/run_all_evaluations.sh
-```
-This script runs the evaluators in sequence:
-- `app/evaluation/lexical_slur/run.py`
-- `app/evaluation/pii/run.py`
-- `app/evaluation/gender_assumption_bias/run.py`
+
+To run all evaluation scripts together:
+
+```bash
+bash scripts/run_all_evaluations.sh
+```
+
+This script runs the evaluators in sequence:
+- `app/evaluation/lexical_slur/run.py`
+- `app/evaluation/pii/run.py`
+- `app/evaluation/gender_assumption_bias/run.py`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 129 - 136, The fenced code block in
backend/README.md breaks the enclosing list item; fix by making the code block
and the following paragraph part of the same list item—either indent the fenced
block and continuation lines by four spaces (so the ```bash block and the "This
script runs the evaluators in sequence:" paragraph remain inside the bullet) or
replace the list entry using the proposed reflow: move the "To run all
evaluation scripts together:" line before the fenced block, wrap the block with
```bash ... ```, and then list the three evaluator paths
(`app/evaluation/lexical_slur/run.py`, `app/evaluation/pii/run.py`,
`app/evaluation/gender_assumption_bias/run.py`) as sub-list items; ensure the
script name scripts/run_all_evaluations.sh is included exactly in the fenced
block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation updates

2 participants