Added gender assumption bias evaluation by rkritika1508 · Pull Request #59 · ProjectTech4DevAI/kaapi-guardrails

rkritika1508 · 2026-02-19T12:24:46Z

Summary

Target issue is #58.
Explain the motivation for making this change. What existing problem does the pull request solve?

Added a script run_all_evaluations.sh to run evaluations for all validators.
Added a dataset for gender assumption bias validator. It contains 300 samples from Hindi, English and Hinglish. The dataset can be found here.
Added code to evaluate the performance of gender assumption bias validator on the curated dataset.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Added a gender-assumption bias evaluator that produces predictions, metrics, and performance stats.
- Added a single-command runner to execute all evaluators in sequence.
Improvements
- PII evaluation and bias runs now record latency and memory; performance stats included in outputs.
- Numeric outputs (precision, recall, F1, latency, memory) are rounded for clearer reports.
Documentation
- README updated with exact dataset filenames and standardized outputs layout.
Chores
- Updated ignore patterns to exclude dataset CSV files.

coderabbitai · 2026-02-19T12:25:00Z

📝 Walkthrough

Walkthrough

Adds a gender assumption bias evaluation runner and a shell orchestrator, introduces a Profiler and rounding for metrics in helpers and evaluators, updates PII and lexical slur runners to emit performance stats, updates docs and .gitignore to ignore dataset CSVs, and writes standardized predictions/metrics outputs.

Changes

Cohort / File(s)	Summary
Configuration `/.gitignore`	Added ignore pattern `backend/app/evaluation/datasets/*.csv`; preserved existing `predictions.csv` ignore entry.
Gender Assumption Bias Evaluation `backend/app/evaluation/gender_assumption_bias/run.py`	New evaluation script: loads dataset, runs `GenderAssumptionBias` on biased + neutral inputs, profiles latency/memory, computes binary metrics, writes `predictions.csv` and `metrics.json` (includes latency & memory).
Evaluation Runner `backend/scripts/run_all_evaluations.sh`	New bash script to run all evaluation runners sequentially with strict error handling and per-runner status output.
Common helpers / profiling `backend/app/evaluation/common/helper.py`	Adds `Profiler`; `compute_binary_metrics` now rounds `precision`, `recall`, and `f1` to 2 decimals; helper exports updated to include `Profiler`.
PII evaluation `backend/app/evaluation/pii/run.py`, `backend/app/evaluation/pii/entity_metrics.py`	PII runner now uses `Profiler` to record latencies and peak memory; entity metrics rounded to 2 decimals; metrics JSON augmented with performance stats.
Lexical slur evaluation `backend/app/evaluation/lexical_slur/run.py`	Latency and memory performance numbers in output JSON are rounded to 2 decimal places.
Documentation `backend/README.md`	Docs updated with explicit dataset filenames/placement, instructions for running the new gender_assumption_bias evaluator, standardized evaluation outputs layout, and reference to `run_all_evaluations.sh`.

Sequence Diagram

sequenceDiagram
    participant Script as "run.py"
    participant Dataset as "Dataset Loader"
    participant Validator as "GenderAssumptionBias\nValidator"
    participant Profiler as "Profiler"
    participant Metrics as "Metrics Compute"
    participant Output as "CSV/JSON\nWriter"

    Script->>Dataset: Load dataset
    Script->>Validator: Initialize validator

    loop for each sample (biased & neutral)
        Script->>Profiler: start recording
        Script->>Validator: Evaluate input
        Validator-->>Script: Prediction
        Script->>Profiler: stop recording
    end

    Script->>Metrics: Compute binary metrics (precision, recall, f1)
    Metrics-->>Script: Metrics
    Script->>Output: Write predictions.csv
    Script->>Output: Write metrics.json (latency & memory stats)
    Output-->>Script: Complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Added evaluation setup #17 — Modifies the offline evaluation framework and helpers (Profiler/metrics) similar to this change.

Suggested reviewers

nishika26

Poem

🐇 I hop through rows and time each thoughtful line,
I tally bias checks and round each metric fine,
I sniff the CSVs, profile pings, and memory too,
I write the preds and metrics — neat, concise, and true,
Bravo — the evaluators hum; I nibble a carrot or two.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: a new gender assumption bias evaluation has been added with a dedicated run script, dataset support, and evaluation code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/evaluation-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

backend/app/evaluation/gender_assumption_bias/run.py (2)
61-62: p95 index is off-by-one under the standard nearest-rank convention.

int(len(p.latencies) * 0.95) truncates, yielding the value above the 95th percentile marker (e.g., index 570 out of 600, which is the 95.17th percentile). The standard nearest-rank formula is ceil(p * N) - 1. For an evaluation script the difference is negligible, but using numpy or the corrected index avoids the discrepancy.
♻️ Proposed fix using math.ceil
+import math
 ...
-"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)],
+"p95": sorted(p.latencies)[math.ceil(len(p.latencies) * 0.95) - 1],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 61 - 62,
The p95 calculation in run.py uses int(len(p.latencies) * 0.95) which truncates
and yields an off-by-one index under the nearest-rank convention; update the p95
computation for p.latencies in the block that builds the dict with "mean" and
"p95" to use the nearest-rank formula (ceil(0.95 * N) - 1) or use a proper
percentile routine (e.g., numpy.percentile) to select the correct index,
ensuring you reference p.latencies and replace the current int(...) index
expression with the corrected ceil-based index or a call to numpy.percentile.
16-68: Wrap top-level execution in an if __name__ == "__main__" guard.

All 50+ lines of evaluation logic execute at import time. This prevents importing any symbol from this module in tests or tooling without triggering file I/O, network, and CSV writes.
♻️ Proposed refactor
+def main():
     df = pd.read_csv(BASE_DIR / "datasets" / "gender_bias_assumption_dataset.csv")
     validator = GenderAssumptionBias()
     ...
     write_json(...)

+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 16 - 68,
The module runs heavy evaluation logic at import time (creating Profiler,
reading CSV into df, running validator.validate, computing metrics, and calling
write_csv/write_json); wrap that top-level execution in a main guard by moving
the existing script body into a function (e.g., def main(): ...) and then add if
__name__ == "__main__": main(), ensuring symbols like Profiler,
GenderAssumptionBias/validator, compute_binary_metrics, write_csv and write_json
remain importable without triggering file I/O or network calls; preserve
behavior and variable names (df, p, y_true/y_pred, metrics) inside main and keep
only definitions/imports at module scope.
backend/scripts/run_all_evaluations.sh (1)
17-22: Add cd "$BACKEND_DIR" before the loop to guarantee uv run picks up the correct project context.

uv run discovers pyproject.toml by ascending from the CWD. If this script is invoked from the repo root (e.g., bash backend/scripts/run_all_evaluations.sh), uv may resolve a different project or fail to find any environment entirely.
♻️ Proposed fix
+cd "$BACKEND_DIR"
+
 for runner in "${RUNNERS[@]}"; do
   name="$(basename "$(dirname "$runner")")"
   echo ""
   echo "==> [$name] $runner"
   uv run python "$runner"
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/run_all_evaluations.sh` around lines 17 - 22, The script
loops over RUNNERS and calls `uv run python "$runner"` but doesn't set the
working directory, so `uv` may ascend from the caller CWD and pick the wrong
pyproject; change the script to `cd` into the backend project before the loop
(e.g., `cd "$BACKEND_DIR"` or use `pushd "$BACKEND_DIR"`/`popd` around the loop)
so that `uv run python` runs from the intended project context and still restore
the original CWD afterward if necessary.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Line 17: Remove the debug print statement by deleting the standalone
print(BASE_DIR, OUT_DIR) call in run.py; if runtime visibility of
BASE_DIR/OUT_DIR is required, replace it with a proper logger call (e.g.,
logger.debug) referencing BASE_DIR and OUT_DIR instead of using print.
- Line 16: Wrap the pd.read_csv call that loads
"gender_bias_assumption_dataset.csv" (the df = pd.read_csv(BASE_DIR / "datasets"
/ "gender_bias_assumption_dataset.csv") statement) in a try/except catching
FileNotFoundError, and in the except log or raise a clear, actionable error that
names the missing file, where it was expected (BASE_DIR / "datasets"), and
instructs the operator to download the dataset from the Google Drive location;
ensure the handler either raises a RuntimeError with that message or calls the
module's logger/exit path (e.g., logger.error(...); sys.exit(1)) so the failure
is user-friendly.

---

Nitpick comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Around line 61-62: The p95 calculation in run.py uses int(len(p.latencies) *
0.95) which truncates and yields an off-by-one index under the nearest-rank
convention; update the p95 computation for p.latencies in the block that builds
the dict with "mean" and "p95" to use the nearest-rank formula (ceil(0.95 * N) -
1) or use a proper percentile routine (e.g., numpy.percentile) to select the
correct index, ensuring you reference p.latencies and replace the current
int(...) index expression with the corrected ceil-based index or a call to
numpy.percentile.
- Around line 16-68: The module runs heavy evaluation logic at import time
(creating Profiler, reading CSV into df, running validator.validate, computing
metrics, and calling write_csv/write_json); wrap that top-level execution in a
main guard by moving the existing script body into a function (e.g., def main():
...) and then add if __name__ == "__main__": main(), ensuring symbols like
Profiler, GenderAssumptionBias/validator, compute_binary_metrics, write_csv and
write_json remain importable without triggering file I/O or network calls;
preserve behavior and variable names (df, p, y_true/y_pred, metrics) inside main
and keep only definitions/imports at module scope.

In `@backend/scripts/run_all_evaluations.sh`:
- Around line 17-22: The script loops over RUNNERS and calls `uv run python
"$runner"` but doesn't set the working directory, so `uv` may ascend from the
caller CWD and pick the wrong pyproject; change the script to `cd` into the
backend project before the loop (e.g., `cd "$BACKEND_DIR"` or use `pushd
"$BACKEND_DIR"`/`popd` around the loop) so that `uv run python` runs from the
intended project context and still restore the original CWD afterward if
necessary.

backend/app/evaluation/gender_assumption_bias/run.py

backend/scripts/run_all_evaluations.sh

backend/app/evaluation/gender_assumption_bias/run.py

nishika26 · 2026-02-20T07:29:25Z

@rkritika1508 , ran the script, and here are my inputs -
1)all the output the values should be rounded off to two decimal values
2) i don't see record of latency in PII remover
3)should we not have a script for ban list validator as well, atleast for the demo so we have accuracy and latency numbers

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

backend/app/evaluation/common/helper.py (1)

19-23: Four independent passes over the same data to compute TP/TN/FP/FN.

Each confusion-matrix counter zips and iterates the full sequence separately. A single pass would be clearer and ~4× fewer iterations. Fine for small evaluation datasets, but easy to tighten.

♻️ Single-pass alternative

 def compute_binary_metrics(y_true, y_pred):
-    tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
-    tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
-    fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True))
-    fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True))
+    tp = tn = fp = fn = 0
+    for yt, yp in zip(y_true, y_pred, strict=True):
+        if yt == 1 and yp == 1:
+            tp += 1
+        elif yt == 0 and yp == 0:
+            tn += 1
+        elif yt == 0 and yp == 1:
+            fp += 1
+        else:
+            fn += 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/common/helper.py` around lines 19 - 23,
compute_binary_metrics currently makes four separate passes over y_true/y_pred
to compute tp, tn, fp, fn which is inefficient; replace those four sum(...)
expressions with a single loop that iterates once over zip(y_true, y_pred,
strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1 and
yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to initialize
the four counters before the loop and compute them in that single pass, leaving
the function's return/signature unchanged.

backend/app/evaluation/lexical_slur/run.py (1)

42-46: Rounding looks good; consider extracting the latency summary into a shared helper.

The round(..., 2) additions address the reviewer feedback. However, the same latency-stats block (mean / p95 / max + memory) is duplicated verbatim in pii/run.py (and likely gender_assumption_bias/run.py). A small helper on Profiler (e.g. summary() -> dict) would eliminate this repetition and keep future format changes in one place.

♻️ Example: add a summary method to Profiler

Add to backend/app/evaluation/common/helper.py:

class Profiler:
    # ... existing methods ...

    def summary(self) -> dict:
        n = len(self.latencies)
        if n == 0:
            return {"latency_ms": {"mean": 0, "p95": 0, "max": 0}, "memory_mb": 0}
        return {
            "latency_ms": {
                "mean": round(sum(self.latencies) / n, 2),
                "p95": round(sorted(self.latencies)[int(n * 0.95)], 2),
                "max": round(max(self.latencies), 2),
            },
            "memory_mb": round(self.peak_memory_mb, 2),
        }

Then each run.py simply calls "performance": p.summary().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/lexical_slur/run.py` around lines 42 - 46, The latency
+ memory summary block is duplicated across runs; add a summary() method to the
Profiler class that returns the same dict shape (handle zero-length latencies by
returning zeros), computing mean, p95, and max with round(..., 2) and rounding
peak_memory_mb to 2 decimals, and then replace the inline block in
lexical_slur/run.py (and other run.py files like pii/run.py and
gender_assumption_bias/run.py) with a call to p.summary() where p is the
Profiler instance.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/pii/run.py`:
- Around line 42-49: The performance block currently assumes p.latencies is
non-empty which can raise ZeroDivisionError/IndexError/ValueError; either
implement and call a centralized Profiler.summary() helper that returns safe
values (e.g., mean, p95, max as numbers or None) for an empty latencies list and
use that result here, or wrap the existing computation in an if p.latencies:
guard and otherwise set "mean", "p95", "max" to None (or 0 per project
convention); apply the same change to lexical_slur/run.py so both places
reference the safe summary implementation or the same guard logic.

In `@backend/README.md`:
- Around line 153-165: Add a language specifier to the fenced output block in
README.md (change the open fence to use ```text or ```console for the
app/evaluation/outputs/gender_assumption_bias/ tree) so it satisfies MD040 and
renders correctly, and ensure the closing ``` remains. Also add a short
documented subsection for the new backend/scripts/run_all_evaluations.sh script:
state its purpose (orchestrates all per-validator runs), provide the basic usage
(how to invoke the script), and list expected outputs (where aggregated
validator outputs and combined metrics are written, e.g.,
app/evaluation/outputs/ and any combined metrics file), so readers know to run
that script in addition to individual validator run.py commands.
- Line 100: Replace the non-descriptive link text "[here](...)" in the README
with a meaningful label (e.g., "validator datasets on Google Drive" or
"validation CSV dataset folder") so the link conveys its target; update the
sentence that instructs users to download the CSVs into
backend/app/evaluation/datasets/ to use that descriptive label instead of
"here".
- Line 109: The README sentence contradicts itself by calling "lexical slur
match, ban list, gender assumption bias" deterministic (so testing "doesn't make
much sense") while immediately stating curated datasets exist and that a
dedicated run.py (gender assumption bias benchmark) is a main deliverable;
update the paragraph to remove the contradiction and fix wording: clarify which
validators are deterministic (if any), state that curated datasets were created
specifically for benchmarking lexical slur match and gender assumption bias and
that run.py executes the gender assumption bias evaluation, and replace "cause"
with "because" for clarity; reference the terms "lexical slur match", "ban
list", "gender assumption bias", and "run.py" so editors can locate and reword
the README accordingly.

---

Nitpick comments:
In `@backend/app/evaluation/common/helper.py`:
- Around line 19-23: compute_binary_metrics currently makes four separate passes
over y_true/y_pred to compute tp, tn, fp, fn which is inefficient; replace those
four sum(...) expressions with a single loop that iterates once over zip(y_true,
y_pred, strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1
and yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to
initialize the four counters before the loop and compute them in that single
pass, leaving the function's return/signature unchanged.

In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 42-46: The latency + memory summary block is duplicated across
runs; add a summary() method to the Profiler class that returns the same dict
shape (handle zero-length latencies by returning zeros), computing mean, p95,
and max with round(..., 2) and rounding peak_memory_mb to 2 decimals, and then
replace the inline block in lexical_slur/run.py (and other run.py files like
pii/run.py and gender_assumption_bias/run.py) with a call to p.summary() where p
is the Profiler instance.

coderabbitai · 2026-02-20T08:28:40Z

backend/app/evaluation/pii/run.py

+        "performance": {
+            "latency_ms": {
+                "mean": round(sum(p.latencies) / len(p.latencies), 2),
+                "p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2),
+                "max": round(max(p.latencies), 2),
+            },
+            "memory_mb": round(p.peak_memory_mb, 2),
+        },


⚠️ Potential issue | 🟡 Minor

Guard against empty p.latencies to avoid ZeroDivisionError / IndexError.

This performance block is newly added. If the input CSV is empty (or all rows are somehow filtered), p.latencies will be an empty list, and Line 44 (sum(...) / len(...)) will raise ZeroDivisionError, Line 45 will raise IndexError, and Line 46 will raise ValueError. The same issue exists in lexical_slur/run.py.

The cleanest fix is the Profiler.summary() helper suggested in my comment on lexical_slur/run.py, which centralizes the guard and eliminates duplication. If you'd rather keep things inline, a simple if p.latencies: guard suffices.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/pii/run.py` around lines 42 - 49, The performance block currently assumes p.latencies is non-empty which can raise ZeroDivisionError/IndexError/ValueError; either implement and call a centralized Profiler.summary() helper that returns safe values (e.g., mean, p95, max as numbers or None) for an empty latencies list and use that result here, or wrap the existing computation in an if p.latencies: guard and otherwise set "mean", "p95", "max" to None (or 0 per project convention); apply the same change to lexical_slur/run.py so both places reference the safe summary implementation or the same guard logic.

backend/README.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/README.md (1)
115-156: Triplicate output-structure documentation is a maintenance liability.

The same output paths appear in three different formats within ~40 lines:

"Standardized output structure" (lines 115–127, indented tree)

"Expected aggregate outputs" (lines 140–149, flat paths, inside the run_all_evaluations.sh block)

Per-validator "Expected outputs" blocks (lines 151–156, 163–168, 175–180, box-drawing trees)

Every future output-path change must be applied in all three places. Consider keeping just the consolidated "Standardized output structure" at lines 115–127 and, in each per-validator section, replacing the redundant block with a single sentence referencing it (e.g., "Outputs are written to app/evaluation/outputs/gender_assumption_bias/ — see Standardized output structure.").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 115 - 156, The README currently repeats the
same output paths three times; remove the duplicate "Expected aggregate outputs"
and each per-validator "Expected outputs" block and keep only the consolidated
"Standardized output structure" section; in the per-validator sections for
lexical_slur, pii_remover, and gender_assumption_bias replace the removed trees
with a single sentence such as "Outputs are written to
app/evaluation/outputs/<validator_name>/ — see the 'Standardized output
structure' section" and optionally add an internal link to that section (e.g.,
"#running-evaluation-tests") so future path changes are maintained in one place.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/README.md`:
- Line 100: Fix the missing space after the period following the Google Drive
link in the README sentence; update the README.md text fragment "[Google
Drive](https://drive.google.com/...)." so that it becomes "[Google
Drive](https://drive.google.com/...). This contains..." (i.e., add a single
space after the period) to separate the two sentences properly.

---

Nitpick comments:
In `@backend/README.md`:
- Around line 115-156: The README currently repeats the same output paths three
times; remove the duplicate "Expected aggregate outputs" and each per-validator
"Expected outputs" block and keep only the consolidated "Standardized output
structure" section; in the per-validator sections for lexical_slur, pii_remover,
and gender_assumption_bias replace the removed trees with a single sentence such
as "Outputs are written to app/evaluation/outputs/<validator_name>/ — see the
'Standardized output structure' section" and optionally add an internal link to
that section (e.g., "#running-evaluation-tests") so future path changes are
maintained in one place.

backend/README.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/README.md (1)
142-154: Inconsistent per-validator output descriptions and a trailing hard-break space.

Line 142 has a trailing space after the backtick command ( before the newline), which Markdown renders as a <br> — likely unintentional.

Additionally, lines 146–148 (PII) and 152–154 (gender bias) each carry specific predictions.csv / metrics.json descriptions, but lexical slur has none, creating an inconsistency. Either add an analogous block for lexical slur or drop the per-validator inline descriptions and rely solely on the generic summary at lines 111–113.
📝 Remove trailing space on line 142
-- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py` 
+- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 142 - 154, Remove the trailing space after
the backticked command `python app/evaluation/lexical_slur/run.py` and add a
short per-validator output description for the lexical slur evaluator to match
the others: state what `predictions.csv` and `metrics.json` contain for lexical
slur (e.g., samples with labels/predictions in `predictions.csv` and evaluation
metrics in `metrics.json`), or alternatively remove the per-validator
descriptions for PII and Gender Assumption Bias so all validators rely on the
generic summary; reference the command `python
app/evaluation/lexical_slur/run.py`, and the filenames `predictions.csv` and
`metrics.json` when making the edit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/README.md`:
- Around line 129-136: The fenced code block in backend/README.md breaks the
enclosing list item; fix by making the code block and the following paragraph
part of the same list item—either indent the fenced block and continuation lines
by four spaces (so the ```bash block and the "This script runs the evaluators in
sequence:" paragraph remain inside the bullet) or replace the list entry using
the proposed reflow: move the "To run all evaluation scripts together:" line
before the fenced block, wrap the block with ```bash ... ```, and then list the
three evaluator paths (`app/evaluation/lexical_slur/run.py`,
`app/evaluation/pii/run.py`, `app/evaluation/gender_assumption_bias/run.py`) as
sub-list items; ensure the script name scripts/run_all_evaluations.sh is
included exactly in the fenced block.

---

Duplicate comments:
In `@backend/README.md`:
- Line 100: Fix the missing space in backend/README.md by updating the sentence
that currently reads "[Google
Drive](https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89).This
contains" to include a space after the period so it becomes "...89). This
contains"; locate and edit the sentence in the README content describing where
to download CSV datasets and ensure proper spacing between the two sentences.
- Line 109: Update the README and evaluation runner so the ban list validator is
either included in the automated evaluation or explicitly documented as
deferred: add the ban-list evaluation invocation to run_all_evaluations.sh (the
script that currently calls the three evaluators) and update the README section
mentioning curated datasets to list the ban list evaluation script and expected
outputs (accuracy/latency), or alternatively add a clear note in README
explaining why the ban list evaluation is omitted and create a tracking issue;
reference the ban list validator name used in the repo and the
run_all_evaluations.sh entry to locate where to add the call and documentation.

---

Nitpick comments:
In `@backend/README.md`:
- Around line 142-154: Remove the trailing space after the backticked command
`python app/evaluation/lexical_slur/run.py` and add a short per-validator output
description for the lexical slur evaluator to match the others: state what
`predictions.csv` and `metrics.json` contain for lexical slur (e.g., samples
with labels/predictions in `predictions.csv` and evaluation metrics in
`metrics.json`), or alternatively remove the per-validator descriptions for PII
and Gender Assumption Bias so all validators rely on the generic summary;
reference the command `python app/evaluation/lexical_slur/run.py`, and the
filenames `predictions.csv` and `metrics.json` when making the edit.

coderabbitai · 2026-02-22T11:24:23Z

backend/README.md

+- To run all evaluation scripts together, use:
+```bash
+bash scripts/run_all_evaluations.sh
 ```
+This script runs the evaluators in sequence:
+- `app/evaluation/lexical_slur/run.py`
+- `app/evaluation/pii/run.py`
+- `app/evaluation/gender_assumption_bias/run.py`


⚠️ Potential issue | 🟡 Minor

Fenced code block inside a list item must be indented to maintain list context.

On GitHub (CommonMark), an unindented ``` at column 0 terminates the enclosing list item. The "This script runs the evaluators in sequence:" paragraph and the sub-list (lines 133–136) are then rendered as a separate top-level block rather than as a continuation of the bullet.

📝 Proposed fix — indent the code block and continuation text

- -- To run all evaluation scripts together, use: -```bash -bash scripts/run_all_evaluations.sh -``` -This script runs the evaluators in sequence: -- `app/evaluation/lexical_slur/run.py` -- `app/evaluation/pii/run.py` -- `app/evaluation/gender_assumption_bias/run.py` + +To run all evaluation scripts together: + +```bash +bash scripts/run_all_evaluations.sh +``` + +This script runs the evaluators in sequence: +- `app/evaluation/lexical_slur/run.py` +- `app/evaluation/pii/run.py` +- `app/evaluation/gender_assumption_bias/run.py`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/README.md` around lines 129 - 136, The fenced code block in backend/README.md breaks the enclosing list item; fix by making the code block and the following paragraph part of the same list item—either indent the fenced block and continuation lines by four spaces (so the ```bash block and the "This script runs the evaluators in sequence:" paragraph remain inside the bullet) or replace the list entry using the proposed reflow: move the "To run all evaluation scripts together:" line before the fenced block, wrap the block with ```bash ... ```, and then list the three evaluator paths (`app/evaluation/lexical_slur/run.py`, `app/evaluation/pii/run.py`, `app/evaluation/gender_assumption_bias/run.py`) as sub-list items; ensure the script name scripts/run_all_evaluations.sh is included exactly in the fenced block.

added gender assumption bias evaluation

130d609

rkritika1508 marked this pull request as ready for review February 19, 2026 12:24

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

backend/app/evaluation/gender_assumption_bias/run.py Show resolved Hide resolved

backend/app/evaluation/gender_assumption_bias/run.py Outdated Show resolved Hide resolved

removed print

ab492c4

rkritika1508 self-assigned this Feb 19, 2026

rkritika1508 added the ready-for-review label Feb 19, 2026

nishika26 linked an issue Feb 20, 2026 that may be closed by this pull request

Evaluation updates #58

Open

nishika26 requested changes Feb 20, 2026

View reviewed changes

backend/scripts/run_all_evaluations.sh Show resolved Hide resolved

backend/app/evaluation/gender_assumption_bias/run.py Show resolved Hide resolved

resolved comments

660e2de

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

rkritika1508 force-pushed the feat/evaluation-update branch from 0e1a079 to 660e2de Compare February 20, 2026 08:30

rkritika1508 added 2 commits February 20, 2026 14:03

resolved comments

7182e76

Merge branch 'main' into feat/evaluation-update

ff4e454

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

backend/README.md Show resolved Hide resolved

nishika26 requested changes Feb 20, 2026

View reviewed changes

backend/README.md Outdated Show resolved Hide resolved

backend/README.md Outdated Show resolved Hide resolved

resolved comments

77a48f1

coderabbitai bot reviewed Feb 22, 2026

View reviewed changes

Comments

Conversation

rkritika1508 commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nishika26 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkritika1508 commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

nishika26 commented Feb 20, 2026 •

edited

Loading