Conversation
📝 WalkthroughWalkthroughAdds a gender assumption bias evaluation runner and a shell orchestrator, introduces a Profiler and rounding for metrics in helpers and evaluators, updates PII and lexical slur runners to emit performance stats, updates docs and .gitignore to ignore dataset CSVs, and writes standardized predictions/metrics outputs. Changes
Sequence DiagramsequenceDiagram
participant Script as "run.py"
participant Dataset as "Dataset Loader"
participant Validator as "GenderAssumptionBias\nValidator"
participant Profiler as "Profiler"
participant Metrics as "Metrics Compute"
participant Output as "CSV/JSON\nWriter"
Script->>Dataset: Load dataset
Script->>Validator: Initialize validator
loop for each sample (biased & neutral)
Script->>Profiler: start recording
Script->>Validator: Evaluate input
Validator-->>Script: Prediction
Script->>Profiler: stop recording
end
Script->>Metrics: Compute binary metrics (precision, recall, f1)
Metrics-->>Script: Metrics
Script->>Output: Write predictions.csv
Script->>Output: Write metrics.json (latency & memory stats)
Output-->>Script: Complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
backend/app/evaluation/gender_assumption_bias/run.py (2)
61-62: p95 index is off-by-one under the standard nearest-rank convention.
int(len(p.latencies) * 0.95)truncates, yielding the value above the 95th percentile marker (e.g., index 570 out of 600, which is the 95.17th percentile). The standard nearest-rank formula isceil(p * N) - 1. For an evaluation script the difference is negligible, but usingnumpyor the corrected index avoids the discrepancy.♻️ Proposed fix using math.ceil
+import math ... -"p95": sorted(p.latencies)[int(len(p.latencies) * 0.95)], +"p95": sorted(p.latencies)[math.ceil(len(p.latencies) * 0.95) - 1],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 61 - 62, The p95 calculation in run.py uses int(len(p.latencies) * 0.95) which truncates and yields an off-by-one index under the nearest-rank convention; update the p95 computation for p.latencies in the block that builds the dict with "mean" and "p95" to use the nearest-rank formula (ceil(0.95 * N) - 1) or use a proper percentile routine (e.g., numpy.percentile) to select the correct index, ensuring you reference p.latencies and replace the current int(...) index expression with the corrected ceil-based index or a call to numpy.percentile.
16-68: Wrap top-level execution in anif __name__ == "__main__"guard.All 50+ lines of evaluation logic execute at import time. This prevents importing any symbol from this module in tests or tooling without triggering file I/O, network, and CSV writes.
♻️ Proposed refactor
+def main(): df = pd.read_csv(BASE_DIR / "datasets" / "gender_bias_assumption_dataset.csv") validator = GenderAssumptionBias() ... write_json(...) +if __name__ == "__main__": + main()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/gender_assumption_bias/run.py` around lines 16 - 68, The module runs heavy evaluation logic at import time (creating Profiler, reading CSV into df, running validator.validate, computing metrics, and calling write_csv/write_json); wrap that top-level execution in a main guard by moving the existing script body into a function (e.g., def main(): ...) and then add if __name__ == "__main__": main(), ensuring symbols like Profiler, GenderAssumptionBias/validator, compute_binary_metrics, write_csv and write_json remain importable without triggering file I/O or network calls; preserve behavior and variable names (df, p, y_true/y_pred, metrics) inside main and keep only definitions/imports at module scope.backend/scripts/run_all_evaluations.sh (1)
17-22: Addcd "$BACKEND_DIR"before the loop to guaranteeuv runpicks up the correct project context.
uv rundiscoverspyproject.tomlby ascending from the CWD. If this script is invoked from the repo root (e.g.,bash backend/scripts/run_all_evaluations.sh),uvmay resolve a different project or fail to find any environment entirely.♻️ Proposed fix
+cd "$BACKEND_DIR" + for runner in "${RUNNERS[@]}"; do name="$(basename "$(dirname "$runner")")" echo "" echo "==> [$name] $runner" uv run python "$runner" done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/scripts/run_all_evaluations.sh` around lines 17 - 22, The script loops over RUNNERS and calls `uv run python "$runner"` but doesn't set the working directory, so `uv` may ascend from the caller CWD and pick the wrong pyproject; change the script to `cd` into the backend project before the loop (e.g., `cd "$BACKEND_DIR"` or use `pushd "$BACKEND_DIR"`/`popd` around the loop) so that `uv run python` runs from the intended project context and still restore the original CWD afterward if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Line 17: Remove the debug print statement by deleting the standalone
print(BASE_DIR, OUT_DIR) call in run.py; if runtime visibility of
BASE_DIR/OUT_DIR is required, replace it with a proper logger call (e.g.,
logger.debug) referencing BASE_DIR and OUT_DIR instead of using print.
- Line 16: Wrap the pd.read_csv call that loads
"gender_bias_assumption_dataset.csv" (the df = pd.read_csv(BASE_DIR / "datasets"
/ "gender_bias_assumption_dataset.csv") statement) in a try/except catching
FileNotFoundError, and in the except log or raise a clear, actionable error that
names the missing file, where it was expected (BASE_DIR / "datasets"), and
instructs the operator to download the dataset from the Google Drive location;
ensure the handler either raises a RuntimeError with that message or calls the
module's logger/exit path (e.g., logger.error(...); sys.exit(1)) so the failure
is user-friendly.
---
Nitpick comments:
In `@backend/app/evaluation/gender_assumption_bias/run.py`:
- Around line 61-62: The p95 calculation in run.py uses int(len(p.latencies) *
0.95) which truncates and yields an off-by-one index under the nearest-rank
convention; update the p95 computation for p.latencies in the block that builds
the dict with "mean" and "p95" to use the nearest-rank formula (ceil(0.95 * N) -
1) or use a proper percentile routine (e.g., numpy.percentile) to select the
correct index, ensuring you reference p.latencies and replace the current
int(...) index expression with the corrected ceil-based index or a call to
numpy.percentile.
- Around line 16-68: The module runs heavy evaluation logic at import time
(creating Profiler, reading CSV into df, running validator.validate, computing
metrics, and calling write_csv/write_json); wrap that top-level execution in a
main guard by moving the existing script body into a function (e.g., def main():
...) and then add if __name__ == "__main__": main(), ensuring symbols like
Profiler, GenderAssumptionBias/validator, compute_binary_metrics, write_csv and
write_json remain importable without triggering file I/O or network calls;
preserve behavior and variable names (df, p, y_true/y_pred, metrics) inside main
and keep only definitions/imports at module scope.
In `@backend/scripts/run_all_evaluations.sh`:
- Around line 17-22: The script loops over RUNNERS and calls `uv run python
"$runner"` but doesn't set the working directory, so `uv` may ascend from the
caller CWD and pick the wrong pyproject; change the script to `cd` into the
backend project before the loop (e.g., `cd "$BACKEND_DIR"` or use `pushd
"$BACKEND_DIR"`/`popd` around the loop) so that `uv run python` runs from the
intended project context and still restore the original CWD afterward if
necessary.
|
@rkritika1508 , ran the script, and here are my inputs - |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
backend/app/evaluation/common/helper.py (1)
19-23: Four independent passes over the same data to compute TP/TN/FP/FN.Each confusion-matrix counter zips and iterates the full sequence separately. A single pass would be clearer and ~4× fewer iterations. Fine for small evaluation datasets, but easy to tighten.
♻️ Single-pass alternative
def compute_binary_metrics(y_true, y_pred): - tp = sum((yt == 1 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) - tn = sum((yt == 0 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True)) - fp = sum((yt == 0 and yp == 1) for yt, yp in zip(y_true, y_pred, strict=True)) - fn = sum((yt == 1 and yp == 0) for yt, yp in zip(y_true, y_pred, strict=True)) + tp = tn = fp = fn = 0 + for yt, yp in zip(y_true, y_pred, strict=True): + if yt == 1 and yp == 1: + tp += 1 + elif yt == 0 and yp == 0: + tn += 1 + elif yt == 0 and yp == 1: + fp += 1 + else: + fn += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/common/helper.py` around lines 19 - 23, compute_binary_metrics currently makes four separate passes over y_true/y_pred to compute tp, tn, fp, fn which is inefficient; replace those four sum(...) expressions with a single loop that iterates once over zip(y_true, y_pred, strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1 and yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to initialize the four counters before the loop and compute them in that single pass, leaving the function's return/signature unchanged.backend/app/evaluation/lexical_slur/run.py (1)
42-46: Rounding looks good; consider extracting the latency summary into a shared helper.The
round(..., 2)additions address the reviewer feedback. However, the same latency-stats block (mean / p95 / max + memory) is duplicated verbatim inpii/run.py(and likelygender_assumption_bias/run.py). A small helper onProfiler(e.g.summary() -> dict) would eliminate this repetition and keep future format changes in one place.♻️ Example: add a summary method to Profiler
Add to
backend/app/evaluation/common/helper.py:class Profiler: # ... existing methods ... def summary(self) -> dict: n = len(self.latencies) if n == 0: return {"latency_ms": {"mean": 0, "p95": 0, "max": 0}, "memory_mb": 0} return { "latency_ms": { "mean": round(sum(self.latencies) / n, 2), "p95": round(sorted(self.latencies)[int(n * 0.95)], 2), "max": round(max(self.latencies), 2), }, "memory_mb": round(self.peak_memory_mb, 2), }Then each
run.pysimply calls"performance": p.summary().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/lexical_slur/run.py` around lines 42 - 46, The latency + memory summary block is duplicated across runs; add a summary() method to the Profiler class that returns the same dict shape (handle zero-length latencies by returning zeros), computing mean, p95, and max with round(..., 2) and rounding peak_memory_mb to 2 decimals, and then replace the inline block in lexical_slur/run.py (and other run.py files like pii/run.py and gender_assumption_bias/run.py) with a call to p.summary() where p is the Profiler instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/pii/run.py`:
- Around line 42-49: The performance block currently assumes p.latencies is
non-empty which can raise ZeroDivisionError/IndexError/ValueError; either
implement and call a centralized Profiler.summary() helper that returns safe
values (e.g., mean, p95, max as numbers or None) for an empty latencies list and
use that result here, or wrap the existing computation in an if p.latencies:
guard and otherwise set "mean", "p95", "max" to None (or 0 per project
convention); apply the same change to lexical_slur/run.py so both places
reference the safe summary implementation or the same guard logic.
In `@backend/README.md`:
- Around line 153-165: Add a language specifier to the fenced output block in
README.md (change the open fence to use ```text or ```console for the
app/evaluation/outputs/gender_assumption_bias/ tree) so it satisfies MD040 and
renders correctly, and ensure the closing ``` remains. Also add a short
documented subsection for the new backend/scripts/run_all_evaluations.sh script:
state its purpose (orchestrates all per-validator runs), provide the basic usage
(how to invoke the script), and list expected outputs (where aggregated
validator outputs and combined metrics are written, e.g.,
app/evaluation/outputs/ and any combined metrics file), so readers know to run
that script in addition to individual validator run.py commands.
- Line 100: Replace the non-descriptive link text "[here](...)" in the README
with a meaningful label (e.g., "validator datasets on Google Drive" or
"validation CSV dataset folder") so the link conveys its target; update the
sentence that instructs users to download the CSVs into
backend/app/evaluation/datasets/ to use that descriptive label instead of
"here".
- Line 109: The README sentence contradicts itself by calling "lexical slur
match, ban list, gender assumption bias" deterministic (so testing "doesn't make
much sense") while immediately stating curated datasets exist and that a
dedicated run.py (gender assumption bias benchmark) is a main deliverable;
update the paragraph to remove the contradiction and fix wording: clarify which
validators are deterministic (if any), state that curated datasets were created
specifically for benchmarking lexical slur match and gender assumption bias and
that run.py executes the gender assumption bias evaluation, and replace "cause"
with "because" for clarity; reference the terms "lexical slur match", "ban
list", "gender assumption bias", and "run.py" so editors can locate and reword
the README accordingly.
---
Nitpick comments:
In `@backend/app/evaluation/common/helper.py`:
- Around line 19-23: compute_binary_metrics currently makes four separate passes
over y_true/y_pred to compute tp, tn, fp, fn which is inefficient; replace those
four sum(...) expressions with a single loop that iterates once over zip(y_true,
y_pred, strict=True) and increments tp, tn, fp, fn accordingly (e.g., if yt==1
and yp==1: tp +=1, etc.). Update the body of compute_binary_metrics to
initialize the four counters before the loop and compute them in that single
pass, leaving the function's return/signature unchanged.
In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 42-46: The latency + memory summary block is duplicated across
runs; add a summary() method to the Profiler class that returns the same dict
shape (handle zero-length latencies by returning zeros), computing mean, p95,
and max with round(..., 2) and rounding peak_memory_mb to 2 decimals, and then
replace the inline block in lexical_slur/run.py (and other run.py files like
pii/run.py and gender_assumption_bias/run.py) with a call to p.summary() where p
is the Profiler instance.
| "performance": { | ||
| "latency_ms": { | ||
| "mean": round(sum(p.latencies) / len(p.latencies), 2), | ||
| "p95": round(sorted(p.latencies)[int(len(p.latencies) * 0.95)], 2), | ||
| "max": round(max(p.latencies), 2), | ||
| }, | ||
| "memory_mb": round(p.peak_memory_mb, 2), | ||
| }, |
There was a problem hiding this comment.
Guard against empty p.latencies to avoid ZeroDivisionError / IndexError.
This performance block is newly added. If the input CSV is empty (or all rows are somehow filtered), p.latencies will be an empty list, and Line 44 (sum(...) / len(...)) will raise ZeroDivisionError, Line 45 will raise IndexError, and Line 46 will raise ValueError. The same issue exists in lexical_slur/run.py.
The cleanest fix is the Profiler.summary() helper suggested in my comment on lexical_slur/run.py, which centralizes the guard and eliminates duplication. If you'd rather keep things inline, a simple if p.latencies: guard suffices.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/evaluation/pii/run.py` around lines 42 - 49, The performance
block currently assumes p.latencies is non-empty which can raise
ZeroDivisionError/IndexError/ValueError; either implement and call a centralized
Profiler.summary() helper that returns safe values (e.g., mean, p95, max as
numbers or None) for an empty latencies list and use that result here, or wrap
the existing computation in an if p.latencies: guard and otherwise set "mean",
"p95", "max" to None (or 0 per project convention); apply the same change to
lexical_slur/run.py so both places reference the safe summary implementation or
the same guard logic.
0e1a079 to
660e2de
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/README.md (1)
115-156: Triplicate output-structure documentation is a maintenance liability.The same output paths appear in three different formats within ~40 lines:
- "Standardized output structure" (lines 115–127, indented tree)
- "Expected aggregate outputs" (lines 140–149, flat paths, inside the
run_all_evaluations.shblock)- Per-validator "Expected outputs" blocks (lines 151–156, 163–168, 175–180, box-drawing trees)
Every future output-path change must be applied in all three places. Consider keeping just the consolidated "Standardized output structure" at lines 115–127 and, in each per-validator section, replacing the redundant block with a single sentence referencing it (e.g., "Outputs are written to
app/evaluation/outputs/gender_assumption_bias/— see Standardized output structure.").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/README.md` around lines 115 - 156, The README currently repeats the same output paths three times; remove the duplicate "Expected aggregate outputs" and each per-validator "Expected outputs" block and keep only the consolidated "Standardized output structure" section; in the per-validator sections for lexical_slur, pii_remover, and gender_assumption_bias replace the removed trees with a single sentence such as "Outputs are written to app/evaluation/outputs/<validator_name>/ — see the 'Standardized output structure' section" and optionally add an internal link to that section (e.g., "#running-evaluation-tests") so future path changes are maintained in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/README.md`:
- Line 100: Fix the missing space after the period following the Google Drive
link in the README sentence; update the README.md text fragment "[Google
Drive](https://drive.google.com/...)." so that it becomes "[Google
Drive](https://drive.google.com/...). This contains..." (i.e., add a single
space after the period) to separate the two sentences properly.
---
Nitpick comments:
In `@backend/README.md`:
- Around line 115-156: The README currently repeats the same output paths three
times; remove the duplicate "Expected aggregate outputs" and each per-validator
"Expected outputs" block and keep only the consolidated "Standardized output
structure" section; in the per-validator sections for lexical_slur, pii_remover,
and gender_assumption_bias replace the removed trees with a single sentence such
as "Outputs are written to app/evaluation/outputs/<validator_name>/ — see the
'Standardized output structure' section" and optionally add an internal link to
that section (e.g., "#running-evaluation-tests") so future path changes are
maintained in one place.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/README.md (1)
142-154: Inconsistent per-validator output descriptions and a trailing hard-break space.Line 142 has a trailing space after the backtick command (
before the newline), which Markdown renders as a<br>— likely unintentional.Additionally, lines 146–148 (PII) and 152–154 (gender bias) each carry specific
predictions.csv/metrics.jsondescriptions, but lexical slur has none, creating an inconsistency. Either add an analogous block for lexical slur or drop the per-validator inline descriptions and rely solely on the generic summary at lines 111–113.📝 Remove trailing space on line 142
-- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py` +- To evaluate Lexical Slur Validator, run the offline evaluation script: `python app/evaluation/lexical_slur/run.py`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/README.md` around lines 142 - 154, Remove the trailing space after the backticked command `python app/evaluation/lexical_slur/run.py` and add a short per-validator output description for the lexical slur evaluator to match the others: state what `predictions.csv` and `metrics.json` contain for lexical slur (e.g., samples with labels/predictions in `predictions.csv` and evaluation metrics in `metrics.json`), or alternatively remove the per-validator descriptions for PII and Gender Assumption Bias so all validators rely on the generic summary; reference the command `python app/evaluation/lexical_slur/run.py`, and the filenames `predictions.csv` and `metrics.json` when making the edit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/README.md`:
- Around line 129-136: The fenced code block in backend/README.md breaks the
enclosing list item; fix by making the code block and the following paragraph
part of the same list item—either indent the fenced block and continuation lines
by four spaces (so the ```bash block and the "This script runs the evaluators in
sequence:" paragraph remain inside the bullet) or replace the list entry using
the proposed reflow: move the "To run all evaluation scripts together:" line
before the fenced block, wrap the block with ```bash ... ```, and then list the
three evaluator paths (`app/evaluation/lexical_slur/run.py`,
`app/evaluation/pii/run.py`, `app/evaluation/gender_assumption_bias/run.py`) as
sub-list items; ensure the script name scripts/run_all_evaluations.sh is
included exactly in the fenced block.
---
Duplicate comments:
In `@backend/README.md`:
- Line 100: Fix the missing space in backend/README.md by updating the sentence
that currently reads "[Google
Drive](https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89).This
contains" to include a space after the period so it becomes "...89). This
contains"; locate and edit the sentence in the README content describing where
to download CSV datasets and ensure proper spacing between the two sentences.
- Line 109: Update the README and evaluation runner so the ban list validator is
either included in the automated evaluation or explicitly documented as
deferred: add the ban-list evaluation invocation to run_all_evaluations.sh (the
script that currently calls the three evaluators) and update the README section
mentioning curated datasets to list the ban list evaluation script and expected
outputs (accuracy/latency), or alternatively add a clear note in README
explaining why the ban list evaluation is omitted and create a tracking issue;
reference the ban list validator name used in the repo and the
run_all_evaluations.sh entry to locate where to add the call and documentation.
---
Nitpick comments:
In `@backend/README.md`:
- Around line 142-154: Remove the trailing space after the backticked command
`python app/evaluation/lexical_slur/run.py` and add a short per-validator output
description for the lexical slur evaluator to match the others: state what
`predictions.csv` and `metrics.json` contain for lexical slur (e.g., samples
with labels/predictions in `predictions.csv` and evaluation metrics in
`metrics.json`), or alternatively remove the per-validator descriptions for PII
and Gender Assumption Bias so all validators rely on the generic summary;
reference the command `python app/evaluation/lexical_slur/run.py`, and the
filenames `predictions.csv` and `metrics.json` when making the edit.
| - To run all evaluation scripts together, use: | ||
| ```bash | ||
| bash scripts/run_all_evaluations.sh | ||
| ``` | ||
| This script runs the evaluators in sequence: | ||
| - `app/evaluation/lexical_slur/run.py` | ||
| - `app/evaluation/pii/run.py` | ||
| - `app/evaluation/gender_assumption_bias/run.py` |
There was a problem hiding this comment.
Fenced code block inside a list item must be indented to maintain list context.
On GitHub (CommonMark), an unindented ``` at column 0 terminates the enclosing list item. The "This script runs the evaluators in sequence:" paragraph and the sub-list (lines 133–136) are then rendered as a separate top-level block rather than as a continuation of the bullet.
📝 Proposed fix — indent the code block and continuation text
-
-- To run all evaluation scripts together, use:
-```bash
-bash scripts/run_all_evaluations.sh
-```
-This script runs the evaluators in sequence:
-- `app/evaluation/lexical_slur/run.py`
-- `app/evaluation/pii/run.py`
-- `app/evaluation/gender_assumption_bias/run.py`
+
+To run all evaluation scripts together:
+
+```bash
+bash scripts/run_all_evaluations.sh
+```
+
+This script runs the evaluators in sequence:
+- `app/evaluation/lexical_slur/run.py`
+- `app/evaluation/pii/run.py`
+- `app/evaluation/gender_assumption_bias/run.py`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/README.md` around lines 129 - 136, The fenced code block in
backend/README.md breaks the enclosing list item; fix by making the code block
and the following paragraph part of the same list item—either indent the fenced
block and continuation lines by four spaces (so the ```bash block and the "This
script runs the evaluators in sequence:" paragraph remain inside the bullet) or
replace the list entry using the proposed reflow: move the "To run all
evaluation scripts together:" line before the fenced block, wrap the block with
```bash ... ```, and then list the three evaluator paths
(`app/evaluation/lexical_slur/run.py`, `app/evaluation/pii/run.py`,
`app/evaluation/gender_assumption_bias/run.py`) as sub-list items; ensure the
script name scripts/run_all_evaluations.sh is included exactly in the fenced
block.
Summary
Target issue is #58.
Explain the motivation for making this change. What existing problem does the pull request solve?
run_all_evaluations.shto run evaluations for all validators.Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Improvements
Documentation
Chores