Skip to content

Fix metadata and refactor evaluation pipeline#60

Open
ivyONS wants to merge 9 commits intomainfrom
sa-619-metadata-fix
Open

Fix metadata and refactor evaluation pipeline#60
ivyONS wants to merge 9 commits intomainfrom
sa-619-metadata-fix

Conversation

@ivyONS
Copy link
Contributor

@ivyONS ivyONS commented Mar 17, 2026

✨ Summary

Make sure the metadata are used through out the evaluation pipeline.

📜 Changes Introduced

Update argument parsing and how it is propagated into the metadata.
Make default values more transparent and capture them in metadata.
Some params (such as model_location) were not used, fix that.

To gain control over parameters for embedding and to enable classifai test, refactor the STAGE1 script use class instance instead of going through API.

Instruction on how to run the pipeline updated accordingly.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

Run the pipeline (as per readme instructions) on a small test data.
Please test checkpoints by interrupting and restarting.

@ivyONS ivyONS marked this pull request as draft March 17, 2026 21:08
@ivyONS ivyONS marked this pull request as ready for review March 18, 2026 14:43
Copy link
Contributor

@peter-spencer-ons peter-spencer-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments


echo "RUNNING: STAGE 1"
"$SCRIPT_DIR"/stage_1_add_semantic_search.py -n "STG1" -b "$batch_size" "$input_csv" "$input_metadata_json" "$output_folder"
"$SCRIPT_DIR"/stage_1_add_semantic_search.py -r -n "STG1" -b "$batch_size" -i "$input_file" -m "$input_metadata_json" -o "$output_folder"
Copy link
Contributor

@peter-spencer-ons peter-spencer-ons Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the default to use the restart (-r flag) parameter? If so, why not in stages 4 and further?

Copy link
Contributor

@peter-spencer-ons peter-spencer-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment within the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants