Context
There is no automated test that exercises the full pipeline from DwC-A to trained model. The individual CLI commands have some test coverage, but integration issues (column mismatches, missing files between steps, incorrect shard patterns) are only caught by running the pipeline manually.
Proposed Changes
Add an end-to-end test that runs the full species classifier pipeline with a tiny dataset:
- Use a small DwC-A fixture (or a subset of an existing one) with ~10-20 images across 3-5 species
- Run all pipeline steps:
fetch-images -> verify-images -> clean-dataset -> build_species_list.py -> split-dataset -> create-webdataset -> train-model
- Train for only 1-3 epochs to keep runtime short
- Use the small-dataset config values:
MIN_INSTANCES=0, --val-frac 0.3, --test-frac 0.2 (already documented as commented-out alternatives in scripts/train_species_classifier.sh)
- Assert that key outputs exist and are valid:
category_map.json has expected species, split CSVs are non-empty, webdataset tar files are created, model checkpoint is saved
- Add GitHub workflows for running the full e2e test locally and in the docker SLURM environment
This could run in CI (CPU-only, training will be slow but feasible for 1-3 epochs on a tiny dataset) or as a local smoke test.
Related
Context
There is no automated test that exercises the full pipeline from DwC-A to trained model. The individual CLI commands have some test coverage, but integration issues (column mismatches, missing files between steps, incorrect shard patterns) are only caught by running the pipeline manually.
Proposed Changes
Add an end-to-end test that runs the full species classifier pipeline with a tiny dataset:
fetch-images->verify-images->clean-dataset->build_species_list.py->split-dataset->create-webdataset->train-modelMIN_INSTANCES=0,--val-frac 0.3,--test-frac 0.2(already documented as commented-out alternatives inscripts/train_species_classifier.sh)category_map.jsonhas expected species, split CSVs are non-empty, webdataset tar files are created, model checkpoint is savedThis could run in CI (CPU-only, training will be slow but feasible for 1-3 epochs on a tiny dataset) or as a local smoke test.
Related
scripts/train_species_classifier.shcontains the pipeline steps and small-dataset config