Skip to content

Fix duplicates#131

Draft
christinehc wants to merge 54 commits intomainfrom
fix_duplicates
Draft

Fix duplicates#131
christinehc wants to merge 54 commits intomainfrom
fix_duplicates

Conversation

@christinehc
Copy link
Copy Markdown
Collaborator

@christinehc christinehc commented Feb 19, 2026

Note: IN PROGRESS

Ideally, I would like to update the build manifest to fully use figshare files, but not all github data has been uploaded to figshare yet. However, if we want to merge changes and open a separate branch to update the manifest, that's fine too.

Summary

Fixes duplication of Sample_ID and updates pipeline to fully produce zebrafish benchmark dose curve response files using new data files. Includes several QOL improvements to the code.

Changelog

  • Fixes error that was creating too many unique Sample_IDs from non-unique samples (i.e. duplicates)
  • Loads (most) data from figshare
  • Checks schemas for output data, including zebrafish. Updated zebrafish schemas accordingly, using the is_a construct as a template for Dose, Fits, and BMD types.
  • Tests BMDRC pipeline on actual data. Note that BMDRC package has undergone updates accordingly to ensure congruence with the pipeline.
  • Updates ontology mappings
  • Adds link to chemical structure figure for portal
  • Fixes figshare uploading so that files are fully uploaded and downloadable
  • Adds DataLoader class to handle data loading, from figshare or otherwise
  • Adds DataManifest class to handle interaction with build manifest (data/srp_build_files.csv), including handling file versions
  • Changes sigGeneStats to new longer format (NOTE: sigGeneStats does not appear to be used in the pipeline currently, potentially need to fix?)
  • Minor changes: improved formatting, refactored code for readability, added full documentation, added secrets, update schema

Issues

changelog:
- previously, unique sample IDs were being assigned to samples with the same metadata parameters due to improper grouping of duplicate samples. The code has been updated to correct sample ID assignment to require uniqueness.
changelog:
- add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally
changelog:
- style: reformat and lint code using ruff
- refactor: change output filename format
- refactor: file cleanup-- e.g. move large list/dict params to separate params file to make code easier to follow.
- style: add some documentation
- feat: add new CLI args for specifying filename format
changelog:
- Manifest handler object for interfacing with and downloading files from the manifest now exists.
- Schema parser to pull columns and slots from schema classes.
addresses #103

changelog:
- zebrafish data now builds with: (1) pre-analyzed chemical LPR data; (2) raw chemical morphology data, and (3) pre-processed sample data.
changelog:
- fixed issue with sample IDs not correctly populating (i.e. 1004 used instead of 1004-1)
- enable FSES files from figshare to also be handled and loaded correctly
changelog:
- scrapped the type union attempts and replaced with `linkml:Any` ranges. Addresses #121
- included the new `test_template` and `test_method` attributes for the samplesToChemicals class. Fixes #119
changelog:
- feat: `src.data.load_figshare_url` now directly loads data from a Figshare URL, accessing the `FigshareDataLoader` object, for convenience. Utility function `src.figshare_url_to_id` extracts figshare IDs from URLs as a helper function.
- feat: Now requires dotenv to load environmental variables (for storing secrets/access tokens locally)
changelog:
- refactor: use `src.data.figshare_url_to_id` utility function to load data from figshare. Flexibly modify file retrieval to fully pull from manifest and load from figshare (this will align the code with planned future development to fully only pull files from figshare)
- refactor: minor changes, such as renaming variables (`build_script.runSampMap`) to be more human-readable
- refactor: use dotenv to load variables (for local execution of pipeline)
- fix: enforce "NULL" value for any `NaN` values in tables. fixes #118
- refactor: use token/variable instead of exposing comptox API key
- docs: added missing docstring
changelog:
- feat: add schema checks and update related functions to correctly parse class name from file name types with and without underscores.
- refactor: rename `res` variable -> `result` for clearer code
@sgosline sgosline marked this pull request as draft March 4, 2026 19:08
@sgosline
Copy link
Copy Markdown
Member

sgosline commented Mar 4, 2026

I converted PR to draft status, just to keep track :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment