Fix duplicates by christinehc · Pull Request #131 · PNNL-CompBio/srpAnalytics

christinehc · 2026-02-19T23:18:37Z

Note: IN PROGRESS

Ideally, I would like to update the build manifest to fully use figshare files, but not all github data has been uploaded to figshare yet. However, if we want to merge changes and open a separate branch to update the manifest, that's fine too.

Summary

Fixes duplication of Sample_ID and updates pipeline to fully produce zebrafish benchmark dose curve response files using new data files. Includes several QOL improvements to the code.

Changelog

Fixes error that was creating too many unique Sample_IDs from non-unique samples (i.e. duplicates)
Loads (most) data from figshare
Checks schemas for output data, including zebrafish. Updated zebrafish schemas accordingly, using the is_a construct as a template for Dose, Fits, and BMD types.
Tests BMDRC pipeline on actual data. Note that BMDRC package has undergone updates accordingly to ensure congruence with the pipeline.
Updates ontology mappings
Adds link to chemical structure figure for portal
Fixes figshare uploading so that files are fully uploaded and downloadable
Adds DataLoader class to handle data loading, from figshare or otherwise
Adds DataManifest class to handle interaction with build manifest (data/srp_build_files.csv), including handling file versions
Changes sigGeneStats to new longer format (NOTE: sigGeneStats does not appear to be used in the pipeline currently, potentially need to fix?)
Minor changes: improved formatting, refactored code for readability, added full documentation, added secrets, update schema

Issues

Fixes Validate ontology links #107
Fixes Handle problematic (null/duplicate) values #102
Fixes Add chemical structure PNG links #109
Fixes Evaluate and integrate bmdrc pipeline with real data #111
Addresses Add endpoint metadata mapping to pipeline #106
Addresses Update zf data schemas #110
Addresses Resolve figshare upload error #105
Fixes Incorporate zebrafish sample data into pipeline #115
Fixes Update pipeline with new data files #103
Addresses Fix LinkML schema to handle NULL values #121
Fixes Add columns to samplesToChemicals #119
Fixes Pipeline output changes #118
Fixes Add Response column to Fits data #130
Fixes Necessary Changes to sigGeneStats #120

changelog: - previously, unique sample IDs were being assigned to samples with the same metadata parameters due to improper grouping of duplicate samples. The code has been updated to correct sample ID assignment to require uniqueness.

changelog: - add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally

changelog: - style: reformat and lint code using ruff - refactor: change output filename format - refactor: file cleanup-- e.g. move large list/dict params to separate params file to make code easier to follow. - style: add some documentation - feat: add new CLI args for specifying filename format

changelog: - Manifest handler object for interfacing with and downloading files from the manifest now exists. - Schema parser to pull columns and slots from schema classes.

…105, #115

addresses #103 changelog: - zebrafish data now builds with: (1) pre-analyzed chemical LPR data; (2) raw chemical morphology data, and (3) pre-processed sample data.

changelog: - fixed issue with sample IDs not correctly populating (i.e. 1004 used instead of 1004-1) - enable FSES files from figshare to also be handled and loaded correctly

changelog: - scrapped the type union attempts and replaced with `linkml:Any` ranges. Addresses #121 - included the new `test_template` and `test_method` attributes for the samplesToChemicals class. Fixes #119

changelog: - feat: `src.data.load_figshare_url` now directly loads data from a Figshare URL, accessing the `FigshareDataLoader` object, for convenience. Utility function `src.figshare_url_to_id` extracts figshare IDs from URLs as a helper function. - feat: Now requires dotenv to load environmental variables (for storing secrets/access tokens locally)

changelog: - refactor: use `src.data.figshare_url_to_id` utility function to load data from figshare. Flexibly modify file retrieval to fully pull from manifest and load from figshare (this will align the code with planned future development to fully only pull files from figshare) - refactor: minor changes, such as renaming variables (`build_script.runSampMap`) to be more human-readable - refactor: use dotenv to load variables (for local execution of pipeline) - fix: enforce "NULL" value for any `NaN` values in tables. fixes #118 - refactor: use token/variable instead of exposing comptox API key - docs: added missing docstring

changelog: - feat: add schema checks and update related functions to correctly parse class name from file name types with and without underscores. - refactor: rename `res` variable -> `result` for clearer code

sgosline · 2026-03-04T19:08:20Z

I converted PR to draft status, just to keep track :)

christinehc added 30 commits November 26, 2025 11:40

feat: include bmd files in workflow

6da2a34

build: use ubuntu-latest for actions

c3856e2

build: fix container name

a9ed382

build: fix container name in docker pull

9538052

build: include bmd in artifacts upload. force continue upload

36255b1

build: create new figshare articles for each upload

ec654f8

build: use correct category IDs for metadata

52c51a8

build: use secret for project ID. specify dataset.

04a07e4

build: remove tags from metadata

067e460

build: fix variable name

1d5910d

chore: update ontology mappings. fixes #107

e33f5f0

fix: include link to structure image png. fixes #109

7119ffe

fix: generate image links only for valid IDs. addresses #109

b9155d4

feat: adapt bmdrc code for zf data. fixes #111

b8e7e68

changelog: - add bmdrc code to main build pipeline via `fitCurveFiles` function. validated to work locally

chore: rename zfBMDS -> zfBMDs to align with bmdrc output

c929b11

chore: add new image_link col to schema

2c4d044

docs: update CLI flag for output directory

56c4f3a

feat: test github actions for figshare data download

b879bd8

build: trigger workflow on push

94140a3

feat: create manifest handler and schema parser. addresses #106

f3967fb

changelog: - Manifest handler object for interfacing with and downloading files from the manifest now exists. - Schema parser to pull columns and slots from schema classes.

feat: create figshare data downloader. addresses #106

f6c5a47

fix: update schema for zebrafish data. addresses #110

e97cd28

chore,format,refactor: clean up unused code and update docstrings

4013769

refactor: simplify mappings scripts and remove reliance on params

4a4cf3d

feat,fix,build: update main build (new files/zf pipeline). addresses #…

84ef109

…105, #115

build: edit dose classes to reflect sample vs. chem classes

139805f

build: update manifest with new files (+ figshare)

9ad5097

chore: add clarifying comments

60c4c19

christinehc added 24 commits January 19, 2026 23:50

feat: build database with zebrafish files. fixes #10, #115

c2974b3

addresses #103 changelog: - zebrafish data now builds with: (1) pre-analyzed chemical LPR data; (2) raw chemical morphology data, and (3) pre-processed sample data.

chore: correct path to schema file

e5a00eb

refactor: implement minor changes for readability

4bf7569

build: remove hardcode and use secrets

ec35a86

fix: repair sample build script. enable new files. fixes #103

c766086

changelog: - fixed issue with sample IDs not correctly populating (i.e. 1004 used instead of 1004-1) - enable FSES files from figshare to also be handled and loaded correctly

build: update docker command

03da584

build: create figshare article only after successful build

df1b86b

build: include manifest in docker build

47a2d65

build: fix entrypoints. create tmp figshare dirs

c3867c0

build: specify token for figshare API

6c99ede

build: create tmp figshare dir. copy src code

768d4c3

build: test figshare auth using github secrets

fe2e3a1

fix: temporarily prevent linkml errors by using Any type. addresses #121

f2d340e

build: include new samplesToChemicals columns. fixes #119

753722d

changelog: - scrapped the type union attempts and replaced with `linkml:Any` ranges. Addresses #121 - included the new `test_template` and `test_method` attributes for the samplesToChemicals class. Fixes #119

build: include new description cols in chemicals

7022345

chore: pipe temporary output to tmp/, not /tmp

8027037

feat: include new data build files. fixes #111, closes #119, closes #115

542882b

fix: add data post-bmdrc fixes. addresses #130

2bcdfb4

fix: check schema flexibly for zf output files

9120d73

chore: add response col to fits. addresses #130

65e499b

feat,refactor: add response column and schema checks. fixes #130

265114b

changelog: - feat: add schema checks and update related functions to correctly parse class name from file name types with and without underscores. - refactor: rename `res` variable -> `result` for clearer code

fix: update sigGeneStats format. fixes #120

68474c4

sgosline marked this pull request as draft March 4, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicates#131

Fix duplicates#131
christinehc wants to merge 54 commits intomainfrom
fix_duplicates

christinehc commented Feb 19, 2026 •

edited

Loading

Uh oh!

sgosline commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

christinehc commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: IN PROGRESS

Summary

Changelog

Issues

Uh oh!

sgosline commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christinehc commented Feb 19, 2026 •

edited

Loading