Normalize taxon_by handling to match area_by/period_by in cohort grouping#997
Normalize taxon_by handling to match area_by/period_by in cohort grouping#997shauryam2807 wants to merge 10 commits intomalariagen:masterfrom
Conversation
|
Hello |
malariagen_data/anoph/frq_base.py
Outdated
| # Copy the specified area_by column to a new "area" column. | ||
| df_samples["area"] = df_samples[area_by] | ||
|
|
||
| # Copy the specified taxon_by column to a new "taxon" column, |
There was a problem hiding this comment.
The issue is that "taxon" is already a column. It is why it is treated differently.
|
Ah @jonbrenas , I see! Because df_samples already has a taxon column with real data, overwriting it destroys that data. To fix #808, should we normalize it into a new column name like "cohort_taxon", or should we ditch the normalization approach altogether since taxon can't be treated exactly like area and period? |
|
Thanks @shauryam2807, I think using a new column name is the best solution. |
|
Hello Changes in this update:
On a separate note — as a GSoC 2026 applicant who has applied for the MalariaGEN project, I just wanted to reiterate how much I've been enjoying contributing to the codebase! While waiting for the final results, I plan to continue addressing issues and helping out wherever I can. Please let me know if this looks good to merge or if there's anything else you'd like me to address here! |
Fixes #808
Problem
_prep_samples_for_cohort_grouping()normalizesarea_by→"area"andperiod_by→"period"by copying user-specified columns into standard internal columns. However,taxon_byis treated inconsistently — it keeps the original column name instead of being normalized to"taxon".This forces
_build_cohorts_from_sample_grouping()to accepttaxon_byas a parameter and maintain two separate label-generation code paths (one for the default"taxon"column, one for custom columns).Re: discussion in #694 (comment): #694 (comment)
Solution (Option A, per @jonbrenas)
Normalize
taxon_byto a standard"taxon"column in_prep_samples_for_cohort_grouping(), consistent witharea_byandperiod_by. This:taxon_byin_build_cohorts_from_sample_grouping()*_byparameters consistent in behaviorChanges
malariagen_data/anoph/frq_base.py— Add taxon column normalization in_prep_samples_for_cohort_grouping(); removetaxon_byparam from_build_cohorts_from_sample_grouping(); simplify label logicmalariagen_data/anoph/snp_frq.py— Update groupby and removetaxon_byfrom_build_cohorts_from_sample_grouping()callsmalariagen_data/anoph/cnv_frq.py— Samemalariagen_data/anoph/hap_frq.py— Sametests/anoph/test_frq_base.py— Add tests for taxon normalizationTesting
taxon_by="taxon"works unchanged (backward compat) ✅taxon_bycreates standard"taxon"column ✅ruff checkpasses ✅