Skip to content

Normalize taxon_by handling to match area_by/period_by in cohort grouping#997

Open
shauryam2807 wants to merge 10 commits intomalariagen:masterfrom
shauryam2807:GH808-normalize-taxon-by
Open

Normalize taxon_by handling to match area_by/period_by in cohort grouping#997
shauryam2807 wants to merge 10 commits intomalariagen:masterfrom
shauryam2807:GH808-normalize-taxon-by

Conversation

@shauryam2807
Copy link
Copy Markdown
Contributor

Fixes #808

Problem

_prep_samples_for_cohort_grouping() normalizes area_by"area" and period_by"period" by copying user-specified columns into standard internal columns. However, taxon_by is treated inconsistently — it keeps the original column name instead of being normalized to "taxon".

This forces _build_cohorts_from_sample_grouping() to accept taxon_by as a parameter and maintain two separate label-generation code paths (one for the default "taxon" column, one for custom columns).

Re: discussion in #694 (comment): #694 (comment)

Solution (Option A, per @jonbrenas)

Normalize taxon_by to a standard "taxon" column in _prep_samples_for_cohort_grouping(), consistent with area_by and period_by. This:

  • Removes the need for taxon_by in _build_cohorts_from_sample_grouping()
  • Simplifies label generation to a single code path
  • Makes all three *_by parameters consistent in behavior

Changes

  • malariagen_data/anoph/frq_base.py — Add taxon column normalization in _prep_samples_for_cohort_grouping(); remove taxon_by param from _build_cohorts_from_sample_grouping(); simplify label logic
  • malariagen_data/anoph/snp_frq.py — Update groupby and remove taxon_by from _build_cohorts_from_sample_grouping() calls
  • malariagen_data/anoph/cnv_frq.py — Same
  • malariagen_data/anoph/hap_frq.py — Same
  • tests/anoph/test_frq_base.py — Add tests for taxon normalization

Testing

  • Default taxon_by="taxon" works unchanged (backward compat) ✅
  • Custom taxon_by creates standard "taxon" column ✅
  • Original custom column is preserved ✅
  • Label generation works correctly ✅
  • ruff check passes ✅

@shauryam2807
Copy link
Copy Markdown
Contributor Author

Hello
Hi @jonbrenas,
When you get a chance, could you take a look at this PR?
Let me know if anything needs improvement—I’ll update it right away.

# Copy the specified area_by column to a new "area" column.
df_samples["area"] = df_samples[area_by]

# Copy the specified taxon_by column to a new "taxon" column,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that "taxon" is already a column. It is why it is treated differently.

@shauryam2807
Copy link
Copy Markdown
Contributor Author

Ah @jonbrenas , I see! Because df_samples already has a taxon column with real data, overwriting it destroys that data. To fix #808, should we normalize it into a new column name like "cohort_taxon", or should we ditch the normalization approach altogether since taxon can't be treated exactly like area and period?

@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @shauryam2807, I think using a new column name is the best solution.

@shauryam2807
Copy link
Copy Markdown
Contributor Author

shauryam2807 commented Apr 10, 2026

Hello
Hi @jonbrenas,
Thanks for the suggestion! I've updated the implementation to use new column names (cohort_taxon, cohort_area, cohort_period) inside _prep_samples_for_cohort_grouping, ensuring the original user metadata columns are never overwritten.

Changes in this update:

  • _prep_samples_for_cohort_grouping: Now writes to cohort_taxon, cohort_area, and cohort_period instead of overwriting taxon, area, and period.
  • _build_cohorts_from_sample_grouping: Renames these back to taxon, area, and period after aggregation, keeping the downstream API (datasets, plots, etc.) fully unchanged.
  • Downstream Updates: Updated groupby() calls in snp_frq.py, hap_frq.py, and cnv_frq.py to use the new internal column names.
  • Tests: Updated test_frq_base.py to match the new column names and behavior.
  • Linting & Formatting:
    • Fixed duplicate constant definitions in af1.py that were causing the pre-commit linting step to fail.
    • Applied ruff format across all modified files to ensure CI passes.

On a separate note — as a GSoC 2026 applicant who has applied for the MalariaGEN project, I just wanted to reiterate how much I've been enjoying contributing to the codebase! While waiting for the final results, I plan to continue addressing issues and helping out wherever I can.

Please let me know if this looks good to merge or if there's anything else you'd like me to address here!

@shauryam2807 shauryam2807 requested a review from jonbrenas April 10, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reassess use of "private" columns in prep_samples_for_cohort_grouping and build_cohorts_from_sample_grouping

2 participants