Summary
There is a TODO in malariagen_data/anoph/dipclust.py indicating that gene_cnv() still needs to be migrated into the modular CNV “data-layer” mixin architecture used by the rest of malariagen_data/anoph/.
Specifically, gene_cnv() is currently implemented in malariagen_data/anoph/cnv_frq.py (as part of AnophelesCnvFrequencyAnalysis) but is conceptually a data access method (returns an xr.Dataset of modal copy number by gene). This creates architectural inconsistency and forces dipclust.py to call it in a way that bypasses the standard class-hierarchy pattern.
How to generate this issue
- Open
malariagen_data/anoph/dipclust.py.
- Jump to the TODO block at ~lines 401–404:
- It explicitly says
gene_cnv() needs to be migrated to the AnophelesCnvData class so it can be found in the class hierarchy.
- Confirm where
gene_cnv() lives today:
- Search in
malariagen_data/anoph/ and you’ll find gene_cnv() implemented in malariagen_data/anoph/cnv_frq.py.
- Verify that
malariagen_data/anoph/cnv_data.py (which defines AnophelesCnvData) does not contain gene_cnv().
This mismatch between dipclust.py (TODO + type ignore) and the actual method location is the root problem.
Problem / Current state
dipclust.py contains:
- a TODO noting
gene_cnv() must be migrated to AnophelesCnvData
- a
self.gene_cnv(...) # type: ignore call inside _dipclust_cnv_bar_trace()
gene_cnv() is implemented in malariagen_data/anoph/cnv_frq.py (within AnophelesCnvFrequencyAnalysis), not in malariagen_data/anoph/cnv_data.py (AnophelesCnvData).
As a result:
- The API surface is inconsistent: a “gene copy number data access” method is tied to a “frequency analysis” class rather than the CNV data mixin.
- Some analysis modules/components rely on inheritance via specific analysis classes instead of the intended data-layer hierarchy.
- Type-checking / linting hints (the
# type: ignore) show the hierarchy is not clean for method discovery.
Why this is important (Impact)
This is important because it increases fragility and maintenance cost:
- Architectural consistency: Other CNV “data” methods live in
cnv_data.py under AnophelesCnvData, but gene_cnv() does not—this breaks the intended pattern.
- Onboarding & extensibility: Adding or reusing gene-level CNV datasets in new modules becomes harder because developers must know that
gene_cnv() lives in an analysis mixin rather than the data mixin.
- Inheritance correctness: If a future class inherits only
AnophelesCnvData (and not AnophelesCnvFrequencyAnalysis), it will not naturally get gene_cnv(), even though it is a data access method.
- Maintainability: Moving logic into the correct layer reduces the chances of duplicated logic and future “workarounds” (
type: ignore, direct calls, or special-case imports).
Proposed fix
- Move
gene_cnv() (and its internal helper, e.g. _gene_cnv)
- From
malariagen_data/anoph/cnv_frq.py
- Into
malariagen_data/anoph/cnv_data.py
- As methods on
AnophelesCnvData.
- Update callers to use the standard hierarchy
malariagen_data/anoph/cnv_frq.py:
- Adjust
AnophelesCnvFrequencyAnalysis so it calls the moved gene_cnv() implementation (directly or via a shared helper).
malariagen_data/anoph/dipclust.py:
- Remove the TODO and remove the
# type: ignore if possible after the method is available on the proper class hierarchy.
- Keep public behavior stable
- Preserve the existing
gene_cnv() signature and returned dataset structure so downstream plots/analyses don’t break.
Tests / Acceptance criteria
- Existing CNV/dipclust-related test suites pass.
- Add/adjust unit tests to verify:
AnophelesCnvData exposes gene_cnv() (or that instances in the expected hierarchy do).
AnophelesDipClustAnalysis no longer needs the # type: ignore for self.gene_cnv(...) (or at least no runtime break occurs).
- Output schema of
gene_cnv() is unchanged (coords/data_vars expected by downstream code).
Implementation approach (high level)
- Extract the current
gene_cnv() / _gene_cnv code from cnv_frq.py.
- Paste into
AnophelesCnvData in cnv_data.py with minimal changes.
- Update imports so
cnv_data.py has access to whatever dependencies gene_cnv() currently uses (e.g., Region, _parse_multi_region, _cn_mode, genome feature access, etc.).
- Refactor
AnophelesCnvFrequencyAnalysis to call self.gene_cnv(...) from the data mixin.
- Update
dipclust.py to call gene_cnv() without the migration warning/type workaround.
Reference
- TODO location:
malariagen_data/anoph/dipclust.py (~lines 401–404)
- Current implementation location:
malariagen_data/anoph/cnv_frq.py
- Intended target location:
malariagen_data/anoph/cnv_data.py (AnophelesCnvData)
Summary
There is a TODO in
malariagen_data/anoph/dipclust.pyindicating thatgene_cnv()still needs to be migrated into the modular CNV “data-layer” mixin architecture used by the rest ofmalariagen_data/anoph/.Specifically,
gene_cnv()is currently implemented inmalariagen_data/anoph/cnv_frq.py(as part ofAnophelesCnvFrequencyAnalysis) but is conceptually a data access method (returns anxr.Datasetof modal copy number by gene). This creates architectural inconsistency and forcesdipclust.pyto call it in a way that bypasses the standard class-hierarchy pattern.How to generate this issue
malariagen_data/anoph/dipclust.py.gene_cnv()needs to be migrated to theAnophelesCnvDataclass so it can be found in the class hierarchy.gene_cnv()lives today:malariagen_data/anoph/and you’ll findgene_cnv()implemented inmalariagen_data/anoph/cnv_frq.py.malariagen_data/anoph/cnv_data.py(which definesAnophelesCnvData) does not containgene_cnv().This mismatch between
dipclust.py(TODO + type ignore) and the actual method location is the root problem.Problem / Current state
dipclust.pycontains:gene_cnv()must be migrated toAnophelesCnvDataself.gene_cnv(...) # type: ignorecall inside_dipclust_cnv_bar_trace()gene_cnv()is implemented inmalariagen_data/anoph/cnv_frq.py(withinAnophelesCnvFrequencyAnalysis), not inmalariagen_data/anoph/cnv_data.py(AnophelesCnvData).As a result:
# type: ignore) show the hierarchy is not clean for method discovery.Why this is important (Impact)
This is important because it increases fragility and maintenance cost:
cnv_data.pyunderAnophelesCnvData, butgene_cnv()does not—this breaks the intended pattern.gene_cnv()lives in an analysis mixin rather than the data mixin.AnophelesCnvData(and notAnophelesCnvFrequencyAnalysis), it will not naturally getgene_cnv(), even though it is a data access method.type: ignore, direct calls, or special-case imports).Proposed fix
gene_cnv()(and its internal helper, e.g._gene_cnv)malariagen_data/anoph/cnv_frq.pymalariagen_data/anoph/cnv_data.pyAnophelesCnvData.malariagen_data/anoph/cnv_frq.py:AnophelesCnvFrequencyAnalysisso it calls the movedgene_cnv()implementation (directly or via a shared helper).malariagen_data/anoph/dipclust.py:# type: ignoreif possible after the method is available on the proper class hierarchy.gene_cnv()signature and returned dataset structure so downstream plots/analyses don’t break.Tests / Acceptance criteria
AnophelesCnvDataexposesgene_cnv()(or that instances in the expected hierarchy do).AnophelesDipClustAnalysisno longer needs the# type: ignoreforself.gene_cnv(...)(or at least no runtime break occurs).gene_cnv()is unchanged (coords/data_vars expected by downstream code).Implementation approach (high level)
gene_cnv()/_gene_cnvcode fromcnv_frq.py.AnophelesCnvDataincnv_data.pywith minimal changes.cnv_data.pyhas access to whatever dependenciesgene_cnv()currently uses (e.g.,Region,_parse_multi_region,_cn_mode, genome feature access, etc.).AnophelesCnvFrequencyAnalysisto callself.gene_cnv(...)from the data mixin.dipclust.pyto callgene_cnv()without the migration warning/type workaround.Reference
malariagen_data/anoph/dipclust.py(~lines 401–404)malariagen_data/anoph/cnv_frq.pymalariagen_data/anoph/cnv_data.py(AnophelesCnvData)