Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf() by Yashsingh045 · Pull Request #1284 · malariagen/malariagen-data-python

Yashsingh045 · 2026-04-10T14:20:59Z

Description

This PR fixes an issue #1283 ,
In the snp_calls_to_vcf() method where requested FORMAT fields (GQ, AD, MQ) were silently dropped if the underlying data was missing or unreadable. It adds user-facing warnings and ensures VCF headers are dynamically updated to accurately reflect available data.

Problem Statement

When a user explicitly requested optional FORMAT fields via the fields parameter, the exporter would silently fallback to missing values (.) if the data arrays were absent from the dataset or if computation failed for specific chunks.

Impact:

This led to "silent failure" where downstream tools would behave unexpectedly due to empty fields, and VCF files were generated with misleading headers that declared fields which contained no real data.

Solution / Changes

Explicit Warnings: Added warnings.warn() calls using the standard UserWarning category to notify users when a requested field is omitted due to missing data.
Header Filtering: Reorganized the initialization logic to extract data arrays before writing the VCF header. This allows the tool to filter the output_fields list so that only fields with valid data arrays are declared in the ##FORMAT header lines.
Fail-Safe Computation: Updated the chunk iteration loop to emit a warning if specific data blocks cannot be read (e.g., due to missing Zarr files on disk), while maintaining VCF structure by writing sentinel . values.
Imports: Added import warnings to malariagen_data/anoph/to_vcf.py.

Closes

Fixes #1283

Copilot

Pull request overview

This PR improves SnpVcfExporter.snp_calls_to_vcf() to avoid silently producing misleading VCF FORMAT headers/columns when optional FORMAT arrays (e.g., call_GQ, call_AD, call_MQ) are missing or unreadable, by warning users and dynamically filtering declared FORMAT fields.

Changes:

Load optional FORMAT arrays up front and drop unavailable requested FORMAT fields with warnings.warn().
Write VCF ##FORMAT headers based on the filtered output_fields rather than the originally requested fields.
Emit warnings when optional FORMAT chunks fail to compute, while still writing . sentinel values to preserve VCF structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T14:24:59Z

malariagen_data/anoph/to_vcf.py

+        include_gq = "GQ" in output_fields
+        include_ad = "AD" in output_fields
+        include_mq = "MQ" in output_fields
+        format_str = ":".join(output_fields)


format_str is built from output_fields (user-supplied order), but the per-sample parts list is assembled in a fixed order (GT, then GQ, then AD, then MQ). If a caller provides fields in a different order (e.g. ("GT","MQ","GQ")), the FORMAT column will not match the sample value ordering, producing an invalid/misleading VCF. Consider either enforcing a canonical field order when building output_fields/format_str, or (preferably) constructing per-sample values by iterating over output_fields so the ordering always stays consistent.

Copilot · 2026-04-10T14:24:59Z

malariagen_data/anoph/to_vcf.py

+        if "GQ" in fields:
+            try:
+                gq_data = ds["call_GQ"].data
+            except KeyError:
+                warnings.warn(
+                    "Requested FORMAT field 'GQ' not found in dataset. "
+                    "GQ values will be omitted from the output VCF.",
+                    UserWarning,
+                    stacklevel=2,
+                )
+                output_fields.remove("GQ")
+
+        if "AD" in fields:
+            try:
+                ad_data = ds["call_AD"].data
+            except KeyError:
+                warnings.warn(
+                    "Requested FORMAT field 'AD' not found in dataset. "
+                    "AD values will be omitted from the output VCF.",
+                    UserWarning,
+                    stacklevel=2,
+                )
+                output_fields.remove("AD")


New behavior (omitting requested FORMAT fields when the backing arrays are missing, and emitting warnings) isn’t covered by tests. Since there are already VCF exporter tests, it would be good to add cases asserting: (1) a requested missing field (e.g. GQ) triggers a warning, (2) the corresponding ##FORMAT header line is not written, and (3) the FORMAT column/sample values reflect the filtered field list.

Yashsingh045 · 2026-04-10T14:25:15Z

Hey @jonbrenas ,
Could you please review this PR when you have time?
Thanks

Yashsingh045 · 2026-04-14T06:51:28Z

Hey @jonbrenas ,
Just a gentle reminder,
could you please review this PR when you have time?
Thanks

jonbrenas

Also, Copilot has some comments.

jonbrenas · 2026-04-14T12:01:38Z

malariagen_data/anoph/to_vcf.py

+                )
+                output_fields.remove("MQ")
+
+        include_gq = "GQ" in output_fields


These should come before the if-statement.

Fix malariagen#1283: Warn when FORMAT fields are missing in VCF export

09d9463

Copilot AI review requested due to automatic review settings April 10, 2026 14:21

Copilot started reviewing on behalf of Yashsingh045 April 10, 2026 14:21 View session

Yashsingh045 mentioned this pull request Apr 10, 2026

snp_calls_to_vcf() silently drops requested FORMAT fields when underlying data arrays don't exist #1283

Open

Copilot AI reviewed Apr 10, 2026

View reviewed changes

jonbrenas requested changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284

Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284
Yashsingh045 wants to merge 1 commit intomalariagen:masterfrom
Yashsingh045:GH1283-warn-missing-vcf-fields

Yashsingh045 commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Yashsingh045 commented Apr 10, 2026

Uh oh!

Yashsingh045 commented Apr 14, 2026

Uh oh!

jonbrenas left a comment

Uh oh!

jonbrenas Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yashsingh045 commented Apr 10, 2026

Description

Problem Statement

Impact:

Solution / Changes

Closes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Yashsingh045 commented Apr 10, 2026

Uh oh!

Yashsingh045 commented Apr 14, 2026

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

jonbrenas Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants