Skip to content

Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284

Open
Yashsingh045 wants to merge 1 commit intomalariagen:masterfrom
Yashsingh045:GH1283-warn-missing-vcf-fields
Open

Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284
Yashsingh045 wants to merge 1 commit intomalariagen:masterfrom
Yashsingh045:GH1283-warn-missing-vcf-fields

Conversation

@Yashsingh045
Copy link
Copy Markdown
Contributor

Description

This PR fixes an issue #1283 ,
In the snp_calls_to_vcf() method where requested FORMAT fields (GQ, AD, MQ) were silently dropped if the underlying data was missing or unreadable. It adds user-facing warnings and ensures VCF headers are dynamically updated to accurately reflect available data.

Problem Statement

When a user explicitly requested optional FORMAT fields via the fields parameter, the exporter would silently fallback to missing values (.) if the data arrays were absent from the dataset or if computation failed for specific chunks.

Impact:

This led to "silent failure" where downstream tools would behave unexpectedly due to empty fields, and VCF files were generated with misleading headers that declared fields which contained no real data.

Solution / Changes

  • Explicit Warnings: Added warnings.warn() calls using the standard UserWarning category to notify users when a requested field is omitted due to missing data.
  • Header Filtering: Reorganized the initialization logic to extract data arrays before writing the VCF header. This allows the tool to filter the output_fields list so that only fields with valid data arrays are declared in the ##FORMAT header lines.
  • Fail-Safe Computation: Updated the chunk iteration loop to emit a warning if specific data blocks cannot be read (e.g., due to missing Zarr files on disk), while maintaining VCF structure by writing sentinel . values.
  • Imports: Added import warnings to malariagen_data/anoph/to_vcf.py.

Closes

Fixes #1283

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves SnpVcfExporter.snp_calls_to_vcf() to avoid silently producing misleading VCF FORMAT headers/columns when optional FORMAT arrays (e.g., call_GQ, call_AD, call_MQ) are missing or unreadable, by warning users and dynamically filtering declared FORMAT fields.

Changes:

  • Load optional FORMAT arrays up front and drop unavailable requested FORMAT fields with warnings.warn().
  • Write VCF ##FORMAT headers based on the filtered output_fields rather than the originally requested fields.
  • Emit warnings when optional FORMAT chunks fail to compute, while still writing . sentinel values to preserve VCF structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +149 to +152
include_gq = "GQ" in output_fields
include_ad = "AD" in output_fields
include_mq = "MQ" in output_fields
format_str = ":".join(output_fields)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_str is built from output_fields (user-supplied order), but the per-sample parts list is assembled in a fixed order (GT, then GQ, then AD, then MQ). If a caller provides fields in a different order (e.g. ("GT","MQ","GQ")), the FORMAT column will not match the sample value ordering, producing an invalid/misleading VCF. Consider either enforcing a canonical field order when building output_fields/format_str, or (preferably) constructing per-sample values by iterating over output_fields so the ordering always stays consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +135
if "GQ" in fields:
try:
gq_data = ds["call_GQ"].data
except KeyError:
warnings.warn(
"Requested FORMAT field 'GQ' not found in dataset. "
"GQ values will be omitted from the output VCF.",
UserWarning,
stacklevel=2,
)
output_fields.remove("GQ")

if "AD" in fields:
try:
ad_data = ds["call_AD"].data
except KeyError:
warnings.warn(
"Requested FORMAT field 'AD' not found in dataset. "
"AD values will be omitted from the output VCF.",
UserWarning,
stacklevel=2,
)
output_fields.remove("AD")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior (omitting requested FORMAT fields when the backing arrays are missing, and emitting warnings) isn’t covered by tests. Since there are already VCF exporter tests, it would be good to add cases asserting: (1) a requested missing field (e.g. GQ) triggers a warning, (2) the corresponding ##FORMAT header line is not written, and (3) the FORMAT column/sample values reflect the filtered field list.

Copilot uses AI. Check for mistakes.
@Yashsingh045
Copy link
Copy Markdown
Contributor Author

Hey @jonbrenas ,
Could you please review this PR when you have time?
Thanks

@Yashsingh045
Copy link
Copy Markdown
Contributor Author

Hey @jonbrenas ,
Just a gentle reminder,
could you please review this PR when you have time?
Thanks

Copy link
Copy Markdown
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Copilot has some comments.

)
output_fields.remove("MQ")

include_gq = "GQ" in output_fields
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should come before the if-statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

snp_calls_to_vcf() silently drops requested FORMAT fields when underlying data arrays don't exist

3 participants