Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284
Fix #1283: Warn and handle missing FORMAT fields in snp_calls_to_vcf()#1284Yashsingh045 wants to merge 1 commit intomalariagen:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves SnpVcfExporter.snp_calls_to_vcf() to avoid silently producing misleading VCF FORMAT headers/columns when optional FORMAT arrays (e.g., call_GQ, call_AD, call_MQ) are missing or unreadable, by warning users and dynamically filtering declared FORMAT fields.
Changes:
- Load optional FORMAT arrays up front and drop unavailable requested FORMAT fields with
warnings.warn(). - Write VCF
##FORMATheaders based on the filteredoutput_fieldsrather than the originally requestedfields. - Emit warnings when optional FORMAT chunks fail to compute, while still writing
.sentinel values to preserve VCF structure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| include_gq = "GQ" in output_fields | ||
| include_ad = "AD" in output_fields | ||
| include_mq = "MQ" in output_fields | ||
| format_str = ":".join(output_fields) |
There was a problem hiding this comment.
format_str is built from output_fields (user-supplied order), but the per-sample parts list is assembled in a fixed order (GT, then GQ, then AD, then MQ). If a caller provides fields in a different order (e.g. ("GT","MQ","GQ")), the FORMAT column will not match the sample value ordering, producing an invalid/misleading VCF. Consider either enforcing a canonical field order when building output_fields/format_str, or (preferably) constructing per-sample values by iterating over output_fields so the ordering always stays consistent.
| if "GQ" in fields: | ||
| try: | ||
| gq_data = ds["call_GQ"].data | ||
| except KeyError: | ||
| warnings.warn( | ||
| "Requested FORMAT field 'GQ' not found in dataset. " | ||
| "GQ values will be omitted from the output VCF.", | ||
| UserWarning, | ||
| stacklevel=2, | ||
| ) | ||
| output_fields.remove("GQ") | ||
|
|
||
| if "AD" in fields: | ||
| try: | ||
| ad_data = ds["call_AD"].data | ||
| except KeyError: | ||
| warnings.warn( | ||
| "Requested FORMAT field 'AD' not found in dataset. " | ||
| "AD values will be omitted from the output VCF.", | ||
| UserWarning, | ||
| stacklevel=2, | ||
| ) | ||
| output_fields.remove("AD") |
There was a problem hiding this comment.
New behavior (omitting requested FORMAT fields when the backing arrays are missing, and emitting warnings) isn’t covered by tests. Since there are already VCF exporter tests, it would be good to add cases asserting: (1) a requested missing field (e.g. GQ) triggers a warning, (2) the corresponding ##FORMAT header line is not written, and (3) the FORMAT column/sample values reflect the filtered field list.
|
Hey @jonbrenas , |
|
Hey @jonbrenas , |
jonbrenas
left a comment
There was a problem hiding this comment.
Also, Copilot has some comments.
| ) | ||
| output_fields.remove("MQ") | ||
|
|
||
| include_gq = "GQ" in output_fields |
There was a problem hiding this comment.
These should come before the if-statement.
Description
This PR fixes an issue #1283 ,
In the snp_calls_to_vcf() method where requested FORMAT fields (GQ, AD, MQ) were silently dropped if the underlying data was missing or unreadable. It adds user-facing warnings and ensures VCF headers are dynamically updated to accurately reflect available data.
Problem Statement
When a user explicitly requested optional FORMAT fields via the fields parameter, the exporter would silently fallback to missing values (.) if the data arrays were absent from the dataset or if computation failed for specific chunks.
Impact:
This led to "silent failure" where downstream tools would behave unexpectedly due to empty fields, and VCF files were generated with misleading headers that declared fields which contained no real data.
Solution / Changes
Closes
Fixes #1283