Conversation
pinin4fjords
left a comment
There was a problem hiding this comment.
Minor bits and pieces, but the general approach looks reasonable to me. Exomiser seems to require those versions, so they shouldn't go in ext.args, and it makes sense to put them in the tuples as you've done.
| tuple val(meta), path(vcf), path(ped), val(assembly), path(phenopacket), path(analysis_script) | ||
| tuple val(meta2), path(reference_cache, stageAs: 'exomiser_data/*'), val(reference_version) | ||
| tuple val(meta3), path(phenotype_cache, stageAs: 'exomiser_data/*'), val(phenotype_version) |
There was a problem hiding this comment.
This stuff looks reasonable to me, I don't really have a better idea
| tuple val(meta), path("*.json"), emit: json | ||
| tuple val(meta), path("*.html"), emit: html | ||
| tuple val(meta), path("*.parquet"), emit: parquet | ||
| tuple val(meta), path("*.vcf"), emit: vcf |
There was a problem hiding this comment.
I would say we need to bgzip this vcf before we output it
There was a problem hiding this comment.
I agree. The thing is, I'm only just testing this tool and I haven't had the opportunity to look further into it.
Chances are it's already compressed and I just missed it in the docs
There was a problem hiding this comment.
The VCF file is tabix-indexed and exomiser ranked alleles can be extracted using grep
Yes its bgzipped (otherwise it cannot be index afaik) so perfect
| tuple val(meta2), path(reference_cache, stageAs: 'exomiser_data/*'), val(reference_version) | ||
| tuple val(meta3), path(phenotype_cache, stageAs: 'exomiser_data/*'), val(phenotype_version) |
There was a problem hiding this comment.
Would it maybe make sense to have a seperate module that takes care of properly loading this data? That is what we did for PCGR. That would be then exomiser/getreference and it would download and create this needed exomiser.data-directory=/data/exomiser-data which can then be just an input to this module?
There was a problem hiding this comment.
IMO the data is too big to be loaded on the fly. It comes down to about 50GB of reference data in total
There was a problem hiding this comment.
Yes thats why I would handle it seperately or at least that is how we are doing it with vep cache etc. Either we add the data to the vep cache thingy @maxulysse built or we create a module that can be used in a pipeline to have this loaded see pcgr in the variantprioritization pipeline. And then for testing we subsample this cache to chr22 etc (I did that for pcgr as well).
POC module PR for
exomiser.I'd like some more eyes on this before I put more time towards it to figure out what's the best way to handle reference data and inputs.
PR checklist
Closes #XXX
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile condanf-core subworkflows test <SUBWORKFLOW> --profile dockernf-core subworkflows test <SUBWORKFLOW> --profile singularitynf-core subworkflows test <SUBWORKFLOW> --profile conda