Skip to content

Big slice imp modules#11034

Open
SkyLexS wants to merge 17 commits intonf-core:masterfrom
SkyLexS:big_slice_imp_modules
Open

Big slice imp modules#11034
SkyLexS wants to merge 17 commits intonf-core:masterfrom
SkyLexS:big_slice_imp_modules

Conversation

@SkyLexS
Copy link
Copy Markdown
Contributor

@SkyLexS SkyLexS commented Mar 24, 2026

PR checklist

Closes #XXX

  • This update includes new arguments for the bigslice tool.
  • If you've fixed a bug or added code that should be tested, add tests!
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • [ x Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

Comment on lines -17 to -18
// WARN: Version information not provided by tool on CLI. Please update this string when bumping container versions.
tuple val("${task.process}"), val('bigslice'), val("2.0.2"), topic: versions, emit: versions_bigslice
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on it rn sorry

@SkyLexS SkyLexS requested a review from camlloyd March 30, 2026 06:35
Copy link
Copy Markdown
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagkaratzas any thoughts as well?

Comment on lines +28 to +33
def export_tsv_cmd = args2 ? """
bigslice \\
--export-tsv ${prefix}/result/tsv_export \\
--program_db_folder ${hmmdb} \\
${prefix}
""" : ''
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this implementation, what is args2 trying to do here to activate the execution?

Shouldn't you just have a boolean input value channel that, if true, just injects --export-tsv in to the main command? Why the whole extra command?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, ext.args can have more than one param - value pair inside. You don't need a new args* value for each param possible. --export-tsv can be passed and checked through ext.args

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--export-tsv is a separate BiG-SLiCE subcommand that must be called as a second bigslice invocation after clustering completes, so it cannot simply be appended to the main command. Two approaches exist: using ext.args2 to explicitly separate the two commands (clear intent, but adds a non-standard extra args variable), or detecting --export-tsv inside ext.args, stripping it from the main command and running it as a post-step (current approach), which keeps a single ext.args entry point. If you have any other suggestions please tell :')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then yeah I would go for a boolean input channel, and inject the second command (with $args2) if requested.
For variable-based command injection, keep the command on one line rather than across multiple though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely add a new nf-test case testing this new funcitonality.

I am confused as to which method to choose, because I can see the --export-tsv being used twice; both here --export-tsv ${prefix}/result/tsv_export \\ and in the ${export_tsv_cmd} command afterwards.

I'd definitely skip args2, and parse ext.args for finding it though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it fundamentally changes the command, I feel input channel is clearer/more explicit

@SkyLexS SkyLexS requested a review from jfy133 April 1, 2026 11:16
@SkyLexS
Copy link
Copy Markdown
Contributor Author

SkyLexS commented Apr 1, 2026

no more results folder it will be sample/(bigslice output) fully working parameters with catch for errors if you have any suggestions please tell

// Flatten the GBK directory into a list of individual GBK files with meta
input[0] = UNTAR_GBK.out.untar.map { meta, dir ->
def gbk_files = []
dir.eachFileRecurse { if (it.name.endsWith('.gbk')) gbk_files << it }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +49 to +51
mv ${prefix}/result/data.db ${prefix}/data.db
mv ${prefix}/result/tmp ${prefix}/tmp
rm -rf ${prefix}/result
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to do these move operations (and actually it's not recommened apparently), youc an just emit prefix as is

@SkyLexS SkyLexS requested a review from jfy133 April 1, 2026 14:35
Copy link
Copy Markdown
Contributor

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments to your PR.

Comment on lines +79 to +84
{ assert resultDir.isDirectory() },
{ assert file("${resultDir}/data.db").exists() },
{ assert resultDir.list().any { it.endsWith('.fa') || new File("${resultDir}/tmp").exists() } },
{ assert snapshot(
process.out.findAll { key, val -> key.startsWith("versions")}
).match() }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add other files to this snapshot as well? Ideally we want all outputs to be at least present by name in the snapshot.

Comment on lines +11 to +13
tuple val(meta), path(bgc, stageAs: 'bgc_files/s*/*')
path(hmmdb)
val(export_tsv)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put all these inputs into one tuple? That will make sure you're sure that EVERY time everything comes together in the right combination.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmdb is a shared reference database (not sample-specific) and export_tsv is a boolean flag (not a file), so neither belongs in the sample tuple. In nf-core, tuples group a meta map with the data files of that specific sample mixing in shared resources or behaviour flags would break this convention.

--program_db_folder ${hmmdb} \\
${prefix}

${export_tsv_cmd}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this tsv then as optional output and have it be accessible for downstream analyses?

@SkyLexS
Copy link
Copy Markdown
Contributor Author

SkyLexS commented Apr 2, 2026

@famosab
The *.fa files in result/tmp/ are non-deterministic, bigslice generates them based on HMM scoring, and the exact set of hits varies between runs/environments due to floating-point differences. Snapshotting their names will always break on CI. So instead we just assert that at least one .fa exists there, which is enough to confirm the tool ran correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants