Skip to content

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485

Merged
madeline-scyphers merged 4 commits into
WrightonLabCSU:devfrom
ruben1294:fix/format-kegg-database-pipeline
May 14, 2026
Merged

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485
madeline-scyphers merged 4 commits into
WrightonLabCSU:devfrom
ruben1294:fix/format-kegg-database-pipeline

Conversation

@ruben1294
Copy link
Copy Markdown

Summary

This PR fixes six bugs found while running the KEGG database formatting
and annotation pipeline (FORMAT_KEGG_DB process). Together they cause
the pipeline to fail when format_kegg = true or produce
silently wrong results. All fixes were validated end-to-end on an HPC
cluster running Nextflow with Docker.


Bug 1 — format_kegg_database.py: top-level scikit-bio import crashes containers without it

File: bin/format_kegg_database.py

from skbio import ... was imported at module level. Any container that does not have scikit-bio installed raises ImportError on startup, before any real work is done. scikit-bio is only needed when a gene_ko_link file is actually processed, so the imports are now moved into the two functions that require them: process_kegg and generate_modified_kegg_fasta.


Bug 2 — format_kegg_database.py: MMseqs2 DB written as kegg.<date>.mmsdb, expected as kegg.mmsdb

File: bin/format_kegg_database.py

process_kegg() previously wrote the MMseqs2 database as:

kegg_mmseqs_db = path.join(output_dir, "kegg.%s.mmsdb" % download_date)
# e.g. kegg/kegg.2018-12-30.mmsdb

However, modules/local/annotate/mmseqs_search.nf symlinks the database directory and then references the database as ${db_name}.mmsdb (where db_name is the directory name "kegg"), so it expects exactly kegg/kegg.mmsdb. The date suffix caused a silent "file not found" error at annotation time.

This is fixed by removing the date suffix:

kegg_mmseqs_db = path.join(output_dir, "kegg.mmsdb")

Bug 3 — format_kegg_database.py: --skip_gene_ko_link argparse uses type=bool instead of action="store_true"

File: bin/format_kegg_database.py

type=bool does not work as a flag. argparse passes the literal string "False" to bool(), and bool("False") == True, so any invocation with --skip_gene_ko_link would receive the wrong value.

# Before (broken):
parser.add_argument("--skip_gene_ko_link", type=bool, default=False, ...)

# After (fixed):
parser.add_argument("--skip_gene_ko_link", action="store_true", ...)

Bug 4 — format_kegg_db.nf: container python_scikit-bio_scipy does not include mmseqs2

File: modules/local/database/format_kegg_db.nf

The FORMAT_KEGG_DB process called mmseqs createdb and mmseqs createindex
but ran inside python_scikit-bio_scipy, which does not ship mmseqs2.
This caused an immediate FileNotFoundError: mmseqs.

The container is now aligned with the rest of the pipeline and uses an image that includes mmseqs2:

// Before:
container "community.wave.seqera.io/library/python_scikit-bio_scipy:0f89a100e990daf2"

// After:
container "community.wave.seqera.io/library/python_pandas_hmmer_mmseqs2_pruned:d2c88b719ab1322c"

Bug 5 — format_kegg_db.nf: bash condition if [ ${skip_gene_ko_link} ] is always true

File: modules/local/database/format_kegg_db.nf

In bash, if [ "0" ] evaluates to true because any non-empty string is truthy. As a result, the process always took the --skip_gene_ko_link branch and silently ignored the gene_ko_link file even when it was provided.

This is fixed by comparing explicitly to the string "true" (see also Bug 6):

# Before:
if [ ${skip_gene_ko_link} ]; then

# After:
if [ "${skip_gene_ko_link}" = "true" ]; then

Bug 6 — dram.nf: skip_gene_ko_link passed as integer 1/0 instead of string "true"/"false"

File: workflows/dram.nf

The value passed to FORMAT_KEGG_DB was:

// Before:
skip_gene_ko_link = params.skip_gene_ko_link ? 1 : 0

This produces the strings "1" or "0" inside the bash script. As noted in Bug 5, if [ "0" ] is true in bash. The fix in format_kegg_db.nf requires the value to be exactly "true" or "false":

// After:
skip_gene_ko_link = params.skip_gene_ko_link ? "true" : "false"

Files changed

File Bugs fixed
bin/format_kegg_database.py #1 (lazy scikit-bio import), #2 (kegg.mmsdb name), #3 (argparse flag)
modules/local/database/format_kegg_db.nf #4 (container with mmseqs2), #5 (bash condition)
workflows/dram.nf #6 (boolean string value)

1. Move scikit-bio imports inside functions (lazy import)
   - `from skbio import ...` was at module level, causing ImportError in
     Docker containers that do not have scikit-bio installed (e.g.
     python_pandas_hmmer_mmseqs2_pruned). The import is only needed when
     a gene_ko_link file is actually processed, so it is now placed inside
     the two functions that use it: process_kegg() and
     generate_modified_kegg_fasta().

2. Fix MMseqs2 output database name (kegg.mmsdb, not kegg.<date>.mmsdb)
   - The database was written as kegg.<download_date>.mmsdb but
     modules/local/annotate/mmseqs_search.nf expects the file to be named
     exactly kegg.mmsdb (it constructs the path as ${db_name}.mmsdb where
     db_name is the parent directory name "kegg"). The date suffix caused a
     "No such file or directory" error at annotation time.

3. Fix --skip_gene_ko_link argparse definition
   - Using `type=bool` does NOT work as a flag: argparse passes the string
     "False" / "True" to bool(), and bool("False") == True. Replaced with
     `action="store_true"` so the flag behaves as intended.
1. Replace container that lacks mmseqs2
   - FORMAT_KEGG_DB used python_scikit-bio_scipy which does not include
     mmseqs2. The process calls mmseqs createdb / createindex, so it fails
     immediately with "No such file or directory: mmseqs". Replaced with
     python_pandas_hmmer_mmseqs2_pruned, which already carries mmseqs2 and
     is used by other annotation processes in the pipeline.

2. Fix bash condition for skip_gene_ko_link
   - The Nextflow value passed to the process is the string "0" or "1"
     (see dram.nf). In bash, `if [ "0" ]` evaluates to TRUE because any
     non-empty string is truthy. FORMAT_KEGG_DB therefore always ran the
     --skip_gene_ko_link branch, ignoring the gene_ko_link file.
     Fixed with an explicit string comparison:
       if [ "${skip_gene_ko_link}" = "true" ]
     (see companion fix in workflows/dram.nf)
The companion fix for the bash condition in format_kegg_db.nf requires
that skip_gene_ko_link be the string "true" or "false" rather than the
integer 1 or 0.

In bash:
  `if [ "0" ]`   -> true  (non-empty string)
  `if [ "false" ]` -> true  (still non-empty - also wrong)

The correct pattern used in format_kegg_db.nf is:
  `if [ "${skip_gene_ko_link}" = "true" ]`

which requires this value to be exactly the string "true" or "false".
Changed `params.skip_gene_ko_link ? 1 : 0` to
`params.skip_gene_ko_link ? "true" : "false"`.
@ruben1294 ruben1294 force-pushed the fix/format-kegg-database-pipeline branch from ade8a2d to f4b2394 Compare March 19, 2026 16:54
from glob import glob
import logging
import subprocess
from skbio import write as write_sequence, read as read_sequence
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't error in general because the container image specifies it. So I am gonna keep it up here because though technically only necessary with some logic flows, I think it is easier to read and nearly 100% of users will be using profiles (either container runtimes or conda) that will come with all dependencies.

errorStrategy 'finish'

conda "${moduleDir}/environment.yml"
container "community.wave.seqera.io/library/python_scikit-bio_scipy:0f89a100e990daf2"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the mmseqs missing here. It should go in the environment file, and then we use wave cli to generate this container uri. So I went ahead and did that and updated it.

@madeline-scyphers
Copy link
Copy Markdown
Member

Just made some small updates, thanks for all this!

@madeline-scyphers madeline-scyphers merged commit e5439e8 into WrightonLabCSU:dev May 14, 2026
1 check passed
@github-project-automation github-project-automation Bot moved this from To Sort to Done in DRAM May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants