fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline by ruben1294 · Pull Request #485 · WrightonLabCSU/DRAM

ruben1294 · 2026-02-24T19:59:58Z

Summary

This PR fixes six bugs found while running the KEGG database formatting
and annotation pipeline (FORMAT_KEGG_DB process). Together they cause
the pipeline to fail when format_kegg = true or produce
silently wrong results. All fixes were validated end-to-end on an HPC
cluster running Nextflow with Docker.

Bug 1 — `format_kegg_database.py`: top-level `scikit-bio` import crashes containers without it

File: bin/format_kegg_database.py

from skbio import ... was imported at module level. Any container that does not have scikit-bio installed raises ImportError on startup, before any real work is done. scikit-bio is only needed when a gene_ko_link file is actually processed, so the imports are now moved into the two functions that require them: process_kegg and generate_modified_kegg_fasta.

Bug 2 — `format_kegg_database.py`: MMseqs2 DB written as `kegg.<date>.mmsdb`, expected as `kegg.mmsdb`

File: bin/format_kegg_database.py

process_kegg() previously wrote the MMseqs2 database as:

kegg_mmseqs_db = path.join(output_dir, "kegg.%s.mmsdb" % download_date)
# e.g. kegg/kegg.2018-12-30.mmsdb

However, modules/local/annotate/mmseqs_search.nf symlinks the database directory and then references the database as ${db_name}.mmsdb (where db_name is the directory name "kegg"), so it expects exactly kegg/kegg.mmsdb. The date suffix caused a silent "file not found" error at annotation time.

This is fixed by removing the date suffix:

kegg_mmseqs_db = path.join(output_dir, "kegg.mmsdb")

Bug 3 — `format_kegg_database.py`: `--skip_gene_ko_link` argparse uses `type=bool` instead of `action="store_true"`

File: bin/format_kegg_database.py

type=bool does not work as a flag. argparse passes the literal string "False" to bool(), and bool("False") == True, so any invocation with --skip_gene_ko_link would receive the wrong value.

# Before (broken):
parser.add_argument("--skip_gene_ko_link", type=bool, default=False, ...)

# After (fixed):
parser.add_argument("--skip_gene_ko_link", action="store_true", ...)

Bug 4 — `format_kegg_db.nf`: container `python_scikit-bio_scipy` does not include `mmseqs2`

File: modules/local/database/format_kegg_db.nf

The FORMAT_KEGG_DB process called mmseqs createdb and mmseqs createindex
but ran inside python_scikit-bio_scipy, which does not ship mmseqs2.
This caused an immediate FileNotFoundError: mmseqs.

The container is now aligned with the rest of the pipeline and uses an image that includes mmseqs2:

// Before:
container "community.wave.seqera.io/library/python_scikit-bio_scipy:0f89a100e990daf2"

// After:
container "community.wave.seqera.io/library/python_pandas_hmmer_mmseqs2_pruned:d2c88b719ab1322c"

Bug 5 — `format_kegg_db.nf`: bash condition `if [ ${skip_gene_ko_link} ]` is always true

File: modules/local/database/format_kegg_db.nf

In bash, if [ "0" ] evaluates to true because any non-empty string is truthy. As a result, the process always took the --skip_gene_ko_link branch and silently ignored the gene_ko_link file even when it was provided.

This is fixed by comparing explicitly to the string "true" (see also Bug 6):

# Before:
if [ ${skip_gene_ko_link} ]; then

# After:
if [ "${skip_gene_ko_link}" = "true" ]; then

Bug 6 — `dram.nf`: `skip_gene_ko_link` passed as integer `1`/`0` instead of string `"true"`/`"false"`

File: workflows/dram.nf

The value passed to FORMAT_KEGG_DB was:

// Before:
skip_gene_ko_link = params.skip_gene_ko_link ? 1 : 0

This produces the strings "1" or "0" inside the bash script. As noted in Bug 5, if [ "0" ] is true in bash. The fix in format_kegg_db.nf requires the value to be exactly "true" or "false":

// After:
skip_gene_ko_link = params.skip_gene_ko_link ? "true" : "false"

Files changed

File	Bugs fixed
`bin/format_kegg_database.py`	#1 (lazy scikit-bio import), #2 (`kegg.mmsdb` name), #3 (argparse flag)
`modules/local/database/format_kegg_db.nf`	#4 (container with mmseqs2), #5 (bash condition)
`workflows/dram.nf`	#6 (boolean string value)

1. Move scikit-bio imports inside functions (lazy import) - `from skbio import ...` was at module level, causing ImportError in Docker containers that do not have scikit-bio installed (e.g. python_pandas_hmmer_mmseqs2_pruned). The import is only needed when a gene_ko_link file is actually processed, so it is now placed inside the two functions that use it: process_kegg() and generate_modified_kegg_fasta(). 2. Fix MMseqs2 output database name (kegg.mmsdb, not kegg.<date>.mmsdb) - The database was written as kegg.<download_date>.mmsdb but modules/local/annotate/mmseqs_search.nf expects the file to be named exactly kegg.mmsdb (it constructs the path as ${db_name}.mmsdb where db_name is the parent directory name "kegg"). The date suffix caused a "No such file or directory" error at annotation time. 3. Fix --skip_gene_ko_link argparse definition - Using `type=bool` does NOT work as a flag: argparse passes the string "False" / "True" to bool(), and bool("False") == True. Replaced with `action="store_true"` so the flag behaves as intended.

1. Replace container that lacks mmseqs2 - FORMAT_KEGG_DB used python_scikit-bio_scipy which does not include mmseqs2. The process calls mmseqs createdb / createindex, so it fails immediately with "No such file or directory: mmseqs". Replaced with python_pandas_hmmer_mmseqs2_pruned, which already carries mmseqs2 and is used by other annotation processes in the pipeline. 2. Fix bash condition for skip_gene_ko_link - The Nextflow value passed to the process is the string "0" or "1" (see dram.nf). In bash, `if [ "0" ]` evaluates to TRUE because any non-empty string is truthy. FORMAT_KEGG_DB therefore always ran the --skip_gene_ko_link branch, ignoring the gene_ko_link file. Fixed with an explicit string comparison: if [ "${skip_gene_ko_link}" = "true" ] (see companion fix in workflows/dram.nf)

The companion fix for the bash condition in format_kegg_db.nf requires that skip_gene_ko_link be the string "true" or "false" rather than the integer 1 or 0. In bash: `if [ "0" ]` -> true (non-empty string) `if [ "false" ]` -> true (still non-empty - also wrong) The correct pattern used in format_kegg_db.nf is: `if [ "${skip_gene_ko_link}" = "true" ]` which requires this value to be exactly the string "true" or "false". Changed `params.skip_gene_ko_link ? 1 : 0` to `params.skip_gene_ko_link ? "true" : "false"`.

madeline-scyphers · 2026-05-14T19:33:58Z

 from glob import glob
 import logging
 import subprocess
-from skbio import write as write_sequence, read as read_sequence


This shouldn't error in general because the container image specifies it. So I am gonna keep it up here because though technically only necessary with some logic flows, I think it is easier to read and nearly 100% of users will be using profiles (either container runtimes or conda) that will come with all dependencies.

madeline-scyphers · 2026-05-14T19:35:43Z

    errorStrategy 'finish'

    conda "${moduleDir}/environment.yml"
-    container "community.wave.seqera.io/library/python_scikit-bio_scipy:0f89a100e990daf2"


Thanks for catching the mmseqs missing here. It should go in the environment file, and then we use wave cli to generate this container uri. So I went ahead and did that and updated it.

madeline-scyphers · 2026-05-14T19:41:12Z

Just made some small updates, thanks for all this!

github-project-automation Bot added this to DRAM Feb 24, 2026

github-project-automation Bot moved this to To Sort in DRAM Feb 24, 2026

ruben1294 added 3 commits March 19, 2026 09:53

ruben1294 force-pushed the fix/format-kegg-database-pipeline branch from ade8a2d to f4b2394 Compare March 19, 2026 16:54

madeline-scyphers reviewed May 14, 2026

View reviewed changes

Small updates to PR

77b79ae

madeline-scyphers merged commit e5439e8 into WrightonLabCSU:dev May 14, 2026
1 check passed

github-project-automation Bot moved this from To Sort to Done in DRAM May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485
madeline-scyphers merged 4 commits into
WrightonLabCSU:devfrom
ruben1294:fix/format-kegg-database-pipeline

ruben1294 commented Feb 24, 2026

Uh oh!

madeline-scyphers May 14, 2026

Uh oh!

madeline-scyphers May 14, 2026

Uh oh!

madeline-scyphers commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruben1294 commented Feb 24, 2026

Summary

Bug 1 — format_kegg_database.py: top-level scikit-bio import crashes containers without it

Bug 2 — format_kegg_database.py: MMseqs2 DB written as kegg.<date>.mmsdb, expected as kegg.mmsdb

Bug 3 — format_kegg_database.py: --skip_gene_ko_link argparse uses type=bool instead of action="store_true"

Bug 4 — format_kegg_db.nf: container python_scikit-bio_scipy does not include mmseqs2

Bug 5 — format_kegg_db.nf: bash condition if [ ${skip_gene_ko_link} ] is always true

Bug 6 — dram.nf: skip_gene_ko_link passed as integer 1/0 instead of string "true"/"false"

Files changed

Uh oh!

madeline-scyphers May 14, 2026

Choose a reason for hiding this comment

Uh oh!

madeline-scyphers May 14, 2026

Choose a reason for hiding this comment

Uh oh!

madeline-scyphers commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1 — `format_kegg_database.py`: top-level `scikit-bio` import crashes containers without it

Bug 2 — `format_kegg_database.py`: MMseqs2 DB written as `kegg.<date>.mmsdb`, expected as `kegg.mmsdb`

Bug 3 — `format_kegg_database.py`: `--skip_gene_ko_link` argparse uses `type=bool` instead of `action="store_true"`

Bug 4 — `format_kegg_db.nf`: container `python_scikit-bio_scipy` does not include `mmseqs2`

Bug 5 — `format_kegg_db.nf`: bash condition `if [ ${skip_gene_ko_link} ]` is always true

Bug 6 — `dram.nf`: `skip_gene_ko_link` passed as integer `1`/`0` instead of string `"true"`/`"false"`