Skip to content

Conversation

@quantumsteve
Copy link
Collaborator

First attempt at replacing cpu-benchmark with a nearly identical test in stress-ng.

differences

  • cpu-benchmark uses a full sphere, while stress-ng uses a single quadrand.
  • cpu-benchmark uses a "terrible" but efficient rng while stress-ng has many options. I chose "lcg" to start with.
  • cpu-benchmark batches cpu-work into 1000000 samples, while stress-ng uses ~16384 samples.
  • stress-ng defines samples and ops as int32_t, while samples in cpu-benchmark are int64_t

stress-ng launches multiple processes, which changes the logic for stopping parent and child processes. A quick search recommended psutil.

@quantumsteve quantumsteve marked this pull request as draft February 9, 2026 20:10
@rafaelfsilva rafaelfsilva added this to the v1.5 milestone Feb 10, 2026
@rafaelfsilva
Copy link
Member

Hi @quantumsteve, it seems the tests for Dask are hanging. Could you please take a look at them? Thanks!

@henricasanova
Copy link
Contributor

henricasanova commented Feb 10, 2026

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

@quantumsteve
Copy link
Collaborator Author

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

@henricasanova
Copy link
Contributor

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

@henricasanova
Copy link
Contributor

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

One thing I noticed is that psutil wasn't listed in pyproject.toml. That "fixed" the bash executor test, in that now it hangs like the others :)

@henricasanova
Copy link
Contributor

henricasanova commented Feb 10, 2026

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

One thing I noticed is that psutil wasn't listed in pyproject.toml. That "fixed" the bash executor test, in that now it hangs like the others :)

Ok, news. Connecting to the container and running wfbench my hand, not involving wfcommons at any of my test infrastructure hangs:

bin/wfbench --name split_fasta_00000001 --percent-cpu 1.0 --cpu-work 1 
[WfBench][09:57:07][INFO] Starting split_fasta_00000001 Benchmark
[WfBench][09:57:07][INFO] Starting CPU and Memory Benchmarks for split_fasta_00000001...
stress-ng: info:  [311] defaulting to a 1 day, 0 secs run per stressor
stress-ng: info:  [311] dispatching hogs: 10 monte-carlo
stress-ng: info:  [312] monte-carlo: pi   ~ 2.6666666666667 vs 3.1415926535898 using lcg (average of 1 runs)
stress-ng: info:  [311] skipped: 0
stress-ng: info:  [311] passed: 10: monte-carlo (10)
stress-ng: info:  [311] failed: 0
stress-ng: info:  [311] metrics untrustworthy: 0
stress-ng: info:  [311] successful run completed in 0.06 secs

That should be easy to diagnose, and I'll look at it soon-ish.

@henricasanova
Copy link
Contributor

Ok, so the culprit is io_proc.join(), which hangs. Also, I am noticing 76 stress-ng processes, and 3 stress-ng zombie processes while this hangs. I assume that's fine/intended, but thought I'd mention it.

@henricasanova
Copy link
Contributor

@quantumsteve A fix is to call io_proc.kill() right before the io_proc.join() because I don't believe that I/O process can actually terminate on its own. I see you had a commented out io_proc.terminate() before the join.... so perhaps you had thought of that. With that fix, the execution does complete. BUT, it leaves behind tons of zombie stress-ng-vm processes, which is of course not good. With this information, likely you can now fix your code? What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants