Conversation
This is a significant structural change to run all the tests in a single container in parallel.
|
Thank you Dylan! Parallelizing the tests is much appreciated, Jenkins was very slow to get through the first few tests in the past.
Yes, a few of the tests (the first 5-6 in the testing) might benefit from 12-16 cores, but most could be run on 4 or less. I believe @matteocantiello can give you the access you need. There might have been some recent changes to mesa_test so @wmwolf maybe you can confirm the github runner is running the recent mesa_test version correctly. |
|
You can see the current test runtimes here. These are all over an hour, and test 5 in particular is 20 hours:
Is the scaling with more cores linear? What's your goal for how long this should take? I also notice it's getting errors like: But seems to successfully log in (it should be using the same credentials as the old system). |
|
I'm not exactly sure about the astero errors. But as far as timing, MESA should scale linearly with number of cores up to about 12 cores. This linear scaling is not fully realized in all tests though. In particular the longer tests (those taking > 1-2 hours) should probably be run on 12-16 cores, and ideally finish in < 2-3 hours each. The c13_pocket and ppisn being the worst offenders i think and taking up to 5 hours on some computers. Here is some example timing from main although i can't confirm if these tests were run with the [ci optional] flag, so they might not be running the full test_suite. Perhaps @warrickball or someone who's tested on their cluster recently can provide more accurate timing or expectations. |
|
Do you want to hard-code certain test numbers to run on 16 cores, or just run all of them that way? |
|
I think hardcoding is best, as it might waste a lot of resources to make the other tests run on more than 4 cores, but someone other than me might provide better direction. |
|
Hardcoding might be best, just to get the big tests done in reasonable time, but I think running everything on 8 cores each will be about as good. At the start, the 5 slow tests will make their way to completion, after which the smaller ones will start filling in on whichever subset of 8 cores becomes available, so nothing would be left idle. 4 cores would just about work too because from your table above, it looks like the slowest test is about 1/4 of the total runtime, so it'd finish just as everything else does, having used the other 12 cores. Fewer cores would leave resources idle, waiting for the slowest test Running everything on all 16 cores would also clearly keep the system perfectly busy but at the cost of the smaller tests not scaling particularly well. I'll see if I can generate a full table of my most recent runtimes. |
|
Here's a crude stab at my numbers. Apologies for times in H:M:S! Latest BlueBEAR runtimes Top 5: Full results: |
|
Thanks @warrickball! is this timing from a normal or full test? That timing might be substantially different between the two. |
|
Full test. |
|
To be clear, the jenkins numbers are 4 cores each, 4 tests in parallel, 16 cores total. I'll try with 8 cores each. instead. |
seems like tests don't use NPROCS? Also increase to 6 tests parallel.
|
As for |
|
I assume the mesa_test version is coming from this line? https://github.com/MESAHub/mesa/blob/main/jenkins/Dockerfile#L25 I did a run with 6 parallel tests, 8 cores per test, and test 5 took 10 hours, so just about half. A that point this test becomes the bottleneck, as other tests had finished an hour earlier, so maybe 4x8 is a good compromise. Also, the tests are setting |
that sounds about right, I've confirmed that test 5 is the c13_pocket. There might be a way to make this test less expensive, but that can be done outside of this pr.
This is correct. |
|
@wmwolf mesa_test 1.2.0 says it's working now, but is producing a traceback: From my perspective, this is ready to merge. |
|
Actually it turns out that traceback is causing all tests to fail. Regardless, since everything is already failing on the old jenkins system due to #977, and things are generally working on the new one, I'm going to go ahead and remove the old project and let you sort out the mesa_test issues and merging this in. |
|
One broad comment is that the intention in the past has been to periodically re-order the tests so that the most expensive ones appear first in do1_test_source. So you should be able to be agnostic about which test numbers require the most cores, other than to perhaps dedicate more resources to the first 5-10 tests. I used to run 16 cores for the first 10 tests and 8 cores for everything else. We can separately try to reorder the tests so that the most expensive ones do indeed appear within the first 5 tests or so (see #1021). @warrickball's timing seems to be generally consistent with my expectations, and suggests that we should be moving the core collapse related models up near the front of do1_test_source. I suppose |
|
Hello @dylex , thank you for helping us get jenkins running tests again! I agree this is almost ready to merge, but i have a couple questions first.
|
It looks like the mesa_test 1.2.0 version is producing a traceback and failing (see above).
I currently have it just using 8 cores for each test. It's basically just a shell script so you can make it use the cores however you want: https://github.com/MESAHub/mesa/blob/jenkins/jenkins/Jenkinsfile#L47
As I understand it the
The only thing you should need us for is if you need to change which builds are run (which repository, branches, permissions). You can adjust the build and environment otherwise. This is standard Jenkins scripted pipeline -- the only thing special is the
The old system is being decomissioned, but you're no longer using it. There are no plans to remove the new one, and it's structured more resiliently than the old one so should survive better if we have hardware failures. This of course presumes that someone involved with this project is still at Flatiron. |
|
Awesome, thank you for the detailed response! Once we get the mesa_test 1.2.0 situation sorted, I think we merge this pr. |
|
I looked at the error message more carefully. The issue happened when trying to submit the logs to the logs server hosted at flatiron. As we can see, the individual tests did get submitted to the testhub without issue. I've heard from @evbauer that he's having problems submitting logs to the log server as well. This is weird, as that process hasn't really changed recently. @dylex : would you be able to look into the server logs of the the logs server (a fun tongue twister) to see if you can see requests that are denied? In the meantime, I'm going to push through v1.2.1 of |
|
I don't see any denied requests, just successful stores: But I do see a few errors similar to this: (All somewhere in filesystem access, but some in differet calls.) However they all seem to be before these errors. However, if this is trying to connect to a different host at FI, this may be being blocked at the firewall. Let me see if I can add an exception. |

We have a new jenkins system with more flexible resources (but fewer overall runners). Merging this change will switch to the new system. (If someone could give me admin on this repo, I can setup the new webhook as well.)
This is a significant structural change to run all the tests in a single container in parallel. It's currently taking a day on 16 cores (about 380 core hours). We could tweak the resources and parallelism if needed.
I also fixed a problem in the Dockerfile introduced by PR #977 7e71e41 that broke the old jenkins builds as well.