jenkins: migration to new k8s-based jenkins system by dylex · Pull Request #1013 · MESAHub/mesa

dylex · 2026-05-26T16:14:17Z

We have a new jenkins system with more flexible resources (but fewer overall runners). Merging this change will switch to the new system. (If someone could give me admin on this repo, I can setup the new webhook as well.)

This is a significant structural change to run all the tests in a single container in parallel. It's currently taking a day on 16 cores (about 380 core hours). We could tweak the resources and parallelism if needed.

I also fixed a problem in the Dockerfile introduced by PR #977 7e71e41 that broke the old jenkins builds as well.

This is a significant structural change to run all the tests in a single container in parallel.

Debraheem · 2026-05-26T16:48:04Z

Thank you Dylan! Parallelizing the tests is much appreciated, Jenkins was very slow to get through the first few tests in the past.

We could tweak the resources and parallelism if needed.

Yes, a few of the tests (the first 5-6 in the testing) might benefit from 12-16 cores, but most could be run on 4 or less.

I believe @matteocantiello can give you the access you need. There might have been some recent changes to mesa_test so @wmwolf maybe you can confirm the github runner is running the recent mesa_test version correctly.

dylex · 2026-05-26T17:18:22Z

You can see the current test runtimes here. These are all over an hour, and test 5 in particular is 20 hours:

Test	Runtime	Command
5	72439.477	mesa_test test 5 --force-logs
2	34449.641	mesa_test test 2 --force-logs
40	26983.783	mesa_test test 40 --force-logs
1	15398.585	mesa_test test 1 --force-logs
52	15389.556	mesa_test test 52 --force-logs
20	14725.373	mesa_test test 20 --force-logs
7	13393.096	mesa_test test 7 --force-logs
9	11360.891	mesa_test test 9 --force-logs
3	10900.387	mesa_test test 3 --force-logs
10	10290.912	mesa_test test 10 --force-logs
8	9967.901	mesa_test test 8 --force-logs
24	7253.871	mesa_test test 24 --force-logs
53	6490.134	mesa_test test 53 --force-logs
11	6173.845	mesa_test test 11 --force-logs
12	5276.727	mesa_test test 12 --force-logs
16	4452.366	mesa_test test 16 --force-logs
92	4349.950	mesa_test test 92 --force-logs
31	4328.200	mesa_test test 31 --force-logs

Is the scaling with more cores linear? What's your goal for how long this should take?

I also notice it's getting errors like:

Failed to submit astero_gyre for commit a062e9c47337cd99a6cf04a6174a0e6bfffb72ce

But seems to successfully log in (it should be using the same credentials as the old system).

Debraheem · 2026-05-26T17:36:30Z

I'm not exactly sure about the astero errors. But as far as timing, MESA should scale linearly with number of cores up to about 12 cores. This linear scaling is not fully realized in all tests though. In particular the longer tests (those taking > 1-2 hours) should probably be run on 12-16 cores, and ideally finish in < 2-3 hours each. The c13_pocket and ppisn being the worst offenders i think and taking up to 5 hours on some computers. Here is some example timing from main although i can't confirm if these tests were run with the [ci optional] flag, so they might not be running the full test_suite.

Perhaps @warrickball or someone who's tested on their cluster recently can provide more accurate timing or expectations.

dylex · 2026-05-26T18:16:09Z

Do you want to hard-code certain test numbers to run on 16 cores, or just run all of them that way?

Debraheem · 2026-05-26T18:18:54Z

I think hardcoding is best, as it might waste a lot of resources to make the other tests run on more than 4 cores, but someone other than me might provide better direction.

warrickball · 2026-05-26T20:24:53Z

Hardcoding might be best, just to get the big tests done in reasonable time, but I think running everything on 8 cores each will be about as good. At the start, the 5 slow tests will make their way to completion, after which the smaller ones will start filling in on whichever subset of 8 cores becomes available, so nothing would be left idle.

4 cores would just about work too because from your table above, it looks like the slowest test is about 1/4 of the total runtime, so it'd finish just as everything else does, having used the other 12 cores. Fewer cores would leave resources idle, waiting for the slowest test c13_pocket to finish.

Running everything on all 16 cores would also clearly keep the system perfectly busy but at the cost of the smaller tests not scaling particularly well.

I'll see if I can generate a full table of my most recent runtimes.

warrickball · 2026-05-26T20:39:30Z

Here's a crude stab at my numbers. Apologies for times in H:M:S!

Latest BlueBEAR runtimes

Top 5:

star/c13_pocket                                 1:38:46
star/1M_pre_ms_to_wd                              52:52
star/hb_2M                                        38:48
star/ppisn                                        28:00
star/make_pre_ccsn_13bvn                          22:23

Full results:

star/c13_pocket                                 1:38:46
star/1M_pre_ms_to_wd                              52:52
star/hb_2M                                        38:48
star/ppisn                                        28:00
star/make_pre_ccsn_13bvn                          22:23
star/20M_z2m2_high_rotation                       22:23
star/make_co_wd                                   21:36
star/5M_cepheid_blue_loop                         12:16
star/zams_to_cc_80                                11:59
star/15M_dynamo                                   11:05
star/cburn_inward                                  9:25
star/pisn                                          9:05
star/make_o_ne_wd                                  8:52
star/custom_rates                                  8:23
star/1M_thermohaline                               8:18
star/1.4M_ms_op_mono                               5:45
star/wd_stable_h_burn                              5:16
star/radiative_levitation                          4:55
star/tzo                                           4:39
star/starspots                                     4:28
star/high_z                                        4:17
star/ns_c                                          4:13
star/conserve_angular_momentum                     4:08
star/1.5M_with_diffusion                           4:08
star/ns_he                                         4:05
star/wd_aic                                        4:02
star/16M_conv_premix                               3:56
star/wd_c_core_ignition                            3:47
star/simplex_solar_calibration                     3:46
star/hot_cool_wind                                 3:46
star/wd_nova_burst                                 3:41
star/relax_composition_j_entropy                   3:40
star/custom_colors                                 3:40
star/make_he_wd                                    3:31
star/7M_prems_to_AGB                               3:30
star/make_brown_dwarf                              3:15
star/ns_h                                          3:13
star/wd_he_shell_ignition                          3:10
star/gyre_in_mesa_rsg                              3:09
star/hse_riemann                                   3:04
star/accreted_material_j                           3:00
star/make_planets                                  2:57
star/1.3M_ms_high_Z                                2:48
star/high_rot_darkening                            2:44
star/magnetic_braking                              2:39
star/wd_cool_0.6M                                  2:38
star/16M_predictive_mix                            2:38
star/high_mass                                     2:37
star/split_burn_big_net                            2:36
star/rsp_Type_II_Cepheid                           2:35
star/check_redo                                    2:34
star/low_z                                         2:33
star/make_metals                                   2:28
star/R_CrB_star                                    2:22
star/carbon_kh                                     2:21
star/adjust_net                                    2:21
star/rsp_check_2nd_crossing                        2:14
star/conductive_flame                              2:11
star/gyre_in_mesa_spb                              2:08
star/gyre_in_mesa_envelope                         2:06
star/gyre_in_mesa_bcep                             2:06
star/gyre_in_mesa_ms                               2:04
star/12M_pre_ms_to_core_collapse                  19:49
star/ccsn_IIp                                     19:37
star/20M_pre_ms_to_core_collapse                  17:59
star/gyre_in_mesa_wd                               1:59
star/semiconvection                                1:56
star/rsp_Cepheid                                   1:56
star/conv_core_cpm                                 1:56
star/check_pulse_atm                               1:56
star/extended_convective_penetration               1:51
star/diffusion_smoothness                          1:45
star/rsp_Delta_Scuti                               1:42
star/irradiated_planet                             1:42
star/wd_diffusion                                  1:37
star/make_env                                      1:36
star/rsp_save_and_load_file                        1:33
star/rsp_RR_Lyrae                                  1:33
star/rsp_gyre                                      1:33
star/rsp_BLAP                                      1:30
star/rsp_BEP                                       1:30
star/make_sdb                                      1:15
star/twin_studies                                  1:13
star/timing                                        1:03
star/make_zams_low_mass                            1:01
star/wd_acc_small_dm                               1:00
star/other_physics_hooks                           0:57
star/make_zams_ultra_high_mass                     0:55
star/make_zams                                     0:52
star/T_tau_gradr                                   0:48

binary/wind_fed_bhhmxb                             7:16
binary/double_bh                                   5:12
binary/jdot_ml_check                               3:51
binary/star_plus_point_mass                        3:41
binary/evolve_both_stars                           3:41
binary/star_plus_point_mass_explicit_mdot          1:53
binary/jdot_ls_check                               1:50
binary/jdot_gr_check                               1:02

astero/fast_simplex                                5:09
astero/fast_newuoa                                 3:44
astero/example_astero                              3:14
astero/fast_scan_grid                              2:18
astero/fast_from_file                              2:18
astero/astero_gyre                                 2:02
astero/astero_adipls                               1:50
astero/surface_effects                             0:52

Debraheem · 2026-05-26T20:42:27Z

Thanks @warrickball! is this timing from a normal or full test? That timing might be substantially different between the two.

warrickball · 2026-05-26T20:45:07Z

Full test.

dylex · 2026-05-26T22:05:26Z

To be clear, the jenkins numbers are 4 cores each, 4 tests in parallel, 16 cores total. I'll try with 8 cores each. instead.

seems like tests don't use NPROCS? Also increase to 6 tests parallel.

wmwolf · 2026-05-27T03:08:05Z

As for mesa_test, yes there is an important change. Update to the latest one (now 1.2.0) so that the URL it tries to submit to is correct. The old one was hard-coded to Heroku rather than using testhub.mesastar.org. The actual usage has not changed, though, so just updating it should suffice.

dylex · 2026-05-27T13:19:01Z

I assume the mesa_test version is coming from this line? https://github.com/MESAHub/mesa/blob/main/jenkins/Dockerfile#L25
I can try changing that to 1.2.0.

I did a run with 6 parallel tests, 8 cores per test, and test 5 took 10 hours, so just about half. A that point this test becomes the bottleneck, as other tests had finished an hour earlier, so maybe 4x8 is a good compromise.

Also, the tests are setting NPROCS and OMP_NUM_THREADS, but it looks like NPROCS is only used for the build, and the tests are only using openmp -- is that correct? If so I'll change it to set NPROCS correctly for the build, and only OMP_NUM_THREADS for the tests.

@wmwolf

As per @wmwolf

Debraheem · 2026-05-27T13:58:02Z

and test 5 took 10 hours

that sounds about right, I've confirmed that test 5 is the c13_pocket. There might be a way to make this test less expensive, but that can be done outside of this pr.

Also, the tests are setting NPROCS and OMP_NUM_THREADS, but it looks like NPROCS is only used for the build, and the tests are only using openmp -- is that correct? If so I'll change it to set NPROCS correctly for the build, and only OMP_NUM_THREADS for the tests.

This is correct.

dylex · 2026-05-27T18:19:59Z

@wmwolf mesa_test 1.2.0 says it's working now, but is producing a traceback:

uccessfully submitted instance of 12M_pre_ms_to_core_collapse for commit 143706cf122255c021600b0f5b495ea3e40b9f35.
Changed MESA_DIR back to /home/jenkins/agent/workspace/CCA_mesa_jenkins.
/usr/lib/ruby/2.7.0/net/http.rb:960:in `initialize': execution expired (Net::OpenTimeout)
	from /usr/lib/ruby/2.7.0/net/http.rb:960:in `open'
	from /usr/lib/ruby/2.7.0/net/http.rb:960:in `block in connect'
	from /usr/lib/ruby/2.7.0/timeout.rb:105:in `timeout'
	from /usr/lib/ruby/2.7.0/net/http.rb:958:in `connect'
	from /usr/lib/ruby/2.7.0/net/http.rb:943:in `do_start'
	from /usr/lib/ruby/2.7.0/net/http.rb:932:in `start'
	from /usr/lib/ruby/2.7.0/net/http.rb:1483:in `request'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:457:in `submit_logs'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:503:in `submit_test_log'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:441:in `submit_instance'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:74:in `block in test'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:862:in `with_mesa_dir'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:60:in `test'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/command.rb:28:in `run'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/invocation.rb:127:in `invoke_command'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor.rb:538:in `dispatch'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/base.rb:584:in `start'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:470:in `<top (required)>'
	from /usr/local/bin/mesa_test:23:in `load'
	from /usr/local/bin/mesa_test:23:in `<main>'
Email, password, and computer name accepted

https://jenkins-new.flatironinstitute.org/job/CCA/job/mesa/job/jenkins/16/stages/?start-byte=0&selected-node=40#log-40-29

From my perspective, this is ready to merge.

dylex · 2026-05-28T16:02:36Z

Actually it turns out that traceback is causing all tests to fail.

Regardless, since everything is already failing on the old jenkins system due to #977, and things are generally working on the new one, I'm going to go ahead and remove the old project and let you sort out the mesa_test issues and merging this in.

wmwolf · 2026-05-28T16:42:31Z

@dylex : those tests did get submitted, interestingly. See here. I think we're just getting a timeout where the testhub isn't responding with a 200 status fast enough. I'll look into it more carefully today, though.

evbauer · 2026-05-28T17:40:39Z

One broad comment is that the intention in the past has been to periodically re-order the tests so that the most expensive ones appear first in do1_test_source. So you should be able to be agnostic about which test numbers require the most cores, other than to perhaps dedicate more resources to the first 5-10 tests. I used to run 16 cores for the first 10 tests and 8 cores for everything else. We can separately try to reorder the tests so that the most expensive ones do indeed appear within the first 5 tests or so (see #1021).

@warrickball's timing seems to be generally consistent with my expectations, and suggests that we should be moving the core collapse related models up near the front of do1_test_source. I suppose c13_pocket should be the first test unless we have plans to get it to run significantly faster.

Debraheem · 2026-05-29T14:49:39Z

Hello @dylex , thank you for helping us get jenkins running tests again! I agree this is almost ready to merge, but i have a couple questions first.

Why is the continuous integration for jenkins still failing for this commit?

If we change the ordering so the longest tests are the first in the lines for he test_suite as per Evan's comments and update testing order for timing #1021, will Jenkins adjust the cores automatically? Can we (the devs) do this, or will we need to contact you to adjust this?
If we add tests to the test_suite or remove test, or need to update the mesasdk, will we also need to contact you? What would need changed
is there any documentation you can leave us with for how to manage the above items or other issues you might see happening in the future, or if anything else changes and we can't reach you?
How long will Jenkins remain online? is there a timeline? I know it's slowly being decomissioned.

dylex · 2026-05-29T17:16:26Z

Why is the continuous integration for jenkins still failing for this commit?

It looks like the mesa_test 1.2.0 version is producing a traceback and failing (see above).

If we change the ordering so the longest tests are the first in the lines for he test_suite as per Evan's comments and update testing order for timing #1021, will Jenkins adjust the cores automatically? Can we (the devs) do this, or will we need to contact you to adjust this?

I currently have it just using 8 cores for each test. It's basically just a shell script so you can make it use the cores however you want: https://github.com/MESAHub/mesa/blob/jenkins/jenkins/Jenkinsfile#L47

If we add tests to the test_suite or remove test, or need to update the mesasdk, will we also need to contact you? What would need changed

As I understand it the count_tests part discovers all the tests. I didn't actually change any of that part, just how the parallelization was working.

is there any documentation you can leave us with for how to manage the above items or other issues you might see happening in the future, or if anything else changes and we can't reach you?

The only thing you should need us for is if you need to change which builds are run (which repository, branches, permissions). You can adjust the build and environment otherwise. This is standard Jenkins scripted pipeline -- the only thing special is the buildPod call which we'll document on the wiki once we finish migrating other projects and things stabalize. You can always contact scicomp@flatironinstitute.org if something comes up.

How long will Jenkins remain online? is there a timeline? I know it's slowly being decomissioned.

The old system is being decomissioned, but you're no longer using it. There are no plans to remove the new one, and it's structured more resiliently than the old one so should survive better if we have hardware failures.

This of course presumes that someone involved with this project is still at Flatiron.

Debraheem · 2026-05-29T17:21:31Z

Awesome, thank you for the detailed response! Once we get the mesa_test 1.2.0 situation sorted, I think we merge this pr.

wmwolf · 2026-05-29T21:23:43Z

I looked at the error message more carefully. The issue happened when trying to submit the logs to the logs server hosted at flatiron. As we can see, the individual tests did get submitted to the testhub without issue. I've heard from @evbauer that he's having problems submitting logs to the log server as well. This is weird, as that process hasn't really changed recently. @dylex : would you be able to look into the server logs of the the logs server (a fun tongue twister) to see if you can see requests that are denied?

In the meantime, I'm going to push through v1.2.1 of mesa_test that will have better error handling and reporting for these situations so it can fail more gracefully.

dylex · 2026-05-29T21:43:21Z

I don't see any denied requests, just successful stores:

[2026-05-29 19:39:17,674] INFO in uploads: Receiving data from rusty for commit 8f7291920adcb57789009a65a01f457002a29de8
[2026-05-29 19:39:17,678] INFO in uploads: Saving file /var/www/mesa-logs/uploads/8f7291920adcb57789009a65a01f457002a29de8/rusty/make_zams_low_mass/mk.txt
[2026-05-29 19:39:17,679] INFO in uploads: Saving file /var/www/mesa-logs/uploads/8f7291920adcb57789009a65a01f457002a29de8/rusty/make_zams_low_mass/out.txt

But I do see a few errors similar to this:

[2026-05-24 08:16:10 +0000] [13] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 177, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1536, in __call__
    return self.wsgi_app(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1511, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/www/mesa-logs/uploads.py", line 43, in dir_listing
    files = os.listdir(abs_path)
            ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
    sys.exit(1)
SystemExit: 1

(All somewhere in filesystem access, but some in differet calls.) However they all seem to be before these errors.

However, if this is trying to connect to a different host at FI, this may be being blocked at the firewall. Let me see if I can add an exception.

jenkins: migration to new k8s-based jenkins system

5086d5f

This is a significant structural change to run all the tests in a single container in parallel.

Debraheem requested review from matteocantiello and wmwolf May 26, 2026 16:48

Debraheem requested a review from warrickball May 26, 2026 17:36

dylex added 2 commits May 26, 2026 18:06

jenkins: use 8 cores per test

a580795

jenkins: set OMP_NUM_THREADS=cores_per_test

60602ab

seems like tests don't use NPROCS? Also increase to 6 tests parallel.

jenkins: update mesa_test to 1.2.0

be4bbee

As per @wmwolf

jenkins: don't set NPROCS, try 5 parallel

143706c

evbauer mentioned this pull request May 28, 2026

update testing order for timing #1021

Open

Conversation

dylex commented May 26, 2026

Uh oh!

Debraheem commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylex commented May 26, 2026

Uh oh!

Debraheem commented May 26, 2026

Uh oh!

dylex commented May 26, 2026

Uh oh!

Debraheem commented May 26, 2026

Uh oh!

warrickball commented May 26, 2026

Uh oh!

warrickball commented May 26, 2026

Uh oh!

Debraheem commented May 26, 2026

Uh oh!

warrickball commented May 26, 2026

Uh oh!

dylex commented May 26, 2026

Uh oh!

wmwolf commented May 27, 2026

Uh oh!

dylex commented May 27, 2026

Uh oh!

Debraheem commented May 27, 2026

Uh oh!

dylex commented May 27, 2026

Uh oh!

dylex commented May 28, 2026

Uh oh!

wmwolf commented May 28, 2026

Uh oh!

evbauer commented May 28, 2026

Uh oh!

Debraheem commented May 29, 2026

Uh oh!

dylex commented May 29, 2026

Uh oh!

Debraheem commented May 29, 2026

Uh oh!

wmwolf commented May 29, 2026

Uh oh!

dylex commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Debraheem commented May 26, 2026 •

edited

Loading