Skip to content

jenkins: migration to new k8s-based jenkins system#1013

Open
dylex wants to merge 5 commits into
mainfrom
jenkins
Open

jenkins: migration to new k8s-based jenkins system#1013
dylex wants to merge 5 commits into
mainfrom
jenkins

Conversation

@dylex
Copy link
Copy Markdown
Collaborator

@dylex dylex commented May 26, 2026

We have a new jenkins system with more flexible resources (but fewer overall runners). Merging this change will switch to the new system. (If someone could give me admin on this repo, I can setup the new webhook as well.)

This is a significant structural change to run all the tests in a single container in parallel. It's currently taking a day on 16 cores (about 380 core hours). We could tweak the resources and parallelism if needed.

I also fixed a problem in the Dockerfile introduced by PR #977 7e71e41 that broke the old jenkins builds as well.

This is a significant structural change to run all the tests in a single
container in parallel.
@Debraheem
Copy link
Copy Markdown
Member

Debraheem commented May 26, 2026

Thank you Dylan! Parallelizing the tests is much appreciated, Jenkins was very slow to get through the first few tests in the past.

We could tweak the resources and parallelism if needed.

Yes, a few of the tests (the first 5-6 in the testing) might benefit from 12-16 cores, but most could be run on 4 or less.

I believe @matteocantiello can give you the access you need. There might have been some recent changes to mesa_test so @wmwolf maybe you can confirm the github runner is running the recent mesa_test version correctly.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 26, 2026

You can see the current test runtimes here. These are all over an hour, and test 5 in particular is 20 hours:

Test Runtime Command
5 72439.477 mesa_test test 5 --force-logs
2 34449.641 mesa_test test 2 --force-logs
40 26983.783 mesa_test test 40 --force-logs
1 15398.585 mesa_test test 1 --force-logs
52 15389.556 mesa_test test 52 --force-logs
20 14725.373 mesa_test test 20 --force-logs
7 13393.096 mesa_test test 7 --force-logs
9 11360.891 mesa_test test 9 --force-logs
3 10900.387 mesa_test test 3 --force-logs
10 10290.912 mesa_test test 10 --force-logs
8 9967.901 mesa_test test 8 --force-logs
24 7253.871 mesa_test test 24 --force-logs
53 6490.134 mesa_test test 53 --force-logs
11 6173.845 mesa_test test 11 --force-logs
12 5276.727 mesa_test test 12 --force-logs
16 4452.366 mesa_test test 16 --force-logs
92 4349.950 mesa_test test 92 --force-logs
31 4328.200 mesa_test test 31 --force-logs

Is the scaling with more cores linear? What's your goal for how long this should take?

I also notice it's getting errors like:

Failed to submit astero_gyre for commit a062e9c47337cd99a6cf04a6174a0e6bfffb72ce

But seems to successfully log in (it should be using the same credentials as the old system).

@Debraheem
Copy link
Copy Markdown
Member

I'm not exactly sure about the astero errors. But as far as timing, MESA should scale linearly with number of cores up to about 12 cores. This linear scaling is not fully realized in all tests though. In particular the longer tests (those taking > 1-2 hours) should probably be run on 12-16 cores, and ideally finish in < 2-3 hours each. The c13_pocket and ppisn being the worst offenders i think and taking up to 5 hours on some computers. Here is some example timing from main although i can't confirm if these tests were run with the [ci optional] flag, so they might not be running the full test_suite.

Perhaps @warrickball or someone who's tested on their cluster recently can provide more accurate timing or expectations.

@Debraheem Debraheem requested a review from warrickball May 26, 2026 17:36
@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 26, 2026

Do you want to hard-code certain test numbers to run on 16 cores, or just run all of them that way?

@Debraheem
Copy link
Copy Markdown
Member

I think hardcoding is best, as it might waste a lot of resources to make the other tests run on more than 4 cores, but someone other than me might provide better direction.

@warrickball
Copy link
Copy Markdown
Contributor

Hardcoding might be best, just to get the big tests done in reasonable time, but I think running everything on 8 cores each will be about as good. At the start, the 5 slow tests will make their way to completion, after which the smaller ones will start filling in on whichever subset of 8 cores becomes available, so nothing would be left idle.

4 cores would just about work too because from your table above, it looks like the slowest test is about 1/4 of the total runtime, so it'd finish just as everything else does, having used the other 12 cores. Fewer cores would leave resources idle, waiting for the slowest test c13_pocket to finish.

Running everything on all 16 cores would also clearly keep the system perfectly busy but at the cost of the smaller tests not scaling particularly well.

I'll see if I can generate a full table of my most recent runtimes.

@warrickball
Copy link
Copy Markdown
Contributor

Here's a crude stab at my numbers. Apologies for times in H:M:S!

Latest BlueBEAR runtimes

Top 5:

star/c13_pocket                                 1:38:46
star/1M_pre_ms_to_wd                              52:52
star/hb_2M                                        38:48
star/ppisn                                        28:00
star/make_pre_ccsn_13bvn                          22:23
Full results:
star/c13_pocket                                 1:38:46
star/1M_pre_ms_to_wd                              52:52
star/hb_2M                                        38:48
star/ppisn                                        28:00
star/make_pre_ccsn_13bvn                          22:23
star/20M_z2m2_high_rotation                       22:23
star/make_co_wd                                   21:36
star/5M_cepheid_blue_loop                         12:16
star/zams_to_cc_80                                11:59
star/15M_dynamo                                   11:05
star/cburn_inward                                  9:25
star/pisn                                          9:05
star/make_o_ne_wd                                  8:52
star/custom_rates                                  8:23
star/1M_thermohaline                               8:18
star/1.4M_ms_op_mono                               5:45
star/wd_stable_h_burn                              5:16
star/radiative_levitation                          4:55
star/tzo                                           4:39
star/starspots                                     4:28
star/high_z                                        4:17
star/ns_c                                          4:13
star/conserve_angular_momentum                     4:08
star/1.5M_with_diffusion                           4:08
star/ns_he                                         4:05
star/wd_aic                                        4:02
star/16M_conv_premix                               3:56
star/wd_c_core_ignition                            3:47
star/simplex_solar_calibration                     3:46
star/hot_cool_wind                                 3:46
star/wd_nova_burst                                 3:41
star/relax_composition_j_entropy                   3:40
star/custom_colors                                 3:40
star/make_he_wd                                    3:31
star/7M_prems_to_AGB                               3:30
star/make_brown_dwarf                              3:15
star/ns_h                                          3:13
star/wd_he_shell_ignition                          3:10
star/gyre_in_mesa_rsg                              3:09
star/hse_riemann                                   3:04
star/accreted_material_j                           3:00
star/make_planets                                  2:57
star/1.3M_ms_high_Z                                2:48
star/high_rot_darkening                            2:44
star/magnetic_braking                              2:39
star/wd_cool_0.6M                                  2:38
star/16M_predictive_mix                            2:38
star/high_mass                                     2:37
star/split_burn_big_net                            2:36
star/rsp_Type_II_Cepheid                           2:35
star/check_redo                                    2:34
star/low_z                                         2:33
star/make_metals                                   2:28
star/R_CrB_star                                    2:22
star/carbon_kh                                     2:21
star/adjust_net                                    2:21
star/rsp_check_2nd_crossing                        2:14
star/conductive_flame                              2:11
star/gyre_in_mesa_spb                              2:08
star/gyre_in_mesa_envelope                         2:06
star/gyre_in_mesa_bcep                             2:06
star/gyre_in_mesa_ms                               2:04
star/12M_pre_ms_to_core_collapse                  19:49
star/ccsn_IIp                                     19:37
star/20M_pre_ms_to_core_collapse                  17:59
star/gyre_in_mesa_wd                               1:59
star/semiconvection                                1:56
star/rsp_Cepheid                                   1:56
star/conv_core_cpm                                 1:56
star/check_pulse_atm                               1:56
star/extended_convective_penetration               1:51
star/diffusion_smoothness                          1:45
star/rsp_Delta_Scuti                               1:42
star/irradiated_planet                             1:42
star/wd_diffusion                                  1:37
star/make_env                                      1:36
star/rsp_save_and_load_file                        1:33
star/rsp_RR_Lyrae                                  1:33
star/rsp_gyre                                      1:33
star/rsp_BLAP                                      1:30
star/rsp_BEP                                       1:30
star/make_sdb                                      1:15
star/twin_studies                                  1:13
star/timing                                        1:03
star/make_zams_low_mass                            1:01
star/wd_acc_small_dm                               1:00
star/other_physics_hooks                           0:57
star/make_zams_ultra_high_mass                     0:55
star/make_zams                                     0:52
star/T_tau_gradr                                   0:48

binary/wind_fed_bhhmxb                             7:16
binary/double_bh                                   5:12
binary/jdot_ml_check                               3:51
binary/star_plus_point_mass                        3:41
binary/evolve_both_stars                           3:41
binary/star_plus_point_mass_explicit_mdot          1:53
binary/jdot_ls_check                               1:50
binary/jdot_gr_check                               1:02

astero/fast_simplex                                5:09
astero/fast_newuoa                                 3:44
astero/example_astero                              3:14
astero/fast_scan_grid                              2:18
astero/fast_from_file                              2:18
astero/astero_gyre                                 2:02
astero/astero_adipls                               1:50
astero/surface_effects                             0:52

@Debraheem
Copy link
Copy Markdown
Member

Thanks @warrickball! is this timing from a normal or full test? That timing might be substantially different between the two.

@warrickball
Copy link
Copy Markdown
Contributor

Full test.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 26, 2026

To be clear, the jenkins numbers are 4 cores each, 4 tests in parallel, 16 cores total. I'll try with 8 cores each. instead.

dylex added 2 commits May 26, 2026 18:06
seems like tests don't use NPROCS?  Also increase to 6 tests parallel.
@wmwolf
Copy link
Copy Markdown
Member

wmwolf commented May 27, 2026

As for mesa_test, yes there is an important change. Update to the latest one (now 1.2.0) so that the URL it tries to submit to is correct. The old one was hard-coded to Heroku rather than using testhub.mesastar.org. The actual usage has not changed, though, so just updating it should suffice.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 27, 2026

I assume the mesa_test version is coming from this line? https://github.com/MESAHub/mesa/blob/main/jenkins/Dockerfile#L25
I can try changing that to 1.2.0.

I did a run with 6 parallel tests, 8 cores per test, and test 5 took 10 hours, so just about half. A that point this test becomes the bottleneck, as other tests had finished an hour earlier, so maybe 4x8 is a good compromise.

Also, the tests are setting NPROCS and OMP_NUM_THREADS, but it looks like NPROCS is only used for the build, and the tests are only using openmp -- is that correct? If so I'll change it to set NPROCS correctly for the build, and only OMP_NUM_THREADS for the tests.

@Debraheem
Copy link
Copy Markdown
Member

and test 5 took 10 hours

that sounds about right, I've confirmed that test 5 is the c13_pocket. There might be a way to make this test less expensive, but that can be done outside of this pr.

Also, the tests are setting NPROCS and OMP_NUM_THREADS, but it looks like NPROCS is only used for the build, and the tests are only using openmp -- is that correct? If so I'll change it to set NPROCS correctly for the build, and only OMP_NUM_THREADS for the tests.

This is correct.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 27, 2026

@wmwolf mesa_test 1.2.0 says it's working now, but is producing a traceback:

uccessfully submitted instance of 12M_pre_ms_to_core_collapse for commit 143706cf122255c021600b0f5b495ea3e40b9f35.
Changed MESA_DIR back to /home/jenkins/agent/workspace/CCA_mesa_jenkins.
/usr/lib/ruby/2.7.0/net/http.rb:960:in `initialize': execution expired (Net::OpenTimeout)
	from /usr/lib/ruby/2.7.0/net/http.rb:960:in `open'
	from /usr/lib/ruby/2.7.0/net/http.rb:960:in `block in connect'
	from /usr/lib/ruby/2.7.0/timeout.rb:105:in `timeout'
	from /usr/lib/ruby/2.7.0/net/http.rb:958:in `connect'
	from /usr/lib/ruby/2.7.0/net/http.rb:943:in `do_start'
	from /usr/lib/ruby/2.7.0/net/http.rb:932:in `start'
	from /usr/lib/ruby/2.7.0/net/http.rb:1483:in `request'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:457:in `submit_logs'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:503:in `submit_test_log'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:441:in `submit_instance'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:74:in `block in test'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/lib/mesa_test.rb:862:in `with_mesa_dir'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:60:in `test'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/command.rb:28:in `run'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/invocation.rb:127:in `invoke_command'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor.rb:538:in `dispatch'
	from /var/lib/gems/2.7.0/gems/thor-1.3.2/lib/thor/base.rb:584:in `start'
	from /var/lib/gems/2.7.0/gems/mesa_test-1.2.0/bin/mesa_test:470:in `<top (required)>'
	from /usr/local/bin/mesa_test:23:in `load'
	from /usr/local/bin/mesa_test:23:in `<main>'
Email, password, and computer name accepted

https://jenkins-new.flatironinstitute.org/job/CCA/job/mesa/job/jenkins/16/stages/?start-byte=0&selected-node=40#log-40-29

From my perspective, this is ready to merge.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 28, 2026

Actually it turns out that traceback is causing all tests to fail.

Regardless, since everything is already failing on the old jenkins system due to #977, and things are generally working on the new one, I'm going to go ahead and remove the old project and let you sort out the mesa_test issues and merging this in.

@wmwolf
Copy link
Copy Markdown
Member

wmwolf commented May 28, 2026

@dylex : those tests did get submitted, interestingly. See here. I think we're just getting a timeout where the testhub isn't responding with a 200 status fast enough. I'll look into it more carefully today, though.

@evbauer
Copy link
Copy Markdown
Member

evbauer commented May 28, 2026

One broad comment is that the intention in the past has been to periodically re-order the tests so that the most expensive ones appear first in do1_test_source. So you should be able to be agnostic about which test numbers require the most cores, other than to perhaps dedicate more resources to the first 5-10 tests. I used to run 16 cores for the first 10 tests and 8 cores for everything else. We can separately try to reorder the tests so that the most expensive ones do indeed appear within the first 5 tests or so (see #1021).

@warrickball's timing seems to be generally consistent with my expectations, and suggests that we should be moving the core collapse related models up near the front of do1_test_source. I suppose c13_pocket should be the first test unless we have plans to get it to run significantly faster.

@Debraheem
Copy link
Copy Markdown
Member

Hello @dylex , thank you for helping us get jenkins running tests again! I agree this is almost ready to merge, but i have a couple questions first.

  1. Why is the continuous integration for jenkins still failing for this commit?
Screenshot 2026-05-29 at 10 43 23 AM
  1. If we change the ordering so the longest tests are the first in the lines for he test_suite as per Evan's comments and update testing order for timing #1021, will Jenkins adjust the cores automatically? Can we (the devs) do this, or will we need to contact you to adjust this?
  2. If we add tests to the test_suite or remove test, or need to update the mesasdk, will we also need to contact you? What would need changed
  3. is there any documentation you can leave us with for how to manage the above items or other issues you might see happening in the future, or if anything else changes and we can't reach you?
  4. How long will Jenkins remain online? is there a timeline? I know it's slowly being decomissioned.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 29, 2026

  1. Why is the continuous integration for jenkins still failing for this commit?

It looks like the mesa_test 1.2.0 version is producing a traceback and failing (see above).

  1. If we change the ordering so the longest tests are the first in the lines for he test_suite as per Evan's comments and update testing order for timing #1021, will Jenkins adjust the cores automatically? Can we (the devs) do this, or will we need to contact you to adjust this?

I currently have it just using 8 cores for each test. It's basically just a shell script so you can make it use the cores however you want: https://github.com/MESAHub/mesa/blob/jenkins/jenkins/Jenkinsfile#L47

  1. If we add tests to the test_suite or remove test, or need to update the mesasdk, will we also need to contact you? What would need changed

As I understand it the count_tests part discovers all the tests. I didn't actually change any of that part, just how the parallelization was working.

  1. is there any documentation you can leave us with for how to manage the above items or other issues you might see happening in the future, or if anything else changes and we can't reach you?

The only thing you should need us for is if you need to change which builds are run (which repository, branches, permissions). You can adjust the build and environment otherwise. This is standard Jenkins scripted pipeline -- the only thing special is the buildPod call which we'll document on the wiki once we finish migrating other projects and things stabalize. You can always contact scicomp@flatironinstitute.org if something comes up.

  1. How long will Jenkins remain online? is there a timeline? I know it's slowly being decomissioned.

The old system is being decomissioned, but you're no longer using it. There are no plans to remove the new one, and it's structured more resiliently than the old one so should survive better if we have hardware failures.

This of course presumes that someone involved with this project is still at Flatiron.

@Debraheem
Copy link
Copy Markdown
Member

Awesome, thank you for the detailed response! Once we get the mesa_test 1.2.0 situation sorted, I think we merge this pr.

@wmwolf
Copy link
Copy Markdown
Member

wmwolf commented May 29, 2026

I looked at the error message more carefully. The issue happened when trying to submit the logs to the logs server hosted at flatiron. As we can see, the individual tests did get submitted to the testhub without issue. I've heard from @evbauer that he's having problems submitting logs to the log server as well. This is weird, as that process hasn't really changed recently. @dylex : would you be able to look into the server logs of the the logs server (a fun tongue twister) to see if you can see requests that are denied?

In the meantime, I'm going to push through v1.2.1 of mesa_test that will have better error handling and reporting for these situations so it can fail more gracefully.

@dylex
Copy link
Copy Markdown
Collaborator Author

dylex commented May 29, 2026

I don't see any denied requests, just successful stores:

[2026-05-29 19:39:17,674] INFO in uploads: Receiving data from rusty for commit 8f7291920adcb57789009a65a01f457002a29de8
[2026-05-29 19:39:17,678] INFO in uploads: Saving file /var/www/mesa-logs/uploads/8f7291920adcb57789009a65a01f457002a29de8/rusty/make_zams_low_mass/mk.txt
[2026-05-29 19:39:17,679] INFO in uploads: Saving file /var/www/mesa-logs/uploads/8f7291920adcb57789009a65a01f457002a29de8/rusty/make_zams_low_mass/out.txt

But I do see a few errors similar to this:

[2026-05-24 08:16:10 +0000] [13] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 177, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1536, in __call__
    return self.wsgi_app(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1511, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/www/mesa-logs/uploads.py", line 43, in dir_listing
    files = os.listdir(abs_path)
            ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
    sys.exit(1)
SystemExit: 1

(All somewhere in filesystem access, but some in differet calls.) However they all seem to be before these errors.

However, if this is trying to connect to a different host at FI, this may be being blocked at the firewall. Let me see if I can add an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants