Skip to content

Compute worker - Improve status update and logs#2223

Open
Didayolo wants to merge 2 commits intodevelopfrom
fix-worker-logs
Open

Compute worker - Improve status update and logs#2223
Didayolo wants to merge 2 commits intodevelopfrom
fix-worker-logs

Conversation

@Didayolo
Copy link
Member

@Didayolo Didayolo commented Feb 28, 2026

Description

  1. Make _update_status robust to failures
  2. Log traceback instead of uninformative messages
  3. Change status to Failed for any exception, avoiding getting stuck to Running
  4. Add a "best effort" logging with push_logs method
  5. Add a container removal in case of failure, avoid orphan containers
  6. Some minor fixes
  7. Improve show_progress clarity

Generally, this change should fix the "stuck in running" bug, and allow more logs to reach the platform's front-end.


Warning: looks like we never pass through Scoring status and do not have scoring logs. I need to check this.

Issues this PR resolves

A checklist for hand testing

  • Check successful and failing submissions
  • Check logs inside container and on the platform

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@Didayolo Didayolo mentioned this pull request Feb 28, 2026
15 tasks
@Didayolo
Copy link
Member Author

Didayolo commented Feb 28, 2026

CircleCI error:

E           AssertionError: Locator expected to be visible
E           Actual value: None
E           Error: element(s) not found 
E           Call log:
E             - Expect "to_be_visible" with timeout 2000ms
E             - waiting for get_by_role("cell", name="Finished")

test_submission.py:46: AssertionError
=========================== short test summary info ============================
FAILED test_submission.py::test_basic[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v15[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_irisV15_code[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_irisV15_result[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v18[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name="Finished")
============== 5 failed, 6 passed, 2 skipped in 220.01s (0:03:40) ==============

Exited with code exit status 1

When I try manually the E2E tests competitions and submissions, I do have Finished state though.

Some logs in the artefacts:

django-1          | �[32m2026-02-28 05:06:29.206�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/163/�[0m
django-1          | �[32m2026-02-28 05:06:29.207�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/163/�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.208�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m545�[0m - �[31m�[1mSubmission patch failed with status = 404, and response = 
compute_worker-1  | b'{"detail":"No Submission matches the given query."}'�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.208�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_status�[0m:�[36m561�[0m - �[31m�[1mFailed to update submission status to Failed: Failure updating submission data.�[0m
compute_worker-1  | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker-1  | 
compute_worker-1  |   File "�[32m/app/�[0m�[32m�[1mcompute_worker.py�[0m", line �[33m683�[0m, in �[35m_get_bundle�[0m
compute_worker-1  |     �[35m�[1mwith�[0m �[1mZipFile�[0m�[1m(�[0m�[1mbundle_file�[0m�[1m,�[0m �[36m"r"�[0m�[1m)�[0m �[35m�[1mas�[0m �[1mz�[0m�[1m:�[0m
compute_worker-1  |     �[36m     │       └ �[0m�[36m�[1m'/codabench/uPK-421_sID-163__pqs91qp9/bundles/tmpz9en08b8'�[0m
compute_worker-1  |     �[36m     └ �[0m�[36m�[1m<class 'zipfile.ZipFile'>�[0m
compute_worker-1  | 
compute_worker-1  |   File "/root/.local/share/uv/python/cpython-3.9.20-linux-x86_64-gnu/lib/python3.9/zipfile.py", line 1268, in __init__
compute_worker-1  |     self._RealGetContents()
compute_worker-1  |     │    └ <function ZipFile._RealGetContents at 0x7e44bca6daf0>
compute_worker-1  |<zipfile.ZipFile [closed]>
compute_worker-1  |   File "/root/.local/share/uv/python/cpython-3.9.20-linux-x86_64-gnu/lib/python3.9/zipfile.py", line 1335, in _RealGetContents
compute_worker-1  |     raise BadZipFile("File is not a zip file")
compute_worker-1  |<class 'zipfile.BadZipFile'>
0m
compute_worker-1  | �[32m2026-02-28 05:06:29.220�[0m | �[1mINFO    �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m539�[0m - �[1mUpdating submission @ http://django:8000/api/submissions/164/ with data = {'status': 'Running', 'status_details': 'ingestion_hostname-local_worker', 'secret': '43efdd09-8cae-4c48-a67b-1528667d0123'}�[0m
django-1          | �[32m2026-02-28 05:06:29.237�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/164/�[0m
django-1          | �[32m2026-02-28 05:06:29.237�[0m | �[33m�[1mWARNING �[0m | �[36mdjango.utils.log�[0m:�[36mlog_response�[0m:�[36m246�[0m - �[33m�[1mNot Found: /api/submissions/164/�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.238�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m545�[0m - �[31m�[1mSubmission patch failed with status = 404, and response = 
compute_worker-1  | b'{"detail":"No Submission matches the given query."}'�[0m
compute_worker-1  | �[32m2026-02-28 05:06:29.238�[0m | �[31m�[1mERROR   �[0m | �[36mcompute_worker�[0m:�[36m_update_status�[0m:�[36m561�[0m - �[31m�[1mFailed to update submission status to Running: Failure updating submission data.�[0m
compute_worker-1  | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker-1  | 
compute_worker-1  |   File "/.venv/bin/celery", line 10, in <module>
compute_worker-1  |     sys.exit(main())
compute_worker-1  |     │   │    └ <function main at 0x7e44bcf471f0>
compute_worker-1  |     │   └ <function Worker.__call__.<locals>.exit at 0x7e44bb2dda60>
compute_worker-1  |<module 'sys' (built-in)>
compute_worker-1  |   File "/.venv/lib/python3.9/site-packages/celery/__main__.py", line 15, in main
compute_worker-1  |     sys.exit(_main())
compute_worker-1  |     │   │    └ <function main at 0x7e44bbac69d0>
compute_worker-1  |     │   └ <function Worker.__call__.<locals>.exit at 0x7e44bb2dda60>
compute_worker-1  |<module 'sys' (built-in)>
compute_worker-1  |   File "/.venv/lib/python3.9/site-packages/celery/bin/celery.py", line 213, in main
compute_worker-1  |     return celery(auto_envvar_prefix="CELERY")
compute_worker-1  |<DYMGroup celery>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant