CodeProject.AI randomly enters "Lost Contact" state after several hours (Ubuntu Server + GTX 750 Ti) #360
Unanswered
guilloteGNU
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Good morning.
I have a very particular issue with CodeProject.AI and I haven't been able to find any explanation for it.
I run a homemade server that works as a media server, NAS, and security server.
Hardware
I am using a bare-metal installation of CodeProject.AI because I was never able to get it working correctly inside Docker. Apparently there are some communication issues between Python, Docker, and the GPU (this is a very old GPU).
I use MotionEye for camera surveillance. Between 12 AM and 6 AM, whenever MotionEye detects movement it sends 5 images separated by 500 ms to CodeProject.AI through a webhook service.
Each image is analyzed in approximately 115 ms using YOLOv8 medium with half precision always disabled. If a person is detected, I receive a Telegram notification.
Originally I was using Xubuntu Minimal 24 without issues (using YOLOv5 if I remember correctly). Later I migrated to Ubuntu Server because I wanted something more "server-like". That is when the problems started.
Suddenly CPAI stopped working and the web interface showed the message:
At first I thought maybe the NVIDIA drivers had been updated to an incompatible version, so I downgraded everything to Ubuntu Server 22.
The behavior has been extremely inconsistent.
The issue works roughly like this:
Usually around 5 AM, when public transportation starts operating, bus headlights trigger MotionEye detection. CPAI detects a person and sends me Telegram notifications normally.
This can work correctly for several consecutive days, or sometimes less than a single day.
Then, around 7:30 AM, if I check the CPAI web interface, I see the "Lost Contact" message.
Changing the model size does not help. Switching back to YOLOv5 does not help either. Running inference on CPU instead of GPU also does not solve it (although I never tested CPU-only operation for long periods).
Once CPAI enters the "Lost Contact" state, it never recovers.
The only way I have found to bring it back online is:
I also disabled Ubuntu cron jobs and unattended upgrades that normally run around 6 AM.
I tried using YOLOv8 small and tiny models as well.
Exactly the same result.
Recently I decided to create a cron job that starts the CPAI service at 11:50 PM and stops it at 6:10 AM. Surprisingly, this seems to work correctly.
However, if I manually start the service again at 7:30 AM for testing, it immediately starts already in "Lost Contact" state.
Even more strangely, during the next night it may work perfectly again and continue sending Telegram notifications during the early morning.
My current theory is that something remains resident in memory after failure — maybe a locked temporary file, stale CUDA context, or some GPU-related process — but nothing unusual appears in the logs or in nvidia-smi.
The logs show no actual errors.
I am also attaching screenshots showing:
Has anyone experienced something similar with older NVIDIA GPUs or Ubuntu Server installations?
Beta Was this translation helpful? Give feedback.
All reactions