Welcome to the Cumulus Support forum.

Latest Cumulus MX V3 release 3.28.6 (build 3283) - 21 March 2024

Cumulus MX V4 beta test release 4.0.0 (build 4017) - 17 March 2024

Legacy Cumulus 1 release v1.9.4 (build 1099) - 28 November 2014 (a patch is available for 1.9.4 build 1099 that extends the date range of drop-down menus to 2030)

Download the Software (Cumulus MX / Cumulus 1 and other related items) from the Wiki

CumulusMX Hangs, Slowdowns, and Failure Modes

Topics about the Beta trials up to Build 3043, the last build by Cumulus's founder Steve Loft. It was by this time way out of Beta but Steve wanted to keep it that way until he made a decision on his and Cumulus's future.

Moderator: mcrossley

Locked
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

I'm working on a sort of CumulusMX "watchdog" script to run alongside of Cumulus on a Raspberry Pi 3. Basically I'm having to restart CumulusMX periodically, and I'd like to automate this as much as possible. I'm using systemd to manage CumulusMX (e.g. https://cumulus.hosiene.co.uk/viewtopic.php?f=27&t=14753). I've been trying to observe and note various failure modes, although often the status of interest isn't quite a failure per se. I've found that systemd will restart CumulusMX quickly if the process is killed (mono actually). I have in mind to share what I learn in case others are interested.

Here are some things I am looking at:
* The <#DataStopped> tag is in a standalone file, which will return either 1 or 0
* The realtime.txt file on my (hosted) web server should update every 30 seconds in my configuration
* A bunch of html and json files should update every 5 minutes in my configuration
* The Raspbian O/S should not get bogged down

In general, a pretty reliable indicator of an issue (which I've not yet automated) seems to be the Dashboard web page, i.e. the default page at http://*:8998/. The local time widget (sample below) needs to be refreshed periodically, but at times I note that may hang beyond the normal couple of seconds. It looks like the realtime.txt file and html and json files may not be updating on the server, but I'm still investigating.

A note on bogged down: One time when I checked on CMX I had a tough time doing much in my usual ssh shell (I pretty much run the Pi headless, without the GUI, and do my admin from an ssh shell). Top looked very unusual...

Code: Select all

top - 08:30:43 up 2 days, 12:33,  3 users,  load average: 15.56, 13.18, 12.78
Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.6 us, 19.1 sy,  0.0 ni, 69.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    945512 total,   766604 used,   178908 free,   146700 buffers
KiB Swap:   102396 total,        0 used,   102396 free.   352756 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  529 root      20   0  628000 192208  17652 S  80.1 20.3   1807:22 mono
Typically, I see the CPU usage for mono below 10% max and the load average pretty close to 0. Mem usage looks to typically run around 8%. I'm not monitoring that (yet), but it seemed like a one-off case. (I'm running Raspbian on a stock Pi3 with few, if any OS mods.) Oh, and I writing Python code because that's big in the embedded (and Pi) world ... and I'm trying to get more comfortable with it.

I'll share more as I figure it out...

Bob
CumulusMX_Local_Time_Widget.jpg
Other Notes

I noticed when I was running the GUI and CumulusMX (and little else) my Pi was using a little swap space which I wanted to avoid. I now boot without the GUI and I'm using only about 50% of memory and am not close to swapping!

Code: Select all

$ free
             total       used       free     shared    buffers     cached
Mem:        945512     413904     531608      42248     144092     165656
-/+ buffers/cache:     104156     841356
Swap:       102396          0     102396
I already have a watchdog running for my (aging) web camera. I added a relay module to the Pi to automate what I had previously done by hand - power-cycle the camera. On my (hosted) web server a cron job already tracks the time since the last image upload (nominally a still image should be uploaded every 5 minutes) and sends me an email if over a certain age. The watchdog pulls down a text file which has the number of seconds since the last image update, and if over a threshold, it power-cycles the web cam. I'm using a copy of that script to tinker with the CMX watchdog.
You do not have the required permissions to view the files attached to this post.
Cheers,
Bob
bab5871
Posts: 28
Joined: Mon 09 May 2016 3:42 pm
Weather Station: Davis Vantage Pro 2 Plus
Operating System: Raspbian RPi3
Location: Ballston Lake NY

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by bab5871 »

I've noticed when mine goes down it's actually the USB connection that drops... so even restarting Cumulus at that point it would be looking for ttyUSB0 and the station is now using ttyUSB1 so I have to reboot the Pi to get it back up anyway on USB0.... how would you account for that?
User avatar
ConligWX
Posts: 1570
Joined: Mon 19 May 2014 10:45 pm
Weather Station: Davis vPro2+ w/DFARS + AirLink
Operating System: Ubuntu 22.04 LTS
Location: Bangor, NI
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by ConligWX »

I think your best option is to find out Why you CumulusMX stops rather than have something "restart" if it has stopped.

My CumulusMX runs 24/7 without any restart scripts to do this.

Usually lockups are - as said previously - USB issues. Zip up you your MXDIAG folder and post here. Perhaps Steve can take a look see when he has time to diagnose anything.

USB issues can be fixed by installing a USB hub between the RasbPi and the Weather console. also ferrite cores help too on the end of the USB cables.

Also please state what version of Mono you are running.
Regards Simon

https://www.conligwx.org - @conligwx
Davis Vantage Pro2 Plus with Daytime FARS • WeatherLink Live • Davis AirLink • PurpleAir •

Image
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

I've cobbled together a couple of scripts to monitor various things as I try to sort out what might cause an issue. I agree with @Toxic17 that the desirable thing is to eliminate the issues as they arise if possible.

One thing I'm watching at the moment is memory usage. Cumulus has been up for about 4 days (under systemd, following a reboot) and I've watched the system memory usage (using the free command) increase from around 22% to 76%. For a few days, I've been watching top - sorted by VIRT. The mono process seems to have the largest footprint, but the top %MEM number has stayed below 10%. I've not yet seen where that memory is going!

I'll let it run to around 90% then I plan to try a systemctl restart and check the memory usage. If it doesn't drop much that would suggest Cumulus/mono isn't the culprit. I'll have to do some reading on top's %MEM versus what free is reporting. I don't see where the top proc mem usage numbers account for 76% usage at a system level.

I'll post a little analysis / summary when I get that far.

My scripts are pretty "hacky" as I'm learning Python along the way (and resorting to bash at times). I'm also learning Github and git (though I worked with numerous source control systems over the years). I see a number of github references in the fora and I am willing to share some of my little hacks if they'd be of value to others... Any advice?

Cheers,
Bob

"top" snippet, sorted by %MEM. At about the 35th process, top reports 0.0%. Summing the non-zero values yields ~ 20% - nowhere near 76%. I have to do some reading, but if anyone has a thought...

Code: Select all

top - 09:38:27 up 4 days, 17:43,  2 users,  load average: 0.00, 0.01, 0.00
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.2 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    945512 total,   710288 used,   235224 free,   403976 buffers
KiB Swap:   102396 total,        0 used,   102396 free.   191320 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  542 root      20   0   91328  60540  17660 S   1.6  6.4  90:55.90 mono
13045 colord    20   0   47828  12448   8312 S   0.0  1.3   0:00.50 colord
  535 root      20   0   13424   9512   6044 S   0.0  1.0   0:10.85 webcamwatch.py
13510 pi        20   0   13152   9480   5952 S   0.0  1.0   5:44.88 python
  537 root      20   0   41252   6916   6024 S   0.0  0.7   0:00.14 ModemManager
Cheers,
Bob
User avatar
ConligWX
Posts: 1570
Joined: Mon 19 May 2014 10:45 pm
Weather Station: Davis vPro2+ w/DFARS + AirLink
Operating System: Ubuntu 22.04 LTS
Location: Bangor, NI
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by ConligWX »

Linux will use ram in a different way than Windows for instance.

a nice explanation here: http://www.linuxatemyram.com/

if you could zip the MXDiag folder up repost the zipfile Steve and a few others could take a look to see if there is anything obvious in the MX files to help diagnose the issue.
Regards Simon

https://www.conligwx.org - @conligwx
Davis Vantage Pro2 Plus with Daytime FARS • WeatherLink Live • Davis AirLink • PurpleAir •

Image
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

Toxic17 wrote:Linux will use ram in a different way than Windows for instance.

a nice explanation here: http://www.linuxatemyram.com/

if you could zip the MXDiag folder up repost the zipfile Steve and a few others could take a look to see if there is anything obvious in the MX files to help diagnose the issue.
That is a nice page. The statements at the bottom are particularly helpful. (http://www.linuxnix.com/find-ram-size-in-linuxunix/ also.) I've not yet gotten used to working with the relatively small amount of RAM on the Pi. (I've never seen a system where there is not at least a little swapping.) Swapping was what I originally focused on, but then I was tinkering to see if I could anticipate when that was about to be an issue. I definitely wasn't looking at it reasonably.

I've not spent much time in the MXDiag folder, but I'm not yet convinced there's a CMX-related issue. (I appreciate the offer and will keep it in mind.)

I did notice though, that there are batches of messages like

Code: Select all

... Sensor contact lost; ignoring outdoor data
That's something I do want to pay attention to. My previous WS had the option to hard-wire the sensors to the base (eliminating battery changes), but my current one is strictly RF. It goes weeks without losing the connection, then seems to go through periods where it keeps dropping. I can monitor the most recent log for that, but for my WS, I'm not aware of a way to force an RF resynch other than a button press. I guess I could hack the hardware...

Thanks much!
Bob

I prefer Unix as a rule, but I noticed Win 7 categorizes buffering and cache usage as "Standby" (and "In Use" excludes that, which seems clearer ... for a noob). :oops:
Cheers,
Bob
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

FYI - using the free space number on the buffers/cache line of free, mem usage looks like it's staying in the 10-12% range after a couple of days.

Thanks Simon!
Cheers,
Bob
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

I'm a little concerned about load average at the moment, and I suspect it's tied to CMX or mono. Here's a plot of the past 3 hours or so from a log I'm maintaining.
pi_proc_loading_20171009_145006.jpg
Normally, as you can see on the left, the load average barely registers. Then I seem to see some cycling. I ran for a number of days with very low numbers until yesterday. I did a systemctl restart on CMX yesterday evening and the load seemed to drop back to what I expected - until mid-day today. It is back to this condition much more quickly than since the last reboot. Because restarting CMX seemed to drop the loading - albeit temporarily - it may not be the whole cause but it certainly seems to contribute.**

I've looked at top, but I've not spotted any processes using the cpu or memory that come close to mono. I have seen the %CPU attributed to mono hit 57% when the load average was high over the past hour. In my samples, its generally 0.0% with an occasional blip where it jumps to 5 or 10% very briefly.

I'm going to let this run for a while this time and see if it actually causes a noticeable slow down.

I also note that swap usage is 0 and memory usage (according to free) is running 13% +/-.

Any ideas?

Thanks,
Bob

** I've disabled the GUI / Desktop, and other than several python or bash scripts tracking various things, CMX should have most of the system resources to use. This is a quad-core Pi 3 so I'm flagging any (1 minute) load average sample over 4.0.

My monitors save the first 20 lines of top output (default mostly), sorted by the RES column every 5 minutes. Another is watching various aspects of the system and CMX, sampling every 24 secs - my realtime interval. (My PWS transmits outdoor data every 48 secs so I choose 2X that rate.)

Finally, top -H sometimes shows a load of mono threads, like the 50-60 I can see on the screen, and maybe more. That seems odd to me but I won't pretend to know much about mono...
You do not have the required permissions to view the files attached to this post.
Cheers,
Bob
User avatar
radilly
Posts: 123
Joined: Fri 17 Jul 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspberry Pi 3, OS Buster Lite
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Post by radilly »

Mono Threads
I've been tweaking what I monitor and it seems to me mono is misbehaving for me - though CMX might be contributing. I'm running mono 3.2.8 which seems to be the general recommendation.

The local dashboard seems to updating every 3 seconds when I watch it, and observations seem to be posting to my web server and Weather Underground (though the latter shows a flat spot in the temp graph for about 30 minutes, where there's normally a little "jitter" in the readings.) Since the Pi's only real task is Cumulus, it doesn't yet seem to be causing much of a functional problem. My goal though, it to try to anticipate a problem and either restart CMX, or reboot the Pi to avoid getting to the point where performance degrades.

The load average has been creeping up, but it cycles (as illustrated in the graph above). A few hours ago I saw the one minute average drop as low as 0.41 and as high as 11.05. Looking at top, 2 things concern me:
  • Some swap space is being used. In my setup that might be hitting the SD card, though if the green LED is blinking, I'm not seeing it. (Most of the IO is going to an SSD.) In my experience once this starts it seldom decreases.
  • The mono process is showing almost 80 %CPU (which I believe means 80% of one core). After a reboot/restart it mostly reports 0%, with an occasional blip to 5-6%.

Code: Select all

top - 09:50:25 up 7 days, 13:18,  2 users,  load average: 4.46, 3.73, 4.13
Tasks: 141 total,   1 running, 140 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.3 us, 12.9 sy,  0.0 ni, 79.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    945512 total,   904156 used,    41356 free,   363488 buffers
KiB Swap:   102396 total,      232 used,   102164 free.   112260 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  540 root      20   0  720100 360148  18176 S  79.2 38.1   5630:58 mono
30544 pi        20   0    5108   2580   2164 R   0.7  0.3   0:00.57 top
29304 pi        20   0   12196   3716   3000 S   0.3  0.4   0:00.74 sshd
I'm monitoring the number of mono threads every 10 minutes. A snippet of that log below shows 333 mono threads running at this point. I'm not aware of anything beside CMX running under mono. After a restart I saw 14 or 15 threads for some days. Then it jumped to ~ 90, then ~ 180, then ~ 270 where it stayed for a few more days. (Interesting pattern, 1*90, 2*90, and 3*90 approx.) It's been around 333 for a day +/-.

Code: Select all

     0  10/18/17 09:24:01 -0400
     1  top - 09:24:01 up 7 days, 12:52,  2 users,  load average: 6.75, 5.13, 4.74
     2  Threads: 478 total,   4 running, 474 sleeping,   0 stopped,   0 zombie
     3  %Cpu(s):  5.5 us, 10.5 sy,  0.0 ni, 83.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
     4  KiB Mem:    945512 total,   902760 used,    42752 free,   362776 buffers
     5  KiB Swap:   102396 total,      232 used,   102164 free.   112156 cached Mem
     6
     7    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
     8    540 root      20   0  720100 359792  18176 S  0.0 38.1   0:43.82 mono
     9    632 root      20   0  720100 359792  18176 S  0.0 38.1   1:31.85 mono
    10    737 root      20   0  720100 359792  18176 S  0.0 38.1   0:01.83 mono
    11    823 root      20   0  720100 359792  18176 S  0.0 38.1   8:44.81 mono
    12    824 root      20   0  720100 359792  18176 S  0.0 38.1   0:01.55 mono
    13    825 root      20   0  720100 359792  18176 S  0.0 38.1   0:47.39 mono
    14    826 root      20   0  720100 359792  18176 S  0.0 38.1  21:34.51 mono
    15    827 root      20   0  720100 359792  18176 S  0.0 38.1   3:04.46 mono
 . . . . . . .
   335  24498 root      20   0  720100 359792  18176 S  0.0 38.1   1:28.23 mono
   336  24499 root      20   0  720100 359792  18176 S  0.0 38.1   1:27.57 mono
   337  24500 root      20   0  720100 359792  18176 S  0.0 38.1   1:28.44 mono
   338  29422 root      20   0  720100 359792  18176 S  0.0 38.1   0:12.30 mono
   339  30284 root      20   0  720100 359792  18176 S  0.0 38.1   0:01.49 mono
   340  30292 root      20   0  720100 359792  18176 S  0.0 38.1   0:00.18 mono
   341  22969 pi        20   0    4636   2656   2460 S  0.0  0.3   0:00.34 top_mono.sh
Cumulus has been up for about a week, and it looks like I rebooted the same day. If I recall, the processor loading clears with a CMX restart, but gets into this cycling within a few days (rather than a week) without a reboot. I may try that again to see...

Code: Select all

● cumulusmx.service - CumulusMX Service
   Loaded: loaded (/lib/systemd/system/cumulusmx.service; enabled)
   Active: active (running) since Tue 2017-10-10 20:31:50 EDT; 1 weeks 0 days ago
 Main PID: 540 (mono)
   CGroup: /system.slice/cumulusmx.service
           └─540 /usr/bin/mono /mnt/root/home/pi/Cumulus_MX/CumulusMX.exe

Oct 10 20:31:55 raspberrypi_02 mono[540]: Fine Offset station found
Oct 10 20:31:55 raspberrypi_02 mono[540]: Connected to station
Oct 10 20:31:57 raspberrypi_02 mono[540]: Cumulus running at: http://*:8998/
Oct 10 20:31:57 raspberrypi_02 mono[540]: (Replace * with any IP address on this machine, or localhost)
Oct 10 20:31:57 raspberrypi_02 mono[540]: Starting web socket server on port 8002
Oct 10 20:31:57 raspberrypi_02 mono[540]: 10/10/2017 8:31:57 PM
Oct 10 20:31:57 raspberrypi_02 mono[540]: Type Ctrl-C to terminate
Oct 17 23:05:01 raspberrypi_02 mono[540]: Warning: Degraded allocation.  Consider increasing nursery-size if th...ists.
Oct 17 23:10:01 raspberrypi_02 mono[540]: Warning: Degraded allocation.  Consider increasing nursery-size if th...ists.
Oct 18 00:05:01 raspberrypi_02 mono[540]: Warning: Repeated degraded allocation.  Consider increasing nursery-size.
With the service in place, a reboot starts Cumulus pretty reliably, and reboot is surprisingly quick. My sense is for the time required, a reboot seems to forestall some of what I'm seeing for 2-3 as long as a CMX restart.

I thought I had resolved the nursery-size warning with this in the service definition...

Code: Select all

Environment=MONO_GC_PARAMS=nursery-size=8m
but maybe I need to increase it further.

Bottom line at this point is it seems to be getting the job done, other than the little blip at WU (MXdiags showed 1 exception near the end of that blip period). I may just keep and eye on it during the day and see if a functional issue surfaces. Swap usage is very small, though increasing slowly. :?

Bob
Cheers,
Bob
Locked