If you use Cumulus, please donate Image

Please read this post before posting

Latest Cumulus release v1.9.4 (build 1099) - Nov 28 2014
Latest Cumulus MX release - v3.0.0 build 3043 Jan 20 2017. See this post for download

CumulusMX Hangs, Slowdowns, and Failure Modes

Discussion of version 3 of Cumulus, which runs on Windows, Linux, and OS X. All Cumulus MX queries in here, please.
User avatar
radilly
Posts: 13
Joined: Fri Jul 17, 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspian Jessie
Location: McMurray, PA, US
Contact:

CumulusMX Hangs, Slowdowns, and Failure Modes

Postby radilly » Tue Aug 22, 2017 2:29 pm

I'm working on a sort of CumulusMX "watchdog" script to run alongside of Cumulus on a Raspberry Pi 3. Basically I'm having to restart CumulusMX periodically, and I'd like to automate this as much as possible. I'm using systemd to manage CumulusMX (e.g. viewtopic.php?f=27&t=14753). I've been trying to observe and note various failure modes, although often the status of interest isn't quite a failure per se. I've found that systemd will restart CumulusMX quickly if the process is killed (mono actually). I have in mind to share what I learn in case others are interested.

Here are some things I am looking at:
* The <#DataStopped> tag is in a standalone file, which will return either 1 or 0
* The realtime.txt file on my (hosted) web server should update every 30 seconds in my configuration
* A bunch of html and json files should update every 5 minutes in my configuration
* The Raspbian O/S should not get bogged down

In general, a pretty reliable indicator of an issue (which I've not yet automated) seems to be the Dashboard web page, i.e. the default page at http://*:8998/. The local time widget (sample below) needs to be refreshed periodically, but at times I note that may hang beyond the normal couple of seconds. It looks like the realtime.txt file and html and json files may not be updating on the server, but I'm still investigating.

A note on bogged down: One time when I checked on CMX I had a tough time doing much in my usual ssh shell (I pretty much run the Pi headless, without the GUI, and do my admin from an ssh shell). Top looked very unusual...

Code: Select all

top - 08:30:43 up 2 days, 12:33,  3 users,  load average: 15.56, 13.18, 12.78
Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.6 us, 19.1 sy,  0.0 ni, 69.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    945512 total,   766604 used,   178908 free,   146700 buffers
KiB Swap:   102396 total,        0 used,   102396 free.   352756 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  529 root      20   0  628000 192208  17652 S  80.1 20.3   1807:22 mono


Typically, I see the CPU usage for mono below 10% max and the load average pretty close to 0. Mem usage looks to typically run around 8%. I'm not monitoring that (yet), but it seemed like a one-off case. (I'm running Raspbian on a stock Pi3 with few, if any OS mods.) Oh, and I writing Python code because that's big in the embedded (and Pi) world ... and I'm trying to get more comfortable with it.

I'll share more as I figure it out...

Bob

CumulusMX_Local_Time_Widget.jpg


Other Notes

I noticed when I was running the GUI and CumulusMX (and little else) my Pi was using a little swap space which I wanted to avoid. I now boot without the GUI and I'm using only about 50% of memory and am not close to swapping!

Code: Select all

$ free
             total       used       free     shared    buffers     cached
Mem:        945512     413904     531608      42248     144092     165656
-/+ buffers/cache:     104156     841356
Swap:       102396          0     102396


I already have a watchdog running for my (aging) web camera. I added a relay module to the Pi to automate what I had previously done by hand - power-cycle the camera. On my (hosted) web server a cron job already tracks the time since the last image upload (nominally a still image should be uploaded every 5 minutes) and sends me an email if over a certain age. The watchdog pulls down a text file which has the number of seconds since the last image update, and if over a threshold, it power-cycles the web cam. I'm using a copy of that script to tinker with the CMX watchdog.
You do not have the required permissions to view the files attached to this post.

bab5871
Posts: 11
Joined: Mon May 09, 2016 3:42 pm
Weather Station: Davis Vantage Vue
Operating System: Raspbian Jesse RPi3
Location: Ballston Lake NY

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby bab5871 » Wed Aug 30, 2017 5:30 pm

I've noticed when mine goes down it's actually the USB connection that drops... so even restarting Cumulus at that point it would be looking for ttyUSB0 and the station is now using ttyUSB1 so I have to reboot the Pi to get it back up anyway on USB0.... how would you account for that?

User avatar
Toxic17
Posts: 500
Joined: Mon May 19, 2014 10:45 pm
Weather Station: Davis Vantage Pro2 Plus
Operating System: Debian 9.1 Stretch
Location: Bangor, NI
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby Toxic17 » Wed Aug 30, 2017 10:02 pm

I think your best option is to find out Why you CumulusMX stops rather than have something "restart" if it has stopped.

My CumulusMX runs 24/7 without any restart scripts to do this.

Usually lockups are - as said previously - USB issues. Zip up you your MXDIAG folder and post here. Perhaps Steve can take a look see when he has time to diagnose anything.

USB issues can be fixed by installing a USB hub between the RasbPi and the Weather console. also ferrite cores help too on the end of the USB cables.

Also please state what version of Mono you are running.
Regards Simon
https://www.conligwx.org
https://www.conligwx.org/pws/
Davis Vantage Pro2+ - CumulusMX v3.0.0 (build 3043) + Saratoga/PWS
Image

User avatar
radilly
Posts: 13
Joined: Fri Jul 17, 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspian Jessie
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby radilly » Tue Sep 12, 2017 2:12 pm

I've cobbled together a couple of scripts to monitor various things as I try to sort out what might cause an issue. I agree with @Toxic17 that the desirable thing is to eliminate the issues as they arise if possible.

One thing I'm watching at the moment is memory usage. Cumulus has been up for about 4 days (under systemd, following a reboot) and I've watched the system memory usage (using the free command) increase from around 22% to 76%. For a few days, I've been watching top - sorted by VIRT. The mono process seems to have the largest footprint, but the top %MEM number has stayed below 10%. I've not yet seen where that memory is going!

I'll let it run to around 90% then I plan to try a systemctl restart and check the memory usage. If it doesn't drop much that would suggest Cumulus/mono isn't the culprit. I'll have to do some reading on top's %MEM versus what free is reporting. I don't see where the top proc mem usage numbers account for 76% usage at a system level.

I'll post a little analysis / summary when I get that far.

My scripts are pretty "hacky" as I'm learning Python along the way (and resorting to bash at times). I'm also learning Github and git (though I worked with numerous source control systems over the years). I see a number of github references in the fora and I am willing to share some of my little hacks if they'd be of value to others... Any advice?

Cheers,
Bob

"top" snippet, sorted by %MEM. At about the 35th process, top reports 0.0%. Summing the non-zero values yields ~ 20% - nowhere near 76%. I have to do some reading, but if anyone has a thought...

Code: Select all

top - 09:38:27 up 4 days, 17:43,  2 users,  load average: 0.00, 0.01, 0.00
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.2 sy,  0.0 ni, 99.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    945512 total,   710288 used,   235224 free,   403976 buffers
KiB Swap:   102396 total,        0 used,   102396 free.   191320 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  542 root      20   0   91328  60540  17660 S   1.6  6.4  90:55.90 mono
13045 colord    20   0   47828  12448   8312 S   0.0  1.3   0:00.50 colord
  535 root      20   0   13424   9512   6044 S   0.0  1.0   0:10.85 webcamwatch.py
13510 pi        20   0   13152   9480   5952 S   0.0  1.0   5:44.88 python
  537 root      20   0   41252   6916   6024 S   0.0  0.7   0:00.14 ModemManager

User avatar
Toxic17
Posts: 500
Joined: Mon May 19, 2014 10:45 pm
Weather Station: Davis Vantage Pro2 Plus
Operating System: Debian 9.1 Stretch
Location: Bangor, NI
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby Toxic17 » Tue Sep 12, 2017 3:22 pm

Linux will use ram in a different way than Windows for instance.

a nice explanation here: http://www.linuxatemyram.com/

if you could zip the MXDiag folder up repost the zipfile Steve and a few others could take a look to see if there is anything obvious in the MX files to help diagnose the issue.
Regards Simon
https://www.conligwx.org
https://www.conligwx.org/pws/
Davis Vantage Pro2+ - CumulusMX v3.0.0 (build 3043) + Saratoga/PWS
Image

User avatar
radilly
Posts: 13
Joined: Fri Jul 17, 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspian Jessie
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby radilly » Wed Sep 13, 2017 2:21 pm

Toxic17 wrote:Linux will use ram in a different way than Windows for instance.

a nice explanation here: http://www.linuxatemyram.com/

if you could zip the MXDiag folder up repost the zipfile Steve and a few others could take a look to see if there is anything obvious in the MX files to help diagnose the issue.


That is a nice page. The statements at the bottom are particularly helpful. (http://www.linuxnix.com/find-ram-size-in-linuxunix/ also.) I've not yet gotten used to working with the relatively small amount of RAM on the Pi. (I've never seen a system where there is not at least a little swapping.) Swapping was what I originally focused on, but then I was tinkering to see if I could anticipate when that was about to be an issue. I definitely wasn't looking at it reasonably.

I've not spent much time in the MXDiag folder, but I'm not yet convinced there's a CMX-related issue. (I appreciate the offer and will keep it in mind.)

I did notice though, that there are batches of messages like

Code: Select all

... Sensor contact lost; ignoring outdoor data

That's something I do want to pay attention to. My previous WS had the option to hard-wire the sensors to the base (eliminating battery changes), but my current one is strictly RF. It goes weeks without losing the connection, then seems to go through periods where it keeps dropping. I can monitor the most recent log for that, but for my WS, I'm not aware of a way to force an RF resynch other than a button press. I guess I could hack the hardware...

Thanks much!
Bob

I prefer Unix as a rule, but I noticed Win 7 categorizes buffering and cache usage as "Standby" (and "In Use" excludes that, which seems clearer ... for a noob). :oops:

User avatar
radilly
Posts: 13
Joined: Fri Jul 17, 2015 11:01 am
Weather Station: Ambient WS-2080
Operating System: Raspian Jessie
Location: McMurray, PA, US
Contact:

Re: CumulusMX Hangs, Slowdowns, and Failure Modes

Postby radilly » Thu Sep 14, 2017 6:45 pm

FYI - using the free space number on the buffers/cache line of free, mem usage looks like it's staying in the 10-12% range after a couple of days.

Thanks Simon!


Return to “Cumulus MX”

Who is online

Users browsing this forum: No registered users and 8 guests