emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Tue, 12/01/2016 - 03:09

I bought two emonTx V3's last summer, and have them both feeding an emonBase via RF.

One works perfectly. The other one runs for a few weeks and then appears to freeze. Its data is no longer appearing on the emonPi.

Other than putting on a timer to periodically repower it (not a good thing), what else can I do? I don't see a replacement circuit board for sale on the site.

Tks.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by Robert Wall on Tue, 12/01/2016 - 09:53.

Are both running the same sketch?
Have you any idea where/why it locks up - might it be temperature-related, for example? Or a massive burst of interference from another appliance?

The first thing I would do is take a hand lens and very carefully look at the soldering on the radio module. That is the likeliest area for a fault that would stop it completely.

The assembled emonTx V3 is available from the Online Store without a case and accessories, I would not recommend trying to obtain a bare PCB and transferring the components.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Tue, 12/01/2016 - 12:36.

Also when you say "appears to freeze. Its data is no longer appearing on the emonPi." is it possible the emonTx is still sending but the packets are not recieved? can you still see an led flash (assuming it' not battery powered) ? if there is usually a flashing led.

What do you have connected to this emonTx (and is that the same as the other) ? depending on the era or emonTx is pulse counting ? or running a continuous sampling sketch with accumulating Wh's that could cause a possible roll over issue? Do you have temperture sensors, if so how many?

Again depending on era (last summer this is the most probable) do you have upto 6 unpopulated temp sensor values all reporting 0? This could cause "bitslip" at the reciever (emonPi)..

If the led is flashing and the firmware/hardware combo fits we can try and check for discarded packets.

Paul

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Tue, 12/01/2016 - 17:31.

Those are a lot of good thoughts from both response. Thank you.

Both emonTx's are as they came from the factory from last year, with no sketches uploaded since purchase. They are V3 but I am not sure of the firmware version - is there a way to tell?

The two units are id 10 (the failing one) and 9 and are feeding the same emonHub. Both units are indoors, sitting side by side on a shelf, so temperature shouldn't be an issue - or if it was, I would expect problems with both units. The failing unit has 4 current sensors, a pulse counter (optical) and one temperature sensor. The non-failing unit has 4 current sensors only. I don't know the firmware versions, but both units were purchased last summer (27 July, 14 Aug) - the failing one first and then the other one.

Unfortunately, I had re-powered it before I thought to check the LED, but I believe from prior failures that the LED on the emonTx stops flashing.

Because I have two units feeding into a single emonBase, and one is working, I don't think RF interference is an issue or it would hit both TX's, and for the same reason I don't think the receiving unit is the problem.

Per roll-over - the pulse counter does roll over, but that appears to not be a problem.

Regarding unpopulated temp sensors - the working unit has no temp sensors, the failing unit has one.

As far as discarded packets... there are other devices on the same frequency (I'm not sure what they are, but I can see them on a spectrum analyzer). This results in a fair number of discarded packets during normal operation. If the emonTx was having its packets discarded, I'm not sure how I would tell the difference. Also, I would think bit-slip would be a problem all the time, not just show up after it has been running a few weeks and then lose absolutely every packet.

I think it is most likely that the emonTx is simply stopping running or it's RF board croaked. Unfortunately, since I fixed it without looking for the flashing LED, I can't be certain. I believe that in past failures, the LED had stopped flashing, though.

Does the software use a watchdog, as is common in embedded systems? The ATMega series has a built-in watchdog timer that will reset the CPU if the firmware doesn't "pet" the watchdog periodically, but it has to be enabled and the petting software has to be done in a way that any persistent problem will result in no petting.

I will take a close look at the RF unit on the failing unit for solder issues, as mentioned. If I can't solve it, I guess I'll just cough up the 60 pounds and shipping to the US. I notice that the store no longer offers node 9 or 10 options. I guess I'll have to program the new one to 10, or redefine my 6 sensors to 8.

Below is log output for one packet from the unit that fails:

2016-01-12 17:04:07,719 DEBUG    RFM2Pi     7 NEW FRAME : OK 10 0 0 25 0 4 0 253 255 131 46 0 0 0 0 0 0 0 0 0 0 0 0 135 33 (-39)
2016-01-12 17:04:07,722 DEBUG    RFM2Pi     7 Timestamp : 1452618247.72
2016-01-12 17:04:07,723 DEBUG    RFM2Pi     7 From Node : 10
2016-01-12 17:04:07,724 DEBUG    RFM2Pi     7    Values : [0, 25, 4, -3, 119.07000000000001, 0, 0, 0, 0, 0, 0, 8583]
2016-01-12 17:04:07,725 DEBUG    RFM2Pi     7      RSSI : -39

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by Robert Wall on Tue, 12/01/2016 - 23:14.

"I notice that the store no longer offers node 9 or 10 options."

If you speak nicely to the shop, they might be able to program one to nodeIDs 9 & 10 for you.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Tue, 12/01/2016 - 23:56.

I would not be surprised if it was the run of 12 zeros (96 zero bits) causing the dropouts, or even a slightly different loop time causing this emonTx to be blocked by the other.

Unfortunately the emonPi firmware does not allow the "discarded" packets to be logged, once they fail the crc they are forgotten. Normally you could set "quiet = false" in emonhub and we could see for sure what was occurring.

Do you by any chance have another "RFM" device like a rfm2pi or jeelink?

The node 9 or 10 choice in the shop was from the days prior to the DIP switches being fitted when it required an edit to the sketch. now you have the choice of 2 node ids on every emonTx (4 on an emonTH), you just flick the switch.

Do you have a programmer? the later firmware for the emonTx uses non-zero values for unused temp sensors so that possibility could be removed with a routine update.

Paul

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Wed, 13/01/2016 - 01:39.

@pb66... This hypothesis doesn't explain why it works okay for weeks and then stops working, but starts immediately after being repowered.

If the data is transmitted async (which I thought it was), each character would have a start and one or more stop bits. One of those would not be zero (I don't remember which these days - I think the stop bit). Async transmission can send zeroes forever without losing sync. Synchronous protocols use other techniques (bit stuffing or sync bytes) to deal with repeated zeros to guarantee there are enough transitions that the UART/USART can maintain sync.

On the other hand, you say they use non-zero values now. Was that due to this issue, or because a zero value is a legitimate value and they don't want confusion between a no-data value and a zero temperature value?

I noticed that the new emonTx's have the dip switches. I don't know what mine have (I haven't opened them up). It's interesting that they only have two choice - two switches allow for 4 choices. I guess they didn't want to confuse people with binary - not a bad idea.

Tks for the thoughts.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by Robert Wall on Wed, 13/01/2016 - 01:57.

You'll find details of the message format on the JeeLabs website. There are no 'characters' as such and no start and stop bits, it's transmitted as one long bit stream, and not Manchester encoded.

On the emonTx, one DIP switch is used for voltage selection (230/110 V), the second for nodeID.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Wed, 13/01/2016 - 03:01.

Wow! Raw bits, no synchronization! Looking at the board I have, it looks like it has a 32kHz frequency reference - ceramic resonator or crystal, I can't tell. Either should be good enough to not require sync for quite a few bits, assuming the receiver behaves itself when slicing the bits.Thanks for the info.

Anyway, I don't think that's the problem with my system, unless somehow the clock on the defective emonTx drifts very slowly, but consistently, and recovers after a 1 second power bounce. I think that's unlikely.

On the dip switch, thanks for the correction.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dwmw2 on Mon, 18/01/2016 - 12:30.

If you want to look at wireless transmissions, you can use a cheap USB DVB-T receiver (under £10 from eBay) and rtl_433. I haven't yet got it actually decoding the emonTx signals but we're working on it.

It can certainly do the basic captures, which users could submit and someone with more clue could interpret for them.

https://github.com/merbanan/rtl_433_tests/pull/67 has a couple of example captures; there's a waveform image there too.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Mon, 18/01/2016 - 20:23.

Thanks. I looked at the received bit stream some time back just by putting a digital oscilloscope on the right pin coming out of the receiver on the emonBase. However, I don't know if that is a faithful reproduction, or if the RF board is decoding the bit stream and reencoding it. It's a bit opaque to me what is happening on the RF boards.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dwmw2 on Tue, 19/01/2016 - 14:34.

The RF board will definitely be decoding the bit stream. You'll get nice clear bits from it, with all the edges in the right places. If the problem is in the RF and timing side, you'll not see it at all; just corrupt bits.

With a RTL2832 DVB receiver, you can see the actual waveform. As shown in the picture in that pull request for example, at https://raw.githubusercontent.com/merbanan/rtl_433_tests/d0770373ae5ba18...

Although we don't yet have rtl_433 decoding the data (qv), you could at least compare the working and non-working transmissions.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Tue, 19/01/2016 - 17:38.

Thanks. The SDR radio does look like a nifty tool.

In my case, all transmissions work, except after a while, one transmitters data goes missing. I am going to wait for the next failure to see if the problem is that the unit just locks up, as I suspect, as opposed to continuing to transmit but not having the data received properly. Then, I might try the SDR. Or I might give up and buy a new emonTx.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Tue, 19/01/2016 - 21:32.

Since you're probing, if you find there's nothing coming out on the RF side, it might also be worth looking at /CS, SCLK and /IRQ. Activity there would indicate the AVR is still hammering away at it. Lack of activity would be less conclusive, but since it happens rarely, the more data you can collect the better.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dwmw2 on Fri, 22/01/2016 - 14:59.

FWIW the emonTx support is now merged into rtl_433 upstream so you should be able to use that for diagnosis of wireless issues fairly easily.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Sat, 23/01/2016 - 00:31.

Can you give more details? I'm not sure what you mean.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Sun, 24/01/2016 - 03:37.

My system is now back in the state where my emonTx (#10) appears to be doing nothing but emonTx (#9) is working. Both are V3.2.

It is AC powered, and from my reading, that means the red LED should flash every once in awhile. It is not.

It is also not showing up in the emoncms.log on the pi, but the other emonTx (#9) is showing. That emonTx (#9) also is not blinking a red light, but it is DC powered so maybe it isn't supposed to.

The end of the log is below.

Any more ideas of what to try?

16-1-24 03:33:47 MQTT INFO Received mqtt message: emonhub/rx/9/values 4,0,31,11,5.27,14.1,0,0,0,0,0,0
2016-1-24 03:33:50 MQTT INFO Reloading config
2016-1-24 03:33:56 MQTT INFO Reloading config
2016-1-24 03:33:59 MQTT INFO Received mqtt message: emonhub/rx/9/values 3,0,32,10,5.31,14.1,0,0,0,0,0,0
2016-1-24 03:34:02 MQTT INFO Reloading config
2016-1-24 03:34:08 MQTT INFO Reloading config
2016-1-24 03:34:09 FEEDWRITER INFO PHPTimeSeries bytes written: 0
2016-1-24 03:34:09 FEEDWRITER INFO PHPFina bytes written: 260
2016-1-24 03:34:14 MQTT INFO Reloading config
2016-1-24 03:34:20 MQTT INFO Reloading config
2016-1-24 03:34:25 MQTT INFO Received mqtt message: emonhub/rx/9/values 3,0,32,10,5.22,14.1,0,0,0,0,0,0
2016-1-24 03:34:26 MQTT INFO Reloading config
2016-1-24 03:34:32 MQTT INFO Reloading config
2016-1-24 03:34:37 MQTT INFO Received mqtt message: emonhub/rx/9/values 2,0,32,11,5.28,14.1,0,0,0,0,0,0
2016-1-24 03:34:38 MQTT INFO Reloading config
2016-1-24 03:34:44 MQTT INFO Reloading config
2016-1-24 03:34:50 MQTT INFO Reloading config
2016-1-24 03:34:50 MQTT INFO Received mqtt message: emonhub/rx/9/values 2,0,30,10,5.3,14.1,0,0,0,0,0,0
2016-1-24 03:34:56 MQTT INFO Reloading config

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Sun, 24/01/2016 - 09:27.

Are you able to probe signals with your scope? If so, check out those signals I mentioned above to see if there's any action on the SPI bus.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Sun, 24/01/2016 - 13:27.

Did you not check if the led flashing whilst the unit was working? without seeing the exact firmware it's difficult to know if "not flashing when not working" is helpful. What you are looking for is an "led flash when not working" as a sign of life or a "change in behavior when it stops working".

The emoncms log you provided only shows node 9 continue to work not node 10 stop, but it's unlikely to show anything more than it just stop. The most useful logs for this type of debugging is emonhub.log, the loglevel should be set to "DEBUG" in emonhub.conf. In the emonhub.conf you should also set "quiet = false" in the [[[runtimesettings]]] of the [[RFM2Pi]] section.

The "bitslip" issue was really the only reason the positive fault indication of 300 was added to current sketches. I am not going to try and convince you it IS that when it is only one of several possibilities. The pattern of occurrence is a little odd but does not rule it out. Previously I was only guessing you had more than one available temp sensor value in the payload but your log above confirms 6 unused temp sensors. The fact that the device in question has one occupied would suggest it is less likely to occur, but the fact this one is ACAC powered and the other ACDC powered could counter that theory, certainly adding pulse counting into the mix on ACAC and you may be stretching things a little further than the other device with less load and a better supply.

What is that single temperature sensor reading does it drop to zero degrees every 2 weeks causing a longer run of zeros and therefore bitslip?

That load and power supply could also make the other node a more dominant rf signal and/or slightly faster loop without the pulse counting of temperture, this could mean that when the 2 devices rf payloads clash, the other device prevails causing a long period of no data from this one, that if left long enough may play out and resume.

The emonTx v3.2 had a rfm12 so I don't think the addition of a temp sensor and pulse sensor should cause an power supply issue alone, but payload, workload, ambient temperature, supply voltage, temperature sensor reading, component tolerances and any variation in rf behavior of it's brother, the receiver or "other devices" introduce enough variables to warrant skipping the "in-theory" debugging and get some actual data to work with.

What we are looking for is packets beginning with a "?" in emonhub.log.

What is your emonBase ? rfm12 or rfm69 ? some firmwares (including emonPi) will not pass the discarded packets to emonhub, if this is the case you can probally update the firmware quite easily, What OEM image are you running?

Paul

EDIT - "Both are V3.2" are you sure? the v3.4's were launched before the emonPi. Did you buy all you stuff in one go or did you have a working system before the emonPi? I ask as the emonPi will not output the "discarded packets" originally I believed you had an emonPi but the recent mention of v3.2 emonTx's pre-dates the emonPi so my reply today is based on you having a RFM2Pi board. If you do only have the emonPi we could temporarily load a RFM2Pi sketch to it for debugging (will need to recompile a hex for 16MHz)

Also regardless of the outcome of this thread, as both units are side by side, it may be a good idea to consider using the ACDC to power both units and the ACAC as a reference only to both units. That way they both get a stable 5vdc with plenty of headroom and all 8 ct's are more accurate, 4 from introducing an AC signal to reference and the other 4 from an undistorted (by power demand) reference.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Sun, 24/01/2016 - 18:40.

Hi,

Thanks for the correction on which log to use. I will include log output below.

I will respond to the rest interpolated in your message below. First, though, some new info, from tests suggested by our exchange...

I put my spectrum analyzer on 433.92 and can see the packets. The ones from the working node are far stronger than the others in the environment, so I set the squelch so that only those packets produce an audio for me to hear. The strong signals are always correlated with the working emonTx (#9) - per the emonhub logs.

Then, I unpowered #9, and reset #10 (the bad one) by removing and returning power to it. Now I can hear #10 on the spectrum analyzer when its red LED blinks, and its packets are showing up in the log OK.

This is very strong evidence that #10 had simply quit transmitting, for whatever reason, and it resumed after I repowered it. FWIW it blinked 10 times (if I counted right) when I repowered it.

Also, now that it is working, the LED is flashing with each transmission. It was not flashing before.

Assuming that it is simply stopping, as shown, what do I do? Buy a new one? Send it back for repair? Or more analysis and experimenting?

... on to the other analysis...

Did you not check if the led flashing whilst the unit was working? without seeing the exact firmware it's difficult to know if "not flashing when not working" is helpful. What you are looking for is an "led flash when not working" as a sign of life or a "change in behavior when it stops working".

Yeah, that was dumb of me. I just assumed it was supposed to flash, and had a memory of it flashing, but its been a few months so I wasn't sure. [after re-powering, it now flashes with each transmission]

The emoncms log you provided only shows node 9 continue to work not node 10 stop, but it's unlikely to show anything more than it just stop. The most useful logs for this type of debugging is emonhub.log, the loglevel should be set to "DEBUG" in emonhub.conf. In the emonhub.conf you should also set "quiet = false" in the [[[runtimesettings]]] of the [[RFM2Pi]] section.

Oops, wrong log. I am now including output from the emonhub log - before repowering node #10. It shows lots of bad packets, but those are from other devices not part of this system.

The "bitslip" issue was really the only reason the positive fault indication of 300 was added to current sketches. I am not going to try and convince you it IS that when it is only one of several possibilities. The pattern of occurrence is a little odd but does not rule it out. Previously I was only guessing you had more than one available temp sensor value in the payload but your log above confirms 6 unused temp sensors. The fact that the device in question has one occupied would suggest it is less likely to occur, but the fact this one is ACAC powered and the other ACDC powered could counter that theory, certainly adding pulse counting into the mix on ACAC and you may be stretching things a little further than the other device with less load and a better supply.

I cannot dispute the experience - obviously that change wasn't just put in for fun. I was just surprised to run into a protocol that didn't provide adequate transitions for bit synchronization, and I was guessing about the accuracy of the clocks. I'm glad it's fixed in the latest version.

I don't see why it would work fine for a week or two and then experience a bit slip on every packet forever. What I see is that data comes in properly until a time after which it comes in not at al [verified by looking at the graphs]l. I don't see the data start to be bad and drift into always bad - it is just bad for up to several weeks at a time.

What is that single temperature sensor reading does it drop to zero degrees every 2 weeks causing a longer run of zeros and therefore bitslip?

The temp since the most recent failure has been well above either 0F or 0C. I live in Phoenix, AZ, but it does occasionally get down to 0C. BTW, it turns out I have a temp sensor on both emonTx's, so I have a record of the temperatures.

That load and power supply could also make the other node a more dominant rf signal and/or slightly faster loop without the pulse counting of temperture, this could mean that when the 2 devices rf payloads clash, the other device prevails causing a long period of no data from this one, that if left long enough may play out and resume.

That's a good thought. However, I have had perfect outages (absolutely no data from #10) that last for weeks. Also, I just now unpowered the other emonTx and still see no packets from #10, so I think we can rule that possibility out.

...

What we are looking for is packets beginning with a "?" in emonhub.log.

What is your emonBase ? rfm12 or rfm69 ? some firmwares (including emonPi) will not pass the discarded packets to emonhub, if this is the case you can probally update the firmware quite easily, What OEM image are you running?

The base is an rfm69 at 433.92 MHz. Lots of discarded packets are showing up in the emonhub log.

EDIT - "Both are V3.2" are you sure? the v3.4's were launched before the emonPi. Did you buy all you stuff in one go or did you have a working system before the emonPi? I ask as the emonPi will not output the "discarded packets" originally I believed you had an emonPi but the recent mention of v3.2 emonTx's pre-dates the emonPi so my reply today is based on you having a RFM2Pi board. If you do only have the emonPi we could temporarily load a RFM2Pi sketch to it for debugging (will need to recompile a hex for 16MHz)

Sorry for the confusion - I called it a "pi" because there's a pi in it. I don't have an emonPi. I have an emonBase. My base and the Tx that is having problems were purchased in July of 2014 and the Tx that is not failing was purchased on Aug 14, 2014. The firmware version isn't mentioned in the receipts and I don't know how to find it, but I'm think it's 3.2. Is there a way to tell?

Also regardless of the outcome of this thread, as both units are side by side, it may be a good idea to consider using the ACDC to power both units and the ACAC as a reference only to both units. That way they both get a stable 5vdc with plenty of headroom and all 8 ct's are more accurate, 4 from introducing an AC signal to reference and the other 4 from an undistorted (by power demand) reference.

Good idea. I will do that.

Paul, thanks for all the good ideas!

John

2016-01-24 18:17:35,037 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 33 220 194 224 175 100 76 246 66 188 146 92 32 105 19 208 255 63 251 237 129 (-93)
2016-01-24 18:17:36,150 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 165 134 165 61 250 31 200 61 163 161 61 218 70 243 247 66 60 232 114 108 6 (-92)
2016-01-24 18:17:36,356 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 156 198 105 62 98 220 228 52 221 127 159 27 5 156 3 97 226 228 160 65 129 (-92)
2016-01-24 18:17:36,762 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 16 9 95 160 153 14 93 52 231 195 64 35 47 21 238 32 224 184 122 192 89 (-91)
2016-01-24 18:17:37,870 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 131 116 111 247 25 243 1 123 180 135 52 175 234 70 134 140 128 145 245 202 213 (-92)
2016-01-24 18:17:38,780 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 155 172 103 7 255 32 122 40 59 220 121 3 44 158 177 108 42 51 147 215 223 (-93)
2016-01-24 18:17:44,204 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 10 185 175 248 222 206 230 176 45 159 2 227 6 220 226 57 74 188 235 176 15 (-91)
2016-01-24 18:17:46,214 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 9 2 0 0 0 0 5 0 18 2 151 0 0 0 0 0 0 0 0 0 0 (-37)
2016-01-24 18:17:49,130 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 176 121 163 36 230 168 205 58 5 250 37 72 92 55 97 40 146 100 139 148 31 (-93)
2016-01-24 18:17:50,655 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 186 209 141 35 138 52 210 247 180 23 135 6 195 60 94 228 176 98 173 80 238 (-90)
2016-01-24 18:17:53,566 INFO     emoncmsorg sending: http://emoncms.org/input/bulk.json?apikey=E-M-O-N-C-M-S-A-P-I-K-E-Y&data=[[1453659453,9,2,0,32,7,5.2700000000000005,15.100000000000001,0,0,0,0,0,0,-37]]&sentat=1453659473
2016-01-24 18:17:53,884 DEBUG    emoncmsorg acknowledged receipt with 'ok' from http://emoncms.org
2016-01-24 18:17:54,820 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 15 124 63 218 171 1 129 99 157 240 235 166 174 79 17 110 190 31 138 188 115 (-91)
2016-01-24 18:17:54,928 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 29 154 206 254 117 150 24 240 241 126 53 128 27 68 160 37 236 98 108 214 248 (-94)
2016-01-24 18:17:57,003 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 147 187 143 93 162 2 229 108 254 181 253 107 228 229 211 190 130 158 160 194 28 (-93)
2016-01-24 18:17:57,817 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 14 199 179 34 227 103 209 110 142 227 122 136 133 31 57 41 226 91 227 0 131 (-95)
2016-01-24 18:17:58,123 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 4 173 43 43 246 98 28 102 32 253 42 181 54 44 120 112 114 250 137 39 103 (-90)
2016-01-24 18:17:58,851 DEBUG    RFM2Pi     2963 NEW FRAME : OK 9 2 0 0 0 33 0 4 0 18 2 151 0 0 0 0 0 0 0 0 0 0 0 0 0 (-36)
2016-01-24 18:17:58,855 DEBUG    RFM2Pi     2963 Timestamp : 1453659478.85
2016-01-24 18:17:58,856 DEBUG    RFM2Pi     2963 From Node : 9
2016-01-24 18:17:58,857 DEBUG    RFM2Pi     2963    Values : [2, 0, 33, 4, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0]
2016-01-24 18:17:58,857 DEBUG    RFM2Pi     2963      RSSI : -36
2016-01-24 18:17:58,858 INFO     RFM2Pi     Publishing: emonhub/rx/9/values 2,0,33,4,5.3,15.1,0,0,0,0,0,0
2016-01-24 18:17:58,861 DEBUG    RFM2Pi     2963 adding frame to buffer => [1453659478, 9, 2, 0, 33, 4, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0, -36]
2016-01-24 18:17:58,862 DEBUG    RFM2Pi     2963 Sent to channel' : ToEmonCMS
2016-01-24 18:18:02,827 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 148 46 61 23 96 219 71 232 255 191 93 198 219 51 101 140 170 123 227 39 255 (-90)
2016-01-24 18:18:08,697 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 27 (-93)
2016-01-24 18:18:11,527 DEBUG    RFM2Pi     2964 NEW FRAME : OK 9 2 0 0 0 34 0 4 0 18 2 151 0 0 0 0 0 0 0 0 0 0 0 0 0 (-38)
2016-01-24 18:18:11,530 DEBUG    RFM2Pi     2964 Timestamp : 1453659491.53
2016-01-24 18:18:11,530 DEBUG    RFM2Pi     2964 From Node : 9
2016-01-24 18:18:11,531 DEBUG    RFM2Pi     2964    Values : [2, 0, 34, 4, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0]
2016-01-24 18:18:11,532 DEBUG    RFM2Pi     2964      RSSI : -38
2016-01-24 18:18:11,533 INFO     RFM2Pi     Publishing: emonhub/rx/9/values 2,0,34,4,5.3,15.1,0,0,0,0,0,0
2016-01-24 18:18:11,534 DEBUG    RFM2Pi     2964 adding frame to buffer => [1453659491, 9, 2, 0, 34, 4, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0, -38]
2016-01-24 18:18:11,535 DEBUG    RFM2Pi     2964 Sent to channel' : ToEmonCMS
2016-01-24 18:18:13,445 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 69 72 185 181 232 254 139 87 188 197 200 253 216 241 28 73 48 26 167 153 42 (-97)
2016-01-24 18:18:21,324 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 13 148 239 192 255 194 174 51 126 98 139 64 110 167 50 61 1 178 238 47 60 (-95)
2016-01-24 18:18:22,332 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 55 223 81 89 123 172 159 21 24 231 42 127 183 172 85 203 75 59 170 97 123 (-91)
2016-01-24 18:18:23,025 INFO     emoncmsorg sending: http://emoncms.org/myip/set.json?apikey=E-M-O-N-C-M-S-A-P-I-K-E-Y
2016-01-24 18:18:23,645 INFO     emoncmsorg sending: http://emoncms.org/input/bulk.json?apikey=E-M-O-N-C-M-S-A-P-I-K-E-Y&data=[[1453659478,9,2,0,33,4,5.3,15.100000000000001,0,0,0,0,0,0,-36],[1453659491,9,2,0,34,4,5.3,15.100000000000001,0,0,0,0,0,0,-38]]&sentat=1453659503
2016-01-24 18:18:23,956 DEBUG    emoncmsorg acknowledged receipt with 'ok' from http://emoncms.org
2016-01-24 18:18:23,984 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 40 11 224 200 32 217 246 31 62 213 93 83 116 106 231 71 248 173 164 59 207 (-92)
2016-01-24 18:18:24,190 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 9 1 0 0 0 0 5 0 18 2 151 0 0 0 0 0 0 0 0 0 0 (-39)
2016-01-24 18:18:25,937 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 133 111 62 242 155 222 82 158 0 123 132 29 163 100 24 182 232 79 82 30 63 (-94)
2016-01-24 18:18:35,423 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 27 0 245 72 113 73 228 24 187 61 79 150 109 7 43 64 133 145 225 150 227 (-93)
2016-01-24 18:18:36,857 DEBUG    RFM2Pi     2965 NEW FRAME : OK 9 2 0 0 0 33 0 5 0 18 2 151 0 0 0 0 0 0 0 0 0 0 0 0 0 (-38)
2016-01-24 18:18:36,860 DEBUG    RFM2Pi     2965 Timestamp : 1453659516.86
2016-01-24 18:18:36,861 DEBUG    RFM2Pi     2965 From Node : 9
2016-01-24 18:18:36,863 DEBUG    RFM2Pi     2965    Values : [2, 0, 33, 5, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0]
2016-01-24 18:18:36,864 DEBUG    RFM2Pi     2965      RSSI : -38
2016-01-24 18:18:36,865 INFO     RFM2Pi     Publishing: emonhub/rx/9/values 2,0,33,5,5.3,15.1,0,0,0,0,0,0
2016-01-24 18:18:36,867 DEBUG    RFM2Pi     2965 adding frame to buffer => [1453659516, 9, 2, 0, 33, 5, 5.3, 15.100000000000001, 0, 0, 0, 0, 0, 0, -38]
2016-01-24 18:18:36,868 DEBUG    RFM2Pi     2965 Sent to channel' : ToEmonCMS
2016-01-24 18:18:36,976 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 137 249 191 122 49 197 231 79 54 165 207 26 188 221 171 26 127 151 79 198 139 (-95)
2016-01-24 18:18:47,165 DEBUG    RFM2Pi     Discarding RX frame 'unreliable content'? 29 219 227 207 175 64 125 36 110 4 56 187 3 114 195 80 56 222 117 200 179 (-93)

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Sun, 24/01/2016 - 20:56.

"It shows lots of bad packets, but those are from other devices not part of this system."

Agreed most of the discarded packets are garbage and the rssi of most are -90+, but there is one in that excerpt

2016-01-24 18:17:46,214 DEBUG RFM2Pi Discarding RX frame 'unreliable content'? 9 2 0 0 0 0 5 0 18 2 151 0 0 0 0 0 0 0 0 0 0 (-37)

That is supposed to be a valid packet and I strongly suspect "bitslip", but that's just one packet and the wrong device.

From your findings it certainly sounds like the emonTx is just stopping rather than the rf signal being lost, dropped or malformed. I would certainly consider some other things before rushing out to buy a new one though. firmware and power supplies (as suggested above) at least.

Do you have a programmer ?

v3.2 is the hardware version not the firmware version, using the programmer you may be able to determine the installed firmware or at least install a known/latest version.

I had no idea the outages were lasting weeks, that is a very definite stop...dead! so firmware or hardware is fast becoming stronger suspects.

Paul

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Sun, 24/01/2016 - 22:25.

Paul... sorry I didn't make it clear sooner about the long lasting failure.

I will try the dc supply in addition to the AC reference.

Could this really be a firmware issue? Usually this sort of failure, at least on a system with a watchdog, is a hardware failure. but if it doesn't use the watchdog right, a firmware error can cause it. Do you know if others have reported similar problems and they have been found to be firmware?

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Sun, 24/01/2016 - 22:56.

I've seen a lot of AVR hangs, and I think I can honestly say they've always been firmware loops... usually f/w looping on external h/w that has failed to respond (which may be because that small piece of h/w has failed or reset unexpectedly). The most recent one I worked on was a Vcc glitch that was low enough to reset a device on the SPI bus, but not low enough to trigger the AVR BOD. Further investigation revealed it was a power supply problem.

And yes, you're right: the AVR wdog is very useful for diagnosing such stuff, especially if you run it in it's two-stage mode. On the first firing, you can capture machine state (including the PC) and store it away in non-volatile storage, then on the second firing, you can let it reset the AVR. Even then you really need a way for your f/w to be able to reset all the external h/w, otherwise you run the risk of the AVR starting afresh, but the various h/w it's talking to all being in some previous state.

AFAIK, Jellib uses the wdog to wake the AVR from a deep sleep. I'm not sure that it then continues to use it to ensure the system is running normally once awake, but it's all open-source so you could check (unless anyone else knows?)

If you assume for a minute your AVR is still running, then you need to go looking at why it would stop blinking the LED. I'm not familiar with the RF module and the Jeelib stuff, but a quick glance in there suggests it runs quite a state machine, and there are several places it hammers away on the RF module until a specific event happens. That's why I proposed you look to see if there's still activity on the SPI bus during the hang.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Sun, 24/01/2016 - 23:45.

I agree with your analysis of AVR and watchdogs. I've done a bunch of comercial products where the product had to keep running no matter what - as long as it had power, and I used the watchdog to assure that. With thousands of installations, I've never heard a complaint where it was locked up and repowering it solved it.

The Red LED is driven directly by the ATMega328, so it's a strong (but not conclusive) argument that the processor is stopping. Given that,I don't think there is productive diagnostics I can do beyond this. If the behavior persists with the DC supply (and I expect it will), then I'll have to buy a new RFu328 or a new emonTx V3.4 (or maybe, if they sell it, a new emonTx populated board).

Thanks

John

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Mon, 25/01/2016 - 05:49.

I don't share your confidence that the watchdog is enabled. At any rate, it's pretty easy to test. Simply add this line to the main loop of the sketch:

while (millis() > 60000);

If after a minute the device just hangs, with no more LED flashing, then we've replicated your symptoms without any hardware issue. If it reboots, then we know the watchdog is indeed watching and an infinite loop is unlikely to be causing what you're seeing.

Assuming for a minute the wdog isn't enabled, then any loop that has the possibility of running forever could be the cause of your symptoms. There seem to be plenty of them in the RF code:

while (!rf12_canSend())
  rf12_recvDone(); // keep the driver state machine going, ignore incoming

Most of them seem to bang on the RF module waiting for it to do as expected, but if it doesn't I can't see any escape clause. So putting your scope probe on the SPI signals might help you pinpoint what's broken.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Mon, 25/01/2016 - 15:00.

As far as I'm aware the watchdog is only used in the JeeLib sleepy functions but I cannot be absolutely sure.

Since we do not know exactly what firmware you are using it is difficult to know if it has been revised at any point, I have looked at the repo but a reorganisation in summer last year prevents searching the commits history.

There have been revisions to the JeeLib library and I cannot rule out any other changes since the compilation of an sketch of un-confirmed revision. Even if I had bought the device recently and there were no known issues or revisions to the firmware I would still try a free, quick and simple firmware update long before I would shell out postage to return a device let alone buy a new one.

In fact if I had a programmer (still unsure if you do) connecting it may tell us the firmware revision and at this point in the game, whether it did or it didn't, or whether it was the latest or not, I would still update it so that I knew exactly what was installed for future reference.

As highlighted by dBC, it could be hanging on an rf12 function. That could be brought about if there is a power supply issue that only effects the rfm module, I don't think they are as accommodating as the avr's with respect to voltage, so just using the 5vdc supply for both emonTx's may well fix the issue.

Paul

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by Robert Wall on Mon, 25/01/2016 - 15:49.

"Most of them seem to bang on the RF module waiting for it to do as expected, but if it doesn't I can't see any escape clause. "

"it could be hanging on an rf12 function"

Both seconded. Almost all the reported problems with the sketches hanging have turned out to be at calls to the RF module, and (from memory) all of those have been the RF module not responding due to a bad joint.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by pb66 on Mon, 25/01/2016 - 16:00.

Robert - I had wondered at first glance if that was supposed to be

"Most of them seem to bang on about the RF module waiting for it to do as expected,"

perhaps referring to the likes of you and me :-)

Paul

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by Robert Wall on Mon, 25/01/2016 - 16:25.

You could be right! But no, I think it meant hang (the keys are diagonally adjacent).

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Mon, 25/01/2016 - 21:48.

It was more like: I'm going to bang on this ball joint with my hammer until it does what I want.

If anyone out there with some OEM h/w and a couple of minutes to spare could briefly add:

while (millis() > 60000);

to their main loop, we'd know for sure if the watchdog is enabled (at least in the f/w version they're running). It should either reset every minute (watchdog enabled) or hang indefinitely after a minute (no watchdog). A quick browse of https://github.com/jcw/jeelib/blob/master/Ports.cpp makes me think it isn't. All the WDTCSR manipulations seem to be setting it up for deep sleep wakeup, and not system resets, but I could have easily missed something in other modules.

If the radio module does occasionally get into some state that requires a hard reset, then enabling the AVR watchdog won't help. That'll break the AVR out of it's immediate loop, but without a mechanism for the freshly-restarted f/w to force a hard reset of the radio module, it's likely to quickly end up back in the loop.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Tue, 26/01/2016 - 03:44.

I will be interested to see if someone runs that test.

As for a hard reset... I would hope that on startup, the AVR hard resets all peripheral devices that need resetting. That's standard practice in systems that need watchdogs - without doing that, the watchdog is really a waste of time. For example, brownouts can cause logic other than the CPU to get in bad states.

It looks like the production code is probably built from emonTxFirmware-master 2/emonTxV3/RFM/emonTxV3.2/emonTxV3_2_DiscreteSampling although the comment in the repo has a different URL, one that is broken.

That code uses an object called Sleepy and assigns the watchdog interrupt to a function in Sleepy. That function, in the emon version of Jeelib, just counts.

But... I don't know what mode the watchdog is in. These things have 3 different programmable modes, and also 3 fuses! One programmable mode would allow the code I saw to still have a hard reset: the "Interrupt and System Reset" mode. But, I cannot tell from reading the code what is going on - at least not without reading the Arduino framework code.

Anyway, here's the link to the full datasheet for the ATMega328.

So, if anyone knows, or does the experiment, I'd love to know whether there really is a working watchdog in this beast - that is, working in the sense of resetting the processor in case of a loop.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Tue, 26/01/2016 - 04:40.

I would hope that on startup, the AVR hard resets all peripheral devices

I can't see any provision for the AVR to hard reset the RF module in the schematic, can you?

without doing that, the watchdog is really a waste of time

Maybe that's why nobody has bothered to enable it (other than to use it for deep sleep wakeups).

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by meso on Tue, 26/01/2016 - 04:40.

I don't see a way for it to reset the RF board. But... I don't know if the RF board needs resetting, since I don't know what's on it and whether it has a CPU with a watchdog. Mysteries abound.

However, even if it has a failure mode, there's no reason for the Arduino CPU to not protect itself with a reset on watchdog. Maybe not every failure can be caught, but it is still good practice.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Tue, 26/01/2016 - 04:59.

Mysteries abound.

There may be one less mystery if you probe the SPI signals next time it hangs.

there's no reason for the Arduino CPU to not protect itself with a reset on watchdog. Maybe not every failure can be caught, but it is still good practice.

Hey, it was you who declared "without doing that, the watchdog is really a waste of time". If you think it'll help, then enable it. You know what they say about opensource... if you don't like it, rewrite it.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by emjay on Tue, 26/01/2016 - 09:13.

@meso,

I don't know if the RF board needs resetting, since I don't know what's on it and whether it has a CPU with a watchdog

The RF Modules have an RF engine driven by a state machine - some state transitions have timers associated, but there is no formal watchdog "reset" to a known state. 'Power on Reset' has this function.

One works perfectly. The other one runs for a few weeks and then appears to freeze.

Identical code and same base hardware, sitting on the same shelf you said. Once you have eliminated the obvious (e.g. swapping power sources between them) surely this points to a hardware issue? Then implementing your version of a watchdog would simply mask the problem?

Good enough for Government work I suppose - look forward to seeing your results.

Re: emonTx locks up after a week or two, needs to be repowered

Submitted by dBC on Tue, 26/01/2016 - 11:22.

Actually, if meso implements it well, the watchdog can be a very useful diagnostic tool, Even if it's not possible to bring a wedged RF module back to life (due to nothing being connected to the module's /RESET pin) you can gather lots of useful information about where the AVR got wedged, and that will often point to what's broken/flakey.

On my AVR-based hardware I keep lots of health status (shown below), and additionally if there has been a WDOG reset, I also record:

  ...
  uint32_t wdog_pc;
  uint32_t last_known_pid;
  uint8_t  wdog_fw_version_maj;
  uint8_t  wdog_fw_version_min;
  uint8_t  wdog_link_status;
  uint8_t  wdog_portc;
  uint8_t  wdog_porta;
} XXX_health_t;

Knowing the AVR PC at the time the wdog fired, the last known process-id, the link state of all the ethernet ports and which SPI /CS's were asserted has helped me isolate h/w failures. Although as I mentioned above, that was just the first step... the ultimate culprit in that case was the power-supply glitching Vcc, but it was all this wdog info that lead me down that path.

Archived Forum