RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Tue, 13/10/2015 - 17:39

I've been investigating the emonpi sent back by Jez Wingfield as discussed in this thread "EmonPi losing connection to EmonTX and EmonTH". The issue Jez Wingfield reported had similarities to one part of the issue John Cantor reported from a monitoring installation discussed here. We had several nodes stop reporting for several hours there too although that wasn't the focus of the discussion.

After reading through the threads again these were I think, the main findings. Please correct me if I have missed anything and thank you all for your input on this:

1. Paul Burnell identified adding a delay after waking the RFM69 module as one solution that improved reliability in the "emonTH Unreliable reading timing?" thread.
2. Ian Davies and emjay identified the long string of zero's on the emontx as being an issue in the thread "Data loss due to RF packets getting corrupted" due to bit slicing going out of sync.
3. Paul Burnell also identified that the rf data packet interval being longer than the fixed interval of the feed engines also produced a missing packet: 62s emonth update rate, 11.2s emontx update rate, results in a missing packets every 30 and ~9 packets.

I've been running a series of tests here, first comparing the returned emonpi, thanks to Jez Wingfield, with an off the shelf emonpi. I then applied the data timing correction (3) and removed the unused temp sensor entries (2). I also tested recording the data via the HTTP connection between emonhub and emoncms on the pi and the MQTT connection that was added earlier this year. The results so far show a significant improvement and an interesting result on the HTTP vs MQTT routes.

Before application of timing correction and removal of extra temp sensors:

100% SUCCESS RATE = 2024 packets

Control EmonPi

JeeLib Pi2 1630 http 81%
JeeLib Pi2 1571 mqtt 78%

EmonPi returned:

    JeeLib Pi1 1384 http 68%
    JeeLib Pi1 1342 mqqt 66%

After application of timing correction and removal of extra temp sensors:

100% SUCCESS RATE = 576 packets

Control EmonPi

    JeeLib Pi2 549 http 95%
    JeeLib Pi2 535 mqtt 93%

EmonPi returned:

    JeeLib Pi1 431 http 75%
    JeeLib Pi1 416 mqtt 72%

I'm not sure why some packets are getting lost via the internal mqtt route. I have a feeling it's the MQTT php client I'm using, but I need to do some more investigation.

Should I make the emontxv3 packet change its length depending on the number of temperature sensors connected? This would require modifications to the emonhub decoder as more temperature sensors are added. Alternatively, I could set the temperature values to an out-of-range value if they are not connected, such as -99.

I'm not sure if the above gets us closer to the answer as to what was causing the nodes to stop updating for several hours, I've yet to replicate this issue in these tests.

LowPowerLabs
To add another thing to the mix: A couple of weeks ago, I thought I'd try testing the LowPowerLabs RFM69 library https://github.com/LowPowerLab/RFM69 because I read that it might be more reliable. It's also a dedicated RFM69 library and has encryption built in. The following tests also include ACK's which may account for the very slight difference in performance vs the control emonpi after the timing correction and removal of extra temp sensors. More testing needed.

run1:
LowPowerLabs http 573 99%
LowPowerLabs mqtt 547 95%

run2:
LowPowerLabs http 2005 99%
LowPowerLabs mqtt 1867 92%

I've uploaded a series of LowPowerLab's examples here: https://github.com/openenergymonitor/emonLPL

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Tue, 13/10/2015 - 20:13.

Hi Trystan

Where are you counting successful packets? Ideally each portion of the journey needs to be assessed independently. There is a strong possibility some packets are being lost within emoncms itself. I have experienced some "missing data" despite having a confirmed receipt in emonhub.log, a similar thing is apparent in this Re: Software only solution for feeding weather data thread which is improved by double posting.

The difference in packet loss between http and mqtt could be effected by different posting rates, on the emonPi it appears both use a fire and forget method however the http is throttled to post every 30secs and the mqtt is not, therefore a dropped packet in http is more significant although possibly less likely due to reduced traffic.

Although ultimately the overall success rate is the target to improve, I think the rf performance needs to be measured and assessed before any processing, preferably at the serial port to have enough meaning to evaluate the rf libs. I have written a draft debug mode for the JeeInterfacer of emonhub to "jeebug" the received packets in tandem with emonhub operating as normal, it timestamps and counts each packet before saving to a log in csv so that can be tailed, attached to a forum post or opened in a spreadsheet for analysis.

The temperture sensor count could be reduced to 4 to reduce the errors to an acceptable level, we never seem to have any issue with packet loss when 4 ct's are at zero so I think 6 temp sensors is just pushing the limit abit. although the length of the packets is still unnecessary traffic which could also impact the success rate.

Having said that I am a big fan of using a positive fault indicator for absent or unreadable sensors, A custom emoncms "temperture" process could "update feed if input less than 300" for example, so that only good values are recorded yet 300+ numbers would still get updated in the inputs page as a form of error code.

Using any encryption will automatically alleviate the "run of zeros errors" I would expect so just using the rf69 specific methods of jeelib would probably improve things significantly too, (but won't fully support rf12's although there is now a rf12_compat mode for rf69 too), have you tried the rf12 lib encryption in JeeLib? I've seen it but not tried it.

Did you determine why the returned emonPi is performing significantly poorer than the "off the shelf" emonPi? It certainly looks like there are a wide range of smaller issues that need to be tackled one by one rather than finding any one fix all solution.

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Kaed on Tue, 13/10/2015 - 21:12.

Good works!

last 3 day I have this idea, conect nodes from RFM69 library, but i dont write this sketch corectly...

For next using in my project please send to github sketch for Emontx and emonBase with MQTT format data comunication. Thanks.

Kaed

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Thu, 15/10/2015 - 11:36.

Thanks Paul

Im counting packets at the graph output. If all the packets get through we should see a complete graph with no missing datapoints and using the skip missing checkbox it's easy to see how much did get through vs how many datapoints should be there.

I like the sound of the debug mode you mention and counting the success rate earlier on in the chain. Would it be possible to record something like a count of missed packets? it would need to expect a certain number of packets with a record of the expected interval or have a counter being sent from each node..

Id be happy to use a positive fault indicator. Should we go for 300?, or maybe the HTTP error code 204 No Content but maybe thats just a bit confusing.

I havent worked out yet why the returned emonpi is performing poorer, I tried adding a bit of solder to all the radio module connections, double checked for solder bridges etc but the performance is still worse.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Fri, 16/10/2015 - 00:22.

Hi Trystan,

Yes for the moment I have a incrementing counter in each node (just for testing) this can just be a byte and ignore the rollover as we are only looking for missing packets so it's easy to use the received increment, if it's greater than 1 to count dropped packets and that can be configured to count per node rather than just totals, this could be paired with the rssi as a successful percentage (link quality?) on each packet or I was also looking at passing RFM stats under there own node id, either the baseid but that may conflict with emonPi base/node or as node 31 since that is never actually used.

I only use 300 out of habit since MartinR used it in his sketches, I will give it some thought as to what may be better.

Regards the emonPi are you using the same SDcard? have you tried swapping them over to determine if it's software or hardware related?

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by emjay on Fri, 16/10/2015 - 13:48.

Trystan,

I haven't worked out yet why the returned emonpi is performing poorer, I tried adding a bit of solder to all the radio module connections, double checked for solder bridges etc but the performance is still worse.

Do you have access to a spectrum visualiser? I suspect that the crystal on the marginal module is out at the edge of the tolerance band. Since the same crystal defines both Tx and Rx centre frequency, forcing an ACK will display the remote packet and the ACK response spectra in sequence.

This is adequate if you have a spare JeeLink around.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Sat, 24/10/2015 - 11:24.

I've upgraded the emonTxV3_4_DiscreteSampling firmware to send its data in just under 10s and set the status code for no temperature sensors to 3000 (will show as 300 in emoncms once the x0.1 node decoding in emonhub is applied). The transmit timing is also adjusted for when there are a different number of CT's connected, ACAC and DS18B20 temperature sensors.

https://github.com/openenergymonitor/emonTxFirmware/tree/master/emonTxV3...

We will aim to start shipping emonTxV3's with this firmware next week.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Tue, 17/11/2015 - 11:40.

Resuming testing on this, a couple of results for the latest EmonTH code overnight:

emonTH_DHT22_DS18B20_RFM69CW_Pulse, dht22 only, v2.6, MQTT: 1022 out of 1092: 94%
emonTH_DHT22_DS18B20_RFM69CW_Pulse, dht22 only, v2.6, HTTP: 1071 out of 1095: 98%

Next im going to add a emontx v3 and run both as an ongoing test.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Tue, 17/11/2015 - 13:12.

Those numbers look much better, are you still counting in emoncms using a fixed interval feed ?

The recent change to make the interval just shy of a minute rather than slightly over so that every recorded fixed interval in emoncms is present actually makes it impossible to actually get 100% of data and only gives the appearance of "no missing packets".

For example (on an emonTx) if you were using a fixed interval of 10 secs and a send interval of 9 secs, every datapoint in emoncms will be present, But every 90secs there will be 2 packets received in a single fixed interval, so one will be dropped, So even if it appeared 100% successful it could only ever be up to 90% successful at most.

Where as previously, for example using a fixed interval of 10 secs and a send interval of 11 secs, it is possible every packet is recorded without fail but every 110 secs a fixed interval datapoint is not used giving the appearance of a 91% success rate when in actual fact it is 100%.

Assuming all the sent packets were to arrive at emoncms, the send interval can play quite a small part either in the actual success rate or just the perceived success rate so as long as the difference in intervals is minimal. The longer intervals eg 1min fixed interval in emoncms and a 59sec send interval could give up to a 98.3% success rate (seen as 100%), or at 61 sec send interval a 100% success rate is possible but will give the appearance of a 98.3% success rate due to the occasional empty datapoint.

I only mention this so that we can be aware of the regular dropped ( or perceived absent) packet during development and also to say if you are counting at source and your (http) tests are on 59sec/1min intervals, you are unlikely to see any better than ~98%. But if counting in emoncms then 98% of 98% would be 96%.

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Tue, 17/11/2015 - 14:03.

Thanks Paul, yes, quite right and yes using fixed interval feed as my counting method.

I've been thinking about adding an acknowledgment check for nodes that are powered all the time to see if its possible to achieve feeds with no gaps and improve the reliability that final few percent, or higher for systems in higher noise environments or suffering lower reliability for other reasons.

I've got example transmitter and receiver code (attached) with ack's working following examples by Jean Claude Wippler here:

http://jeelabs.org/2010/12/11/rf12-acknowledgements
https://github.com/jcw/jeelib/blob/master/examples/RF12/roomNode/roomNod...

Il create an emontx v3 example with this build in and a emonpi reciever and continue with the testing.

Id like to get to a point where we can say with some certainty what level of packet reliability we expect from the system.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Tue, 17/11/2015 - 17:41.

I've created a version of the standard EmonTxV3 firmware that waits for an ACK and am now testing this in parallel with the current firmware. https://github.com/openenergymonitor/emonTxFirmware/tree/master/emonTxV3...

There is also a minor fix to the emonpi firmware to remove the serial print when it sends an ack.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Thu, 19/11/2015 - 14:57.

Some more test results, 1x EmonTH and 2x EmonTx's one with acknowledgments.
Test duration 25 hours and 30 mins.

Reliability recorded by counting total number of NAN values in the feed data directly using this script: https://github.com/emoncms/usefulscripts/blob/master/integritycheck/miss...

EmonTH temperature MQTT 60s 124 missing, 91.9%
EmonTH temperature HTTP 60s 101 missing, 93.4%

EmonTx temperature MQTT 10s 749 missing, 91.8%
EmonTx temperature HTTP 10s 593 missing, 93.5%

EmonTx RMS Voltage MQTT 10s             749 missing, 91.8%
EmonTx RMS Voltage HTTP 10s             593 missing, 93.5%
EmonTx RMS Voltage MQTT 10s with ACK    458 missing, 95.0%
EmonTx RMS Voltage HTTP 10s with ACK    377 missing, 95.9%

Jeelink test transmitter 5s with ACK 716 missing, 96.1%

The reliability seems to have dropped by about 2% from the previous run which had one EmonTH running. Im also surprised that adding acknowledgments did not improve things to a greater extent, I was hoping for near 100%. I will try and dig into this some more. I will try recording jumps in an incrementing counter as you suggest Paul.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Bramco on Sat, 28/11/2015 - 15:24.

Wasn't sure whether I should start a new thread but I have had a couple of complete drop outs in the comms to my system. This is a new pi2 with a new rfm69pi on 868MHz and running V9 rc2. The emonTX has been running for the last couple of years without any hitches. This is doing PV diversion, Martin's PLL sketch with 5 DS1820B temperature sensors. So it will be rfm12 not rfm 69.

On a several occasions recently the inputs have just stopped. Rebooting brings them back of course but when you are away for a while you can't necessarily do the reboot unless you have set up your router to allow remote access.

I know the emonTX is still running because I can see the effects of the PV diversion - the tank gets hotter and also when I reboot the inputs become active.

Trystan, did you get anywhere with this reliability issue, or should I go back to the rfm12pi that I had on my previous system?

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Sun, 29/11/2015 - 21:46.

Bramco - The drop-outs maybe the rfm2pi locking out and rebooting the Pi also restarts the rfm2pi, see the latter part of the RFM12PI receiver goes hard down sometimes thread for more info. the test would be to ssh in and reset the rfm2pi by pulsing the reset pin (pin7 of the Pi's gpio).

The issues here may apply but it would be unusual that no packets get through, these issues normally arise from clashes and/or bit syncing issues, the bit syncing is unlikely unless you have all the temp sensors disconnected (zero values) and although the extra sensitivity of the rfm69 may introduce some clashing some data would still get through.

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Bramco on Mon, 30/11/2015 - 07:44.

Paul, thanks for the link, daft isn't it that I had actually contributed to that thread earlier this year but couldn't find it yesterday through search. Anyway, I'll continue this there.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Thu, 10/12/2015 - 14:53.

I've been continuing with the testing on this, every few days/week setting new tests and adjusting them when I realise a better way to run them. As before Im keen to get to a point where we have a clear understanding of what to expect in terms of the reliability of the radio link.

Test 1: short packet, message id, replies and failure count (test in the office)

I put together a basic test node which sends just 3 unsigned longs: message id, number of retries and recorded failures. I then had a jeelink with a receiver sketch that printed this packet out to serial and recorded the number of failed packets from its perspective: by counting failures as any gaps in the message id.

Testing with jeelib, 12 byte packet every 5s, ~4m apart, 5 retries and a 10ms wait time for an ack and another 10ms before trying to send a packet again:

total number of messages: 29080
total retries: 19271
total failures (node): 933
total failures (receiver): 932
success rate: 96.8%
retry rate: 66%

Test 2: Long packet (test at home)

As I mentioned here I've been working on adding continuous sampling to the emontx and emonpi code. One of my goals for the new firmware is to also record accumulating watt-hours on the emontx to reduce dependency on reliable radio and internet connection for accurate kwh data.

Adding 4x Long type watt hour values to the emontx packet which already has 6x temperature values extends the packet to a total of 42 bytes: (decoder: h,h,h,h,h,L,L,L,L,L,h,h,h,h,h,h) along side this I wanted to record message id, number of retries and failure count bringing the packet length to a total of 54 bytes (66 bytes is the jeelib limit, low power labs limit 61 bytes).

I noticed that the emontx would crash after a couple of a number of hours, start resetting and stop (other nodes connected to the same emonpi continued ok). The packet reliability was also quite bad and so I manually set some of the CT channels I was not using to be non-zero. The emontx kept crashing and so I thought Id try a low power labs version. Initially the low power labs version would not run it kept resetting at which point I realised that this was of course a power issue with the longer packet and powering the emontx from the acac supply. I adjusted the transmit power level down on the low power labs version and it started working reliably.

I then adjusted both the low power labs test and the jeelib test to run as similar code as I could: same number of retry attempts: 2, same ack wait time: 40ms, same delay between retry attempts: 50ms. Low power labs implements encryption so I did not non-zero any of the values, the jeelib test ran with temperatures set to 3000. The power level on the low power labs test was set to 10. Im not sure what the power level is on the jeelib test. Both where set to a 9.8s transmit time.

The results of this test where:

Low Power Labs:

test duration 15.7 hours

total message count: 5710
total retries: 915
total failures (emontx): 256
total missing emoncms: 472
total points at 10s data interval: 5659

emontx success rate: 95.5%
emoncms success rate: 91.7%
retry rate: 16%

Jeelib

emontx crashed after 3 hours:
message count: 1076
number of retries: 1371
number of failures (emontx): 299
total datapoints in emoncms if full: 1058
total datapoints in emoncms recorded: 841
total failures (emoncms): 217

The success rate over this period from emoncms perspective was 79.5%
The success rate from the emontx's perspective was 72.2%
The number of retries was 127%

I then restarted the test again moving both emontx's to the exact same location, Increased the wait between retries to 100ms and disabled the LED on both, to see if the jeelib test would run for longer. this test is ongoing.

Test 3: short packet again (test in the office) jeelib

Updated tx to have 40ms ack wait time, 2 retries and 100ms retry delay.

total number of messages: 14768
total retries: 7208
total failures (node): 316
total failures (receiver): 314
success rate: 97.8%
retry rate: 49%

This testing is still ongoing so there's no conclusion yet. I need to investigate the power supply question a bit more on the emontx. It should be possible to set the power level with jeelib in addition to lowpowerlabs with uint16_t rf12_control(uint16_t cmd). perhaps I should run a comparison with a smaller packet length too. It would be useful to be able to get all the data out of the emontx, watt hours and temperature values but maybe theres another way to do it by breaking the packet down into multiple packets or requiring an additional power supply when we want everything recorded.. or reducing the power level.

We've discussed the goal to have encryption on the rfm network for some time and encryption has the added bonus of solving the zero value problem. The low power labs library also resulted in smaller retry rates.

Personally Im edging towards using the low power labs library as its designed with the rfm69 in mind but then it would be a big task to make the switch as it doesn't support the rfm12 and so would create its own problems with backwards compatibility.

Il keep at it parallel testing for a while, Il see if I can sort the power issue on the continuous sampling long packet test and report back soon.

Im now building quite a collection of lowpowerlabs examples here for anyone interested:
https://github.com/openenergymonitor/emonLPL

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Bramco on Thu, 10/12/2015 - 17:08.

Trystan,

A couple of things spring to mind.. Firstly is there any reason all the data has to be sent together. Could you send alternate packets with the data split into 2 blocks so that you are sending short packets, these seem to have a better success rate. The TX could send as if it is 2 nodes. One for temperatures, one for power for example.

Secondly although I haven't looked the libraries I'm guessing there is a fair amount of code you never use. Could you not strip the libraries down to only the essential part, or even make your own libraries. I haven't looked at teh code for a while but I'm pretty sure Martin's PLL code didn't use the libraries, so may have the kernel of what's needed at the sending end.

It would be good to find out why the jeelibs version crashed as this may possibly have something to do with the issues.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Fri, 11/12/2015 - 13:14.

Thanks Bramco, yes that could be a possibility, perhaps in the long term we need to develop the way we send data further. One of the challenges at the moment is the need to define decoders in emonhub.conf. Id like to explore the possibility of sending the decoder definition as a configuration packet at startup to make this step easier which would then require a way to itentify if a packet was a data packet or a configuration packet. This would need perhaps one byte at the start of a packet which could then identify up to 256 different packets from one node.. that would be a very large change however but something to think about.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Robert Wall on Fri, 11/12/2015 - 13:55.

"Could you not strip the libraries down to only the essential part..."

Transmit-only? Like I did in the 3-phase sketch?

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Fri, 11/12/2015 - 14:44.

I haven't had much time to delve much deeper into this recently, but I was asked ( at the 11th hour) to install a 3ph system not so long back, Due to the short deadline and no on-going access I had to put something together that was fairly bullet-proof without thorough testing.

I installed 3 x emonTx v3.4's running a cut down version of the discreet sampling sketch v2.0, I simply removed all the temp sensing and pulse counting etc to reduce the packet size and also reduce any variation in the send interval due to interruptions etc, I adjusted the send interval to target 10s and when tested individually the timestamps assigned by emonhub were consistently ~9.98s.

The packet is power1. power2, power3, power4, voltage and a packet counter that simply gets incremented immediately after each send.

The 3 emonTx's are installed in a box with a Raspberry Pi and RFM69Pi (and Wi-Fi dongle) all the emonTx's are AC-AC powered, one off each phase and the Pi has a AC-DC power supply, so signal strength is very strong, consistent and as far as I could tell no outside influence.

There are no ACK's in play and there is no synchronization beyond initially waiting 2secs between powering each emonTx up over 2 months ago.

Due to a proxy server this data didn't start to flow or get recorded until the 25th Nov, The pulse counters are simply logged to a timeseries feed at emoncms.org and last night I exported the pulse count feeds and found these results.

	Counted	Missing	Total	Success Rate
node/phase 1	125275	12011	137286	91%
node/phase 2	125265	9213	134478	93%
node/phase 3	124026	14803	138829	89%
TOTALS	374566	36027	410593	91%

The "missing" are a sum of all the counter increments greater than 1. This success rate wasn't a surprise and with 3 nodes and no ACK or sync a 10% error can be at least expected even if not acceptable.

The real eye opener came when looking closer at the intervals etc

	Time span	Interval count	Average interval	Min interval	Max interval
1	1330376	137286	9.690544s	4s	1523s
2	1330402	134478	9.893083s	4s	603s
3	1330417	138829	9.583135s	4s	1415s

Subtracting the 1st timestamp from the last to get the time spanned, then divided by the packets accounted for above (logged or missing) gives us an average logging interval of 9.73s which is only 0.25s out, However many of the recorded intervals are well under 10s especially when following a late arrival eg 14secs then 6secs, since there is no ACK or coding for retries,syncing or buffering so the times are being delayed due to traffic and being delayed automatically..

This revelation makes me question if the use of acks (and/or controlled retries) is actually imposing a limit and adding to the failures, It will certainly cause "missing" datapoints

When used with a fixed interval feed the extended intervals caused by "delays" will cause empty datapoints and the shortened intervals caused by a catchup could (unless perfectly aligned) get overwritten by the next arrival and ignored

There are a vast number of intervals greater than 10s but less than 0.15% (that's well under a sixth of one percent) of the intervals recorded were greater than 30s so the massive "max intervals" were a bit misleading at first, but what they do show is prolonged "outages" presumably due to the slow move through a clashing cycle due to similar send intervals. as the posting always resumes with the correct count and there is no resetting or user intervention at all, I am also using original emonhub with buffering and there is never more than one emonTx "not reporting" at any one time,

Plus in many instances when the "blocked" emonTx reappears another is blocked instantly and if you look at the last page of the attached spreadsheet I have highlighted a few examples in colour, the first in blue you can see node 1 and 3 jostle for dominance blocking the other.

These results seem to suggest any attempts to set the send interval to match the fixed interval is pointless if there is any chance the packet will not reach the receiver first attempt, this would also be the case in your own examples as a 9.8s interval delayed by waiting for an ack then delaying for a retry, twice over will probally be over 10 seconds and result in a missing datapoint anyway.

I'm sure the success would be heavily impacted if the packet length was extended for the wh values or tempertures, but I also think dividing the packet would increase the traffic and result in many more succesful half packets but the impact on the whole data may not improve much especially as the overhead size (and processing time) is increased for the same size data. It would be interesting to test that theory, one instant where it could work is IF temperture and/or Wh totals were being sent less frequently eg every 60s with power every 10s, (this could also be done using 2 packet sizes for the same node with some mods to emonhub, cycling 5 short then 1 long packet)

This is a link to the spreadsheet I was using (in my dropbox as (it's 28Mb so I couldn't upload it here), it's not labelled up all that well but should be self explanatory, for each feed there is a timestamp and a count, to that I've added a calculated "interval" and "packets missed".

(or the feeds used are 95037,95038 and 95407 so you can check the data directly on the emoncms server.)

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by Bramco on Fri, 11/12/2015 - 16:17.

Robert, yes, that was what I was thinking. I haven't looked at the libraries but from experience there are fewer bugs in a a smaller code base.

Paul, interesting analysis. And it confirms what Trystan is seeing which begs the question can you ever get to 100% reliability with the rfm systems. Almost makes you think it might be worth switching to an ESP8266 based system with WiFi.

Also I wouldn't change emonhub, I'd just switch to thinking of the nodes as virtual, i.e. they can be housed on one physical device.

In my case for instance I have one emonTX doing PV diversion and monitoring my electric as well as measuring temperatures on my heat bank. So actually logically they would be better to be sent as two packets with different node ids even though one physical device is doing the work.

Also Trystan, given things are not as reliable as we'd like, it may make things less reliable to start relying on configuration packages. You'd have to build in quite a lot of error correction and failsafe into things if you do that, e.g. what happens after a power failure, or a restart of one of the systems etc. Personally I haven't any nodes defined in emonhub, I just use the default. With my CH system which is on an ESP8266 I've had to do quite a bit of code to handle various system outages and I'm still not sure I've got everything covered.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Fri, 11/12/2015 - 16:53.

"One of the challenges at the moment is the need to define decoders in emonhub.conf"

I agree, however this "need" is imposed by emoncms's nodes module and fixed payloads, not emonhub.

Sometime ago I had a working example of a payload structure that could be defined in the node using signed or unsigned 1,2,4 & 8 byte ints/longs plus floats and then be automatically decoded AND scaled by emonhub with no previous knowledge of the payload at a cost of 1byte per value, which also worked to prevent long zero runs as each values data was separated by the key byte.

I have since scrapped that idea as the names etc still need allocating and it really doesn't make sense to me to "define" the payload in the one place that is not editable ie an uneditable sketch usually connected by rf.

I prefer to define the payloads in emonhub if anywhere (it shouldn't be mandatory) but not just to match pre-defined payloads but to actually define the payloads.

I am moving towards a system of having a "catalog" of possible variables that could be returned by a particular node (like the continuous sampling sketch) that I can add to the emonhub.conf to actually define the remote nodes payload. by setting varX varY varZ in emonhub this would be sent to the node and the payload changed to return those variables, meaning any change to the payload could be edited in one place to be executed in the node and updated to emoncms eliminating any chance of mismatched payloads, keeping it flexible and the payloads only need be as long as required (at that time) with no additional payload overhead, just the occasional update packet.

By opening the door on "2-way RF comms" for control packets and payload definition we could also look at calibration and other settings for battery devices this could be done using a "message pending" flagged ack to change the sleep pattern for receipt of the message.

Paul

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Mon, 14/12/2015 - 17:29.

Thanks a lot for the input on this Paul. Im not sure that I understand where your minimum update times of 4s are coming from, if the packets are being timestamped in emonhub? The longer than the interval updates make sense I think.

For systems with acks: It should be possible to set the interval and number of retries to always ensure the timing is within 9.8s. i.e sending every 9.4s with 2x retries at 40ms ack wait + 100ms retry delay = max time of 420ms. The min time should then be 9.4s and max time 9.82s?

I wonder if there would be a way to add a slight random factor that would avoid aligned collisions? lets say a 100ms variation in send interval?

Id like to check how many failures adding the ACK check actually reduces via the message counter method.

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by pb66 on Mon, 14/12/2015 - 23:09.

I'm not sure I "understand" exactly why/how they occur myself, but I suspect JeeLib or the rfm itself is pausing or retrying and these attempts appear to go on for up to 6 seconds.

See the attached clip of 9mins from feed 95037 (90 x timeseries datapoints 1448440750 to 1448441671). The "interval" is derived from subtracting the timestamp from the previous one and the "missing" is from subtracting the count from the previous record and also -1 for the expected increment.

The timestamp is set by emonhub upon arrival from the rfm69pi, the feed is timeseries so the timestamps are not changed in emoncms.

The count from "259" to "352" is consistent apart from "292","315","317" and "350" missing, there is an additional gap of around 10s for each of those missing packets. (marked in red)

There are also a number of values coming in late, (marked in blue) with corresponding shorter "catch up" intervals, the count is ok, and the "catch-up" packet continues the intervals as they were prior to the "late" arrival. So the packets seem to be produced and initially sent at 10s intervals, but it appears the delayed packets do not effect the next packet, somehow it is getting "delayed" independently in transit.

In my emonTx sketch I'm using a "calculate and send if greater than interval" in the main loop. timestamps aside the success rate is pretty good for 3 unmanaged emonTx's in a box, so much so that the timestamp could almost be ignored in favour of just using the "count" after all the actual timestamp we should be recording is when the data was recorded not when it was received, the count increment tells us which 10s slot it belongs to.

The adjustment to the interval to allow for the possible retries you suggest might work for the battery devices sending every minute or so, as 0.25 to 0.5 of a sec in 60secs only represents a 0.4 to 0.8% adjustment. But on a 10 second interval, dropping to 9.4 secs dictates a 94% success rate,even if no packets are lost, unless retries are needed, effectively making a success rate of over 94% dependent on the first send attempts failing, because if successful there is a greater chance it will be overwritten within the same fixed interval in emoncms. IMO it would be better to aim for 10s and the IF delayed the data rolls over to the next fixed interval, yes that will result in an empty data point, but at least all the data is retained and 100% success is at least a possibility.

Using "if millis() >= last update + 10s" and no deep sleep on 10s update devices and a small interval reduction on 60s update devices using sleepy might be the way to go but using sleepy lose sometime and interrupts will still make the interval unpredictable and if data is being over written or the device is sending more frequently due to in accurate time keeping it may actually be more battery friendly to keep millis() going and not use sleepy lose some time. the efficient use of battery life has to be measured in usable data rather than time in the device 6mths good data is better than 9mths patchy data, plus you can always extend the interval if you want more life,

I guess it boils down to making sure every transmission counts being the most efficient use of resources, whether it be battery power for the emonTH or "air space" for the more frequent emonTx's.

The 100ms random bit might help the long aligned clashes but again the variation might take you into the next datapoint unless you reduce the interval even further. another 100ms on a 10s interval means 93 to 94%, That's not far off the results of my 3 unmanaged emonTx's in a box or your tests above.

Paul

EDIT - added a spreadsheet of the 9mins of data as the png didn't scale so well (added ".txt" to upload, save as 95037.xlsx)

Re: RF69 reliability, timing, temp sensors, mqtt & lowpowerlabs

Submitted by TrystanLea on Wed, 16/12/2015 - 14:38.

Thanks Paul. good points, i guess the most important thing is that the graphs appear complete rather than 100% success rate from the transmitters perspective, hence why I wasn't really worried about sending at 9.4s resulting in an overwrite every 16 packets but yes good point and yes battery life would also be a very important consideration.

A couple more tests, two emontx v3.4's overnight on jeelib, 2x retries, 40ms ack wait time, 100ms retry time.

Emontx 1 running discreet sampling code: 5622 msg, 5469 retries, 1343 fail 76%
via emoncms graph:
@evening 1929/2279 = 85%
@2kw 1130/1338 = 84%
@50w 969/1136 = 85%

Emontx 2 running continuous sampling code: 5687, 4354, 1153 80%
via emoncms graph:
@evening 2051/2279 = 89%
@2kw 1152/1338 = 86%
@50w 1022/1136 = 90%

The failure rate is greatly affected by temperature, Il get a screenshot of this later its quite pronounced.

Archived Forum