Bug #2380
closedlogical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH / T3109 can be disabled
100%
Description
Every so often we received bug reports from users indicating "stuck" logical channels in the BSC, which accumulat over time and don't get automatically recycled. Unfortunately I cannot seem to find any issue here in redmine about it, but I just have experienced something similar here during testing with osmo-bts-virtual, virtphy and my L1CTL/LAPDm test cases from osmo-ttcn3-hacks.
From the BSC side, formerly-used channels would still show up in "show lchan", even on a completely idle cell:
OpenBSC> show lchan BTS 0, TRX 0, Timeslot 0, Lchan 1: Type SDCCH Connection: 0, State: RELEASE REQUESTED BS Power: 3 dBm, MS Power: 36 dBm Channel Mode / Codec: signalling No Subscriber Bound IP: 0.0.0.0 Port 0 RTP_TYPE2=0 CONN_ID=0 Measurement Report: Flags: DLinval RXL-FULL-ul: -110 dBm, RXL-SUB-ul: -110 dBm RXQ-FULL-ul: 0, RXQ-SUB-ul: 0
I can not trigger this with a well-behaving MS (e.g. OsmocomBB performing a LU). In that case, the lchan is released as expected (rsl_lu_normal.pcapng)
However, from a L1CTL based test case that doesn't properly establish LAPDm on the radio channel, it can be reproduced quite easily ( (rsl_est_fefe_fail.pcapng).
Initial comparison of the pcap traces shows that the erroneous case (causing stuck lchan) doesn't ever show a RELEASE INDICATION from BTS to BSC. This is not a surprise, as there is no functional LAPDm entity in the test case, and hence there is no established SACCH that could be released.
Files
Related issues
Updated by laforge over 6 years ago
- Related to Bug #49: T3109 not implemented added
Updated by laforge over 6 years ago
- Related to Bug #50: Abnormal channel release handling is wrong added
Updated by laforge over 6 years ago
- Related to Bug #1610: RF channel release skipping T3109 added
Updated by laforge over 6 years ago
- Status changed from New to In Progress
While the test case without any real LAPDm instance on the MS side is of course fairly artificial, this can of course also occur in normal operation, in at least the following cases:
- MS runs out of coverage, out of power or reboots before being able to send LAPDm DISC (genreating RELEASE INDICATION at BTS)
- radio interference at the time of the LAPDm DISC
- malicious or buggy MS
This might explain some very sporadic failure cases in real-world setups
Updated by laforge over 6 years ago
- % Done changed from 0 to 50
The problem was that in the specific config file I was using T3109 was set to 0. And without T3109 being ever started, the erroneous channel release procedure doesn't complete. I still think this is a bug. we either should reject a value of zero, or otherwise remain in a clean / sane state, no matter what the user configures.
Updated by laforge over 6 years ago
- Subject changed from logical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH to logical channels can get stuck if BTS sends no RELEASE INDICATION for SACCH / T3109 can be disabled
- % Done changed from 50 to 80
See patches https://gerrit.osmocom.org/3305 and https://gerrit.osmocom.org/3306 adressing this problem.
Updated by laforge over 6 years ago
Please note that even the osmo-gsm-tester config file templates suffer from this, see https://gerrit.osmocom.org/3308
Updated by ipse over 6 years ago
My personal thank you for fixing this. It's been a very annoying issue in our installations.
Updated by laforge over 6 years ago
On Thu, Jul 20, 2017 at 10:27:16AM +0000, ipse [REDMINE] wrote:
My personal thank you for fixing this. It's been a very annoying issue in our installations.
Did you have T3109 set to 0 in your config? If not, it might be another
bug, so don't thank me too soon.
It's of course always appreciated if you could help us expanding the
test coverage :)
Updated by zecke over 6 years ago
On 20. Jul 2017, at 01:22, laforge [REDMINE] <redmine@lists.osmocom.org> wrote:
Issue #2380 has been updated by laforge.
% Done changed from 0 to 50
The problem was that in the specific config file I was using T3109 was set to 0. And without T3109 being ever started, the erroneous channel release procedure doesn't complete. I still think this is a bug. we either should reject a value of zero, or otherwise remain in a clean / sane state, no matter what the user configures.
Keeping this 0 (== not do anything) behavior was a bad compatibility decision... IIRC we (I) made timers configurable before we implemented them and I didn't want to reject t3109 0. :(
But there is a (delayed) SACCH release indication issue with the lchan code. I posted twice to the ML but won't have time to fix it right now (SACCH release indication arriving after the timeout will do bad things).
Updated by laforge over 6 years ago
- Status changed from In Progress to Closed
- % Done changed from 80 to 100
Updated by ipse over 6 years ago
laforge wrote:
On Thu, Jul 20, 2017 at 10:27:16AM +0000, ipse [REDMINE] wrote:
My personal thank you for fixing this. It's been a very annoying issue in our installations.
Did you have T3109 set to 0 in your config? If not, it might be another
bug, so don't thank me too soon.
Yes, they were set to 0, because we copy-pasted that part of the config from examples. I've manually changed them to the defaults from your patch - we'll see if we see stuck channels again in the next few days.
Updated by ipse over 6 years ago
We've monitored our installation were we changed the timer values to the reasonable defaults and amount of stuck channels seem to be better but is not 100% gone. After a surge in traffic we've got one BTS which has 2 stuck channels. BTS is still running with those channels stuck - let me know if you want me to provide some more debugging info about it.
PS I didn't restart osmo-nitb or osmo-bts's after setting timer values, as I assume they are applied immediately to new channels. Is that correct?
Updated by ipse over 6 years ago
We've just got another stuck channel in the following state, so it looks like the issue is not completely resolved:
OpenBSC> show lchan 6 1 0 BTS 6, TRX 1, Timeslot 0, Lchan 0: Type NONE Connection: 0, State: BROKEN UNUSABLE Error reason: activation timeout BS Power: 35 dBm, MS Power: 33 dBm Channel Mode / Codec: signalling No Subscriber Bound IP: 0.0.0.0 Port 0 RTP_TYPE2=0 CONN_ID=0 Measurement Report: Flags: DLinval RXL-FULL-ul: -110 dBm, RXL-SUB-ul: -110 dBm RXQ-FULL-ul: 0, RXQ-SUB-ul: 0
Interestingly, so far I've only seen TCH channels stuck and I've never seen SDCCH channels stuck.
I tried to analyze whether something else was happening at the time of the issue and it looks like another BTS had a spike in traffic making both of its 2 TRX saturated AND we had relatively high load average on the server at the same time (1.2 vs normal 0.2-0.5) which makes me think it's IO related. I think what may be happening is that a (relatively) high time waiting for our SQLite DB may have led to a timeout and stuck the channel. Not sure how realistic this is.
Happy to run some more debugging or provide NITB/BTS logs if there is any interest in fixing this.
Updated by ipse over 6 years ago
PS I also see an OML/RSL fail at around the same time, so I wonder if this is related. I failed to find this moment in the logs after a quick look, though, so can't correlate. =|
Updated by zecke over 6 years ago
@ipse/all: Please see: http://lists.osmocom.org/pipermail/openbsc/2017-April/010560.html.
One issue is if the SACCH deactivation ack arrives after the timeout (e.g. because it is delayed through TCP) we will try to release the channel again and as it fails mark the channel as broken. It should only be a couple of hours of work (it can be easily simulated with my Smalltalk fakebts code or newer approaches, ttcn-3?) but sadly the combination of day job + commute keep me busy enough.
Alex, I hope that fairwaves finds the time to move the lchan to use a FSM or can sponsor that work? I am afraid that my non OpenBSC/Osmocom paid work and commute don't leave too much time to fix such issues.
Updated by keith over 6 years ago
Rhizomatica installations have had
timer t3109 4
for as long as I have been monitoring this issue.
After patches from zecke a few months back, we now recover all stuck or "broken" channels except for those resulting from NACK on activation.
I currently use a cronjob to wait for a quiet moment and then drop the bts when these accumulate.
I found it hard to impossible to reproduce anything in the production environment, or to track down any identifiable cause, so I kind of gave up on it for the moment.