Project

General

Profile

Bug #4061

Prolonged remaining in state with RSL link gone, OML link open

Added by keith about 1 month ago. Updated 17 days ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
06/14/2019
Due date:
% Done:

100%

Spec Reference:

Description

Possibly with unreliable Abis link (wifi), osmo-bts seems to get into a state
where the OML is open, but RSL is down. I believe osmo-bts used to detect this and exit, subseuqently being restarted by OS

"drop bts connection X oml" from the bsc side does not do anything.

Will try to investigate more and add info here each time I find it happening..

root@sysmobts-v2:~# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 172.16.0.11:51368       172.16.0.1:3002         ESTABLISHED

History

#1 Updated by keith about 1 month ago

VTY info when osmo-bts is in this state:

OsmoBTS# show version
OsmoBTS 0.8.1.202-e1da (OsmoBTS).
OsmoBTS# show bts 0
BTS 0 is of FIXME type in band GSM850, has CI 0 LAC 0, BSIC 0 and 1 TRX
  Description: (null)
  Unit ID: 1000/0/0, OML Stream ID 0x00
  NM State: Oper 'NULL', Admin 'unknown 0x0', Avail 'Dependency'
  Site Mgr NM State: Oper 'Enabled', Admin 'unknown 0x0', Avail 'OK'
  Paging: Queue size 200, occupied 0, lifetime 0s
  AGCH: Queue limit 0, occupied 0, dropped 0, merged 0, rejected 0, ag-res 0, non-res 0
  CBCH backlog queue length: 0
  Paging: queue length 0, buffer space 200
  OML Link state: disconnected.
  TRX 0
    phy 0 0.0 dsp 0.0.0 fpga 0.0.0
  Features:
    001 GPRS                                    
    002 EGPRS                                   
    006 OML Alerts                              
    007 AGCH/PCH proportional allocation        
    009 Fullrate speech V1                      
    010 Halfrate speech V1                      
    011 Fullrate speech EFR                     
    012 Fullrate speech AMR                     
    013 Halfrate speech AMR                     
  base transceiver station:
   Received paging requests (Abis):        0 (0/s 0/m 0/h 0/d)
   Dropped paging requests (Abis):        0 (0/s 0/m 0/h 0/d)
   Sent paging requests (Um):        0 (0/s 0/m 0/h 0/d)
   Received RACH requests (Um):        0 (0/s 0/m 0/h 0/d)
   Dropped RACH requests (Um):        0 (0/s 0/m 0/h 0/d)
   Received RACH requests (Handover):        0 (0/s 0/m 0/h 0/d)
   Received RACH requests (CS/Abis):        0 (0/s 0/m 0/h 0/d)
   Received RACH requests (PS/PCU):        0 (0/s 0/m 0/h 0/d)
   Received AGCH requests (Abis):        0 (0/s 0/m 0/h 0/d)
   Sent AGCH requests (Abis):        0 (0/s 0/m 0/h 0/d)
   Sent AGCH DELETE IND (Abis):        0 (0/s 0/m 0/h 0/d)
OsmoBTS# show trx
TRX 0 of BTS 0 is on ARFCN 0
Description: (null)
  RF Nominal Power: 37000 dBm, reduced by 0 dB, resulting BS power: 37000 dBm
  NM State: Oper 'Disabled', Admin 'Unlocked', Avail 'OK'
  RSL State: disconnected
  Baseband Transceiver NM State: Oper 'NULL', Admin 'unknown 0x0', Avail 'OK'
  IPA stream ID: 0x00

#2 Updated by laforge about 1 month ago

  • Assignee set to Hoernchen
  • Priority changed from Normal to High

@Hoernchen, please look into this. I guess we need some proper review of the existing code, possibly resulting in the introduction of a proper FSM taking care about connection failures.

In general, our policy for OsmoBTS has always (since its creation) been to "fail fast", i.e. to terminate the process and let it re-spawn once the OML link is down. I'm not aware of any change of the related code in recent months/years.

For RSL, one can argue that it could/should reconnect while keeping OsmoBTS running, but for OML a restart is the "safe" choice as the RF carrier will be down during reconfiguration anyway, and a restart of the process will [via our systemd service on osmo-bts-{sysmo,lc15,oc2g} reset all state by reloading the FPGA bitstream and the DSP image, ensuring we always start from a 100% defined, clean state.

#3 Updated by laforge about 1 month ago

#4 Updated by laforge about 1 month ago

side note: it might also make sense to have a look how a nanoBTS behaves in comparison.

#5 Updated by Hoernchen about 1 month ago

I can confirm that osmo-bts does not detect a broken tcp connection if I drop the packets using iptables. A tcp connection with default settings will only time out after multiple hours, so this sounds like a reasonable explanation for the issue. osmo-bsc supports tcp keepalive using the config setting, but this is currently ignored by osmo-bts, it only used by the bsc callback function in libosmo-abis.

I've pushed a patch for libosmo-abis at https://gerrit.osmocom.org/c/libosmo-abis/+/14564 that allows using the usual timeout setting for ipa clients like osmo-bts, which fixes the issue for me.
Example config lines for osmo-bts:

e1_input
 e1_line 0 driver ipa
 e1_line 0 port 0
 e1_line 0 keepalive 10 2 5

#6 Updated by Hoernchen about 1 month ago

I've added TCP_USER_TIMEOUT to the patch, too - keepalive only applies to idle connections, but this timeout applies to unacked data, the manpage says: "[...] failure may take up to 20 minutes with the current system defaults in a normal WAN environment."

#7 Updated by laforge about 1 month ago

Pleas also note that we don't have to rely on TCP keepalives or anything like that,
as there's the IPA PING/PONG mechanism as part of the IPA CCM sub-layer. Both sides should
actually send PING messages in periodic intervals, and give up if they don't receive a PONG
from the peer. It might be that we only implemented the "PING responder" part.

I recently added ipa_keepalive_fsm_start to libosmo-abis. It's used only in osmo-remsim,
but should probably be used by virtually any of our programs that implement IPA
multiplex, starting from BTS (Abis), BSC (Abis), MSC (SCCPlite, GSUP), HLR (GSUP),
all our CTRL interfaces, ... - but anything beyond RSL+OML in BTS+BSC is out of scope for this
ticket. I'll add separate tickets about this.

#8 Updated by Hoernchen 18 days ago

  • % Done changed from 0 to 100

I'll close this for the time being, the patch that adds the tcp keepalive/timeouts for ipa clients was mergend, so unless you happen to have a weird connection that intercepts TCP connections the existing keepalive implementation will cover the bsc/bts rsl/oml case.

#9 Updated by Hoernchen 17 days ago

  • Status changed from New to Closed

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)