Project

General

Profile

Actions

Bug #2560

closed

osmo-bts-trx crash with sigabrt

Added by msuraev over 6 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
-
Target version:
-
Start date:
10/06/2017
Due date:
% Done:

90%

Spec Reference:

Description

It's been reported that osmo-bts-trx crashes under certain conditions on Ubuntu 16.04 x86_64.

Attached is a crashfile (detail can be extracted with apport-unpack) and config files and pcaps.


Files

_usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash _usr_local_osmo-bts_src_osmo-bts-trx_osmo-bts-trx.0.crash 210 KB msuraev, 10/06/2017 03:14 PM
openbsc_osmobsc.conf openbsc_osmobsc.conf 6.22 KB msuraev, 10/06/2017 03:16 PM
osmo-bts.cfg osmo-bts.cfg 624 Bytes msuraev, 10/06/2017 03:16 PM
osmo-bts-trx core dumped.pcap osmo-bts-trx core dumped.pcap 61.7 KB msuraev, 10/06/2017 03:16 PM
osmo-bts-trx osmo-bts-trx 1.5 MB msuraev, 10/09/2017 02:44 PM
Actions #1

Updated by msuraev over 6 years ago

Backtrace:

Core was generated by `/usr/local/osmo-bts/src/osmo-bts-trx/osmo-bts-trx -c /root/osmocom_files/osmo-b'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
325     localealias.c: No such file or directory.
(gdb) bt
#0  0x00007fd6a8e24428 in read_alias_file (fname=<optimized out>, fname_len=<optimized out>) at localealias.c:325
#1  0x0000000000413692 in down_fom (msg=0xbf7570, bts=0xc2e620) at oml.c:1128
#2  down_oml (bts=0xc2e620, msg=0xbf7570) at oml.c:1437
#3  0x000000000042532d in sign_link_cb (msg=<optimized out>) at abis.c:166
#4  0x00007fd6a97f2dc4 in ?? ()
#5  0x0000000000cba648 in ?? ()
...

Actions #2

Updated by msuraev over 6 years ago

Actions #3

Updated by laforge over 6 years ago

  • Priority changed from Normal to High
Actions #4

Updated by msuraev over 6 years ago

  • % Done changed from 0 to 50

Workaround in gerrit 4232 should prevent the crash. The reason, as pointed out by Neels in ML, is the OSMO_ASSERT(trx); in trx_phy_instance(). This is triggered in osmo-bts-trx (although I did not manage to reproduce it locally) when attribute request arrives at the time when TRX is not yet available. I suspect this is due to missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that would require either to store the request and properly plug responder into TRX init or to make sure that TRX is always available by delaying osmo-bts-trx connection to BSC until it's ready. Not sure if either is worth pursuing ATM.

Alternatively, BSC can detect this situation and re-request attributes later on (not sure at which point though).

The downside of the workaround in gerrit 4232 is that some TRX-specific attributes might not be reported to BSC. So far it's purely informational: the only thing we do with the response is logging.

Actions #5

Updated by msuraev over 6 years ago

  • Status changed from New to In Progress
Actions #6

Updated by laforge over 6 years ago

Hi Max,

On Thu, Oct 12, 2017 at 12:13:40PM +0000, msuraev [REDMINE] wrote:

Workaround in gerrit 4232 should prevent the crash. The reason, as
pointed out by Neels in ML, is the OSMO_ASSERT(trx); in
trx_phy_instance(). This is triggered in osmo-bts-trx (although I did
not manage to reproduce it locally) when attribute request arrives at
the time when TRX is not yet available. I suspect this is due to
missing/in-progress connection to osmo-trx.

The right fix would be to only reply when TRX is available. But that
would require either to store the request and properly plug responder
into TRX init or to make sure that TRX is always available by delaying
osmo-bts-trx connection to BSC until it's ready. Not sure if either is
worth pursuing ATM.

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes"). Crashing osmo-bts-trx is not a good way of
handling this.

Alternatively, BSC can detect this situation and re-request attributes
later on (not sure at which point though).

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

The downside of the workaround in gerrit 4232 is that some
TRX-specific attributes might not be reported to BSC. So far it's
purely informational: the only thing we do with the response is
logging.

Yes, but that's just the status quo. We need this to work properly,
and/or fail gracefully in order to be able to use the attributes. Let's
not create a chicken-and-egg situation here, where in the future we'll
then say "well yes, ideally we could use the attributes, but then
they aren't reported reliably".

Actions #7

Updated by msuraev over 6 years ago

laforge wrote:

There is an alternative: Simply reply with an error, or with an empty
response ("no attributes").

That's what patch in gerrit 4232 does.

It could be a periodic timer with something like 3 tries, after which
point the OML connection is dropped.

I'm not sure how reliable it would be: after the OML connection is dropped, osmo-bts will be restarted (by systemd for example), try to connect again and so on.

Anyway, to implement it properly I have to reproduce the crash first.

Actions #8

Updated by msuraev over 6 years ago

  • Status changed from In Progress to Stalled
  • % Done changed from 50 to 60

Gerrit 4232 has been merged.

Actions #9

Updated by laforge about 6 years ago

  • Assignee deleted (msuraev)
Actions #10

Updated by laforge about 6 years ago

  • Assignee set to lynxis
Actions #11

Updated by laforge almost 6 years ago

  • Priority changed from High to Low
Actions #12

Updated by laforge over 4 years ago

  • Assignee deleted (lynxis)
Actions #13

Updated by pespin over 4 years ago

  • Status changed from Stalled to Feedback
  • Assignee set to pespin
  • % Done changed from 60 to 90
trx can only be null there if gsm_bts_trx_num(mo->bts, mo->obj_inst.trx_nr) return NULL, and that can only happen for 2 reasons:
  • BSC/BTS is not correctly configured and it's asking for a TRX which is not allocated by BTS
  • the TRX was not yet created and added to bts->trx_list during gsm_bts_trx_alloc().

First case is a config issue and it's already fixed (the crash).
Second case: let's check if it's possible with current code base:

gsm_bts_trx_alloc() is called in two places:
  • For first TRX: during gsm_bts_alloc() (but initialized later during bts_init()). That's called in main.c really early on before OML is available, so we are safe here.
  • For other TRX: during "trx <0-254>" cmd from VTY. Called during main.c vty_read_config_file(). OML conn is only started later in main.c during abis_open(), so we are safe here too.

So imho this is not longer an issue and the ticket can be closed. I'll close it later if nobody disagrees.

Actions #14

Updated by pespin almost 4 years ago

  • Status changed from Feedback to Closed

Closing, since nobody disagreed during 6 months, an I never had this kind of crash while operating an osmo-bts-trx.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)