Project

General

Profile

Bug #4215

osmo-bts-trx: DL burst towards osmo-trx dopped due to enqueued CMD NOHANDOVER

Added by pespin about 2 months ago. Updated 7 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
osmo-bts-trx
Target version:
-
Start date:
09/27/2019
Due date:
% Done:

100%

Spec Reference:

Description

Ass seen in BTS_Tests.TC_deact_sacch.pcap, when BSC sends over RSL a DEACTIVATE SACCH message to the BTS, the BTS enqueues a CMD NOHANDOVER. After that, we can see several of these messages in osmo-bts-trx log:

20386    09:02:08.628490    Sep 27, 2019 09:02:08.628490000 CEST    172.18.9.20    33148    172.18.9.10    4729    GSMTAP    203    phy0.0: Enqueuing TRX control command 'CMD NOHANDOVER 2 0' 
20387    09:02:08.628506    Sep 27, 2019 09:02:08.628506000 CEST    172.18.9.20    33148    172.18.9.10    4729    GSMTAP    189    phy0.0: Sending control 'CMD NOHANDOVER 2 0' 
...
20389    09:02:08.628780    Sep 27, 2019 09:02:08.628780000 CEST    172.18.9.20    33148    172.18.9.10    4729    GSMTAP    193    phy0.0: Tx burst (hdr_ver=1): tn=0 fn=1401 pwr=0
20390    09:02:08.628794    Sep 27, 2019 09:02:08.628794000 CEST    172.18.9.20    33148    172.18.9.10    4729    GSMTAP    191    phy0.0: Ignoring TX data, transceiver offline. 
[repeated]
...
20406    09:02:08.629022    Sep 27, 2019 09:02:08.629022000 CEST    172.18.9.20    33148    172.18.9.10    4729    GSMTAP    193    phy0.0: Response message: 'RSP NOHANDOVER 0 2 0' 

I have also seen this kind of behavior before when running osmo-bts-trx+osmo-trx locally.

So the problem seems to be in osmo-bts-trx trx_if.c, in trx_if_send_burst():

    /* we must be sure that we have clock, and we have sent all control
     * data */
    if (transceiver_available && llist_empty(&l1h->trx_ctrl_list)) {
        send(l1h->trx_ofd_data.fd, buf, nbits + 6, 0);
    } else
        LOGPPHI(l1h->phy_inst, DTRX, LOGL_ERROR, "Ignoring TX data, transceiver offline.\n");

So IIUC, that check is there to make sure we don't send DL bursts while the TRX is stopped or still in progress of being configured, but I guess it should have its DL bursts dropped while already running and having a NOHANDOVER command.

History

#1 Updated by pespin about 2 months ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

I started doing some clean up in order to sort out real need for different states variables to be able to clrearly indicate in code above when we should send or not data.

remote: https://gerrit.osmocom.org/c/osmo-bts/+/15613 struct gsm_bts: Add model_priv pointer handing bts_model specific data
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15614 bts-trx: Allocate struct osmo_trx_clock_state as part of bts-trx private data
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15615 bts-trx: Move transceiver_available as part of bts-trx private data

#2 Updated by pespin about 2 months ago

  • Status changed from In Progress to Feedback
  • % Done changed from 20 to 90

Next bunch of refactor patches to fix the issue:
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15627 bts-trx: vty: Print phy link state in cmd 'show transceiver'
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15628 bts-trx: trx_set_bts(): Avoid double loop by checking current trx
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15629 bts-trx: Rework code handling poweron state
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15630 bts-trx: Don't reset transceiver_available in scheduler_trx.c
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15615 bts-trx: Get rid of messy transceiver_available state handler
remote: https://gerrit.osmocom.org/c/osmo-bts/+/15631 bts-trx: Drop unused func check_transceiver_availability()

Once merged the ticket can be closed (issue is fixed as a side-effect of last patch).

#3 Updated by ipse about 2 months ago

Just want to make sure - after all these patches, is `osmo-bts-trx` still monitoring that `osmo-trx` is "alive"? By monitoring CLK INDs or somehow else? I.e. will it notice if `osmo-trx` suddenly crash?

#4 Updated by pespin about 1 month ago

During CMD POWERON command, the TRX is expected to create all required sockets and start sending clock indications against osmo-bts-trx, and once done, asnwer with RSP POWERON 0.
At that point, "powered" bool in BTS becomes true, until CMD POWEROFF is sent (and RSP POWEROFF 0 is received).

Once TRX is "powered", osmo-bts-trx starts sending bursts to TRX:

    if (trx_if_powered(l1h)) {
         send(l1h->trx_ofd_data.fd, buf, nbits + 6, 0);    <--- I just detected we don't check return code here, I'll submit a patch.

If osmo-trx crashes, its sockets will become unavailable and trx_data_read_cb() in osmo-bts-trx will eventually fail, detecting thus that TRX is unavailable. I think thought that apart from logging the errr it doesn't report it to upper layers. That's something we may want to improve (by calling for instance bts_shutdown()).

What really stops the BTS if TRX becomes unavailable is bts_shutdown() in scheduler_trx.c. First IND CLOCK received in trx_sched_clock() triggers trx_setup_clock(), which sets up a timerfd that triggers from time to time to push frames to upper layers. if IND CLOCK stopped being receivied, wall clock vs ind clock skews too much and trx_fn_timer_cb() exists the BTS with bts_shutdown(bts, "No clock from osmo-trx");.

You can easily see it by running osmo-bts-trx and osmo-trx, then at some point killing osmo-trx. osmo-bts-trx will stop with the "No clock from osmo-trx" message.

Now the question is: What happens if TRX answers with RSP POWERON 0 and sockets are available but no CLOCK IND is ever set? Then the timerfd is never set, so trx_fn_timer_cb() is never called and probably BTS never exits and thinks everything's fine, but the scheduler is really paused (so for instance no DL bursts are sent against the TRX). So it seems we need to fix this case by adding some kind of timing in osmo-bts-trx to timeout if first CLOCK IND is not send after X seconds, then exit. (I tried the scenario by commenting out CLOCK ind write() in osmo-trx).

But this issue didn't come from last patches merged, afaiu it was already there. I'll provide patches for that shortly.

#5 Updated by pespin about 1 month ago

Improvement to log send/recv() errors:
https://gerrit.osmocom.org/c/osmo-bts/+/15702 bts-trx: Log TRXC and TRXD socket recv()/send() failures

I'll work on the timeout if no IND CLOCK is never received later.

#6 Updated by pespin about 1 month ago

Timeout after POWERON and no CLOCK IND ever received is here:
https://gerrit.osmocom.org/c/osmo-bts/+/15706 bts-trx: Time out if no clock ind recvd after RSP POWERON

Once merged I think we can close the ticket if ipse has not other concerns.

#7 Updated by pespin 7 days ago

  • Status changed from Feedback to Resolved
  • % Done changed from 90 to 100

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)