BSC sends COMPLETE L3 before RESET
At least in SCCPlite, we've received a protocol trace from a customer that looks like this:
- IPA CCM handshake
- SCCP CR with BSSMAP COMPLETE L3 INFO
- another SCCP CR with BSSMAP COMPLETE L3 INFO
- only then a SCCP UDT with BSSMAP RESET
The Reset procedure should happen as the first thing after the A link comes up, before any user data is communicated. The SCCP CR messages of the example above should ideally be queued (or else discarded) until the RESET procedure completes. Discarding is probably the easy option, as queueing would have to involve timeouts (what if the RESET takes 5 minutes to complete), ...
https://gerrit.osmocom.org/c/libosmo-netif/+/15403 stream: Introduce API osmo_stream_cli_is_connected
https://gerrit.osmocom.org/c/libosmo-netif/+/15404 stream: Fix scheduling of queued messages during connecting state
- % Done changed from 0 to 60
https://gerrit.osmocom.org/c/libosmo-sccp/+/15405 ss7: Do not queue messages if stream is not connected
Helpful call stack:
sccp_sclc_user_sap_down_nofree xua_gen_encode_and_send xua_gen_msg_cl sccp_scrc_rx_sclc_msg sua_addr_parse scrc_local_out_common scrc_node_12 gen_mtp_transfer_req_xua sua2sccp_tx_m3ua osmo_ss7_user_mtp_xfer_req m3ua_hmdc_rx_from_l2 hmrt_message_for_routing ipa_tx_xua_as xua_as_transmit_msg osmo_ss7_asp_send osmo_stream_cli_send/osmo_stream_srv_send
- Category set to A interface
- Status changed from In Progress to Feedback
- % Done changed from 60 to 70
More related commits:
remote: https://gerrit.osmocom.org/c/osmo-bsc/+/15406 a_reset.c: Don't wait 2 seconds to send first BSSMAP RESET
remote: https://gerrit.osmocom.org/c/osmo-bsc/+/15407 bsc: gsm_08_08.c: Remove repeated conn not null check
I could not find the exact culprit of the issue, according to what I understand from the code it should not happen at all. I think it may happen if the BSC<->MSC conn was already established at some previous point, and then it got restarted without the BSC not yet knowing about it, so upper layers still think the conn is active and so those CL3 Info messages can be sent. And since those are not answered, at some point this condition from a_reset.c triggers, sending the BSSAP reset:
if (reset_ctx->conn_loss_counter >= BAD_CONNECTION_THRESOLD)
But I'm just speculating, it's difficult to say because the bsc logs related to the pcap file don't match (eg. the src port of the connection and timestamps differ), so it's almost impossible to know exactly what's going on since I also lack previous context in the pcap file.
I think the best is to stall this ticket and once the fixes above submitted are merged, try again and get more data to better figure out the issue.
#8 Updated by pespin about 2 months ago
I checked again about the possibility of osmo-bsc forwarding a COMPL L3 message before having done the reset, and again I was unable to find how it can happen.
a_reset.c keeps the SCCP link state in an FSM, and it can be checked with a_reset_conn_ready(), which can only return true with "reset_fsm->state == ST_CONN".
Then, here's the code path when a COMPL L3 message is received in BSC through RSL from BTS:
bsc_compl_l3 bsc_find_msc (doesn't check with a_reset_conn_ready(), but it's expected since later more fine grained USSD is sent to subscriber in complete_layer3()) complete_layer3 osmo_bsc_sigtran_new_conn a_reset_conn_ready return false [Upon return false above, complete_layer3() does bsc_send_ussd_no_srv() and returns without forwarding the message).
So I go back to what I said in last comment. I think the resets seen afterwards were sent by incremented
conn_loss_counter through calls to
a_reset.c:a_reset_conn_fail() and reaching the threshold
reset_ctx->conn_loss_counter >= BAD_CONNECTION_THRESOLD
I don't think it's worth spending more time in related topic until we find some setup were we can clearly see this issue again and get some proper traces with pcaps, sine afaiu those resets could be expected.
laforge what do you think?