Bug #3859
closedSGs FSM doesn't consider disconnected HLR
100%
Description
When having incoming SGs LU REQ from the MMS while no HLR is connected, we get:
<0011> sgs_server.c:185 SGs socket bound to r=NULL<->l=0.0.0.0:29118 Mon Mar 25 17:20:23 2019 DLSS7 <001e> osmo_ss7.c:1283 0: ASP Restart for server not implemented yet! Mon Mar 25 17:20:23 2019 DMNCC <0004> msc_main.c:604 Using internal MNCC handler. Mon Mar 25 17:20:23 2019 DLGLOBAL <0012> telnet_interface.c:104 Available via telnet 0.0.0.0 4254 Mon Mar 25 17:20:23 2019 DSMPP <000c> smpp_smsc.c:1012 SMPP at 0.0.0.0 2775 Mon Mar 25 17:20:23 2019 DLCTRL <0019> control_if.c:911 CTRL at 0.0.0.0 4255 Mon Mar 25 17:20:23 2019 DLSMS <0018> sms_queue.c:250 Attempting to send 20 SMS Mon Mar 25 17:20:23 2019 DLSMS <0018> sms_queue.c:234 SMS queue: no SMS to be sent Mon Mar 25 17:20:23 2019 DLSMS <0018> sms_queue.c:261 Sending SMS done (0 attempted) Mon Mar 25 17:20:23 2019 DLSMS <0018> sms_queue.c:317 SMSqueue added 0 messages in 0 rounds Mon Mar 25 17:20:23 2019 DLMGCP <0022> mgcp_client.c:716 MGCP client: using endpoint domain '@mgw' Mon Mar 25 17:20:23 2019 DLMGCP <0022> mgcp_client.c:791 MGCP GW connection: r=127.0.0.1:2427<->l=127.0.0.1:2727 Mon Mar 25 17:20:23 2019 DMSC <0006> msc_main.c:372 CS7 Instance identifiers: A = Iu = 0 Mon Mar 25 17:20:23 2019 DLSCCP <001f> sccp_user.c:397 OsmoMSC-A-Iu: Using SS7 instance 0, pc:0.23.1 Mon Mar 25 17:20:23 2019 DLSCCP <001f> sccp_user.c:415 OsmoMSC-A-Iu: Using AS instance as-clnt-OsmoMSC-A Mon Mar 25 17:20:23 2019 DLSCCP <001f> sccp_user.c:420 OsmoMSC-A-Iu: Creating default route Mon Mar 25 17:20:23 2019 DLSCCP <001f> sccp_user.c:476 OsmoMSC-A-Iu: Using ASP instance asp-clnt-OsmoMSC-A Mon Mar 25 17:20:23 2019 DLSS7 <001e> osmo_ss7.c:471 0: Creating SCCP instance Mon Mar 25 17:20:23 2019 DBSSAP <0010> a_iface.c:674 Initalizing SCCP connection to stp... Mon Mar 25 17:20:27 2019 DSGS <0011> sgs_server.c:123 r=192.168.122.186:37270<->l=192.168.122.1:29118: Accepted new SGs connection Mon Mar 25 17:24:41 2019 DSGS <0011> fsm.c:320 SGs-VLR-RESET(262-42-8001-01)[0x55fda7c789d0]{unknown 0}: Allocated Mon Mar 25 17:24:41 2019 DSGS <0011> fsm.c:320 SGs-UE(num:0)[0x55fda7c760f0]{SGs-NULL}: Allocated Mon Mar 25 17:24:41 2019 DSGS <0011> vlr_sgs_fsm.c:359 SGs-UE(num:0)[0x55fda7c760f0]{SGs-NULL}: state_chg to SGs-NULL Mon Mar 25 17:24:41 2019 DVLR <000e> vlr.c:438 set IMSI on subscriber; IMSI=262423203001508 id=262423203001508 Mon Mar 25 17:24:41 2019 DVLR <000e> vlr.c:391 New subscr, IMSI: 262423203001508 Mon Mar 25 17:24:41 2019 DVLR <000e> vlr.c:438 set IMSI on subscriber; IMSI=262423203001508 id=262423203001508 Mon Mar 25 17:24:41 2019 DSGS <0011> vlr_sgs.c:96 SGs-UE(num:0)[0x55fda7c760f0]{SGs-NULL}: Received Event RX_LU_FROM_MME Mon Mar 25 17:24:41 2019 DSGS <0011> vlr_sgs_fsm.c:55 SGs-UE(num:0)[0x55fda7c760f0]{SGs-NULL}: state_chg to SGs-LA-UPDATE-PRESENT Mon Mar 25 17:24:41 2019 DVLR <000e> gsm_04_08.c:1767 SUBSCR(IMSI-262423203001508:TMSInew-0x8611AEA5) VLR: update for IMSI=262423203001508 (MSISDN=, used=1) Mon Mar 25 17:24:41 2019 DVLR <000e> vlr.c:192 GSUP tx: 04010862423202031005f8280102 Mon Mar 25 17:24:41 2019 DLGSUP <001c> gsup_client.c:353 GSUP not connected, unable to send 04 01 08 62 42 32 02 03 10 05 f8 28 01 02 Mon Mar 25 17:24:41 2019 DSGS <0011> vlr_sgs_fsm.c:65 SGs-UE(num:0)[0x55fda7c760f0]{SGs-LA-UPDATE-PRESENT}: (sub IMSI-262423203001508:TMSInew-0x8611AEA5) HLR LU request failed Mon Mar 25 17:24:55 2019 DVLR <000e> vlr.c:438 set IMSI on subscriber; IMSI=262423203001508 id=262423203001508 Mon Mar 25 17:24:55 2019 DSGS <0011> vlr_sgs.c:96 SGs-UE(num:0)[0x55fda7c760f0]{SGs-LA-UPDATE-PRESENT}: Received Event RX_LU_FROM_MME Mon Mar 25 17:24:55 2019 DSGS <0011> vlr_sgs.c:96 SGs-UE(num:0)[0x55fda7c760f0]{SGs-LA-UPDATE-PRESENT}: Event RX_LU_FROM_MME not permitted
Even after many minutes, there is no timeout or any other visible recovery. We have to consider such cases as the HLR might always be unreachable at least temporarily. What does the spec say? Shouldn't we return something at all to the MME in this case?
Updated by laforge about 5 years ago
What makes the problem ven worse: If the HLR is later recovered and the MME is sending another LU REQ, we get:
Mon Mar 25 17:31:03 2019 DSGS <0011> vlr_sgs.c:96 SGs-UE(num:0)[0x55fda7c760f0]{SGs-LA-UPDATE-PRESENT}: Received Event RX_LU_FROM_MME Mon Mar 25 17:31:03 2019 DSGS <0011> vlr_sgs.c:96 SGs-UE(num:0)[0x55fda7c760f0]{SGs-LA-UPDATE-PRESENT}: Event RX_LU_FROM_MME not permitted
so a temporary HLR outage will break CSFB for an apparently indefinite time :/
Updated by dexter about 5 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 30
I found a way to reproduce the problem above using TTCN3. I also investigated the spec and found out that we are supposed to send an SGsAP-RESET-INDICATION to the MME in those cases. (see also: 3GPP TS 29.118 5.7 VLR failure procedure).
I have some code ready that triggers the sending of an SGsAP-RESET-INDICATION when the HLR (VLR) fails. This works so far. We now need a TTCN3 test that responds to the SGsAP-RESET-INDICATION properly.
Updated by dexter about 5 years ago
- % Done changed from 30 to 90
There is now a TTCN3 test that provokes the problem. See the following patches:
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/13556 SGsAP_Templates: Remove invalid template.
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/13557 MSC_Tests: allow disabeling GSUP
https://gerrit.osmocom.org/#/c/osmo-ttcn3-hacks/+/13558 MSC_Tests: Add testcase to simulate VLR/HLR failure (SGsAP)
There are several problems in the MSC. On the one side the code in the VLR did not report the failure back to the SGs related code in the msc. I have added a flag so thet the actual msc code gets aware of the failure. When the flag is set, the reset procedure is carried out. This works well on the TTCN3 test so far.
https://gerrit.osmocom.org/#/c/osmo-msc/+/13559 sgs_iface: detect and react to VLR/HLR failure
Updated by dexter almost 5 years ago
The patches that add the TTCN3 tests are all merged. The MSC part still needs some review:
https://gerrit.osmocom.org/#/c/osmo-msc/+/13559 sgs_iface: detect and react to VLR/HLR failure
Updated by laforge almost 5 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
patch now reviewed/rebased/merged