Bug #4619
closedttcn3-bsc-test: all LCLS test cases broken since build #987
90%
Description
Check out https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/test_results_analyzer/.
This is probably related to my recent IPA/RSL emulation refactoring changes:
Changes
1. centos-repo-install-test: new image (detail)
2. debian-repo-install-test: move scripts to osmo-ci (detail)
3. ttcn3-bts-test: enable 3 additional transceivers for BTS#0 (detail)
Here is an extract from the build artifacts of build#987:
06:26:01.572279 921 RSL_Emulation.ttcn:556 Matching on port IPA_PT succeeded: matched 06:26:01.572304 921 RSL_Emulation.ttcn:556 Receive operation on port IPA_PT succeeded, message from IPA0-RSL-IPA(920): @IPA_Emulation.ASP_RSL_Unitdata : { conn_id := 1, streamId := IPAC_PROTO_RSL_TRX0 (0), rsl := { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mo de := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } } } } id 27 06:26:01.572325 921 RSL_Emulation.ttcn:556 Message with id 27 was extracted from the queue of IPA_PT. 06:26:01.572351 921 RSL_Emulation.ttcn:199 No Dchan handler for 0{ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } 06:26:01.572469 923 MSC_ConnectionHandler.ttcn:784 dec_PDU_ML3_NW_MS(): Decoded @MobileL3_Types.PDU_ML3_NW_MS: { discriminator := '0110'B, tiOrSkip := { skipIndicator := '0000'B }, msgs := { rrm := { assignmentCommand := { messageType := '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001 'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand := { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := om it, cellChannelDescr := omit, descrMultislotAllocation := omit, modeOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 := omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omi t, mobileAllocation_at := omit, startingTime := omit, frequencyList_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobil eAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetModeIndication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_a fterTime := omit, extendedTSCSet_beforeTime := omit } } } } 06:26:01.572471 921 RSL_Emulation.ttcn:563 setverdict(fail): none -> fail reason: "RSL for unknown Dchan", new component reason: "RSL for unknown Dchan" 06:26:01.572505 921 RSL_Emulation.ttcn:564 Stopping MTC. The current test case will be terminated. 06:26:01.572526 921 RSL_Emulation.ttcn:564 Stopping test component execution.
Note that ttcn3-bsc-test-latest is not affected.
Files
Related issues
Updated by fixeria over 3 years ago
This is probably related to my recent IPA/RSL emulation refactoring changes:
...
Note that ttcn3-bsc-test-latest is not affected.
Actually, this should not be the case. If that was a regression in ttcn3-bsc-test, then LCLS tests in *-latest would fail too.
Updated by fixeria over 3 years ago
Interestingly enough, some test cases pass when running on my machine:
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_only pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_bts_connect_break Summary: pass->FAIL: 6 pass: 8 skip: 149
All libraries, osmo-bsc and ttcn3-bts-test (running natively) were compiled from the recent master.
Updated by fixeria over 3 years ago
Hmm, running the same group of test cases again gives different results:
pass BSC_Tests_LCLS.TC_lcls_gcr_only pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc pass BSC_Tests_LCLS.TC_lcls_connect_break pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_bts_connect_break Summary: pass->FAIL: 7 pass: 7 skip: 149
so the test results have sporadic nature...
Updated by fixeria over 3 years ago
Running LCLS test cases third time without restarting osmo-bsc resulted in a segmentation fault:
Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376 backtrace() returned 11 addresses /usr/local/lib/libosmocore.so.12(osmo_panic+0xce) [0x7f6da763758e] osmo-bsc(+0x35a25) [0x55707098ca25] osmo-bsc(+0x3ada5) [0x557070991da5] /usr/local/lib/libosmocore.so.12(+0x13104) [0x7f6da7630104] /usr/local/lib/libosmocore.so.12(osmo_timers_update+0xc2) [0x7f6da762a062] /usr/local/lib/libosmocore.so.12(+0xd5d2) [0x7f6da762a5d2] /usr/local/lib/libosmocore.so.12(osmo_select_main_ctx+0xf) [0x7f6da762ad2f] osmo-bsc(+0x195f1) [0x5570709705f1] /usr/lib/libc.so.6(__libc_start_main+0xf2) [0x7f6da743f002] osmo-bsc(+0x1973e) [0x55707097073e] signal 6 received
I'll try to reproduce this and investigate the reason. Buiding with "-O0 -ggdb" now...
Updated by fixeria over 3 years ago
This is reproducible, just run BSC_Tests_LCLS.control three times. Here is a backtrace:
Program received signal SIGABRT pwndbg> bt #0 0x00007ffff6767355 in raise () from /usr/lib/libc.so.6 #1 0x00007ffff6750853 in abort () from /usr/lib/libc.so.6 #2 0x00007ffff72b8593 in osmo_panic_default (args=0x7fffffffdc30, fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:49 #3 osmo_panic (fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:84 #4 0x0000555555be6e52 in conn_get_bts (conn=0x622000057160) at ../../include/osmocom/bsc/gsm_data.h:1376 #5 0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758 #6 0x00007ffff72b1104 in fsm_tmr_cb (data=0x612000060220) at ../../../src/libosmocore/src/fsm.c:325 #7 0x00007ffff72ab062 in osmo_timers_update () at ../../../src/libosmocore/src/timer.c:257 #8 0x00007ffff72ab5d2 in _osmo_select_main (polling=0) at ../../../src/libosmocore/src/select.c:260 #9 0x00007ffff72abd2f in osmo_select_main_ctx (polling=<optimized out>) at ../../../src/libosmocore/src/select.c:291 #10 0x0000555555e1b81b in main (argc=3, argv=0x7fffffffe1b8) at osmo_bsc_main.c:953 #11 0x00007ffff6752002 in __libc_start_main () from /usr/lib/libc.so.6 #12 0x0000555555b61bbe in _start ()
Updated by fixeria over 3 years ago
Among with that, we seem to have a memory leak. This is what I see after running one of the LCLS tests several times:
OsmoBSC# show talloc-context application 3 tree 0x60b000000780 talloc report on 'osmo-bsc' (total 1852081 bytes in 738 blocks) ipa contains 1013233 bytes in 68 blocks (ref 0) 0x60b000000780 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000222460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000206460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0001ea460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0001ce460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0001b2460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000196460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00017a460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00015e460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000142460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000126460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00010a460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0000ee460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0000d2460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f0000b6460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00009a460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00007e460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000062460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f000046460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00002a460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00001c460 struct e1inp_line contains 48240 bytes in 3 blocks (ref 0) 0x62f00000e460 struct ipa_server_link contains 96 bytes in 2 blocks (ref 0) 0x6100000028a0 struct ipa_server_link contains 96 bytes in 2 blocks (ref 0) 0x6100000027a0
Updated by neels over 3 years ago
fixeria wrote:
#5 0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758
this looks like a timeout at a point where the lchan has already been torn down, so that code path should not OSMO_ASSERT(conn->lchan) but rather error handle.
(still doesn't explain the root cause of why the situation comes about)
Updated by fixeria over 3 years ago
this looks like a timeout at a point where the lchan has already been torn down
Yes, this (usually) happens right after the test case execution is finished. Looks like a race condition to me.
Updated by fixeria over 3 years ago
Among with that, we seem to have a memory leak.
And not just one memory leak, here is another one:
OsmoBSC# show talloc-context application 5 tree 0x6180000040e0 talloc report on 'osmo-bsc' (total 1900481 bytes in 745 blocks) struct gsm_network contains 777050 bytes in 584 blocks (ref 0) 0x6160000000e0 struct mgcp_client contains 3488 bytes in 71 blocks (ref 0) 0x6180000040e0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009f160 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009f090 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000abc80 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000ab870 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000aab70 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000aaaa0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000a4d30 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000a4c60 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000a4510 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000a4100 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009b6e0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009b610 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009aec0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00009aab0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000095080 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000094ad0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000917a0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d0000916d0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00007efe0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00007ef10 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00007e7c0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00007e3b0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00008ae00 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00008ad30 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00008a5e0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00008a1d0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000083f80 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000083eb0 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000083760 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d000083350 struct mgcp_response_pending contains 40 bytes in 1 blocks (ref 0) 0x60d00007aad0
Updated by fixeria over 3 years ago
And not just one memory leak, here is another one:
This might be related to:
DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x61200000bc20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX... DLCLS ERROR osmo_bsc_lcls.c:539 LCLS[0x612000010120]{NO_LONGER_LS}: transition to state NOT_POSSIBLE_LS not permitted! DMSC INFO osmo_bsc_sigtran.c:388 Tx MSC: BSSMAP: LCLS-NOTIFICATION DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x612000010d20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...
or
DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused' DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused'
I could not find the reason why it's leaked, but found another problem while looking at the code:
https://gerrit.osmocom.org/c/osmo-mgw/+/18880 libosmo-mgcp-client: fix use-after-free in mgcp_client_tx()
Updated by fixeria over 3 years ago
I could not find the reason why it's leaked, but found another problem while looking at the code:
Ok, this change fixes the problem for me:
https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue
Updated by fixeria over 3 years ago
- Status changed from New to In Progress
Interestingly enough, reverting "src/input/ipaccess.c: set TCP_NODELAY" [1] makes the LCLS test results much more stable:
pass BSC_Tests_LCLS.TC_lcls_gcr_only pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break pass BSC_Tests_LCLS.TC_lcls_connect_clear pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_bts_connect_break Summary: pass->FAIL: 1 pass: 13 skip: 149
[1] https://git.osmocom.org/libosmo-abis/commit/?id=62725d0b58a4842ce96ac9107c565de40fe4e945
Updated by fixeria over 3 years ago
https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue
As was recommended by neels, I took a different approach based on timers:
https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak in case if no response is received
Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376
This problem is getting even more annoying...
Updated by fixeria over 3 years ago
Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376
According to the backtrace, the crash originates from assignment_fsm_timer_cb(). This callback tries to update some counters:
int assignment_fsm_timer_cb(struct osmo_fsm_inst *fi)
{
struct gsm_subscriber_connection *conn = assignment_fi_conn(fi);
assignment_count_result(CTR_ASSIGNMENT_TIMEOUT); // New: b5ccf09fc4 2020-06-08 14:36:46
assignment_fail(GSM0808_CAUSE_EQUIPMENT_FAILURE, "Timeout");
return 0;
}
so trying to follow (Ctrl+] in Vim) assignment_count_result() brings us to the macros hell:
#define assignment_count_result(counter) do { \
if (!conn->assignment.result_rate_ctr_done) { \
assignment_count(counter); /* <--- We go here */ \
conn->assignment.result_rate_ctr_done = true; \
} else \
LOG_ASSIGNMENT(conn, LOGL_DEBUG, \
"result rate counter already recorded, NOT counting as: %s %s\n", \
bsc_ctr_description[BSC_##counter].name, \
bsc_ctr_description[BSC_##counter].description); \
} while(0)
following the rabbit hole, we end up here:
/* Assume presence of local var 'conn' as struct gsm_subscriber_connection */
#define assignment_count(counter) do { \
struct gsm_bts *bts = conn_get_bts(conn); /* <--- Here is the party pooper! */ \
LOG_ASSIGNMENT(conn, LOGL_DEBUG, "incrementing rate counter: %s %s\n", \
bsc_ctr_description[BSC_##counter].name, \
bsc_ctr_description[BSC_##counter].description); \
rate_ctr_inc(&conn->network->bsc_ctrs->ctr[BSC_##counter]); \
rate_ctr_inc(&bts->bts_ctrs->ctr[BTS_##counter]); \
} while(0)
so I think I found culprit:
commit b5ccf09fc4042c7fb1fdaaa6263961c40b32564e Author: Daniel Willmann <dwillmann@sysmocom.de> Date: Mon Jun 8 14:36:46 2020 +0200 Count assignment rates per BTS as well This adds the assignment counters for the BTS as well and changes the assignment_count() macro to increase both the counters for the BSC as well as the BTS. Related: SYS#4877 Change-Id: I0009e51d4caf68e762138d98e2e23d49acc3cc1a
Updated by fixeria over 3 years ago
Here is a potential fix:
https://gerrit.osmocom.org/c/osmo-bsc/+/18907 assignment_fsm: fix assignment_count(): conn->lchan may be NULL
the bug will be tracked in a separate ticket (see #4620), neels will complete the patch.
Updated by fixeria over 3 years ago
- % Done changed from 0 to 20
I executed BSC_Tests_LCLS.TC_lcls_gcr_only a few times and made packet captures of successful and unsuccessful outcomes.
Comparing the both captures reveals an interesting detail:
- in case of a successful outcome:
- the TC sends CM Service Request with cause MO call
- the IUT sends MGCP CRCX to the virtual MGW
- the IUT activates a TCH channel, TC ACKnowledges
- the IUT sends Assignment Command
- the TC responds with Assignment Complete
- the TC responds with MGCP 200 OK
- the IUT sends RSL ip.access CRCX, TC ACKnowledges
- the IUT sends RSL ip.access MDCX, TC ACKnowledges
- the IUT sends MGCP MDCX to the virtual MGW, TC responds with MGCP 200 OK
- the IUT sends MGCP CRCX to the virtual MGW, TC responds with MGCP 200 OK
- in case of a unsuccessful outcome:
- the TC sends CM Service Request with cause MO call
- the IUT sends MGCP CRCX to the virtual MGW
- the IUT activates a TCH channel, TC ACKnowledges
- the IUT sends Assignment Command
- the TC responds with MGCP 200 OK
- the IUT sends RSL ip.access CRCX, TC does not respond :/
so the key difference is that the IUT sends RSL ip.access CRCX before getting Assignment Complete from the TC. I am not familiar with these things, so my best guess is that the IUT is not doing anything wrong, it's just because the test case does not expect RSL ip.access CRCX. Looks like a race condition caused by decreased delays on the RSL link (thanks to TCP_NODELAY).
Updated by fixeria over 3 years ago
- File lcls_ass_crcx.pcap lcls_ass_crcx.pcap added
Here is what happens on the TITAN's side in case of an unsuccessful outcome:
IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Stream before decoding: '0008007E70010AF410F203'O IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Decoded @IPA_Types.PDU_IPA: { lengthInd := 8, streamId := IPAC_PROTO_RSL_TRX0 (0), streamIdExt := omit, payloadData := '7E70010AF410F203'O } IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT IPAC_PROTO_RSL_TRX0 (0) with IPAC_PROTO_CCM (254) unmatched: First message in the queue does not match the template: IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT succeeded: matched IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Stream before decoding: '7E70010AF410F203'O IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Decoded @RSL_Types.RSL_Message: { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mode := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } } } IPA0-RSL-RSL(9)@DELL: No Dchan handler for trx_nr=0 and chan_nr={ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } TC_lcls_gcr_only-B(11)@DELL: Rx L3 from net: { discriminator := '0110'B, tiOrSkip := { skipIndicator := '0000'B }, msgs := { rrm := { assignmentCommand := { messageType := '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand : = { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := omit, cellChannelDescr := omit, descrMultislotAllocation := omit, mo deOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 := omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omit, mobileAllocation_at := omit, startingTime := omit, frequencyLis t_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobileAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetMode Indication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_afterTime := omit, extendedTSCSet_beforeTime := omit } } } }
so it fails because RSL ip.access CRCX is somehow received before the Assignment Command. Usually, on receipt of an Assignment Command the test case calls f_rslem_register(chan_nr), so the RSL Emulation knows were to dispatch messages related to a given chan_nr. In this case the order of messages is messed up, so the RSL Emulation has no idea where to dispatch the received RSL ip.access CRCX message.
This is very odd, given that in Wireshark the order looks correct (see attached capture).
Updated by fixeria over 3 years ago
- Status changed from In Progress to Feedback
- % Done changed from 20 to 40
This is very odd, given that in Wireshark the order looks correct (see attached capture).
Ok, I guess I know what's happening. The test case consists of several components running in parallel. Even if the messages are received in the correct order (I am pretty sure they are), there is still a risk that the RSL Emulation component (IPA0-RSL-RSL@DELL) would process RSL ip.access CRCX faster than the test case component (TC_lcls_gcr_only-B(11)@DELL) would process Assignment Command and register a new expectation in the RSL Emulation component. The TCP_NODELAY itself is not a problem, it just makes osmo-bsc send RSL ip.access CRCX faster, so we hit this race condition more often than without this option.
So this is clearly a problem of the test suite, and I don't know how to solve this problem. It's unfortunately the dark side of concurrency. Probably we can implement a queue in the RSL Emulation component, that would absorb all messages for which there is currently no handler. As soon as a new chan_nr/handler is registered, it would iterate over the queue and dispatch stalled messages.
laforge what do you think?
Updated by fixeria over 3 years ago
P.S. I believe that sporadic test case failures in ttcn3-bsc-test-sccplite [1] are also caused by this problem. All of them fail with:
RSL for unknown Dchan BSC_Tests.ttcn:4501 BSC_Tests control part BSC_Tests.ttcn:2176 TC_assignment_codec_fr testcase
[1] https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-sccplite/test_results_analyzer/
Updated by laforge over 3 years ago
I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?
Updated by fixeria over 3 years ago
laforge wrote:
I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?
Thanks for a hint. Unfortunately, I doubt that f_rslem_suspend() would help here. If we suspend processing of RSL messages, we would not get the Assignment Command, and thus wound not send Assignment Complete to the IUT. I did a quick test and indeed it breaks the assignment :/
I'll implement a queue. Given that I am now familiar with the RSL Emulation code, I think it will be easy.
Updated by fixeria over 3 years ago
- % Done changed from 40 to 90
Hooray! This change makes all LCLS test cases pass on my machine with TCP_NODELAY enabled:
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DCHAN messages
Updated by fixeria over 3 years ago
Just to summarize all comments unrelated to the LCLS test case failures:
- "Assert failed conn->lchan" - tracked in #4620, should be fixed by https://gerrit.osmocom.org/c/osmo-bsc/+/18907
- MGCP memleak - no separate ticket, should be fixed by https://gerrit.osmocom.org/c/osmo-mgw/+/18881
- A-bis memleak - tracked in #4624, pespin will take care
Updated by fixeria over 3 years ago
- % Done changed from 90 to 100
As was discussed with laforge, I re-implemented queuing, so it needs to be enabled and disabled explicitly by calling f_rslem_dchan_queue_enable() and f_rslem_dchan_queue_dispatch():
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18947 library/RSL_Emulation: fix comment near f_rslem_suspend()
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DChan messages
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18948 MSC_ConnectionHandler: fix race condition in f_establish_fully()
pass BSC_Tests_LCLS.TC_lcls_gcr_only pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc pass BSC_Tests_LCLS.TC_lcls_connect_break pass BSC_Tests_LCLS.TC_lcls_connect_clear pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr pass BSC_Tests_LCLS.TC_lcls_bts_connect_break Summary: pass: 14 skip: 149
Updated by fixeria over 3 years ago
- Priority changed from High to Low
I was hoping to see all LCLS test cases green in ttcn3-bsc-test build#1019 and ttcn3-centos-bsc-test build#24, but only some of them became green. The rest started to fail due to a different reason (not seen before):
"MSC_ConnectionHandler.ttcn:1034 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" BSC_Tests_LCLS.ttcn:745 BSC_Tests_LCLS control part BSC_Tests_LCLS.ttcn:419 TC_lcls_gcr_bway_codec_mismatch testcase
As it turns out, this a regression caused by:
https://gerrit.osmocom.org/c/osmo-bsc/+/18907 fix crashes due to OSMO_ASSERT(conn->lchan)
20200624220227798 DMSC ERROR gsm_08_08.c:488 Compl L3 without lchan 20200624220227798 DLINP ERROR ipaccess.c:393 Bad signalling message, sign_link returned error: Operation not permitted.
that should have been fixed by:
https://gerrit.osmocom.org/c/osmo-bsc/+/18993 fix segfault introduced by recent segfault fix
Unfortunately, the recent shutdown FSM implementation in osmo-bts somehow broke osmo-bts-omldummy, so currently all test cases fail:
https://jenkins.osmocom.org/jenkins/view/TTCN3-centos/job/TTCN3-centos-bsc-test/26/
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/1020/
https://lists.osmocom.org/pipermail/openbsc/2020-June/013195.html
Since they all pass on my machine with older osmo-bts-omldummy, setting priority to Low.
I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.
Updated by fixeria over 3 years ago
- Status changed from Feedback to Stalled
I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.
Somehow 2/3 test cases still fail. No more time to work on this, will try to investigate later.
"MSC_ConnectionHandler.ttcn:1038 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" BSC_Tests_LCLS.ttcn:742 BSC_Tests_LCLS control part BSC_Tests_LCLS.ttcn:262 TC_lcls_gcr_only testcase
P.S. I just tried to run all LCLS test cases on my machine (natively and in Docker), all pass.
Updated by fixeria over 3 years ago
Huh, this may (potentially) be caused by the test cases we run before LCLS group, e.g. the following ones:
/* at bottom as they might crash OsmoBSC before OS#3182 is fixed */ execute( TC_early_conn_fail() ); execute( TC_late_conn_fail() );
Updated by fixeria over 3 years ago
- Status changed from Stalled to In Progress
- Priority changed from Low to High
Updated by fixeria over 3 years ago
- Status changed from In Progress to New
- Assignee changed from fixeria to neels
- Priority changed from High to Normal
- % Done changed from 100 to 80
Quick *.pcap analysis (latest vs master) shows that the breakage seems to be caused by the round-robin MSC selection.
Successful test case execution, see https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/645/artifact/logs/bsc-tester/BSC_Tests_LCLS.TC_lcls_gcr_only.pcap.gz:
1401 3.222635 172.18.2.20 → 172.18.2.203 RSL 100 IMMEDIATE ASSIGN COMMAND (CCCH) (RR) Immediate Assignment 1402 3.225732 172.18.2.203 → 172.18.2.20 RSL 96 ESTablish INDication (DTAP) (MM) CM Service Request 1410 3.225935 172.18.2.20 → 172.18.2.203 GSMTAP 160 MSC Tx MSC COMPL L3 1411 3.225956 172.18.2.20 → 172.18.2.203 GSMTAP 235 MSC Initializing resources for new SIGTRAN connection to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP... 1417 3.226055 172.18.2.20 → 172.18.2.203 GSMTAP 176 MSC Allocated new connection id: 98 1418 3.226086 172.18.2.20 → 172.18.2.203 GSMTAP 221 MSC Opening new SIGTRAN connection (id=98) to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP 1451 3.226594 172.18.2.200 → 172.18.2.203 GSMTAP 205 LSS7 XUA_AS(virt-msc0){AS_ACTIVE}: Received Event AS-TRANSFER.req 1452 3.226607 172.18.2.200 → 172.18.2.203 GSMTAP 165 LINP connected read/write 1453 3.226612 172.18.2.200 → 172.18.2.203 GSMTAP 157 LINP sending data 1454 3.226623 187 → 185 BSSAP 160 CR (BSSMAP) Complete Layer 3 Information (DTAP) (MM) CM Service Request
Unsuccessful test case execution, see https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/lastCompletedBuild/artifact/logs/bsc-tester/BSC_Tests_LCLS.TC_lcls_gcr_only.pcap.gz:
342 3.160835 172.18.2.20 → 172.18.2.203 RSL 100 IMMEDIATE ASSIGN COMMAND (CCCH) (RR) Immediate Assignment 343 3.163572 172.18.2.203 → 172.18.2.20 RSL 96 ESTablish INDication (DTAP) (MM) CM Service Request 351 3.163697 172.18.2.20 → 172.18.2.203 GSMTAP 160 MSC Tx MSC COMPL L3 352 3.163709 172.18.2.20 → 172.18.2.203 GSMTAP 211 MSC New subscriber IMSI-001019876543210: MSC round-robin selects msc 1 353 3.163721 172.18.2.20 → 172.18.2.203 GSMTAP 231 MSC Initializing resources for new SCCP connection to MSC: RI=SSN_PC,PC=0.0.2,SSN=BSSAP... 358 3.163767 172.18.2.20 → 172.18.2.203 GSMTAP 177 MSC Allocated new connection id: 133 359 3.163776 172.18.2.20 → 172.18.2.203 GSMTAP 218 MSC Opening new SCCP connection (id=133) to MSC: RI=SSN_PC,PC=0.0.2,SSN=BSSAP 391 3.164242 172.18.2.200 → 172.18.2.203 GSMTAP 203 LSS7 XUA_AS(virt-msc1){AS_DOWN}: Received Event AS-TRANSFER.req 392 3.164251 172.18.2.200 → 172.18.2.203 GSMTAP 208 LSS7 XUA_AS(virt-msc1){AS_DOWN}: Event AS-TRANSFER.req not permitted
So AFAIU, the BSC is trying to send a Complete Layer 3 Information to a different MSC (PC=0.23.1 vs PC=0.0.2), so the test case never gets it it.
Assigning to neels since he has been working on the related topic.
Updated by neels over 3 years ago
Found out that the failure in TC_lcls_gcr_only is related to leftover state from a previous MSC pooling test:
'msc 1' still appears connected to osmo-bsc, even though the ttcn3 init omits connecting that MSC.
Now I need to pinpoint what exactly the right mechanism would be for osmo-bsc to realize that an MSC is no longer connected.
I realize that I haven't paid enough attention to that aspect yet.
However, the TC_lcls_gcr_bway_connect test properly picks 'msc 0' and still fails, not sure yet of the reason.
Updated by laforge over 3 years ago
On Fri, Aug 07, 2020 at 01:44:00PM +0000, neels [REDMINE] wrote:
Now I need to pinpoint what exactly the right mechanism would be for osmo-bsc to realize that an MSC is no longer connected.
Without any active calls (SCCP connections) you don't really know.
The only situations are:- SCCP connection timers timing out (if there are any active SCCP connections)
- SCCP network informing the user about unavailability of a given point code
- we don't imlpement this in Osmocom so far, but there are messages like
M3UA DUNA/DAVA that can be used by an STP to notify M3UA ASPs like the BSC about
availability/unavailability of a given point code. - libosmo-sigtan would have to parse such messages and send N-NOTICE.indication
from the SCCP provider to the SCCP user (local SCCP user like MSC, BSC, ...),
as well as N-INFO.indication for active SCCP connections
- we don't imlpement this in Osmocom so far, but there are messages like
Updated by neels over 3 years ago
- Related to Feature #4701: implement OsmoSTP notification of peers disconnecting, e.g. for OsmoBSC to detect that a specific MSC in the pool is disconnected added
Updated by neels over 3 years ago
- Status changed from In Progress to Resolved
- % Done changed from 80 to 100
The LCLS test breakage is fixed by https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19554 by setting 'no allow-attach' on MSCs that are not intended to be used by a test, during f_init().
Detecting SCCP detach of MSCs should be a separate issue from this one: #4701
Closing this issue since all LCLS tests pass again since the patch was merged.
Updated by neels over 3 years ago
- Status changed from Resolved to In Progress
- % Done changed from 100 to 90
re-opening, apparently I broke the 'latest' build with the 'master' fix
Updated by neels over 3 years ago
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/19567
will fast track this change to fix the 'latest' builds.
Updated by laforge over 3 years ago
- Status changed from In Progress to Resolved
can be considered resolved. both nightly and latest LCLS tests pass again.
Updated by fixeria almost 3 years ago
- Related to Bug #4899: ttcn3-bsc-test-latest BSC_Tests_LCLS failing since a few days ago added