Project

General

Profile

Bug #4619

ttcn3-bsc-test: all LCLS test cases broken since build #987

Added by fixeria 17 days ago. Updated 9 days ago.

Status:
Stalled
Priority:
Low
Assignee:
Target version:
-
Start date:
06/17/2020
Due date:
% Done:

100%

Spec Reference:
Tags:

Description

Check out https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/test_results_analyzer/.

This is probably related to my recent IPA/RSL emulation refactoring changes:

Changes
1. centos-repo-install-test: new image (detail)
2. debian-repo-install-test: move scripts to osmo-ci (detail)
3. ttcn3-bts-test: enable 3 additional transceivers for BTS#0 (detail)

Here is an extract from the build artifacts of build#987:

06:26:01.572279 921 RSL_Emulation.ttcn:556 Matching on port IPA_PT succeeded:  matched                                                                                   
06:26:01.572304 921 RSL_Emulation.ttcn:556 Receive operation on port IPA_PT succeeded, message from IPA0-RSL-IPA(920): @IPA_Emulation.ASP_RSL_Unitdata : { conn_id := 1, 
streamId := IPAC_PROTO_RSL_TRX0 (0), rsl := { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { 
iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mo
de := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } }
 } } id 27                                                                                                                                                               
06:26:01.572325 921 RSL_Emulation.ttcn:556 Message with id 27 was extracted from the queue of IPA_PT.                                                                    
06:26:01.572351 921 RSL_Emulation.ttcn:199 No Dchan handler for 0{ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 }                                                    
06:26:01.572469 923 MSC_ConnectionHandler.ttcn:784 dec_PDU_ML3_NW_MS(): Decoded @MobileL3_Types.PDU_ML3_NW_MS: { discriminator := '0110'B, tiOrSkip := { skipIndicator :=
 '0000'B }, msgs := { rrm := { assignmentCommand := { messageType := '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001
'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand := { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := om
it, cellChannelDescr := omit, descrMultislotAllocation := omit, modeOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3
 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 := omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omi
t, mobileAllocation_at := omit, startingTime := omit, frequencyList_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobil
eAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetModeIndication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_a
fterTime := omit, extendedTSCSet_beforeTime := omit } } } }                     
06:26:01.572471 921 RSL_Emulation.ttcn:563 setverdict(fail): none -> fail reason: "RSL for unknown Dchan", new component reason: "RSL for unknown Dchan"                 
06:26:01.572505 921 RSL_Emulation.ttcn:564 Stopping MTC. The current test case will be terminated.                                                                       
06:26:01.572526 921 RSL_Emulation.ttcn:564 Stopping test component execution.

Note that ttcn3-bsc-test-latest is not affected.

lcls_ass_crcx.pcap lcls_ass_crcx.pcap 468 Bytes fixeria, 06/19/2020 01:09 AM

History

#1 Updated by fixeria 17 days ago

This is probably related to my recent IPA/RSL emulation refactoring changes:
...
Note that ttcn3-bsc-test-latest is not affected.

Actually, this should not be the case. If that was a regression in ttcn3-bsc-test, then LCLS tests in *-latest would fail too.

#2 Updated by fixeria 17 days ago

Interestingly enough, some test cases pass when running on my machine:

pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 6
  pass: 8
  skip: 149

All libraries, osmo-bsc and ttcn3-bts-test (running natively) were compiled from the recent master.

#3 Updated by fixeria 17 days ago

Hmm, running the same group of test cases again gives different results:

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass BSC_Tests_LCLS.TC_lcls_connect_break
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_clear
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass->FAIL BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 7
  pass: 7
  skip: 149

so the test results have sporadic nature...

#4 Updated by fixeria 17 days ago

Running LCLS test cases third time without restarting osmo-bsc resulted in a segmentation fault:

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376
backtrace() returned 11 addresses
/usr/local/lib/libosmocore.so.12(osmo_panic+0xce) [0x7f6da763758e]
osmo-bsc(+0x35a25) [0x55707098ca25]
osmo-bsc(+0x3ada5) [0x557070991da5]
/usr/local/lib/libosmocore.so.12(+0x13104) [0x7f6da7630104]
/usr/local/lib/libosmocore.so.12(osmo_timers_update+0xc2) [0x7f6da762a062]
/usr/local/lib/libosmocore.so.12(+0xd5d2) [0x7f6da762a5d2]
/usr/local/lib/libosmocore.so.12(osmo_select_main_ctx+0xf) [0x7f6da762ad2f]
osmo-bsc(+0x195f1) [0x5570709705f1]
/usr/lib/libc.so.6(__libc_start_main+0xf2) [0x7f6da743f002]
osmo-bsc(+0x1973e) [0x55707097073e]
signal 6 received

I'll try to reproduce this and investigate the reason. Buiding with "-O0 -ggdb" now...

#5 Updated by fixeria 17 days ago

This is reproducible, just run BSC_Tests_LCLS.control three times. Here is a backtrace:

Program received signal SIGABRT
pwndbg> bt
#0  0x00007ffff6767355 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff6750853 in abort () from /usr/lib/libc.so.6
#2  0x00007ffff72b8593 in osmo_panic_default (args=0x7fffffffdc30, fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:49
#3  osmo_panic (fmt=<optimized out>) at ../../../src/libosmocore/src/panic.c:84
#4  0x0000555555be6e52 in conn_get_bts (conn=0x622000057160) at ../../include/osmocom/bsc/gsm_data.h:1376
#5  0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758
#6  0x00007ffff72b1104 in fsm_tmr_cb (data=0x612000060220) at ../../../src/libosmocore/src/fsm.c:325
#7  0x00007ffff72ab062 in osmo_timers_update () at ../../../src/libosmocore/src/timer.c:257
#8  0x00007ffff72ab5d2 in _osmo_select_main (polling=0) at ../../../src/libosmocore/src/select.c:260
#9  0x00007ffff72abd2f in osmo_select_main_ctx (polling=<optimized out>) at ../../../src/libosmocore/src/select.c:291
#10 0x0000555555e1b81b in main (argc=3, argv=0x7fffffffe1b8) at osmo_bsc_main.c:953
#11 0x00007ffff6752002 in __libc_start_main () from /usr/lib/libc.so.6
#12 0x0000555555b61bbe in _start ()

#6 Updated by fixeria 17 days ago

Among with that, we seem to have a memory leak. This is what I see after running one of the LCLS tests several times:

OsmoBSC# show talloc-context application 3 tree 0x60b000000780
talloc report on 'osmo-bsc' (total 1852081 bytes in 738 blocks)
    ipa                            contains 1013233 bytes in  68 blocks (ref 0) 0x60b000000780
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000222460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000206460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001ea460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001ce460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0001b2460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000196460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00017a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00015e460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000142460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000126460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00010a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000ee460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000d2460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f0000b6460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00009a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00007e460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000062460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f000046460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00002a460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00001c460
      struct e1inp_line              contains  48240 bytes in   3 blocks (ref 0) 0x62f00000e460
      struct ipa_server_link         contains     96 bytes in   2 blocks (ref 0) 0x6100000028a0
      struct ipa_server_link         contains     96 bytes in   2 blocks (ref 0) 0x6100000027a0

#7 Updated by neels 17 days ago

fixeria wrote:

#5 0x0000555555c1edc8 in assignment_fsm_timer_cb (fi=0x612000060220) at assignment_fsm.c:758

this looks like a timeout at a point where the lchan has already been torn down, so that code path should not OSMO_ASSERT(conn->lchan) but rather error handle.

(still doesn't explain the root cause of why the situation comes about)

#8 Updated by fixeria 17 days ago

this looks like a timeout at a point where the lchan has already been torn down

Yes, this (usually) happens right after the test case execution is finished. Looks like a race condition to me.

#9 Updated by fixeria 17 days ago

Among with that, we seem to have a memory leak.

And not just one memory leak, here is another one:

OsmoBSC# show talloc-context application 5 tree 0x6180000040e0
talloc report on 'osmo-bsc' (total 1900481 bytes in 745 blocks)
  struct gsm_network             contains 777050 bytes in 584 blocks (ref 0) 0x6160000000e0
    struct mgcp_client             contains   3488 bytes in  71 blocks (ref 0) 0x6180000040e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009f160
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009f090
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000abc80
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000ab870
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000aab70
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000aaaa0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4d30
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4c60
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4510
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000a4100
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009b6e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009b610
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009aec0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00009aab0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000095080
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000094ad0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000917a0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d0000916d0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007efe0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007ef10
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007e7c0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007e3b0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008ae00
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008ad30
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008a5e0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00008a1d0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083f80
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083eb0
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083760
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d000083350
      struct mgcp_response_pending   contains     40 bytes in   1 blocks (ref 0) 0x60d00007aad0

#10 Updated by fixeria 17 days ago

And not just one memory leak, here is another one:

This might be related to:

DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x61200000bc20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...
DLCLS ERROR osmo_bsc_lcls.c:539 LCLS[0x612000010120]{NO_LONGER_LS}: transition to state NOT_POSSIBLE_LS not permitted!
DMSC INFO osmo_bsc_sigtran.c:388 Tx MSC: BSSMAP: LCLS-NOTIFICATION
DRLL ERROR mgcp_client_fsm.c:525 MGCP_CONN(to-MSC)[0x612000010d20]{ST_READY}: MGW/DLCX: abrupt FSM termination with connections still present, sending unconditional DLCX...

or

DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused'
DLMGCP ERROR mgcp_client.c:700 Failed to read: r=127.0.0.1:2427<->l=127.0.0.1:2727: 111='Connection refused'

I could not find the reason why it's leaked, but found another problem while looking at the code:

https://gerrit.osmocom.org/c/osmo-mgw/+/18880 libosmo-mgcp-client: fix use-after-free in mgcp_client_tx()

#11 Updated by fixeria 17 days ago

I could not find the reason why it's leaked, but found another problem while looking at the code:

Ok, this change fixes the problem for me:

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue

#12 Updated by fixeria 17 days ago

  • Status changed from New to In Progress

Interestingly enough, reverting "src/input/ipaccess.c: set TCP_NODELAY" [1] makes the LCLS test results much more stable:

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass->FAIL BSC_Tests_LCLS.TC_lcls_connect_break
pass BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass->FAIL: 1
  pass: 13
  skip: 149

[1] https://git.osmocom.org/libosmo-abis/commit/?id=62725d0b58a4842ce96ac9107c565de40fe4e945

#13 Updated by fixeria 16 days ago

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak: clear pending queue

As was recommended by neels, I took a different approach based on timers:

https://gerrit.osmocom.org/c/osmo-mgw/+/18881 libosmo-mgcp-client: fix memleak in case if no response is received

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376

This problem is getting even more annoying...

#14 Updated by fixeria 16 days ago

Assert failed conn->lchan ../../../../src/osmo-bsc/include/osmocom/bsc/gsm_data.h:1376

According to the backtrace, the crash originates from assignment_fsm_timer_cb(). This callback tries to update some counters:

int assignment_fsm_timer_cb(struct osmo_fsm_inst *fi)
{
        struct gsm_subscriber_connection *conn = assignment_fi_conn(fi);
        assignment_count_result(CTR_ASSIGNMENT_TIMEOUT); // New: b5ccf09fc4 2020-06-08 14:36:46
        assignment_fail(GSM0808_CAUSE_EQUIPMENT_FAILURE, "Timeout");
        return 0;
}

so trying to follow (Ctrl+] in Vim) assignment_count_result() brings us to the macros hell:

#define assignment_count_result(counter) do { \
                if (!conn->assignment.result_rate_ctr_done) { \
                        assignment_count(counter); /* <--- We go here */ \
                        conn->assignment.result_rate_ctr_done = true; \
                } else \
                        LOG_ASSIGNMENT(conn, LOGL_DEBUG, \
                                       "result rate counter already recorded, NOT counting as: %s %s\n", \
                                       bsc_ctr_description[BSC_##counter].name, \
                                       bsc_ctr_description[BSC_##counter].description); \
        } while(0)

following the rabbit hole, we end up here:

/* Assume presence of local var 'conn' as struct gsm_subscriber_connection */
#define assignment_count(counter) do { \
                struct gsm_bts *bts = conn_get_bts(conn); /* <--- Here is the party pooper! */ \
                LOG_ASSIGNMENT(conn, LOGL_DEBUG, "incrementing rate counter: %s %s\n", \
                               bsc_ctr_description[BSC_##counter].name, \
                               bsc_ctr_description[BSC_##counter].description); \
                rate_ctr_inc(&conn->network->bsc_ctrs->ctr[BSC_##counter]); \
                rate_ctr_inc(&bts->bts_ctrs->ctr[BTS_##counter]); \
        } while(0)

so I think I found culprit:

commit b5ccf09fc4042c7fb1fdaaa6263961c40b32564e
Author: Daniel Willmann <dwillmann@sysmocom.de>
Date:   Mon Jun 8 14:36:46 2020 +0200

    Count assignment rates per BTS as well

    This adds the assignment counters for the BTS as well and changes the
    assignment_count() macro to increase both the counters for the BSC as
    well as the BTS.

    Related: SYS#4877
    Change-Id: I0009e51d4caf68e762138d98e2e23d49acc3cc1a

#15 Updated by fixeria 16 days ago

Here is a potential fix:

https://gerrit.osmocom.org/c/osmo-bsc/+/18907 assignment_fsm: fix assignment_count(): conn->lchan may be NULL

the bug will be tracked in a separate ticket (see #4620), neels will complete the patch.

#16 Updated by fixeria 16 days ago

  • % Done changed from 0 to 20

I executed BSC_Tests_LCLS.TC_lcls_gcr_only a few times and made packet captures of successful and unsuccessful outcomes.
Comparing the both captures reveals an interesting detail:

  • in case of a successful outcome:
    • the TC sends CM Service Request with cause MO call
    • the IUT sends MGCP CRCX to the virtual MGW
    • the IUT activates a TCH channel, TC ACKnowledges
    • the IUT sends Assignment Command
    • the TC responds with Assignment Complete
    • the TC responds with MGCP 200 OK
    • the IUT sends RSL ip.access CRCX, TC ACKnowledges
    • the IUT sends RSL ip.access MDCX, TC ACKnowledges
    • the IUT sends MGCP MDCX to the virtual MGW, TC responds with MGCP 200 OK
    • the IUT sends MGCP CRCX to the virtual MGW, TC responds with MGCP 200 OK
  • in case of a unsuccessful outcome:
    • the TC sends CM Service Request with cause MO call
    • the IUT sends MGCP CRCX to the virtual MGW
    • the IUT activates a TCH channel, TC ACKnowledges
    • the IUT sends Assignment Command
    • the TC responds with MGCP 200 OK
    • the IUT sends RSL ip.access CRCX, TC does not respond :/

so the key difference is that the IUT sends RSL ip.access CRCX before getting Assignment Complete from the TC. I am not familiar with these things, so my best guess is that the IUT is not doing anything wrong, it's just because the test case does not expect RSL ip.access CRCX. Looks like a race condition caused by decreased delays on the RSL link (thanks to TCP_NODELAY).

#17 Updated by fixeria 16 days ago

Here is what happens on the TITAN's side in case of an unsuccessful outcome:

IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Stream before decoding: '0008007E70010AF410F203'O                                                                                   
IPA0-RSL-IPA(8)@DELL: dec_PDU_IPA(): Decoded @IPA_Types.PDU_IPA: { lengthInd := 8, streamId := IPAC_PROTO_RSL_TRX0 (0), streamIdExt := omit, payloadData := '7E70010AF410F203'O }                                                                                                                                                                 
IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT IPAC_PROTO_RSL_TRX0 (0) with IPAC_PROTO_CCM (254) unmatched: First message in the queue does not match the template:     
IPA0-RSL-IPA(8)@DELL: Matching on port IPA_PORT succeeded:  matched                                                                                                      
IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Stream before decoding: '7E70010AF410F203'O                                                                                     
IPA0-RSL-IPA(8)@DELL: dec_RSL_Message(): Decoded @RSL_Types.RSL_Message: { msg_disc := { msg_group := RSL_MDISC_IPACCESS (63), transparent := false }, msg_type := RSL_MT_IPAC_CRCX (112), ies := { { iei := RSL_IE_CHAN_NR (1), body := { chan_nr := { u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 } } }, { iei := RSL_IE_IPAC_SPEECH_MODE (244), body := { ipa_speech_mode := { reserved := '00'B, mode := RSL_IPA_SPM_RECVONLY (1), codec := RSL_IPA_CODEC_FR (0) } } }, { iei := RSL_IE_IPAC_RTP_PAYLOAD (242), body := { ipa_rtp_pt := 3 } } } }

IPA0-RSL-RSL(9)@DELL: No Dchan handler for trx_nr=0 and chan_nr={ u := { ch0 := RSL_CHAN_NR_Bm_ACCH (1) }, tn := 2 }

TC_lcls_gcr_only-B(11)@DELL: Rx L3 from net: { discriminator := '0110'B, tiOrSkip := { skipIndicator := '0000'B }, msgs := { rrm := { assignmentCommand := { messageType 
:= '00101110'B, descrOf1stChAfterTime := { timeslotNumber := '010'B, channelTypeandTDMAOffset := '00001'B, octet3 := '43'O ("C"), octet4 := '67'O ("g") }, PowerCommand :
= { powerlevel := '00111'B, fPC_EP := '0'B, ePC_Mode := '0'B, spare_1 := '0'B }, frequencyList_at := omit, cellChannelDescr := omit, descrMultislotAllocation := omit, mo
deOf1stChannel := { elementIdentifier := '63'O ("c"), mode := '01'O }, channelSet2 := omit, channelSet3 := omit, channelSet4 := omit, channelSet5 := omit, channelSet6 :=
 omit, channelSet7 := omit, channelSet8 := omit, descrOf2ndChAfterTime := omit, modeOf2ndChannel := omit, mobileAllocation_at := omit, startingTime := omit, frequencyLis
t_bt := omit, descrOf1stCh_bt := omit, descrOf2ndCh_bt := omit, frequencyChannelSequence := omit, mobileAllocation_bt := omit, cipherModeSetting := omit, vGCS_TargetMode
Indication := omit, multiRateConfiguration := omit, vGCS_Ciphering_Parameters := omit, extendedTSCSet_afterTime := omit, extendedTSCSet_beforeTime := omit } } } }

so it fails because RSL ip.access CRCX is somehow received before the Assignment Command. Usually, on receipt of an Assignment Command the test case calls f_rslem_register(chan_nr), so the RSL Emulation knows were to dispatch messages related to a given chan_nr. In this case the order of messages is messed up, so the RSL Emulation has no idea where to dispatch the received RSL ip.access CRCX message.

This is very odd, given that in Wireshark the order looks correct (see attached capture).

#18 Updated by fixeria 16 days ago

  • Status changed from In Progress to Feedback
  • % Done changed from 20 to 40

This is very odd, given that in Wireshark the order looks correct (see attached capture).

Ok, I guess I know what's happening. The test case consists of several components running in parallel. Even if the messages are received in the correct order (I am pretty sure they are), there is still a risk that the RSL Emulation component (IPA0-RSL-RSL@DELL) would process RSL ip.access CRCX faster than the test case component (TC_lcls_gcr_only-B(11)@DELL) would process Assignment Command and register a new expectation in the RSL Emulation component. The TCP_NODELAY itself is not a problem, it just makes osmo-bsc send RSL ip.access CRCX faster, so we hit this race condition more often than without this option.

So this is clearly a problem of the test suite, and I don't know how to solve this problem. It's unfortunately the dark side of concurrency. Probably we can implement a queue in the RSL Emulation component, that would absorb all messages for which there is currently no handler. As soon as a new chan_nr/handler is registered, it would iterate over the queue and dispatch stalled messages.

laforge what do you think?

#19 Updated by fixeria 16 days ago

P.S. I believe that sporadic test case failures in ttcn3-bsc-test-sccplite [1] are also caused by this problem. All of them fail with:

RSL for unknown Dchan
      BSC_Tests.ttcn:4501 BSC_Tests control part
      BSC_Tests.ttcn:2176 TC_assignment_codec_fr testcase

[1] https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-sccplite/test_results_analyzer/

#20 Updated by laforge 15 days ago

I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?

#21 Updated by fixeria 15 days ago

laforge wrote:

I think we already have the capability to temporarily suspend processing RSL messages
during hand-over, see f_rslem_suspend(). Maybe this approach can also be used here?

Thanks for a hint. Unfortunately, I doubt that f_rslem_suspend() would help here. If we suspend processing of RSL messages, we would not get the Assignment Command, and thus wound not send Assignment Complete to the IUT. I did a quick test and indeed it breaks the assignment :/

I'll implement a queue. Given that I am now familiar with the RSL Emulation code, I think it will be easy.

#22 Updated by fixeria 15 days ago

  • % Done changed from 40 to 90

Hooray! This change makes all LCLS test cases pass on my machine with TCP_NODELAY enabled:

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DCHAN messages

#23 Updated by fixeria 15 days ago

Just to summarize all comments unrelated to the LCLS test case failures:

#24 Updated by fixeria 13 days ago

  • % Done changed from 90 to 100

As was discussed with laforge, I re-implemented queuing, so it needs to be enabled and disabled explicitly by calling f_rslem_dchan_queue_enable() and f_rslem_dchan_queue_dispatch():

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18947 library/RSL_Emulation: fix comment near f_rslem_suspend()
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18923 library/RSL_Emulation: implement waiting queue for DChan messages
https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/18948 MSC_ConnectionHandler: fix race condition in f_establish_fully()

pass BSC_Tests_LCLS.TC_lcls_gcr_only
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_codec_mismatch
pass BSC_Tests_LCLS.TC_lcls_gcr_nomatch_bway_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_cfg
pass BSC_Tests_LCLS.TC_lcls_gcr_unsuppported_csc
pass BSC_Tests_LCLS.TC_lcls_gcr_bway_dont_connect_csc
pass BSC_Tests_LCLS.TC_lcls_connect_break
pass BSC_Tests_LCLS.TC_lcls_connect_clear
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect
pass BSC_Tests_LCLS.TC_lcls_bts_gcr_bway_connect_hr
pass BSC_Tests_LCLS.TC_lcls_bts_connect_break

Summary:
  pass: 14
  skip: 149

#25 Updated by laforge 12 days ago

Patch merged

#26 Updated by fixeria 10 days ago

  • Priority changed from High to Low

I was hoping to see all LCLS test cases green in ttcn3-bsc-test build#1019 and ttcn3-centos-bsc-test build#24, but only some of them became green. The rest started to fail due to a different reason (not seen before):

"MSC_ConnectionHandler.ttcn:1034 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" 
      BSC_Tests_LCLS.ttcn:745 BSC_Tests_LCLS control part
      BSC_Tests_LCLS.ttcn:419 TC_lcls_gcr_bway_codec_mismatch testcase

As it turns out, this a regression caused by:

https://gerrit.osmocom.org/c/osmo-bsc/+/18907 fix crashes due to OSMO_ASSERT(conn->lchan)

20200624220227798 DMSC ERROR gsm_08_08.c:488 Compl L3 without lchan
20200624220227798 DLINP ERROR ipaccess.c:393 Bad signalling message, sign_link returned error: Operation not permitted.

that should have been fixed by:

https://gerrit.osmocom.org/c/osmo-bsc/+/18993 fix segfault introduced by recent segfault fix

Unfortunately, the recent shutdown FSM implementation in osmo-bts somehow broke osmo-bts-omldummy, so currently all test cases fail:

https://jenkins.osmocom.org/jenkins/view/TTCN3-centos/job/TTCN3-centos-bsc-test/26/
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/1020/
https://lists.osmocom.org/pipermail/openbsc/2020-June/013195.html

Since they all pass on my machine with older osmo-bts-omldummy, setting priority to Low.
I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.

#27 Updated by fixeria 9 days ago

  • Status changed from Feedback to Stalled

I'll close this ticket as soon as we see all LCLS test cases green in Jenkins.

Somehow 2/3 test cases still fail. No more time to work on this, will try to investigate later.

https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/1024/testReport/(root)/BSC_Tests_LCLS/

"MSC_ConnectionHandler.ttcn:1038 : Timeout waiting for COMPLETE LAYER 3 INFORMATION" 
      BSC_Tests_LCLS.ttcn:742 BSC_Tests_LCLS control part
      BSC_Tests_LCLS.ttcn:262 TC_lcls_gcr_only testcase

P.S. I just tried to run all LCLS test cases on my machine (natively and in Docker), all pass.

#28 Updated by fixeria 9 days ago

Huh, this may (potentially) be caused by the test cases we run before LCLS group, e.g. the following ones:

/* at bottom as they might crash OsmoBSC before OS#3182 is fixed */
execute( TC_early_conn_fail() );
execute( TC_late_conn_fail() );

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)