Project

General

Profile

Bug #2823

Use bsc_subscr_conn_fsm in BSC

Added by dexter 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
01/08/2018
Due date:
% Done:

100%

Estimated time:
Spec Reference:

Description

On laforge/fsm a draft FSM implementation can be found to make the handling of the subscriber connection safer and stateful.

osmo-bsc.cfg osmo-bsc.cfg 8.96 KB dexter, 03/19/2018 12:53 PM

History

#1 Updated by dexter 5 months ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 30

#2 Updated by dexter 5 months ago

I have read through the code of the FSM implementation. There is still a lot of stuff that needs to be completed, especially the handover stuff. I am not entirely sure if this FSM is breaking existing handover features. That would be good to know for sure first.

Also the there are a lot of signals sent from random code locations. Thats probably intentional, but maybe it would be better to call a function inside bsc_subscr_conn_fsm.c that then sends the signal like we did it with the MGCP FSMs. This would allow us to run some checks and so some assertions.

The FSM had a small bug that prevented it from working probably when making calls. I fixed that and it seems to run really well!

#3 Updated by dexter 5 months ago

I have tested the internal handover with the FSM. It works fine, so the the FSM did not break anything. Also the FSM now cleans up/frees correctly. The synchronization between the MGCP FSM is still problematic. We can not run them separately, they need to be synchronized because the port/ip information for the RTP streams becomes known at different times.

#4 Updated by dexter 5 months ago

While integrating the FSM I ran into some trouble, nothing that couldn't be fixed, but after discussion with Harald we decided to go a bit further and clean up the FSM that controls the MGCP side. The vision is to have a generalized FSM that can be used in osmo-bsc, osmo-msc and in any other project that somehow needs to interact with an MGW. The FSM works as a child FSM of some parent (in the bsc case this is the GSCON FSM).

For now the FSM has two states ST_READY and ST_WAIT. In ST_READY we wait for an action to do e.g. EV_MDCX. Then we call the mgcp_client, the mgcp_client calls the callback and the callback sends back an EV_MDCX_RESP. The code in ST_WAIT gets executed and a signal is sent back to the parant FSM. This works the dams for DLCX and CRCX. CRCX is an exception. When the function to execute the CRCX is calld, then the FSM is created and the EV_CRCX automatically issued. This is the only way to perform an CRCX, so once the FSM came up successfully we can be sure that the connection exists.

In case of error the FSM shall unlink from its parent, inform the parent via the term event and then safely destroy itsself. This allows us to tear down the SCCP connection quickly while the MGCP cleanups are still in progress. Its also much simpler for the parent FSM since the parent does not have to synchronize.

The API is just a set of three function, one for CRCX (sets up the FSM and does the actual CRCX), one for MDCX and one for DLCX (also takes care of cleaning up everything)

/*! allocate FSM, and create a new connection on the MGW.
 *  \param[in] mgcp MGCP client descriptor.
 *  \param[in] mgcpparent_fi Parent FSM instance.
 *  \param[in] parent_term_evt Event to be sent to parent when terminating.
 *  \param[in] parent_evt Event to be sent to parent when operation is done.
 *  \param[in] endpoint Optional endpoint identifier string.
 *  \param[in] addr Optional ip-address string.
 *  \param[in] port Optional ip-port number.
 *  \returns newly-allocated, initialized and registered FSM instance, NULL on error. */
struct osmo_fsm_inst *mgcp_conn_create(struct mgcp_client *mgcp, struct osmo_fsm_inst *parent_fi, uint32_t parent_term_evt,
                       uint32_t parent_evt, char *endpoint, char *addr, uint16_t port)

/*! modify an existing connection on the MGW.
 *  \param[in] fi FSM instance.
 *  \param[in] parent_evt Event to be sent to parent when operation is done.
 *  \param[in] addr New ip-address string.
 *  \param[in] port New ip-port number.
 *  \returns 0 on success, -EINVAL on error. */
int mgcp_conn_modify(struct osmo_fsm_inst *fi, uint32_t parent_evt, char *addr, uint16_t port)

/*! delete existing connection on the MGW, destroy FSM afterwards.
 *  \param[in] fi FSM instance.
 *  \returns 0 on success, -EINVAL on error. */
int mgcp_conn_delete(struct osmo_fsm_inst *fi)

The signals that are sent back to the parent fsm will always contain a pointer to the private data of the FSM, from there the parent can get relevant information (ports, IP-Address, Endpoint identifier etc.)

I am directly implementing the FSM in osmo-mgw, so it won't be in osmo-bsc, osmo-bsc will just use it and we may also decide to use it in osmo-msc as well once we are confident that it works.

#5 Updated by laforge 5 months ago

On Fri, Jan 12, 2018 at 05:43:27PM +0000, dexter [REDMINE] wrote:

For now the FSM has two states ST_READY and ST_WAIT.

I would make sure to introduce more states. In the initial state, until the CRCX is acknowledged,
no MDCX is permitted. MDCX is only permitted after a successful CRCX response has been received,
so you need to differentiate thie state from the other?

The API is just a set of three function, one for CRCX (sets up the FSM and does the actual CRCX), one for MDCX and one for DLCX (also takes care of cleaning up everything)

struct osmo_fsm_inst *mgcp_conn_create(struct mgcp_client *mgcp, struct osmo_fsm_inst *parent_fi, uint32_t parent_term_evt,
uint32_t parent_evt, char *endpoint, char *addr, uint16_t port)

I think the API should be more future-proof and flexible. Let's make sure that at leaset the addr/port
parameters are passed in via some struct pointer. This way we can extend it later on without breaking
the ABI.

The client will soon / should for example provide the SDP related bits such as which codec shall be used for that connection, ... - and once we start doing this, we don't want to break the API again and again.

It might make sense to have a struct for the local (mgw) side of the connection and another (or a copy of the saeme struct?) for the remote (bts/core network) side. If only the "local" is passed in while the "remote" struct is NULL, then the CRCX only does the "bind" operation. If both local+remote are specified, it performs bind+connect. This should map to the concepts of MGCP (local / remote connection options).

I am directly implementing the FSM in osmo-mgw, so it won't be in osmo-bsc, osmo-bsc will just use it and we may also decide to use it in osmo-msc as well once we are confident that it works.

great.

#6 Updated by dexter 5 months ago

  • % Done changed from 30 to 40

I would make sure to introduce more states...

I see, I have now moved more of the logic into separate states. In the end I think it also makes the code easier to understand. We have now:

    ST_CRCX,
    ST_CRCX_RESP,
    ST_READY,
    ST_MDCX_RESP,
    ST_DLCX_RESP,

ST_CRCX is entered immediately on startup. It sends the CRCX message to the MGW and enters ST_CRCX_RESP. When the response is received the FSM fires an EV_CRCX_RESP to tell the parent that the CRCX was successful. Then it waits in ST_READY for further operations. For the other operations, the mechanism is nearly the same. When e.g. an EV_MDCX is received, the a message is generated and sent to the MGW. We enter ST_MDCX_RESP and wait for the MGW response. When the response is received, we fire the matching event and return back to ST_READY.

Let's make sure that at leaset the addr/port

The parameters that relate on the connection are now encapsulated in struct mgcp_conn_info (see below). The same struct is used to return results with in the event pointer. This allows us to have the struct with the context information entirely private. I have removed it from the header file now, so that it is inaccessable from outside.

...If both local+remote are specified...

On the CRCX we may or may not pass in an struct mgcp_conn_info *conn_info. If it is not passed, then the we know that the user only wants to bind. If it is there it we know that the user wants to do bind+connect. The result is returned with the event pointer. (I do not really get what you mean. Why would we want to pass remote and local to the create function? To tell it a memory location where it can write to? I have reserved fixed memory in the contect struct now. I think this is more convenient, since we can return the pointer with the signal and the callee does not have to take care for the memory.)

/*! Connection information. This struct organizes the connection infromation
 *  one connection side (either remote or local). It is used to pass parameters
 *  (local) to the FSM and get responses (remote) from the FSM as pointer
 *  attached to the FSM event */
struct mgcp_conn_info {
    /*!< RTP connection IP-Address (string) */
    char addr[INET_ADDRSTRLEN];

    /*!< RTP connection IP-Port */
    uint16_t port;
};

/*! allocate FSM, and create a new connection on the MGW.
 *  \param[in] mgcp MGCP client descriptor.
 *  \param[in] mgcpparent_fi Parent FSM instance.
 *  \param[in] parent_term_evt Event to be sent to parent when terminating.
 *  \param[in] parent_evt Event to be sent to parent when operation is done.
 *  \param[in] endpoint Endpoint identifier string.
 *  \param[in] conn_info Optional connection information (ip, port...).
 *  \returns newly-allocated, initialized and registered FSM instance, NULL on error. */
struct osmo_fsm_inst *mgcp_conn_create(struct mgcp_client *mgcp, struct osmo_fsm_inst *parent_fi, uint32_t parent_term_evt,
                       uint32_t parent_evt, char *endpoint, struct mgcp_conn_info *conn_info)

/*! modify an existing connection on the MGW.
 *  \param[in] fi FSM instance.
 *  \param[in] parent_evt Event to be sent to parent when operation is done.
 *  \param[in] conn_info New connection information (ip, port...).
 *  \returns 0 on success, -EINVAL on error. */
int mgcp_conn_modify(struct osmo_fsm_inst *fi, uint32_t parent_evt, struct mgcp_conn_info *conn_info)

/*! delete existing connection on the MGW, destroy FSM afterwards.
 *  \param[in] fi FSM instance. */
void mgcp_conn_delete(struct osmo_fsm_inst *fi)

So far I have now the CRCX and the error handling (timeout, cleanup...) working. For debugging I created myself a small test FSM that acts as parent. There is also the question what should happen if someone tries to execute an operation while the FSM is budy. Under normal conditions this should never happen, since Child and parent are synchroinzed through events. But for DLCX we need something. I have solved this with a flag. If someone tries to trigger a DLCX while the FSM is busy, it just sets the flag and does nothing. When the FSM is done with the pending operation it checks the flag. If it finds the flag set. It continues with the DLCX operation.

#7 Updated by dexter 5 months ago

The patch is up for review: https://gerrit.osmocom.org/#/c/5881/

However, I noticed that I still get a segfault when the parent FSM frees. Otherwise I think it should be fine by now and we are ready to try it out in real life.

I am a bit unhappy with struct mgcp_conn_info We have the two members endpoint and call_id there. These are misplaced when doing an MDCX. On CRCX they are fine. We need to specify the call_id and an endpoint (with wildcard). On MDCX we do not specify no endpoint and no call_id, but the members exist. So if someone specifies the endpoint it is at least checked. The call_id is an unsigned int. On MDCX it is ignored since I can not detect if it is correctly specified or not. This is kind of a cosmetic issue. The Alternative would be more different struct types which would also be not nice either.

#8 Updated by dexter 5 months ago

I am now using the client FSM with bsc_subscr_conn_fsm. Voice calls work without crashing, also the endpoints seem to be released without problems. However, the whole thing is not ready yet. I still need some error handling and return code checking.

#9 Updated by dexter 5 months ago

  • % Done changed from 40 to 60

I have merged the changes I made with the latest state of laforge/fsm and pushed the changes to pmaier/fsm. This was a bit bumpy since laforge/fsm changed here and there while I worked out my changes, but now its in sync again. I hope I did not break anything again.

I am currently working on the problem with the rogue callers of osmo_bsc_sigtran_send(). We must ensure that osmo_bsc_sigtran_send() is called from nowhere except from bsc_subscr_conn_fsm.c and even there it may only be called when an sccp connection is open.

I have identified the following locations:

./src/osmo-bsc/osmo_bsc_api.c:52:    osmo_bsc_sigtran_send(conn, resp);

./src/osmo-bsc/osmo_bsc_bssap.c:679:    osmo_bsc_sigtran_send(conn, resp);
calles from bssmap_handle_cipher_mode() on bad cipher mode.

./src/osmo-bsc/osmo_bsc_bssap.c:852:    osmo_bsc_sigtran_send(conn, resp); called from bssmap_handle_assignm_req() on bad assignment => ok!
./src/osmo-bsc/osmo_bsc_api.c:478:    osmo_bsc_sigtran_send(conn, resp); CLEAR REQUEST
This is also problematic, if the bsc_api can issue uncontrolled clear reqests
at any time. Should be fixed now, since now it uses a signal to the FSM.

./src/osmo-bsc/osmo_bsc_api.c:158:    queue_msg_or_return(resp);  SAPI n REJECT
./src/osmo-bsc/osmo_bsc_api.c:169:    queue_msg_or_return(resp);  CIPHER MODE COMPLETE

./src/osmo-bsc/osmo_bsc_api.c:462:    queue_msg_or_return(resp); ASSIGNMENT FAIL
This is a bug, the assignment failure message shoud not be generated here, instead
we should dispatch GSCON_EV_RR_ASS_FAIL to the FSM so that the FSM can take care
of this properly.

./src/osmo-bsc/osmo_bsc_api.c:491:    queue_msg_or_return(resp);  CLASSMARK UPDATE

I need some advise for the handling of the MGW. Since we now have the FSM based interface for the MGW its a lot easier to handle the MGW connections. But even now we end up with three extra states. ST_WAIT_CRCX_BTS, ST_WAIT_MDCX_BTS, and ST_WAIT_CRCX_MSC. Between ST_WAIT_CRCX_BTS and ST_WAIT_MDCX_BTS also sits ST_WAIT_ASS_CMP. Over the whole MGW handling and assignment period we must pass DTAP traffic and be also ready to handle other SCCP traffic. I think it would look a lot nicer if the MGCP handling stuff would have its own FSM, however, then we gain even more complexity and also this separate FSM then would be responsible for a lot of GSCON related stuff. I am not sure if this is so helpful. But however. Once we add Handover again, GSCON will gain another two states for the MDCX procedure (or maybe not if we do a smart reordering of the states). I think it would be good if we could discuss this issue tomorrow. There are also some open questions regarding the FSM.

#10 Updated by dexter 5 months ago

As discusses recently:

  • The MGCP part will get its own FSM
  • The assignment phase will be split up in lchan allocation and mode modify phase. At the moment we just call gsm0808_assign_req() which does not offer the flexibility we need here.

Last week I ran a few TTCN3 tests against the current state and found a few crashes. Those are fixed now. It looks pretty stable now. Also its now impossible to send connection oriented sigtran messages when the FSM is in an "unconnected" state.

#11 Updated by laforge 4 months ago

  • Project changed from OsmoMSC to OsmoBSC
  • Category deleted (A interface (general))
  • Priority changed from Normal to Urgent

#12 Updated by dexter 4 months ago

Current status: I am currently working on distinguishing the assignments with voice channels from the assignments which only request as signaling channel (Since we do not support CSD, we should reject any requests for data channels).

This is the current group of tests I am currently on:

#BSC_Tests.TC_assignment_cic_only    # Error
#BSC_Tests.TC_assignment_csd        # Pass (Passes because we reject any ass-req. other than voice)
#BSC_Tests.TC_assignment_ctm        # Pass (Passes because we reject any ass-req. other than voice)
#BSC_Tests.TC_assignment_sign        # Fail

#BSC_Tests.TC_assignment_fr_a5_0    # Pass
#BSC_Tests.TC_assignment_fr_a5_1_codec_missing # Pass
#BSC_Tests.TC_assignment_fr_a5_1    # Pass
#BSC_Tests.TC_assignment_fr_a5_3    # Pass
#BSC_Tests.TC_assignment_fr_a5_4    # Pass

The following tasks relate to this task as well:

https://osmocom.org/issues/2782 Bug #2782 OsmoBSC sends BSSMAP ASSIGNMENT COMPLETE before RSL MODE MODIFY succeeds
https://osmocom.org/issues/2823 Bug #2823 Use bsc_subscr_conn_fsm in BSC
http://osmocom.org/issues/2768 Bug #2768 OsmoBSC doesn't perform MGCP DLCX in all cases of channel release
https://osmocom.org/issues/2283 Bug #2283 Inter-BSC hand-over is missing (BSC side)
http://osmocom.org/issues/2898 Bug #2898 OsmoBSC can generate BSSMAP ASSIGNMENT FAIL after BSSMAP ASSIGNMENT COMPLETE

#13 Updated by dexter 4 months ago

Also related:
https://osmocom.org/issues/2936 Bug #2936 Fix TTCN3 Test BSC_Tests.TC_assignment_sign

#14 Updated by dexter 4 months ago

Note: deleted pmaier/fsm, pmaier/fsm now holds the current state based on laforge/fsm. The branch has also been rebased to current master.

#15 Updated by dexter 4 months ago

Note: the current state (rebased to master) can be found on pmaier/fsm3. However, the rebase was not all to smoothly, I had a lot of merge conflicts. I did a quick manual test and it seems to work fine, also the VTY triggered handover. But I am not sure about the recently handover related stuff. We need to check on that.

#16 Updated by dexter 4 months ago

  • % Done changed from 60 to 70

I have resolved the merge conflicts and sent the patches which were not directly related to GSCON to review. For the remaining patches on pmaier/fsm I suggest to squash them into a single patch.

Next steps:

  • For now I got handover working. However, we might consider to move the handling T3103 into the FSM.
  • To support Inter-BSC handover we need to be able to accept SCCP connections from the MSC. There is also a TTCN3 test for that already: BSC_Tests.TC_outbound_connect

#17 Updated by dexter 4 months ago

This is the current status when testing neels/fsm3:

#BSC_Tests.control
#BSC_Tests.TC_chan_act_noreply            #Pass
#BSC_Tests.TC_chan_act_ack_noest        #Pass
#BSC_Tests.TC_chan_act_ack_est_ind_noreply    #Pass
#BSC_Tests.TC_chan_act_ack_est_ind_refused    #Pass
#BSC_Tests.TC_chan_act_nack            #Pass
#BSC_Tests.TC_chan_exhaustion            #Pass
#BSC_Tests.TC_ctrl                #Pass
#BSC_Tests.TC_chan_rel_rll_rel_ind        #Fail (known problem, postponed)
#BSC_Tests.TC_chan_rel_conn_fail        #Pass
#BSC_Tests.TC_chan_rel_hard_clear        #Pass
#BSC_Tests.TC_chan_rel_hard_rlsd        #Pass
#BSC_Tests.TC_chan_rel_a_reset            #Pass

#BSC_Tests.TC_rll_est_ind_inact_lchan        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sapi1        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sapi3        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sacch        #Pass

#BSC_Tests.TC_outbound_connect            #Error (The BSC must accept a connection)
#BSC_Tests.TC_assignment_cic_only        #Pass
#BSC_Tests.TC_assignment_csd            #Pass
#BSC_Tests.TC_assignment_ctm            #Pass
#BSC_Tests.TC_assignment_sign            #Fail
#BSC_Tests.TC_assignment_fr_a5_0        #Pass
#BSC_Tests.TC_assignment_fr_a5_1_codec_missing    #Pass
#BSC_Tests.TC_assignment_fr_a5_1        #Pass
#BSC_Tests.TC_assignment_fr_a5_3        #Pass
#BSC_Tests.TC_assignment_fr_a5_4        #Pass

#BSC_Tests.TC_paging_imsi_nochan         #Pass
#BSC_Tests.TC_paging_tmsi_nochan        #Pass
#BSC_Tests.TC_paging_tmsi_any            #Pass
#BSC_Tests.TC_paging_tmsi_sdcch            #Pass
#BSC_Tests.TC_paging_tmsi_tch_f            #Pass
#BSC_Tests.TC_paging_tmsi_tch_hf        #Pass
#BSC_Tests.TC_paging_imsi_nochan_cgi        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lac_ci        #Pass
#BSC_Tests.TC_paging_imsi_nochan_ci        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lai        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lac        #Pass
#BSC_Tests.TC_paging_imsi_nochan_all        #Pass
#BSC_Tests.TC_paging_imsi_nochan_plmn_lac_rnc    #Pass
#BSC_Tests.TC_paging_imsi_nochan_rnc        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lac_rnc    #Pass
#BSC_Tests.TC_paging_imsi_nochan_lacs        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lacs_empty    #Pass
#BSC_Tests.TC_paging_imsi_a_reset        #Pass
#BSC_Tests.TC_paging_imsi_load            #Fail
#BSC_Tests.TC_paging_counter            #Pass

#BSC_Tests.TC_rsl_drop_counter            #Pass
#BSC_Tests.TC_classmark                #Pass
#BSC_Tests.TC_unsol_ass_fail            #Pass
#BSC_Tests.TC_unsol_ass_compl            #Pass
#BSC_Tests.TC_unsol_ho_fail            #Pass
#BSC_Tests.TC_err_82_short_msg            #Pass
#BSC_Tests.TC_err_84_unknown_msg        #Fail (should trigger an RR status but does not)
#BSC_Tests.TC_ho_int                #Fail (regession ?)

BSC_Tests.TC_ho_int seems to be indeed a regression. All other tests look fine when compared against the current jenkins result.

#18 Updated by neels 4 months ago

dexter wrote:

BSC_Tests.TC_ho_int seems to be indeed a regression. All other tests look fine when compared against the current jenkins result.

I see TC_ho_int ending "inconclusive" and see the same on jenkins. What do you see when using the branch?

My biggest hindrance for testing the branch is that the osmo-bsc.git test suite doesn't build and hence 'make install' doesn't work. I'd need to work around that, by omitting the tests, but I'd prefer if we got those working instead. Obviously some code needs to move around to get the tests to build, still, and no point in testing thoroughly if we know it needs refactoring anyway.

#19 Updated by dexter 4 months ago

  • % Done changed from 70 to 80

Not too much progress during the last week. The internal handover code had a bug which got caught thanks to the TC_ho_int test. The current status is on pmaier/fsm4.

neels: I can get TC_ho_int to pass, but it only works sometimes. I have the feeling that there are race conditions in the TTCN3 code. The behavior I observe is not specific to pmaier/fsm4, the current master exhibits the same behavior.

neels: I think we are almost there. When you can manage to get the handover unit-tests to pass again we can squash all patches into one and submit to review.

#20 Updated by dexter 4 months ago

  • % Done changed from 80 to 90

The GSCON related commits are now squased + tested using TTCN3 and a manual test. The current state can be found on pmaier/fsm5. The patch is up for review in gerrit now: https://gerrit.osmocom.org/7142

#21 Updated by dexter 3 months ago

The current state that matches the state that is currently in review can be found on pmaier/fsm6

#22 Updated by dexter 3 months ago

The patch got merged, but unfortunately we see a couple regressions. I have now re-run all the tests with the current master (osmo-bsc + libs + ttcn3-testsuite). This is the result:

#BSC_Tests.TC_ctrl_msc_connection_status    #Pass
#BSC_Tests.TC_ctrl_msc0_connection_status    #Pass
#BSC_Tests.TC_ctrl                #Pass

#BSC_Tests.TC_chan_act_noreply            #Pass
#BSC_Tests.TC_chan_act_ack_noest        #Pass
#BSC_Tests.TC_chan_act_ack_est_ind_noreply    #Pass
#BSC_Tests.TC_chan_act_ack_est_ind_refused    #Pass
#BSC_Tests.TC_chan_act_nack            #Pass
#BSC_Tests.TC_chan_exhaustion            #Pass
#BSC_Tests.TC_chan_rel_rll_rel_ind        #Pass
#BSC_Tests.TC_chan_rel_conn_fail        #Pass
#BSC_Tests.TC_chan_rel_hard_clear        #Pass
#BSC_Tests.TC_chan_rel_hard_rlsd        #Pass
#BSC_Tests.TC_chan_rel_a_reset            #Pass

#BSC_Tests.TC_outbound_connect            #Pass

#BSC_Tests.TC_assignment_cic_only        #Pass
#BSC_Tests.TC_assignment_csd            #Pass
#BSC_Tests.TC_assignment_ctm            #Pass
#BSC_Tests.TC_assignment_sign            #Fail
BSC_Tests.TC_assignment_fr_a5_0        #Pass (jenkins: fail)
#BSC_Tests.TC_assignment_fr_a5_1_codec_missing    #Pass
BSC_Tests.TC_assignment_fr_a5_1        #Pass (jenkins: fail)
BSC_Tests.TC_assignment_fr_a5_3        #Pass (jenkins: fail)
BSC_Tests.TC_assignment_fr_a5_4        #Pass (jenkins: fail)

#BSC_Tests.TC_rll_est_ind_inact_lchan        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sapi1        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sapi3        #Pass
#BSC_Tests.TC_rll_est_ind_inval_sacch        #Pass

#BSC_Tests.TC_paging_imsi_nochan         #Pass
#BSC_Tests.TC_paging_tmsi_nochan        #Pass
#BSC_Tests.TC_paging_tmsi_any            #Pass
#BSC_Tests.TC_paging_tmsi_sdcch            #Pass
#BSC_Tests.TC_paging_tmsi_tch_f            #Pass
#BSC_Tests.TC_paging_tmsi_tch_hf        #Pass
#BSC_Tests.TC_paging_imsi_nochan_cgi        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lac_ci        #Pass
#BSC_Tests.TC_paging_imsi_nochan_ci        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lai        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lac        #Pass (jenkins: fail)
#BSC_Tests.TC_paging_imsi_nochan_all        #Pass
#BSC_Tests.TC_paging_imsi_nochan_plmn_lac_rnc    #Fail (regression)
#BSC_Tests.TC_paging_imsi_nochan_rnc        #Fail (regression)
#BSC_Tests.TC_paging_imsi_nochan_lac_rnc    #Fail (regression)
#BSC_Tests.TC_paging_imsi_nochan_lacs        #Pass
#BSC_Tests.TC_paging_imsi_nochan_lacs_empty    #Pass
#BSC_Tests.TC_paging_imsi_a_reset        #Pass
#BSC_Tests.TC_paging_imsi_load            #Pass
#BSC_Tests.TC_paging_counter            #Pass

#BSC_Tests.TC_rsl_drop_counter            #Pass
#BSC_Tests.TC_rsl_unknown_unit_id        #Pass

#BSC_Tests.TC_oml_unknown_unit_id        #Pass

#BSC_Tests.TC_classmark                #Pass
#BSC_Tests.TC_unsol_ass_fail            #Pass
#BSC_Tests.TC_unsol_ass_compl            #Pass
#BSC_Tests.TC_unsol_ho_fail            #Pass
#BSC_Tests.TC_err_82_short_msg            #Pass
#BSC_Tests.TC_err_84_unknown_msg        #Pass
#BSC_Tests.TC_ho_int                #Pass (jenkins: fail)

#BSC_Tests.TC_bssap_rlsd_does_not_cause_bssmap_reset        #Pass
#BSC_Tests.TC_bssmap_clear_does_not_cause_bssmap_reset   #Fail
#BSC_Tests.TC_ms_rel_ind_does_not_cause_bssmap_reset     #Fail

What worries me the most is that the TC_assignment_... (marked with jenkins: fail) tests are failing. The tests work fine locally. I have checked the build artefacts from jenkins and one can clearly see that the assignment request immediately returns an assignment failure. In the log we can see that apparently the connection to the emulated MGW fails. (mgcp_client.c:466 Failed to read: 111/Connection refused):

Mon Mar 19 07:54:23 2018 DMSC <0008> osmo_bsc_sigtran.c:296 Opening new SIGTRAN connection (id=16) to MSC: RI=SSN_PC,PC=0.23.1,SSN=BSSAP
Mon Mar 19 07:54:23 2018 DMSC <0008> bsc_subscr_conn_fsm.c:257 SUBSCR_CONN[0x2134b70]{INIT}: state_chg to WAIT_CC
Mon Mar 19 07:54:23 2018 DMSC <0008> osmo_bsc_sigtran.c:189 SUBSCR_CONN[0x2134b70]{WAIT_CC}: Received Event MO-CONNECT.cfm
Mon Mar 19 07:54:23 2018 DMSC <0008> bsc_subscr_conn_fsm.c:290 SUBSCR_CONN[0x2134b70]{WAIT_CC}: state_chg to ACTIVE
Mon Mar 19 07:54:23 2018 DMSC <0008> osmo_bsc_bssap.c:806 SUBSCR_CONN[0x2134b70]{ACTIVE}: Received Event ASSIGNMENT_CMD
Mon Mar 19 07:54:23 2018 DMSC <0008> bsc_subscr_conn_fsm.c:317 SUBSCR_CONN[0x2134b70]{ACTIVE}: Channel assignment: chan_mode=SPEECH_V1, full_rate=1
Mon Mar 19 07:54:23 2018 DMSC <0008> bsc_subscr_conn_fsm.c:341 SUBSCR_CONN[0x2134b70]{ACTIVE}: state_chg to WAIT_CRCX_BTS
Mon Mar 19 07:54:23 2018 DRLL <0000> fsm.c:264 MGCP_CONN[0x2138390]{ST_CRCX}: Allocated
Mon Mar 19 07:54:23 2018 DRLL <0000> fsm.c:294 MGCP_CONN[0x2138390]{ST_CRCX}: is child of SUBSCR_CONN[0x2134b70]
Mon Mar 19 07:54:23 2018 DRLL <0000> mgcp_client_fsm.c:591 MGCP_CONN[0x2138390]{ST_CRCX}: Received Event EV_CRCX
Mon Mar 19 07:54:23 2018 DRLL <0000> mgcp_client_fsm.c:210 MGCP_CONN[0x2138390]{ST_CRCX}: state_chg to ST_CRCX_RESP
Mon Mar 19 07:54:23 2018 DLMGCP <0021> mgcp_client.c:466 Failed to read: 111/Connection refused
Mon Mar 19 07:54:27 2018 DRLL <0000> fsm.c:184 MGCP_CONN[0x2138390]{ST_CRCX_RESP}: Timeout of T1
Mon Mar 19 07:54:27 2018 DRLL <0000> mgcp_client_fsm.c:449 MGCP_CONN[0x2138390]{ST_CRCX_RESP}: Terminating (cause = OSMO_FSM_TERM_REGULAR)
Mon Mar 19 07:54:27 2018 DRLL <0000> mgcp_client_fsm.c:449 MGCP_CONN[0x2138390]{ST_CRCX_RESP}: Removing from parent SUBSCR_CONN[0x2134b70]
Mon Mar 19 07:54:27 2018 DRLL <0000> mgcp_client_fsm.c:449 MGCP_CONN[0x2138390]{ST_CRCX_RESP}: Freeing instance
Mon Mar 19 07:54:27 2018 DRLL <0000> fsm.c:346 MGCP_CONN[0x2138390]{ST_CRCX_RESP}: Deallocated
Mon Mar 19 07:54:27 2018 DMSC <0008> mgcp_client_fsm.c:449 SUBSCR_CONN[0x2134b70]{WAIT_CRCX_BTS}: Received Event MGW_FAILURE_BTS
Mon Mar 19 07:54:27 2018 DMSC <0008> bsc_subscr_conn_fsm.c:885 SUBSCR_CONN[0x2134b70]{WAIT_CRCX_BTS}: state_chg to ACTIVE
Mon Mar 19 07:54:27 2018 DLINP <0013> input/ipaccess.c:243 Sign link vanished, dead socket
Mon Mar 19 07:54:27 2018 DLINP <0013> input/ipaccess.c:71 Forcing socket shutdown with no signal link set
Mon Mar 19 07:54:27 2018 DLMI <0015> bsc_init.c:409 Lost some E1 TEI link: 1 0x7f969d156070

I am currently a bit lost here. Maybe there is something wrong with the environment? I am not sure. The failure is in the mgcp_client.c, which also was used before.

(I have attached my osmo-bsc.cfg, but it is not much different to the config file we use in docker-playground)

#23 Updated by laforge 3 months ago

did you try to run the docker containers locally? This should get you closer to the
jenkins setup, and ideally reproduce the behavior.

Btw: the test are now running much more stable/reproducable. I haven't seen
any tets that sometimes pass, sometimes fail anymore. So at least their probability
has been reduced significantly. If there still are such occasions, please point
them out to me so they can be fixed.

#24 Updated by dexter 3 months ago

I think trying it with docker is the next that should be tried. I have already installed the community edition of docker, but it seems to have some problems with the permissions. I will figure out tomorrow whats wrong there.

Thanks for adding the vc_CTRL_IPA.stop. I will keem some records on the stability from now on.

#25 Updated by neels 3 months ago

I'm running the tests in docker locally and can reproduce the failures.
The root reason:

DLMGCP ERROR mgcp_client.c:466 Failed to read: 111/Connection refuse

Taking a closer look at TC_assignment_fr_a5_0:
The last "successful" jenkins build was https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test/114/
but when looking at those logs, the test is actually inconclusive:
"MTC@561edcaab304: Test case TC_assignment_fr_a5_0 finished. Verdict: inconc reason: Timeout waiting for ASSIGNMENT COMPLETE"
The point being that our jenkins still thinks inconclusive tests are successful :/
The failure has been around probably from the start of dockerized bsc tests.

docker-playground/ttcn3-bsc-test/osmo-bsc.cfg contains no mgw configuration, hence osmo-bsc will expect to reach an MGW at 127.0.0.1.
Since osmo-bsc is run in the osmo-bsc-master docker container and the ttcn3 test case in the ttcn3-bsc-test container with an IP address of 172.18.2.203,
this cannot possibly work.

Conclusion 1: add mgw config to osmo-bsc.cfg
https://gerrit.osmocom.org/7400

Next, in BSC_Tests.ttcn, it says

function f_init_mgcp(charstring id) runs on test_CT {
        id := id & "-MGCP";

        var MGCPOps ops := {
                create_cb := refers(MGCP_Emulation.ExpectedCreateCallback),
                unitdata_cb := refers(MGCP_Emulation.DummyUnitdataCallback)
        };
        var MGCP_conn_parameters mgcp_pars := {
                callagent_ip := mp_bsc_ip,
                callagent_udp_port := -1,
                mgw_ip := mp_test_ip,
                mgw_udp_port := 2427
        };

        vc_MGCP := MGCP_Emulation_CT.create(id);
        vc_MGCP.start(MGCP_Emulation.main(ops, mgcp_pars, id));
}

and
modulepar {
        /* IP address at which the BSC can be reached */
        charstring mp_bsc_ip := "127.0.0.1";
[...]
        /* IP address at which the test binds */
        charstring mp_test_ip := "127.0.0.1";

Hence the MGCP emulation seems to also bind to 127.0.0.1 while it should be reachable by the "remote" osmo-bsc-main docker container.

Conclusion 2: add mp_test_ip := 0.0.0.0 to BSC_Tests.cfg.
https://gerrit.osmocom.org/7401

With these patches, the tests using MGCP pass
(and I actually have a chance of establishing a TCH/F to test inter-BSC handover).

So, for the record, the FSM has not introduced a regression here, the docker tests concerning MGCP have always been broken.
The FSM has actually improved handling of the error by replying properly with Assignment Failure, which was until then missing.

#26 Updated by dexter 3 months ago

As discussed with neels yesterday I fixed a couple of cosmetic issues in the already merged gscon patch.

remote:   https://gerrit.osmocom.org/7420 cosmetic: remove unused enum members
remote:   https://gerrit.osmocom.org/7421 cosmetic: fix typo
remote:   https://gerrit.osmocom.org/7422 cosmetic: fix argument order of forward_dtap()
remote:   https://gerrit.osmocom.org/7423 cosmetic: remove needless fixme note.
remote:   https://gerrit.osmocom.org/7424 cosmetic: fix incomplete sentence in comment.
remote:   https://gerrit.osmocom.org/7425 Cosmetic: fix missing semicolon after osmo-assert
remote:   https://gerrit.osmocom.org/7426 cosmetic: remove dead code and obsolete fixmes
remote:   https://gerrit.osmocom.org/7427 cosmetic: remove old, already commented-out code
remote:   https://gerrit.osmocom.org/7428 cosmetic: remove dead code

With 7428 I am not entirely sure. It removes a lot of dead code and code that has been commented out. I also removed the related VTY commands since they serve no purpose anymore. However, this could cause fallout for some users who still have those commands in their config files.

#27 Updated by neels 3 months ago

As mentioned, what's causing current serious fallout is the removal of bs11-config and ipaccess-config from the build:
our debian package feeds of osmo-bsc are continuously broken because of that.
I believe lynxis mentioned wanting to look at it? talk to him.

And/or please take a look whether you can easily make them compile and re-add them again,
which would resolve the issue.

thx!

#28 Updated by dexter 3 months ago

neels:

I have added the dependencies introduced by GSCON and added stubs for that what I could not resolve:

remote:   https://gerrit.osmocom.org/7460 ipaccess: make ipaccess-config build again
remote:   https://gerrit.osmocom.org/7461 bs11: make bs11_config build again

#29 Updated by dexter 3 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Since the GSCON patches are merged we should continue the remaining problems in separate tasks, just like #3109. I set this to resolved now.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)