Project

General

Profile

Actions

Bug #6302

closed

ttcn3-hnbgw-test-latest regression (IUT segmentation fault)

Added by fixeria 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Target version:
-
Start date:
12/12/2023
Due date:
% Done:

100%

Spec Reference:

Description

Starting from December 5th, we're seeing regressions in ttcn3-hnbgw-test latest:

https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-hnbgw-test-latest/535/ (+28 failures)

All affected testcases fail due to a DTE:

MTC@6005ed7d57b8: setverdict(fail): none -> fail reason: ""VTY Timeout for prompt: enable"", new component reason: ""VTY Timeout for prompt: enable"" 

We can also see a coredump file in the artifacts:

https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-hnbgw-test-latest/535/artifact/logs/hnbgw/core

I managed to reproduce the problem locally and examined the coredump in gdb:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f75debf1489 in on_success (data=<optimized out>, ci=0x562ca9a8f690) at ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c:539
539     ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c: No such file or directory.

(gdb) bt
#0  0x00007f75debf1489 in on_success (data=<optimized out>, ci=0x562ca9a8f690) at ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c:539
#1  osmo_mgcpc_ep_fsm_handle_ci_events (fi=<optimized out>, event=<optimized out>, data=<optimized out>) at ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c:957
#2  0x00007f75deaf2ef0 in _osmo_fsm_inst_dispatch (fi=0x562ca9a8be50, event=0, data=0x562ca9a9603c, file=0x7f75debf5ba3 "mgcp_client_fsm.c", line=446) at ./src/core/fsm.c:875
#3  0x00007f75deaf2ef0 in _osmo_fsm_inst_dispatch (fi=0x562ca9a95c10, event=3, data=0x562ca9a95d40, file=0x7f75debf5ba3 "mgcp_client_fsm.c", line=429) at ./src/core/fsm.c:875
#4  0x00007f75debe2817 in mgcp_client_handle_response (mgcp=0x562ca9a7d970, pending=0x562ca9a8b840, response=<optimized out>) at ./src/libosmo-mgcp-client/mgcp_client.c:246
#5  0x00007f75debe2dc4 in mgcp_client_rx (mgcp=mgcp@entry=0x562ca9a7d970, msg=msg@entry=0x562ca9a96380) at ./src/libosmo-mgcp-client/mgcp_client.c:741
#6  0x00007f75debe3da7 in mgcp_do_read (fd=0x562ca9a7ddb0) at ./src/libosmo-mgcp-client/mgcp_client.c:771
#7  0x00007f75deb0f241 in osmo_wqueue_bfd_cb (fd=0x562ca9a7ddb0, what=1) at ./src/core/write_queue.c:47
#8  0x00007f75deb00a94 in poll_disp_fds (n_fd=<optimized out>) at ./src/core/select.c:419
#9  _osmo_select_main (polling=polling@entry=0) at ./src/core/select.c:457
#10 0x00007f75deb00ba6 in osmo_select_main_ctx (polling=polling@entry=0) at ./src/core/select.c:513
#11 0x0000562ca98dc6e2 in main (argc=3, argv=0x7ffc2bf14318) at ./src/osmo-hnbgw/osmo_hnbgw_main.c:317

Looks like the problem is actually in libosmo-mgcp-client rather than in osmo-hnbgw?

ii  libosmo-mgcp-client12:amd64        1.12.1                         amd64        libosmo-mgcp-client: Osmocom's Media Gateway Control Protocol client utilities
ii  libosmocore                        1.9.2                          amd64        Open Source MObile COMmunications CORE library (metapackage)
ii  osmo-hnbgw                         1.5.0                          amd64        OsmoHNBGW: Osmocom Home Node B Gateway
Actions #1

Updated by fixeria 5 months ago

  • Assignee changed from fixeria to pespin

This is a NULL pointer dereference in libosmo-mgcp-client:

(gdb) frame 0
#0  0x00007f17c9fa0489 in on_success (data=<optimized out>, ci=0x55ad22c0f630) at ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c:539
539     in ./src/libosmo-mgcp-client/mgcp_client_endpoint_fsm.c
539         osmo_mgcpc_ep_fsm_check_state_chg_after_response(ci->ep->fi);

(gdb) p ci->ep
$1 = (struct osmo_mgcpc_ep *) 0x0

Looks like it's the new testcase TC_rab_assign_mgw_iuup_addr_chg triggering a segfault:

commit af74650899947067f7c8556e0929342b426e2f8a
Author: Pau Espin Pedrol <pespin@sysmocom.de>
Date:   Wed Nov 29 16:18:28 2023 +0100

    hnbgw: Introduce test TC_rab_assign_mgw_iuup_addr_chg

I think we should either:

  • not execute TC_rab_assign_mgw_iuup_addr_chg for the -latest,
  • back-port fixes from master (if there were any).

pespin assigning to you.

Actions #2

Updated by neels 5 months ago

  • Status changed from New to In Progress
  • Assignee changed from pespin to neels
Actions #3

Updated by neels 5 months ago

  • % Done changed from 0 to 90

My results:

It is a genuine bug in libosmo-mgcp-client.
It never showed before because we don't seem to trigger a fatal error in MGCP event handling in any other tests.

The new test returns an MDCX OK, and the emitted libosmo-mgcp-client notify.fi event_success being received by osmo-hnbgw code causes the call to abort -- osmo-hnbgw doesn't like some details of the MDCX response and shuts everything down.
So an MGCP response handling deallocates the osmo_mgcpc_ep upon notify event.

The code continues to want to check the endpoint's state after emitting the notify.fi event, which accesses a NULL pointer ci->ep...

But it actually is a use-after-free of ci itself, too,
because osmo-hnbgw "fails" to set up osmo_fsm_set_dealloc_ctx(OTC_SELECT).
With address sanitizer, it bugs out even before dereferencing ci->ep == NULL.

Then, the youngest osmo-hnbgw commit teaches osmo-hnbgw to not error on receiving that MDCX.
Hence the bug is no longer triggered.
https://gerrit.osmocom.org/c/osmo-hnbgw/+/35168 I936a50fed38a201c4a8da99b40f07082049e5157

But we still have the bug in libosmo-mgcp-client.
The patch to fix it contains another long description in code comment, explaining the fix.
https://gerrit.osmocom.org/c/osmo-mgw/+/35349

I also have here this patch that sets osmo-hnbgw to use osmo_fsm_set_dealloc_ctx(OTC_SELECT),
but now that osmo-mgcp-client was fixed without the need for it, I didn't submit it.
It also would not solve the bug by itself (old code still fails to check ci->ep != NULL).

Actions #4

Updated by neels 5 months ago

now what about the -latest test failure ...

  • backport osmo-mgw fix, creating a 1.12 branch, to fix a ttcn3-hnbgw-latest run of a test intended for hnbgw-master?
    (actually backport to fix a DoS vector)
  • fundamentally change the fact that we run tests intended for master on the latest release and at the same time expect 'latest' to pass, meaning that we constantly have to add conditions to the ttcn3 files?
    Ironically, this time the -latest failure uncovered a genuine bug =)
    But usually it doesn't and it's just extra work IMHO, rant over.
Actions #5

Updated by laforge 4 months ago

On Tue, Dec 12, 2023 at 11:57:45PM +0000, neels wrote:

  • fundamentally change the fact that we run tests intended for master on the latest release and at the same time expect 'latest' to pass, meaning that we constantly have to add conditions to the ttcn3 files?
we generally don't expect all 'master' tests to pass on latest. Obviously there are things supported
in (or fixed in) master, whcih won't work on latest. However, we expect that
  • old tests (that didn't change test expectations) to continue to pass on latest
  • no test (old or new, passing or non-passing) should ever segfault the IUT. This means that we must fix latest and make patch releases
Actions #6

Updated by neels 4 months ago

so we should:

  • backport osmo-mgw fix, creating a 1.12 branch, to fix a

segfault.

Actions #7

Updated by laforge 4 months ago

  • Assignee changed from neels to osmith
Actions #8

Updated by laforge 4 months ago

  • Priority changed from High to Immediate
Actions #9

Updated by osmith 4 months ago

Merged the patch from Neels to master and prepared a patch release:

Actions #10

Updated by osmith 4 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)