Bug #5255
closedttcn3-bsc-test-latest: CBSP and LCLS test cases fail since build #1095
Added by fixeria over 2 years ago. Updated over 2 years ago.
100%
Description
It looks like some test case(s) cause a segmentation fault of the IUT:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/1095/artifact/logs/bsc/core
so the remaining CBSP/LCLS test cases cannot talk to it anymore:
Stacktrace "VTY Timeout for prompt: enable" BSC_Tests_LCLS.ttcn:742 BSC_Tests_LCLS control part BSC_Tests_LCLS.ttcn:254 TC_lcls_gcr_only testcase
Updated by fixeria over 2 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 20
I found the culprit:
20210930074737198 DLGLOBAL <0015> logging_vty.c:1113 TTCN3 f_logp(): TC_lost_sdcch_during_assignment() start Segmentation fault (core dumped)
This test case was introduced quite recently:
commit 92cfa1c45ae1cb52d5aefb774f93468fef607417 Author: Neels Hofmeyr <nhofmeyr@sysmocom.de> Date: Tue Sep 28 18:29:44 2021 +0200 bsc: add TC_lost_sdcch_during_assignment()
and the aim is to reproduce a segfault described in SYS#5627.
Updated by fixeria over 2 years ago
- Status changed from In Progress to Stalled
- % Done changed from 20 to 40
I decided to back-port a patch fixing the segfault and create a patch release (1.7.0 -> 1.7.1):
https://gerrit.osmocom.org/c/osmo-bsc/+/25753 assignment_fsm: Check for conn->lchan
osmith, pespin, may I ask one of you to help with createing the actual patch release? I used to have a docker image with Debian and all the tools needed for osmo-release.sh, but then did 'docker system prune --all' and lost it.
Updated by osmith over 2 years ago
- Status changed from Stalled to Resolved
- % Done changed from 40 to 100
Sure, done: https://git.osmocom.org/osmo-bsc/commit/?h=1.7.1
Updated by fixeria over 2 years ago
Updated by fixeria over 2 years ago
- Status changed from Resolved to In Progress
- % Done changed from 100 to 80
Unfortunately, latest osmo-bsc still crashes when TC_lost_sdcch_during_assignment is being executed:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/1109/artifact/logs/bsc/core
This time we get a bit further and see some more logging:
20211014074531510 DLGLOBAL <0015> logging_vty.c:1113 TTCN3 f_logp(): TC_lost_sdcch_during_assignment() start 20211014074531758 DAS <0011> assignment_fsm.c:618 assignment(msc0-conn198_subscr-IMSI-001019876543210_0-0-1-TCH_F-0)[0x5579ec6b3070]{WAIT_RR_ASS_COMPLETE}: (bts=0,trx=0,ts=1,ss=0) Assignment failed in state WAIT_RR_ASS_COMPLETE, cause EQUIPMENT FAILURE: Unable to send RR Assignment Command: conn without lchan 20211014074531758 DAS <0011> assignment_fsm.c:148 assignment(msc0-conn198_subscr-IMSI-001019876543210_0-0-1-TCH_F-0)[0x5579ec6b3070]{WAIT_RR_ASS_COMPLETE}: (bts=0,trx=0,ts=1,ss=0) Assignment failed 20211014074531758 DMSC <0007> assignment_fsm.c:149 SUBSCR_CONN(msc0-conn198_subscr-IMSI-001019876543210)[0x5579ec69bd60]{CLEARING}: Event ASSIGNMENT_END not permitted 20211014074531759 DCHAN <000f> lchan_fsm.c:837 lchan(0-0-1-TCH_F-0)[0x5579ec6acf70]{WAIT_RF_RELEASE_ACK}: transition to state WAIT_RLL_RTP_ESTABLISH not permitted! 20211014074531779 DLMGCP <0025> mgcp_client.c:691 Cannot find matching MGCP transaction for trans_id 420 20211014074533758 DCHAN <000f> lchan_fsm.c:81 lchan(0-0-1-TCH_F-0)[0x5579ec6acf70]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) lchan allocation failed in state WAIT_RF_RELEASE_ACK: Timeout 20211014074533759 DCHAN <000f> lchan_fsm.c:116 lchan(0-0-1-TCH_F-0)[0x5579ec6acf70]{WAIT_RF_RELEASE_ACK}: (type=TCH_F) Signalling Assignment FSM of error (lchan allocation failed in state WAIT_RF_RELEASE_ACK: Timeout) Segmentation fault (core dumped)
Updated by fixeria over 2 years ago
- Status changed from In Progress to Feedback
- Assignee changed from fixeria to neels
Unfortunately, latest osmo-bsc still crashes when TC_lost_sdcch_during_assignment is being executed: [...]
neels could you please take a look? I was trying to figure out why it still segfaults, but could not find anything suspicious.
Updated by fixeria over 2 years ago
Interestingly enough, I cannot reproduce the segfault locally with osmo-bsc 1.7.1-0-gf20b3086a.
Updated by neels over 2 years ago
fixeria wrote:
Unfortunately, latest osmo-bsc still crashes when TC_lost_sdcch_during_assignment is being executed: [...]
neels could you please take a look? I was trying to figure out why it still segfaults, but could not find anything suspicious.
osmo-bsc does not crash for me anymore during this test, using current master, where pmaier's fix is merged.
The test also passes on jenkins. Where / how did you still see a crash?
Updated by fixeria over 2 years ago
neels wrote:
osmo-bsc does not crash for me anymore during this test, using current master, where pmaier's fix is merged.
The test also passes on jenkins. Where / how did you still see a crash?
The recent master does not crash, but latest release (1.7.1) does, see for instance:
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/1113/artifact/logs/bsc/
1.7.1 is basically a patch release with pmaier's fix applied. And somehow it still segfaults on Jenkins.
Updated by fixeria over 2 years ago
Good news: I managed to reproduce the segfault in a docker container by running it this way:
docker run -it --rm --network=host -v osmo-ttcn3-hacks:/data fixeria/osmo-bsc-latest /usr/bin/osmo-bsc -c /data/bsc/osmo-bsc.cfg
and I am even getting the same logging output. Here is a backtrace:
#0 _lchan_on_activation_failure (lchan=lchan@entry=0x7f037ea25748, activ_for=<optimized out>, for_conn=0x0, line=line@entry=1574, file=0x563e3f8d910d "lchan_fsm.c") at lchan_fsm.c:117 #1 0x0000563e3f882317 in _lchan_on_activation_failure (line=1574, file=0x563e3f8d910d "lchan_fsm.c", for_conn=<optimized out>, activ_for=<optimized out>, lchan=0x7f037ea25748) at lchan_fsm.c:1574 #2 lchan_fsm_timer_cb (fi=0x563e401a3d00) at lchan_fsm.c:1574 #3 0x00007f037ddd5f16 in fsm_tmr_cb (data=0x563e401a3d00) at fsm.c:325 #4 0x00007f037ddd01a6 in osmo_timers_update () at timer.c:273 #5 0x00007f037ddd0b67 in _osmo_select_main (polling=0) at select.c:373 #6 0x00007f037ddd0ce6 in osmo_select_main_ctx (polling=<optimized out>) at select.c:434 #7 0x0000563e3f81e6bf in main (argc=<optimized out>, argv=<optimized out>) at osmo_bsc_main.c:1039
Updated by fixeria over 2 years ago
- Status changed from Feedback to Stalled
- Assignee changed from neels to osmith
- % Done changed from 80 to 90
We need to back-port another change from the recent master:
commit dfd7bef6644d0c0837f7e5498bc5c86362b668dc Author: Vadim Yanitskiy <vyanitskiy@sysmocom.de> Date: Sun Jul 11 13:19:22 2021 +0600 lchan_fsm: fix potential NULL-pointer dereference Change-Id: I373855b95f8bde0ce8f9c2ae7bf95c9135d33484 Related: SYS#5526
I submitted a cherry-pick to Gerrit:
https://gerrit.osmocom.org/c/osmo-bsc/+/25836 lchan_fsm: fix potential NULL-pointer dereference
And again, I would need some help from osmith to create a patch release. This time 1.7.2.
Updated by fixeria over 2 years ago
I also cherry-picked both patches to the '2021q1':
https://gerrit.osmocom.org/c/osmo-bsc/+/25837 assignment_fsm: Check for conn->lchan [NEW]
https://gerrit.osmocom.org/c/osmo-bsc/+/25838 lchan_fsm: fix potential NULL-pointer dereference [NEW]
Updated by fixeria over 2 years ago
- Assignee changed from osmith to pespin
Oliver is on holidays this week, Pau agreed to help (thanks!).
Updated by pespin over 2 years ago
- Status changed from Stalled to Feedback
- Assignee changed from pespin to fixeria
tag 1.7.2 pushed with commit "lchan_fsm: fix potential NULL-pointer dereference" in it.
Reassigning to fixeria .
Updated by fixeria over 2 years ago
- Status changed from Feedback to Resolved
- % Done changed from 90 to 100
Good news: latest osmo-bsc (1.7.2) does not crash anymore:
https://jenkins.osmocom.org/jenkins/view/TTCN3-centos/job/TTCN3-centos-bsc-test-latest/228/ (no core file, -36 failures)
https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bsc-test-latest/1116/ (no core file, -36 failures)