Project

General

Profile

Actions

Bug #5665

open

ERROR: files left in build directory after distclean: ./tests/core

Added by fixeria over 1 year ago. Updated 5 months ago.

Status:
Stalled
Priority:
Normal
Assignee:
Target version:
-
Start date:
08/28/2022
Due date:
% Done:

20%

Spec Reference:

Description

Since recently we're observing sporadic master-* job failures on Jenkins. There is always a coredump file, which makes 'distcleancheck' target fail:

ERROR: files left in build directory after distclean:
./tests/core
make[1]: Leaving directory '/build/libosmocore-1.7.0.26-862dd/_build/sub'
make[1]: *** [Makefile:1010: distcleancheck] Error 1
make: *** [Makefile:941: distcheck] Error 1

This is not specific to libosmocore, I saw master-osmo-{bsc,msc} failing with the same verdict too.


Files

core core 4.38 MB osmo-bsc coredump from build 21867 fixeria, 05/23/2023 12:08 PM

Related issues

Related to Core testing infrastructure - Bug #5642: (Jenkins) Collect artifacts on master build failuresResolvedosmith08/09/2022

Actions
Related to Core testing infrastructure - Bug #5858: osmo-python-tests: Ignore, at least clean up 'tmp_dummy_sock'Feedback01/15/2023

Actions
Actions #1

Updated by fixeria over 1 year ago

  • Related to Bug #5642: (Jenkins) Collect artifacts on master build failures added
Actions #2

Updated by fixeria over 1 year ago

Thanks to osmith, Jenkins does collect artifacts on master build failures now.

Yesterday master-osmo-msc failed with a coredump file present in the artifacts:

https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/32327/

$ file Downloads/core 
Downloads/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/osmo-msc/osmo-msc -c /build/osmo-msc-1.9.0.16-62977/tests/../doc/examples', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/osmo-msc/osmo-msc', platform: 'x86_64'

Note that -c /build/osmo-msc-1.9.0.16-62977/tests/../doc/examples points to a folder, not a configuration file.

Actions #3

Updated by arehbein about 1 year ago

  • Related to Bug #5858: osmo-python-tests: Ignore, at least clean up 'tmp_dummy_sock' added
Actions #4

Updated by laforge 11 months ago

is this still an active issue?

Actions #5

Updated by fixeria 11 months ago

laforge wrote in #note-4:

is this still an active issue?

Yes, I am still seeing it from time to time. But not as often as before. One example below:

https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21747/

Actions #7

Updated by fixeria 11 months ago

Looking at one of the coredump files more closely (from osmo-bsc build 21867, find attached):

fixeria@DELL:~$ file /tmp/core 
/tmp/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/osmo-bsc/osmo-bsc -r tmp_dummy_sock -c /build/osmo-bsc-1.10.0.164-1f3bf/t', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/osmo-bsc/osmo-bsc', platform: 'x86_64'

Note the -r tmp_dummy_sock in the command line. A quick grep in osmo-bsc.git yields the following:

tests/ctrl_test_runner.py:        return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c",
tests/ctrl_test_runner.py:        return (4249, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc")
tests/ctrl_test_runner.py:        return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c",
tests/ctrl_test_runner.py:        return (4248, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc")
tests/ctrl_test_runner.py:        return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c",
tests/ctrl_test_runner.py:        return (4249, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc")
Actions #8

Updated by fixeria 11 months ago

fixeria wrote in #note-7:

Note the -r tmp_dummy_sock in the command line. A quick grep in osmo-bsc.git yields the following: [...]

Let's try to narrow the problem down to a specific ctrl_test_runner.py run triggering the crash:

https://gerrit.osmocom.org/c/python/osmo-python-tests/+/32929 osmoutil: print return code in end_proc() [NEW]

I expect to see a non-zero return code (132) in case of a segfault.

Actions #9

Updated by osmith 11 months ago

As discussed in chat: a few patches to tar attach a tarball of the whole workspace when this happens, so we get relevant binaries and libraries:
https://gerrit.osmocom.org/q/topic:workspace.tar.xz

Actions #10

Updated by fixeria 11 months ago

There was another unexpected failure:

https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21915/

however the artifacts contain no workspace.tar.xz. I found out why and submitted fixes:

https://gerrit.osmocom.org/c/osmo-bsc/+/33067 fixup: contrib/jenkins: create workspace.tar.xz on error [NEW]
https://gerrit.osmocom.org/c/osmo-msc/+/33068 fixup: contrib/jenkins: create workspace.tar.xz on error [NEW]

Actions #11

Updated by fixeria 11 months ago

  • Status changed from New to In Progress
Actions #12

Updated by fixeria 11 months ago

So I pulled recent debian:bullseye, extracted the workspace archive and mounted it at /build:

$ docker run -v ./build:/build -it --rm debian:bullseye
root@66453c9d6868:/# apt update
root@66453c9d6868:/# apt install --no-install-suggests --no-install-recommends gdb

root@66453c9d6868:/# cd /build/
root@66453c9d6868:/build# find . -type f -name core
./osmo-msc-1.10.0.53-912f/_build/sub/core
root@66453c9d6868:/build# find . -type f -name osmo-msc
./src/osmo-msc/osmo-msc

root@66453c9d6868:/build# gdb -q ./src/osmo-msc/osmo-msc
Reading symbols from ./src/osmo-msc/osmo-msc...
(gdb) core ./osmo-msc-1.10.0.53-912f/_build/sub/core
warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc during file-backed mapping note processing
warning: Can't open file /usr/lib/x86_64-linux-gnu/libmnl.so.0.2.0 during file-backed mapping note processing
warning: Can't open file /usr/lib/x86_64-linux-gnu/libsctp.so.1.0.18 during file-backed mapping note processing
warning: Can't open file /usr/lib/x86_64-linux-gnu/libtalloc.so.2.3.1 during file-backed mapping note processing
warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/sms.db-shm during file-backed mapping note processing
[New LWP 231973]
Core was generated by `./src/osmo-msc/osmo-msc -c /build/osmo-msc-1.10.0.53-912f/tests/../doc/examples'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fcb8dad6d67 in ?? ()

After installing the missing dependencies gdb is still unable tell us what happened :/

root@66453c9d6868:/build# apt install --no-install-suggests --no-install-recommends libmnl-dev libsctp-dev libtalloc-dev
root@66453c9d6868:/build# gdb -q ./src/osmo-msc/osmo-msc
Reading symbols from ./src/osmo-msc/osmo-msc...
(gdb) core ./osmo-msc-1.10.0.53-912f/_build/sub/core
warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc during file-backed mapping note processing
warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/sms.db-shm during file-backed mapping note processing
[New LWP 231973]
Core was generated by `./src/osmo-msc/osmo-msc -c /build/osmo-msc-1.10.0.53-912f/tests/../doc/examples'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fcb8dad6d67 in ?? ()

My best guess is that /build/src/osmo-msc/osmo-msc is actually not the right binary, the one that generated the coredump must be /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc, but it's not present in the archive - perhaps was removed by distcleancheck.

Actions #13

Updated by fixeria 11 months ago

fixeria wrote in #note-8:

Let's try to narrow the problem down to a specific ctrl_test_runner.py run triggering the crash:

https://gerrit.osmocom.org/c/python/osmo-python-tests/+/32929 osmoutil: print return code in end_proc() [NEW]

I expect to see a non-zero return code (132) in case of a segfault.

In the Console output of today's failed osmo-msc-master we see:

Process returned code: -11

If I read https://en.wikipedia.org/wiki/Signal_(IPC)#Default_action correctly, signal 11 is SIGSEGV.

fixeria wrote in #note-12:

My best guess is that /build/src/osmo-msc/osmo-msc is actually not the right binary, the one that generated the coredump must be /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc, but it's not present in the archive - perhaps was removed by distcleancheck.

So I am wondering: what if we raise an exception from tests/{ctrl,vty}_test_runner.py? Would that abort distcleancheck and prevent it from removing the binary?

Actions #14

Updated by fixeria 11 months ago

https://gerrit.osmocom.org/c/python/osmo-python-tests/+/33139 osmoutil: return proc's return code from end_proc() [NEW]
https://gerrit.osmocom.org/c/osmo-msc/+/33141 tests/{ctrl,vty}_test_runner.py: raise an exception if proc's rc != 0 [NEW]
https://gerrit.osmocom.org/c/osmo-bsc/+/33140 tests/{ctrl,vty}_test_runner.py: raise an exception if proc's rc != 0 [NEW]

Actions #15

Updated by fixeria 11 months ago

  • Status changed from In Progress to Stalled

Once the patches are merged, we need to wait for another coredump...

Actions #16

Updated by fixeria 11 months ago

  • Status changed from Stalled to In Progress

We are lucky today, didn't have to wait weeks for this to happen again:

https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21986/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/a1=default,a2=default,a3=default,a4=default,label=osmocom-master/21986/

With my recent patches, make distcleancheck is aborted early:

======================================================================
ERROR: testTrxArfcn (__main__.TestCtrlBSC)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/osmo-bsc-1.10.0.171-2a69/_build/sub/tests/../../../tests/ctrl_test_runner.py", line 151, in tearDown
    TestCtrlBase.tearDown(self)
  File "/build/osmo-bsc-1.10.0.171-2a69/_build/sub/tests/../../../tests/ctrl_test_runner.py", line 68, in tearDown
    raise Exception("Process returned %d" % rc)
Exception: Process returned -6

This time it's signal 6, what corresponds to SIGABRT (with osmo-msc it was SIGSEGV).

Also this time we see two binaries in the workspace archive:

root@d1f585d86a48:/# find . -type f -name core
./build/osmo-bsc-1.10.0.171-2a69/_build/sub/core
root@d1f585d86a48:/# find . -type f -name osmo-bsc
./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc
./build/src/osmo-bsc/osmo-bsc
Actions #17

Updated by fixeria 11 months ago

  • % Done changed from 0 to 20

Here we go:

root@d1f585d86a48:/# gdb -q ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc
Reading symbols from ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc...
(No debugging symbols found in ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc)
(gdb) core-file ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/core
[New LWP 57556]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./src/osmo-bsc/osmo-bsc -r tmp_dummy_sock -c /build/osmo-bsc-1.10.0.171-2a69/te'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f4a99f14ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f4a99f14ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4a99efe537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f4a99f563a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f4a99f5d69a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f4a99f5e3cc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f4a99f5e503 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f4a99f60395 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f4a99f61da4 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007f4a9a15cecc in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#9  0x00007f4a9a15d852 in talloc_named_const () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#10 0x00007f4a9a197900 in msgb_alloc_c () from /build/deps/install/lib/libosmocore.so.20
#11 0x00007f4a9a197a45 in msgb_alloc () from /build/deps/install/lib/libosmocore.so.20
#12 0x00007f4a9a270759 in msgb_alloc_headroom () from /build/deps/install/lib/libosmogsm.so.18
#13 0x00007f4a9a27295e in ipa_msg_alloc () from /build/deps/install/lib/libosmogsm.so.18
#14 0x00007f4a9a272448 in ipa_msg_recv_buffered () from /build/deps/install/lib/libosmogsm.so.18
#15 0x00007f4a9a2d702e in handle_control_read () from /build/deps/install/lib/libosmoctrl.so.0
#16 0x00007f4a9a1b6cd0 in osmo_wqueue_bfd_cb () from /build/deps/install/lib/libosmocore.so.20
#17 0x00007f4a9a1a2f59 in poll_disp_fds () from /build/deps/install/lib/libosmocore.so.20
#18 0x00007f4a9a1a3066 in _osmo_select_main () from /build/deps/install/lib/libosmocore.so.20
#19 0x00007f4a9a1a30d3 in osmo_select_main_ctx () from /build/deps/install/lib/libosmocore.so.20
#20 0x000056103b889b02 in main ()

So now we know that it's testTrxArfcn somehow crashing osmo-bsc.

Actions #18

Updated by fixeria 11 months ago

The same backtrace with libc6-dbg installed:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f4a99efe537 in __GI_abort () at abort.c:79
#2  0x00007f4a99f563a8 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f4a9a074390 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f4a99f5d69a in malloc_printerr (str=str@entry=0x7f4a9a072028 "corrupted double-linked list") at malloc.c:5347
#4  0x00007f4a99f5e3cc in unlink_chunk (p=p@entry=0x56103be33830, av=0x7f4a9a0aab80 <main_arena>) at malloc.c:1460
#5  0x00007f4a99f5e503 in malloc_consolidate (av=av@entry=0x7f4a9a0aab80 <main_arena>) at malloc.c:4494
#6  0x00007f4a99f60395 in _int_malloc (av=av@entry=0x7f4a9a0aab80 <main_arena>, bytes=bytes@entry=1435) at malloc.c:3699
#7  0x00007f4a99f61da4 in __GI___libc_malloc (bytes=1435) at malloc.c:3058
#8  0x00007f4a9a15cecc in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#9  0x00007f4a9a15d852 in talloc_named_const () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2
#10 0x00007f4a9a197900 in msgb_alloc_c () from /build/deps/install/lib/libosmocore.so.20
#11 0x00007f4a9a197a45 in msgb_alloc () from /build/deps/install/lib/libosmocore.so.20
#12 0x00007f4a9a270759 in msgb_alloc_headroom () from /build/deps/install/lib/libosmogsm.so.18
#13 0x00007f4a9a27295e in ipa_msg_alloc () from /build/deps/install/lib/libosmogsm.so.18
#14 0x00007f4a9a272448 in ipa_msg_recv_buffered () from /build/deps/install/lib/libosmogsm.so.18
#15 0x00007f4a9a2d702e in handle_control_read () from /build/deps/install/lib/libosmoctrl.so.0

corrupted double-linked list does not look good, looks like a heap overflow?

Actions #19

Updated by fixeria 11 months ago

With this little patch I was able to reproduce the osmo-bsc segfault locally:

diff --git a/tests/ctrl_test_runner.py b/tests/ctrl_test_runner.py
index 780842634..7b2880c47 100755
--- a/tests/ctrl_test_runner.py
+++ b/tests/ctrl_test_runner.py
@@ -830,9 +830,18 @@ if __name__ == '__main__':
     print("confpath %s, workdir %s" % (confpath, workdir))
     os.chdir(workdir)
     print("Running tests for specific control commands")
-    suite = unittest.TestSuite()
-    add_bsc_test(suite, workdir, TestCtrlBSC)
-    add_bsc_test(suite, workdir, TestCtrlBSCNeighbor)
-    add_bsc_test(suite, workdir, TestCtrlBSCNeighborCell)
-    res = unittest.TextTestRunner(verbosity=verbose_level).run(suite)
-    sys.exit(len(res.errors) + len(res.failures))
+    #suite = unittest.TestSuite()
+    #suite.addTest(TestCtrlBSC('testTrxArfcn'))
+    #add_bsc_test(suite, workdir, TestCtrlBSC)
+    #add_bsc_test(suite, workdir, TestCtrlBSCNeighbor)
+    #add_bsc_test(suite, workdir, TestCtrlBSCNeighborCell)
+    for i in range(1000):
+        suite = unittest.TestSuite()
+        #suite.addTest(TestCtrlBSC('testTrxArfcn'))
+        #suite.addTest(TestCtrlBSCNeighborCell('testCtrlListBTS'))
+        # ERROR: testBtsGenerateSystemInformation (__main__.TestCtrlBSC.testBtsGenerateSystemInformation)
+        suite.addTest(TestCtrlBSC('testBtsGenerateSystemInformation'))
+        res = unittest.TextTestRunner(verbosity=verbose_level).run(suite)
+        assert len(res.errors) == 0
+        assert len(res.failures) == 0
+    sys.exit(0)

It does not matter which testcase I am executing in a loop, the crash happens sporadically (just wait ~20-60 seconds and you'll see a new coredump in coredumpctl).

(gdb) bt
#0  0x00007f5ef69d126c in ?? () from /usr/lib/libc.so.6
#1  0x00007f5ef6981a08 in raise () from /usr/lib/libc.so.6
#2  0x00007f5ef696a538 in abort () from /usr/lib/libc.so.6
#3  0x00007f5ef696b2db in ?? () from /usr/lib/libc.so.6
#4  0x00007f5ef69db1b7 in ?? () from /usr/lib/libc.so.6
#5  0x00007f5ef69dbcce in ?? () from /usr/lib/libc.so.6
#6  0x00007f5ef69deb14 in ?? () from /usr/lib/libc.so.6
#7  0x00007f5ef69df82a in malloc () from /usr/lib/libc.so.6
#8  0x00007f5ef6b3eb60 in ?? () from /usr/lib/libtalloc.so.2
#9  0x00007f5ef6b3ecde in talloc_named_const () from /usr/lib/libtalloc.so.2
#10 0x00007f5ef6ed4220 in msgb_alloc_c (ctx=<optimized out>, size=size@entry=1203, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex") at msgb.c:77
#11 0x00007f5ef6ed430b in msgb_alloc (size=size@entry=1203, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex") at msgb.c:110
#12 0x00007f5ef6f97df8 in msgb_alloc_headroom (size=size@entry=1203, headroom=headroom@entry=3, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex")
    at ../../include/osmocom/core/msgb.h:556
#13 0x00007f5ef6f991d8 in ipa_msg_alloc (headroom=3, headroom@entry=0) at ipa.c:715
#14 0x00007f5ef6f997f6 in ipa_msg_recv_buffered (fd=11, rmsg=rmsg@entry=0x7ffda14f1c00, tmp_msg=tmp_msg@entry=0x55ce207b1010) at ipa.c:596
#15 0x00007f5ef701b2ce in handle_control_read (bfd=0x55ce207b0fb0) at control_if.c:353
#16 0x00007f5ef6eea7af in osmo_wqueue_bfd_cb (fd=0x55ce207b0fb0, what=1) at write_queue.c:47
#17 0x00007f5ef6edc73e in poll_disp_fds (n_fd=n_fd@entry=9) at select.c:419
#18 0x00007f5ef6edc7e6 in _osmo_select_main (polling=<optimized out>) at select.c:457
#19 0x00007f5ef6edc89b in osmo_select_main_ctx (polling=<optimized out>) at select.c:513
#20 0x000055ce1fd96cfa in main ()
Actions #20

Updated by fixeria 5 months ago

  • Status changed from In Progress to Stalled

Today I saw a potentially related master build failure for master-osmo-sgsn (it's SGSN for the first time!):

https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-sgsn/44257/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-sgsn/44257/IU=--enable-iu,WITH_MANUALS=1,a3=default,a4=default,label=osmocom-master/

ERROR: files left in build directory after distclean:
./core
make[1]: *** [Makefile:751: distcleancheck] Error 1
make[1]: Leaving directory '/build/osmo-sgsn-1.11.0.1-e746b/_build/sub'
make: *** [Makefile:680: distcheck] Error 1

I guess it's also related to the CTRL tests somehow:

$ file Downloads/core 
Downloads/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/sgsn/osmo-sgsn -c /build/osmo-sgsn-1.11.0.1-e746b/tests/../doc/examples/o', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/sgsn/osmo-sgsn', platform: 'x86_64'
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)