Bug #5665
openERROR: files left in build directory after distclean: ./tests/core
20%
Description
Since recently we're observing sporadic master-* job failures on Jenkins. There is always a coredump file, which makes 'distcleancheck' target fail:
ERROR: files left in build directory after distclean: ./tests/core make[1]: Leaving directory '/build/libosmocore-1.7.0.26-862dd/_build/sub' make[1]: *** [Makefile:1010: distcleancheck] Error 1 make: *** [Makefile:941: distcheck] Error 1
This is not specific to libosmocore, I saw master-osmo-{bsc,msc} failing with the same verdict too.
Files
Related issues
Updated by fixeria over 1 year ago
- Related to Bug #5642: (Jenkins) Collect artifacts on master build failures added
Updated by fixeria over 1 year ago
Thanks to osmith, Jenkins does collect artifacts on master build failures now.
Yesterday master-osmo-msc failed with a coredump file present in the artifacts:
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/32327/
$ file Downloads/core Downloads/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/osmo-msc/osmo-msc -c /build/osmo-msc-1.9.0.16-62977/tests/../doc/examples', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/osmo-msc/osmo-msc', platform: 'x86_64'
Note that -c /build/osmo-msc-1.9.0.16-62977/tests/../doc/examples
points to a folder, not a configuration file.
Updated by fixeria 7 months ago
Yesterday both master-osmo-bsc
and master-osmo-msc
failed, again due to the mysterious core file:
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21867/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21867/a1=default,a2=default,a3=default,a4=default,label=osmocom-master/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21870/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21870/a1=default,a2=default,a3=default,a4=default,label=osmocom-master/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/37243/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/37243/IU=--enable-iu,WITH_MANUALS=1,a3=default,a4=default,label=osmocom-master/
Updated by fixeria 7 months ago
Looking at one of the coredump files more closely (from osmo-bsc build 21867, find attached):
fixeria@DELL:~$ file /tmp/core /tmp/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/osmo-bsc/osmo-bsc -r tmp_dummy_sock -c /build/osmo-bsc-1.10.0.164-1f3bf/t', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/osmo-bsc/osmo-bsc', platform: 'x86_64'
Note the -r tmp_dummy_sock
in the command line. A quick grep in osmo-bsc.git yields the following:
tests/ctrl_test_runner.py: return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c", tests/ctrl_test_runner.py: return (4249, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc") tests/ctrl_test_runner.py: return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c", tests/ctrl_test_runner.py: return (4248, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc") tests/ctrl_test_runner.py: return ["./src/osmo-bsc/osmo-bsc", "-r", "tmp_dummy_sock", "-c", tests/ctrl_test_runner.py: return (4249, "./src/osmo-bsc/osmo-bsc", "OsmoBSC", "bsc")
Updated by fixeria 7 months ago
fixeria wrote in #note-7:
Note the
-r tmp_dummy_sock
in the command line. A quick grep in osmo-bsc.git yields the following: [...]
Let's try to narrow the problem down to a specific ctrl_test_runner.py
run triggering the crash:
https://gerrit.osmocom.org/c/python/osmo-python-tests/+/32929 osmoutil: print return code in end_proc() [NEW]
I expect to see a non-zero return code (132) in case of a segfault.
Updated by osmith 7 months ago
As discussed in chat: a few patches to tar attach a tarball of the whole workspace when this happens, so we get relevant binaries and libraries:
https://gerrit.osmocom.org/q/topic:workspace.tar.xz
Updated by fixeria 7 months ago
There was another unexpected failure:
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21915/
however the artifacts contain no workspace.tar.xz
. I found out why and submitted fixes:
https://gerrit.osmocom.org/c/osmo-bsc/+/33067 fixup: contrib/jenkins: create workspace.tar.xz on error [NEW]
https://gerrit.osmocom.org/c/osmo-msc/+/33068 fixup: contrib/jenkins: create workspace.tar.xz on error [NEW]
Updated by fixeria 6 months ago
- Status changed from New to In Progress
This time osmo-msc generated a coredump:
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/37438/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-msc/37438/IU=--enable-iu,WITH_MANUALS=1,a3=default,a4=default,label=osmocom-master/
and fortunately we have the workspace.tar.xz
! Taking a look.
Updated by fixeria 6 months ago
So I pulled recent debian:bullseye
, extracted the workspace archive and mounted it at /build
:
$ docker run -v ./build:/build -it --rm debian:bullseye root@66453c9d6868:/# apt update root@66453c9d6868:/# apt install --no-install-suggests --no-install-recommends gdb root@66453c9d6868:/# cd /build/ root@66453c9d6868:/build# find . -type f -name core ./osmo-msc-1.10.0.53-912f/_build/sub/core root@66453c9d6868:/build# find . -type f -name osmo-msc ./src/osmo-msc/osmo-msc root@66453c9d6868:/build# gdb -q ./src/osmo-msc/osmo-msc Reading symbols from ./src/osmo-msc/osmo-msc... (gdb) core ./osmo-msc-1.10.0.53-912f/_build/sub/core warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc during file-backed mapping note processing warning: Can't open file /usr/lib/x86_64-linux-gnu/libmnl.so.0.2.0 during file-backed mapping note processing warning: Can't open file /usr/lib/x86_64-linux-gnu/libsctp.so.1.0.18 during file-backed mapping note processing warning: Can't open file /usr/lib/x86_64-linux-gnu/libtalloc.so.2.3.1 during file-backed mapping note processing warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/sms.db-shm during file-backed mapping note processing [New LWP 231973] Core was generated by `./src/osmo-msc/osmo-msc -c /build/osmo-msc-1.10.0.53-912f/tests/../doc/examples'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fcb8dad6d67 in ?? ()
After installing the missing dependencies gdb is still unable tell us what happened :/
root@66453c9d6868:/build# apt install --no-install-suggests --no-install-recommends libmnl-dev libsctp-dev libtalloc-dev root@66453c9d6868:/build# gdb -q ./src/osmo-msc/osmo-msc Reading symbols from ./src/osmo-msc/osmo-msc... (gdb) core ./osmo-msc-1.10.0.53-912f/_build/sub/core warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc during file-backed mapping note processing warning: Can't open file /build/osmo-msc-1.10.0.53-912f/_build/sub/sms.db-shm during file-backed mapping note processing [New LWP 231973] Core was generated by `./src/osmo-msc/osmo-msc -c /build/osmo-msc-1.10.0.53-912f/tests/../doc/examples'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fcb8dad6d67 in ?? ()
My best guess is that /build/src/osmo-msc/osmo-msc
is actually not the right binary, the one that generated the coredump must be /build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc
, but it's not present in the archive - perhaps was removed by distcleancheck
.
Updated by fixeria 6 months ago
fixeria wrote in #note-8:
Let's try to narrow the problem down to a specific
ctrl_test_runner.py
run triggering the crash:https://gerrit.osmocom.org/c/python/osmo-python-tests/+/32929 osmoutil: print return code in end_proc() [NEW]
I expect to see a non-zero return code (132) in case of a segfault.
In the Console output of today's failed osmo-msc-master we see:
Process returned code: -11
If I read https://en.wikipedia.org/wiki/Signal_(IPC)#Default_action correctly, signal 11 is SIGSEGV.
fixeria wrote in #note-12:
My best guess is that
/build/src/osmo-msc/osmo-msc
is actually not the right binary, the one that generated the coredump must be/build/osmo-msc-1.10.0.53-912f/_build/sub/src/osmo-msc/osmo-msc
, but it's not present in the archive - perhaps was removed bydistcleancheck
.
So I am wondering: what if we raise an exception from tests/{ctrl,vty}_test_runner.py
? Would that abort distcleancheck
and prevent it from removing the binary?
Updated by fixeria 6 months ago
https://gerrit.osmocom.org/c/python/osmo-python-tests/+/33139 osmoutil: return proc's return code from end_proc() [NEW]
https://gerrit.osmocom.org/c/osmo-msc/+/33141 tests/{ctrl,vty}_test_runner.py: raise an exception if proc's rc != 0 [NEW]
https://gerrit.osmocom.org/c/osmo-bsc/+/33140 tests/{ctrl,vty}_test_runner.py: raise an exception if proc's rc != 0 [NEW]
Updated by fixeria 6 months ago
- Status changed from Stalled to In Progress
We are lucky today, didn't have to wait weeks for this to happen again:
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/21986/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-bsc/a1=default,a2=default,a3=default,a4=default,label=osmocom-master/21986/
With my recent patches, make distcleancheck
is aborted early:
====================================================================== ERROR: testTrxArfcn (__main__.TestCtrlBSC) ---------------------------------------------------------------------- Traceback (most recent call last): File "/build/osmo-bsc-1.10.0.171-2a69/_build/sub/tests/../../../tests/ctrl_test_runner.py", line 151, in tearDown TestCtrlBase.tearDown(self) File "/build/osmo-bsc-1.10.0.171-2a69/_build/sub/tests/../../../tests/ctrl_test_runner.py", line 68, in tearDown raise Exception("Process returned %d" % rc) Exception: Process returned -6
This time it's signal 6, what corresponds to SIGABRT
(with osmo-msc it was SIGSEGV
).
Also this time we see two binaries in the workspace archive:
root@d1f585d86a48:/# find . -type f -name core ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/core root@d1f585d86a48:/# find . -type f -name osmo-bsc ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc ./build/src/osmo-bsc/osmo-bsc
Updated by fixeria 6 months ago
- % Done changed from 0 to 20
Here we go:
root@d1f585d86a48:/# gdb -q ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc Reading symbols from ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc... (No debugging symbols found in ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/src/osmo-bsc/osmo-bsc) (gdb) core-file ./build/osmo-bsc-1.10.0.171-2a69/_build/sub/core [New LWP 57556] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `./src/osmo-bsc/osmo-bsc -r tmp_dummy_sock -c /build/osmo-bsc-1.10.0.171-2a69/te'. Program terminated with signal SIGABRT, Aborted. #0 0x00007f4a99f14ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007f4a99f14ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f4a99efe537 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f4a99f563a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007f4a99f5d69a in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x00007f4a99f5e3cc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007f4a99f5e503 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x00007f4a99f60395 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #7 0x00007f4a99f61da4 in malloc () from /lib/x86_64-linux-gnu/libc.so.6 #8 0x00007f4a9a15cecc in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #9 0x00007f4a9a15d852 in talloc_named_const () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #10 0x00007f4a9a197900 in msgb_alloc_c () from /build/deps/install/lib/libosmocore.so.20 #11 0x00007f4a9a197a45 in msgb_alloc () from /build/deps/install/lib/libosmocore.so.20 #12 0x00007f4a9a270759 in msgb_alloc_headroom () from /build/deps/install/lib/libosmogsm.so.18 #13 0x00007f4a9a27295e in ipa_msg_alloc () from /build/deps/install/lib/libosmogsm.so.18 #14 0x00007f4a9a272448 in ipa_msg_recv_buffered () from /build/deps/install/lib/libosmogsm.so.18 #15 0x00007f4a9a2d702e in handle_control_read () from /build/deps/install/lib/libosmoctrl.so.0 #16 0x00007f4a9a1b6cd0 in osmo_wqueue_bfd_cb () from /build/deps/install/lib/libosmocore.so.20 #17 0x00007f4a9a1a2f59 in poll_disp_fds () from /build/deps/install/lib/libosmocore.so.20 #18 0x00007f4a9a1a3066 in _osmo_select_main () from /build/deps/install/lib/libosmocore.so.20 #19 0x00007f4a9a1a30d3 in osmo_select_main_ctx () from /build/deps/install/lib/libosmocore.so.20 #20 0x000056103b889b02 in main ()
So now we know that it's testTrxArfcn
somehow crashing osmo-bsc.
Updated by fixeria 6 months ago
The same backtrace with libc6-dbg
installed:
(gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f4a99efe537 in __GI_abort () at abort.c:79 #2 0x00007f4a99f563a8 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f4a9a074390 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #3 0x00007f4a99f5d69a in malloc_printerr (str=str@entry=0x7f4a9a072028 "corrupted double-linked list") at malloc.c:5347 #4 0x00007f4a99f5e3cc in unlink_chunk (p=p@entry=0x56103be33830, av=0x7f4a9a0aab80 <main_arena>) at malloc.c:1460 #5 0x00007f4a99f5e503 in malloc_consolidate (av=av@entry=0x7f4a9a0aab80 <main_arena>) at malloc.c:4494 #6 0x00007f4a99f60395 in _int_malloc (av=av@entry=0x7f4a9a0aab80 <main_arena>, bytes=bytes@entry=1435) at malloc.c:3699 #7 0x00007f4a99f61da4 in __GI___libc_malloc (bytes=1435) at malloc.c:3058 #8 0x00007f4a9a15cecc in ?? () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #9 0x00007f4a9a15d852 in talloc_named_const () from /usr/lib/x86_64-linux-gnu/libtalloc.so.2 #10 0x00007f4a9a197900 in msgb_alloc_c () from /build/deps/install/lib/libosmocore.so.20 #11 0x00007f4a9a197a45 in msgb_alloc () from /build/deps/install/lib/libosmocore.so.20 #12 0x00007f4a9a270759 in msgb_alloc_headroom () from /build/deps/install/lib/libosmogsm.so.18 #13 0x00007f4a9a27295e in ipa_msg_alloc () from /build/deps/install/lib/libosmogsm.so.18 #14 0x00007f4a9a272448 in ipa_msg_recv_buffered () from /build/deps/install/lib/libosmogsm.so.18 #15 0x00007f4a9a2d702e in handle_control_read () from /build/deps/install/lib/libosmoctrl.so.0
corrupted double-linked list
does not look good, looks like a heap overflow?
Updated by fixeria 6 months ago
With this little patch I was able to reproduce the osmo-bsc segfault locally:
diff --git a/tests/ctrl_test_runner.py b/tests/ctrl_test_runner.py
index 780842634..7b2880c47 100755
--- a/tests/ctrl_test_runner.py
+++ b/tests/ctrl_test_runner.py
@@ -830,9 +830,18 @@ if __name__ == '__main__':
print("confpath %s, workdir %s" % (confpath, workdir))
os.chdir(workdir)
print("Running tests for specific control commands")
- suite = unittest.TestSuite()
- add_bsc_test(suite, workdir, TestCtrlBSC)
- add_bsc_test(suite, workdir, TestCtrlBSCNeighbor)
- add_bsc_test(suite, workdir, TestCtrlBSCNeighborCell)
- res = unittest.TextTestRunner(verbosity=verbose_level).run(suite)
- sys.exit(len(res.errors) + len(res.failures))
+ #suite = unittest.TestSuite()
+ #suite.addTest(TestCtrlBSC('testTrxArfcn'))
+ #add_bsc_test(suite, workdir, TestCtrlBSC)
+ #add_bsc_test(suite, workdir, TestCtrlBSCNeighbor)
+ #add_bsc_test(suite, workdir, TestCtrlBSCNeighborCell)
+ for i in range(1000):
+ suite = unittest.TestSuite()
+ #suite.addTest(TestCtrlBSC('testTrxArfcn'))
+ #suite.addTest(TestCtrlBSCNeighborCell('testCtrlListBTS'))
+ # ERROR: testBtsGenerateSystemInformation (__main__.TestCtrlBSC.testBtsGenerateSystemInformation)
+ suite.addTest(TestCtrlBSC('testBtsGenerateSystemInformation'))
+ res = unittest.TextTestRunner(verbosity=verbose_level).run(suite)
+ assert len(res.errors) == 0
+ assert len(res.failures) == 0
+ sys.exit(0)
It does not matter which testcase I am executing in a loop, the crash happens sporadically (just wait ~20-60 seconds and you'll see a new coredump in coredumpctl).
(gdb) bt #0 0x00007f5ef69d126c in ?? () from /usr/lib/libc.so.6 #1 0x00007f5ef6981a08 in raise () from /usr/lib/libc.so.6 #2 0x00007f5ef696a538 in abort () from /usr/lib/libc.so.6 #3 0x00007f5ef696b2db in ?? () from /usr/lib/libc.so.6 #4 0x00007f5ef69db1b7 in ?? () from /usr/lib/libc.so.6 #5 0x00007f5ef69dbcce in ?? () from /usr/lib/libc.so.6 #6 0x00007f5ef69deb14 in ?? () from /usr/lib/libc.so.6 #7 0x00007f5ef69df82a in malloc () from /usr/lib/libc.so.6 #8 0x00007f5ef6b3eb60 in ?? () from /usr/lib/libtalloc.so.2 #9 0x00007f5ef6b3ecde in talloc_named_const () from /usr/lib/libtalloc.so.2 #10 0x00007f5ef6ed4220 in msgb_alloc_c (ctx=<optimized out>, size=size@entry=1203, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex") at msgb.c:77 #11 0x00007f5ef6ed430b in msgb_alloc (size=size@entry=1203, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex") at msgb.c:110 #12 0x00007f5ef6f97df8 in msgb_alloc_headroom (size=size@entry=1203, headroom=headroom@entry=3, name=name@entry=0x7f5ef6fc4057 "IPA Multiplex") at ../../include/osmocom/core/msgb.h:556 #13 0x00007f5ef6f991d8 in ipa_msg_alloc (headroom=3, headroom@entry=0) at ipa.c:715 #14 0x00007f5ef6f997f6 in ipa_msg_recv_buffered (fd=11, rmsg=rmsg@entry=0x7ffda14f1c00, tmp_msg=tmp_msg@entry=0x55ce207b1010) at ipa.c:596 #15 0x00007f5ef701b2ce in handle_control_read (bfd=0x55ce207b0fb0) at control_if.c:353 #16 0x00007f5ef6eea7af in osmo_wqueue_bfd_cb (fd=0x55ce207b0fb0, what=1) at write_queue.c:47 #17 0x00007f5ef6edc73e in poll_disp_fds (n_fd=n_fd@entry=9) at select.c:419 #18 0x00007f5ef6edc7e6 in _osmo_select_main (polling=<optimized out>) at select.c:457 #19 0x00007f5ef6edc89b in osmo_select_main_ctx (polling=<optimized out>) at select.c:513 #20 0x000055ce1fd96cfa in main ()
Updated by fixeria 27 days ago
- Status changed from In Progress to Stalled
Today I saw a potentially related master build failure for master-osmo-sgsn (it's SGSN for the first time!):
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-sgsn/44257/
https://jenkins.osmocom.org/jenkins/view/master/job/master-osmo-sgsn/44257/IU=--enable-iu,WITH_MANUALS=1,a3=default,a4=default,label=osmocom-master/
ERROR: files left in build directory after distclean: ./core make[1]: *** [Makefile:751: distcleancheck] Error 1 make[1]: Leaving directory '/build/osmo-sgsn-1.11.0.1-e746b/_build/sub' make: *** [Makefile:680: distcheck] Error 1
I guess it's also related to the CTRL tests somehow:
$ file Downloads/core Downloads/core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './src/sgsn/osmo-sgsn -c /build/osmo-sgsn-1.11.0.1-e746b/tests/../doc/examples/o', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: './src/sgsn/osmo-sgsn', platform: 'x86_64'