Project

General

Profile

Bug #4456

SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with musl

Added by laforge 20 days ago. Updated 19 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
remsim-client
Target version:
-
Start date:
03/16/2020
Due date:
% Done:

20%


Description

I haven't seen this problem on Debian/x86 or Raspbian/arm builds so far. But it's visible on OpenWRT musl builds on ARM.

Thread 1 "osmo-remsim-cli" received signal SIGABRT, Aborted.
__restore_sigs (set=set@entry=0xbefff8b8) at src/signal/block.c:44
44      }
(gdb) bt
#0  __restore_sigs (set=set@entry=0xbefff8b8) at src/signal/block.c:44
#1  0xb6fc92b4 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2  0xb6faaa70 in abort () at src/exit/abort.c:14
#3  0xb6e7b928 in talloc_abort (reason=0xb6e7de45 "Bad talloc magic value - unknown value")
    at ../talloc.c:500
#4  0xb6e7bb70 in talloc_abort_unknown_value () at ../talloc.c:513
#5  talloc_chunk_from_ptr (ptr=<optimized out>) at ../talloc.c:529
#6  0xb6e7c2d0 in __talloc_with_prefix (context=0x29d50, size=28, prefix_len=0, tc_ret=0xbefff9ac)
    at ../talloc.c:743
#7  0xb6e7c678 in __talloc (tc=0xbefff9a4, size=size@entry=28, context=<optimized out>)
    at ../talloc.c:804
#8  _talloc_named_const (context=<optimized out>, size=size@entry=28, name=0xb6ef18f6 "struct osmo_fd")
    at ../talloc.c:961
#9  0xb6e7c6d4 in _talloc_zero (ctx=<optimized out>, size=size@entry=28, name=<optimized out>)
    at ../talloc.c:2402
#10 0xb6ef0f30 in osmo_usb_added_cb (fd=10, events=<optimized out>, user_data=0x0) at osmo_libusb.c:86
#11 0xb6ec704c in usbi_add_pollfd ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#12 0xb6eca794 in op_open ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#13 0xb6ec38c0 in libusb_open ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#14 0xb6ef160c in osmo_libusb_open_claim_interface (ctx=ctx@entry=0x0, luctx=luctx@entry=0x0, 
    ifm=0xbefffa90, ifm@entry=0xbefffa88) at osmo_libusb.c:390
#15 0x000129fc in client_user_main (bc=bc@entry=0x263c0) at user_simtrace2.c:415
#16 0x00012280 in main (argc=15, argv=0xbefffc44) at remsim_client_main.c:166

Related issues

Related to libosmocore - Bug #4062: vty tests fails on arm (raspberry pi)Resolved06/16/2019

History

#1 Updated by laforge 20 days ago

ok, this seems to related to thread-local-storage. I remember we had hit a gcc bug about this some time ago (#4062)

(gdb) frame 10
#10 0xb6ef0f30 in osmo_usb_added_cb (fd=10, events=<optimized out>, user_data=0x0) at osmo_libusb.c:86
86              struct osmo_fd *ofd = talloc_zero(OTC_GLOBAL, struct osmo_fd);
(gdb) p OTC_GLOBAL
Cannot find thread-local storage for Thread 1740.1740, shared library /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libosmocore.so.12:
Remote target failed to process qGetTLSAddr request

Given that this build was created using gcc-7.3.0, we would actually expect it to have fixed thread-local-storage in shared libraries. weird...

#2 Updated by laforge 20 days ago

  • Related to Bug #4062: vty tests fails on arm (raspberry pi) added

#3 Updated by laforge 20 days ago

Attempting to use the workaround of using -mtls-dialect=gnu2 when building libosmocore should solve the problem.

....but:

Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13

It seems that the dynamic linker used by musl doesn't support this at all. I'm a bit at a loss, about (I think) 20 years after introduction of thread-local storage. sigh

#4 Updated by dalias 20 days ago

The TLS thing looks like it might be a red herring. GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.

With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version. There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.

If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.

#5 Updated by dalias 20 days ago

Actually I forget whether libthreaddb.so is loaded into the tracee/target or on the host running gdb, but in either case we don't have one for musl yet, and if we implement one it will just use generic public interfaces like __tls_get_addr that gdb would ideally already be using itself. Getting gdb to do decent multithreaded debugging without it is a bit of a battle but it works with moderate functionality.

#6 Updated by lynxis 20 days ago

laforge which package feed did you use? I would like to reproduce it. On which platform do you tried it?

#7 Updated by laforge 19 days ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

dalias wrote:

The TLS thing looks like it might be a red herring.

it might, but given that we have had the exact same problem at the exact same part of the code on other ARM32 platforms before (See #4062), it would be quite a coincidence. And I'm referring to the actual bug causing the SIGABRT while accessing __thread variables in shared libraries, not to the problems gdb has resolving their address.

GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.

Yes, I'm aware of the fact that gdb not resolving the thread-local address is a separate issue (and not the source of my complaint)

With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version.

It is 1.1.20 on this platform (and AM335x phyCORE based custom board).

There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.

I'm not at liberty to upgrade the base OS, but I could try it just to see if we can narrow it down.

If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.

Thanks a lot for your quick feedback here. I'll let you know if we have to resort to that.

The fun part is that the target program is not multi-threaded at all. It's just that our core library (libosmocore) is prepared for multi-threaded programs by declaring various global state (like the talloc contexts; talloc must have one of those per thread) as thread-local.

#8 Updated by laforge 19 days ago

lynxis wrote:

laforge which package feed did you use? I would like to reproduce it. On which platform do you tried it?

this is a phycore-am335x based board (you probably know which one). No official package feeds are used, it's a fully custom build.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)