Project

General

Profile

Bug #4456

SIGABRT while opening USB device in osmo-remsim-client-st2 on OpenWRT with musl

Added by laforge 2 months ago. Updated 6 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
remsim-client
Target version:
-
Start date:
03/16/2020
Due date:
% Done:

40%


Description

I haven't seen this problem on Debian/x86 or Raspbian/arm builds so far. But it's visible on OpenWRT musl builds on ARM.

Thread 1 "osmo-remsim-cli" received signal SIGABRT, Aborted.
__restore_sigs (set=set@entry=0xbefff8b8) at src/signal/block.c:44
44      }
(gdb) bt
#0  __restore_sigs (set=set@entry=0xbefff8b8) at src/signal/block.c:44
#1  0xb6fc92b4 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2  0xb6faaa70 in abort () at src/exit/abort.c:14
#3  0xb6e7b928 in talloc_abort (reason=0xb6e7de45 "Bad talloc magic value - unknown value")
    at ../talloc.c:500
#4  0xb6e7bb70 in talloc_abort_unknown_value () at ../talloc.c:513
#5  talloc_chunk_from_ptr (ptr=<optimized out>) at ../talloc.c:529
#6  0xb6e7c2d0 in __talloc_with_prefix (context=0x29d50, size=28, prefix_len=0, tc_ret=0xbefff9ac)
    at ../talloc.c:743
#7  0xb6e7c678 in __talloc (tc=0xbefff9a4, size=size@entry=28, context=<optimized out>)
    at ../talloc.c:804
#8  _talloc_named_const (context=<optimized out>, size=size@entry=28, name=0xb6ef18f6 "struct osmo_fd")
    at ../talloc.c:961
#9  0xb6e7c6d4 in _talloc_zero (ctx=<optimized out>, size=size@entry=28, name=<optimized out>)
    at ../talloc.c:2402
#10 0xb6ef0f30 in osmo_usb_added_cb (fd=10, events=<optimized out>, user_data=0x0) at osmo_libusb.c:86
#11 0xb6ec704c in usbi_add_pollfd ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#12 0xb6eca794 in op_open ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#13 0xb6ec38c0 in libusb_open ()
   from /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libusb-1.0.so.0
#14 0xb6ef160c in osmo_libusb_open_claim_interface (ctx=ctx@entry=0x0, luctx=luctx@entry=0x0, 
    ifm=0xbefffa90, ifm@entry=0xbefffa88) at osmo_libusb.c:390
#15 0x000129fc in client_user_main (bc=bc@entry=0x263c0) at user_simtrace2.c:415
#16 0x00012280 in main (argc=15, argv=0xbefffc44) at remsim_client_main.c:166

Related issues

Related to libosmocore - Bug #4062: vty tests fails on arm (raspberry pi)Resolved06/16/2019

History

#1 Updated by laforge 2 months ago

ok, this seems to related to thread-local-storage. I remember we had hit a gcc bug about this some time ago (#4062)

(gdb) frame 10
#10 0xb6ef0f30 in osmo_usb_added_cb (fd=10, events=<optimized out>, user_data=0x0) at osmo_libusb.c:86
86              struct osmo_fd *ofd = talloc_zero(OTC_GLOBAL, struct osmo_fd);
(gdb) p OTC_GLOBAL
Cannot find thread-local storage for Thread 1740.1740, shared library /home/laforge/openwrt/scripts/../staging_dir/target-arm_cortex-a8+vfpv3_musl_eabi/root-omap/usr/lib/libosmocore.so.12:
Remote target failed to process qGetTLSAddr request

Given that this build was created using gcc-7.3.0, we would actually expect it to have fixed thread-local-storage in shared libraries. weird...

#2 Updated by laforge 2 months ago

  • Related to Bug #4062: vty tests fails on arm (raspberry pi) added

#3 Updated by laforge 2 months ago

Attempting to use the workaround of using -mtls-dialect=gnu2 when building libosmocore should solve the problem.

....but:

Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13
Error relocating /usr/lib/libosmogsm.so.13: unsupported relocation type 13

It seems that the dynamic linker used by musl doesn't support this at all. I'm a bit at a loss, about (I think) 20 years after introduction of thread-local storage. sigh

#4 Updated by dalias 2 months ago

The TLS thing looks like it might be a red herring. GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.

With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version. There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.

If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.

#5 Updated by dalias 2 months ago

Actually I forget whether libthreaddb.so is loaded into the tracee/target or on the host running gdb, but in either case we don't have one for musl yet, and if we implement one it will just use generic public interfaces like __tls_get_addr that gdb would ideally already be using itself. Getting gdb to do decent multithreaded debugging without it is a bit of a battle but it works with moderate functionality.

#6 Updated by lynxis 2 months ago

laforge which package feed did you use? I would like to reproduce it. On which platform do you tried it?

#7 Updated by laforge 2 months ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 20

dalias wrote:

The TLS thing looks like it might be a red herring.

it might, but given that we have had the exact same problem at the exact same part of the code on other ARM32 platforms before (See #4062), it would be quite a coincidence. And I'm referring to the actual bug causing the SIGABRT while accessing __thread variables in shared libraries, not to the problems gdb has resolving their address.

GDB is just unable/unwilling to find TLS by calling __tls_get_addr in the tracee and wants a libthreaddb.so coerced into the tracee's memory space, which we don't have for musl.

Yes, I'm aware of the fact that gdb not resolving the thread-local address is a separate issue (and not the source of my complaint)

With that said, all TLS access in musl, whether you use TLSDESC or not, is async-signal-safe and always has been. Moreover, ARM TLSDESC has been supported since 1.1.21, so you must be using a rather old version.

It is 1.1.20 on this platform (and AM335x phyCORE based custom board).

There were some bugs affecting TLS allocation/alignment on ARM that were fixed in the past few years, so it might be one of them that's affecting you; I'd try upgrading and see. But it's also plausible that it's something completely different.

I'm not at liberty to upgrade the base OS, but I could try it just to see if we can narrow it down.

If you can't upgrade and just need to patch, let me know and I'll try to find the relevant commits.

Thanks a lot for your quick feedback here. I'll let you know if we have to resort to that.

The fun part is that the target program is not multi-threaded at all. It's just that our core library (libosmocore) is prepared for multi-threaded programs by declaring various global state (like the talloc contexts; talloc must have one of those per thread) as thread-local.

#8 Updated by laforge 2 months ago

lynxis wrote:

laforge which package feed did you use? I would like to reproduce it. On which platform do you tried it?

this is a phycore-am335x based board (you probably know which one). No official package feeds are used, it's a fully custom build.

#9 Updated by laforge 6 days ago

  • % Done changed from 20 to 40

After upgrading musl to 1.1.23, I am still seeing a SIGABRT problem in this code path.

It turns out that modern versions of libtalloc randomize the magic "key" by which they identify every chunk of talloc memory. This presumably is a mechanism to make it harder to attack, as every time you start a process, that magic value changes. Initialization of this
talloc_magic' happens inside talloc_lib_init(), which is an __attribute__((constructor)) function.

Now in libosmocore, we use talloc from within other __attribute__((constructor)) functions, particularly also osmo_ctx_init() here in this particular codepath. So if the constructors are called in the "wrong" order (libosmocore before talloc), then we get the following sequence of events:
  1. libosmocore allocating some memory object in talloc. Talloc uses the compiled-in default TALLOC_MAGIC value for identifying those chunks
  2. talloc constructor changing the magic value (in a global variable) to some randomized value
  3. any later follow-up code that wants to allocate memory (using a context/chunk with the old magic value) will abort() as the magic value is not what it expects

This problem is aggravated by the fact that talloc_lib_init() will unconditionally use a new random value if called repeatedly, rather than checking if the current value is already a random value. In other words: it's not safe to simply call talloc_lib_init() ahead of any memory allocation function in a constructor.

Interestingly, gcc supports a priority value associated with constructor functions, where a low value means high priority and a high value means low priority: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attributes.html

So I will try to use a large numeric priority value in libosmocore and see if that adresses the problem.

#10 Updated by laforge 6 days ago

I tried it with __attribute__((constructor(65535))) in libosmocore. However, the bug still persists.

To my big surprise, I couldn't find any notion of priority handling in https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c - could it be they simply ignore the gcc specification of this attribute?

#11 Updated by laforge 6 days ago

I submitted my findings to https://www.openwall.com/lists/musl/2020/05/21/7

Meanwhile, I hacked an explicit call to osmo_ctx_init() into the main function of osmo-remsim-client, making libosmocore re-allocate the contexts at that point. This gets me beyond the talloc-abort, but it's of course not a solution - particularly since musl is hostile towards people trying to implement quirks by not offering a MUSL or the like.

#12 Updated by laforge 6 days ago

Also see https://github.com/bminor/glibc/blob/e3022f4bcd69eb9f103a6de626a1e9e343fc7ada/elf/dl-init.c#L109

  /* Stupid users forced the ELF specification to be changed.  It now
     says that the dynamic loader is responsible for determining the
     order in which the constructors have to run.  The constructors
     for all dependencies of an object must run before the constructor
     for the object itself.  Circular dependencies are left unspecified.
     This is highly questionable since it puts the burden on the dynamic
     loader which has to find the dependencies at runtime instead of
     letting the user do it right.  Stupidity rules!  */

clearly the author of those lines is very unhappy, but the spec is the spec...

#13 Updated by laforge 6 days ago

Ok, already in 1997 the specification (System V Application Binary Interface, Edition 4.1) on this was clear:

Initialization and Termination Functions

After the dynamic linker has built the process image and performed the reloca-
tions, each shared object gets the opportunity to execute some initialization code.
All shared object initializations happen before the executable file gains control.
Before the initialization code for any object A is called, the initialization code for
any other objects that object A depends on are called. For these purposes, an
object A depends on another object B, if B appears in A’s list of needed objects
(recorded in the DT_NEEDED entries of the dynamic structure). The order of ini-
tialization for circular dependencies is undefined.
So the constructor of libtalloc must be called before the constructor of libosmocore, as long as there is
  • no circular dependency (there certainly is not)
  • libtalloc appears in libosmcoore's list of needed objects (DT_NEEDED) which is true.

So even without any priority, the constructors must be called in the intuitively correct order. So to me this still is a linker bug.

#14 Updated by laforge 6 days ago

the dependency information seems quite [obviously] right: libtalloc doesn't depend on libosmocore, but libosmocore depends on libtalloc.

# ldd ./libosmocore.so
        ldd (0xb6f46000)
        libtalloc.so.2 => /usr/lib/libtalloc.so.2 (0xb6efd000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6ee2000)
        libc.so => ldd (0xb6f46000)

# ldd ./libtalloc.so
        ldd (0xb6f5b000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6f29000)
        libc.so => ldd (0xb6f5b000)

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)