Bug #4181
closedosmo-trx-uhd: Crash during physical unplug of device
0%
Description
I was running my network using osmo-trx-uhd with an Ettus B200 and I unplugged the devie. Got this:
Thu Aug 29 20:55:42 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 738780 Thu Aug 29 20:55:43 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 738997 Thu Aug 29 20:55:44 2019 DMAIN <0000> Transceiver.cpp:1113 [tid=139657724520192] ClockInterface: sending IND CLOCK 739213 terminate called after throwing an instance of 'uhd::io_error' what(): EnvironmentError: IOError: usb rx6 transfer status: LIBUSB_TRANSFER_NO_DEVICE [ERROR] [UHD] signal 6 received An unexpected exception was caught in a task loop.The task loop will now exit, things may not work.EnvironmentError: IOError: usb rx8 transfer status: LIBUSB_TRANSFER_NO_DEVICE talloc report on 'OsmoTRX' (total 5246 bytes in 21 blocks) /home/pespin/dev/sysmocom/git/libosmocore/src/rate_ctr.c:234 contains 512 bytes in 1 blocks (ref 0) 0x6160000057e0 /home/pespin/dev/sysmocom/git/osmo-trx/CommonLibs/trx_rate_ctr.cpp:276 contains 8 bytes in 1 blocks (ref 0) 0x60b0000c3130 /home/pespin/dev/sysmocom/git/osmo-trx/CommonLibs/trx_rate_ctr.cpp:275 contains 32 bytes in 1 blocks (ref 0) 0x60c000023620 telnet_connection contains 1 bytes in 1 blocks (ref 0) 0x60b0000c2370 logging contains 4303 bytes in 11 blocks (ref 0) 0x60b0000155a0 struct trx_ctx contains 390 bytes in 4 blocks (ref 0) 0x6140000006a0 msgb contains 0 bytes in 1 blocks (ref 0) 0x608000005f80 full talloc report on 'OsmoTRX' (total 5246 bytes in 21 blocks) ... ./run_out.sh: line 12: 28572 Aborted (core dumped) $@
(./run_out.sh is the bash script I use to launch osmo-trx-uhd).
So it seems we are not handling a UHD exception in UHDDevice which ends up aborting the entire process. We should handle it and stop osmo-trx-uhd process gracefully through the osmo signal available for that purpose.
Updated by pespin over 4 years ago
Updated by pespin over 4 years ago
- Status changed from New to Closed
As far as I understand, the UHD code is run in a separate thread created by UHD's task::make: https://files.ettus.com/manual/classuhd_1_1task.html
task_handler = task::make( boost::bind(&libusb_session_impl::libusb_event_handler_task, this, _context));
As a result, we have no access to catching c++ exceptions from that thread, and the c++ exception ends up calling abort() which sends SIGABRT to the process ("signal 6 received").
The current osmo-trx signal handler:
static void sig_handler(int signo) { if (gshutdown) /* We are in the middle of shutdown process, avoid any kind of extra action like printing */ return; fprintf(stderr, "signal %d received\n", signo); switch (signo) { case SIGINT: case SIGTERM: fprintf(stderr, "shutting down\n"); gshutdown = true; break; case SIGABRT: case SIGUSR1: talloc_report(tall_trx_ctx, stderr); talloc_report_full(tall_trx_ctx, stderr); break; case SIGUSR2: talloc_report_full(tall_trx_ctx, stderr); break; case SIGHUP: log_targets_reopen(); default: break; } }
Unfortunately there doesn't seem to be a way to handle things fine and shut down properly in this scenario (abort called). According to POSIX:
The abort() function shall cause abnormal process termination to occur, unless the signal SIGABRT is being caught and the signal handler does not return.
I don't see how we can avoid the signal handler stopping (other than by stopping/cancelling the thread or making it block forever, specially since we use signalfd so the process executing the abort signal is probably the main one. It's probably better to let it continue so a core dump is generated (not sure if it makes much sense though since anyway other threads keep runing...).
So I'm closing the ticket.