Bug #4995
Updated by laforge over 3 years ago
AF_PACKET sockets have the following "incredible" semantics: * even when marked write-able by select/poll, they still many return -1 / ENOBUFS in case the socket buffer and/or tx-queue of the driver is full * there is no way to safely/sanely wait for buffer space to become available again * the only option is to re-try until it finally succeeds, preferably after a "reasonable" amount of sleep, considering the data rate of the underlying transport medium I'm attaching a reproducer to demonstrate the problem, both with select() and without. See also SYS#5343 The problem is further increased due to the fact that libosmocore/ns2 doesn't even notice if this happens. The call chain looks like this: * @frame_relay.c:osmo_fr_tx_dlc()@ is used to transmit NS messages by the NS2 core * @frame_relay.c@ will do whatever handling internaly * @gprs_ns2fr.c:fr_tx_cb()@ is the call-back we register with the frame_relay.c cre ** puts msgb into osmo_wqueue * wqueue code calls @gprs_ns2_fr.c:handle_netif_write()@ as write-callback when the FD is write-able (always!) ** we directlry retunr te result of the write() syscall (which is -1 in case of error) ** @osmo_wqueue_bfd_cb()@ calls that write_cb(), but *** it expects a -errno type return value, not the return value of a syscall that's likely just -1. *** only treats "-EAGAIN" as a trigger to re-enqueue the just-dequeued message So all in all, we are using a write_queue, but it will never really queue anything, as the socket is always writable and we don't realize if the write actually fails. What makes this even worse: A shared write_queue for user traffic and Q.933 Q.931 LMI (or even NS-ALIVE) traffic is actively dangerous * if Q.933 Q.931 starts to fail ,the entire link will be marked dead at the FR level * if NS-ALIVE starts to fail, the NS-VC will be marked as DEAD Now the question is what to do about it: # as an absolute minimum, we should have a counter and/or error messages if ENOBUFS or any other error happens # users of osmo_wqueue should always return a "-errno" style return value. We should audit all our code, not just this example # there should be some notification of the uppwer layers/application when ENOBUFS happens # have separate queues or some other prioritization that prefers Q.933 Q.931 LMI traffic and NS-signaling (ALIVE/ALIVE-ACK) over all user traffic (NS-UNITDATA)