Bug #4995: handle ENOBUFS on write to AF_PACKET socket - libosmocore - Open Source Mobile Communications

Bug #4995

AF_PACKET sockets have the following "incredible" semantics: 
 * even when marked write-able by select/poll, they still many return -1 / ENOBUFS in case the socket buffer and/or tx-queue of the driver is full 
 * there is no way to safely/sanely wait for buffer space to become available again 
 * the only option is to re-try until it finally succeeds, preferably after a "reasonable" amount of sleep, considering the data rate of the underlying transport medium 

 I'm attaching a reproducer to demonstrate the problem, both with select() and without. 

 See also SYS#5343 

 The problem is further increased due to the fact that libosmocore/ns2 doesn't even notice if this happens. 

 The call chain looks like this: 
 * @frame_relay.c:osmo_fr_tx_dlc()@ is used to transmit NS messages by the NS2 core 
 * @frame_relay.c@ will do whatever handling internaly 
 * @gprs_ns2fr.c:fr_tx_cb()@ is the call-back we register with the frame_relay.c cre 
 ** puts msgb into osmo_wqueue 
 * wqueue code calls @gprs_ns2_fr.c:handle_netif_write()@ as write-callback when the FD is write-able (always!) 
 ** we directlry retunr te result of the write() syscall (which is -1 in case of error) 
 ** @osmo_wqueue_bfd_cb()@ calls that write_cb(), but 
 *** it expects a -errno type return value, not the return value of a syscall that's likely just -1. 
 *** only treats "-EAGAIN" as a trigger to re-enqueue the just-dequeued message 

 So all in all, we are using a write_queue, but it will never really queue anything, as the socket is always writable and we don't realize if the write actually fails. 

 What makes this even worse: A shared write_queue for user traffic and Q.933 Q.931 LMI (or even NS-ALIVE) traffic is actively dangerous 
 * if Q.933 Q.931 starts to fail ,the entire link will be marked dead at the FR level 
 * if NS-ALIVE starts to fail, the NS-VC will be marked as DEAD 

 Now the question is what to do about it: 
 # as an absolute minimum, we should have a counter and/or error messages if ENOBUFS or any other error happens 
 # users of osmo_wqueue should always return a "-errno" style return value.    We should audit all our code, not just this example 
 # there should be some notification of the uppwer layers/application when ENOBUFS happens 
 # have separate queues or some other prioritization that prefers Q.933 Q.931 LMI traffic and NS-signaling (ALIVE/ALIVE-ACK) over all user traffic (NS-UNITDATA)

Back

Project

General

Profile

Osmocom Libraries » libosmocore

Bug #4995