Bug #4995: handle ENOBUFS on write to AF_PACKET socket - libosmocore - Open Source Mobile Communications

Actions

Copy link

Bug #4995

closed

handle ENOBUFS on write to AF_PACKET socket

Added by laforge over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

laforge

Category:

Target version:

Start date:

01/30/2021

Due date:

% Done:

100%

Spec Reference:

Description

AF_PACKET sockets have the following "incredible" semantics:

even when marked write-able by select/poll, they still many return -1 / ENOBUFS in case the socket buffer and/or tx-queue of the driver is full
there is no way to safely/sanely wait for buffer space to become available again
the only option is to re-try until it finally succeeds, preferably after a "reasonable" amount of sleep, considering the data rate of the underlying transport medium

I'm attaching a reproducer to demonstrate the problem, both with select() and without.

Updated by laforge over 3 years ago

File packet.c packet.c added

Actions

Copy link

Updated by laforge over 3 years ago

% Done changed from 0 to 10

After some thinking about possible approaches with prioritiy queues or the like, I think I concluded

osmo_wqueue cannot be used at all due to its integration with select and assumption that select would only return if a socket is really write-able, ...
instead of complex multiple priority queues, I think we should implement one new mechanism with the following policy
- enqueue Q.933 LMI at the head of the qeuue, instead of the tail
- enqueue NS-ALIVE/ACK at the head of the queue, instad of the tail
- enqueue NS-UNITDATA with BVCI==0 at the tail of the queue
- never enqueue NS-UNITDATA with BVCI!=0, i.e. user traffic on PTP BVCs

The dequeue mechanism would be entirely driven by the ENOBUFS returns and a related osmo_timer. So if we get ENOBUFS, we start a timer and retry later. The duration of the timer can be baesed on a rough estimate of the transimit time of the just-failed packet.

This should

keep our queue very short, and
ensure that all essential signaling on all layers (LMI/NS/BVC) gets reliable delivery
- even BVC-RESET for PTP BVC happens on NS-BVCI 0), while
dropping that traffic which we really can afford to drop (user plane)
also ensures we don't get into buffer bloat

The number of dropped packets should have a counter, and it ideally should also provide some feedback to the higher layers so that we can incorporate it in the per-BVC flow control. I would consider the latter a second, possibly even optional step.

Actions

Copy link

Updated by laforge over 3 years ago

Status changed from New to In Progress

I can actually reproduce the dropping of vital messages like Q.933 or NS-ALIVE/ACK under high load.

Using osmo-ns-dummy with the load-generator patch from https://gerrit.osmocom.org/c/libosmocore/+/22553 and this config snippet:

ns-traffic-generator foo
 nsei 1001
 bvci 0
 packet-size 1400
 interval-us 2500
 lsp 0
 lsp-mode fixed

will try to transmit at more than twice the capacity of the E1 line and hence reproduce the problem very quickly. Within a minute or so, the FR/HDLC link will be declared as unreliable.

Actions

Copy link

Updated by laforge over 3 years ago

% Done changed from 10 to 70

laforge wrote:

After some thinking about possible approaches with prioritiy queues or the like, I think I concluded

osmo_wqueue cannot be used at all due to its integration with select and assumption that select would only return if a socket is really write-able, ...

instead of complex multiple priority queues, I think we should implement one new mechanism with the following policy

enqueue Q.933 LMI at the head of the qeuue, instead of the tail

enqueue NS-ALIVE/ACK at the head of the queue, instad of the tail

enqueue NS-UNITDATA with BVCI==0 at the tail of the queue

never enqueue NS-UNITDATA with BVCI!=0, i.e. user traffic on PTP BVCs

This is now implemented in https://gerrit.osmocom.org/c/libosmocore/+/22555

The "Q.933 / NS-ALIVE starvation" can no longer be observed when doing overload testing using this patch.

Actions

Copy link