Feature #5751
openio_uring support in libosmocore
40%
Description
Traditionally our I/O abstraction in libosmocore has been select()
. In libosmocore 1.5.0 (2020) we migrated over to poll()
to support more than 1024 FDs and to avoid the extreme amount of fd-set memcpy()ing involved in the venerable select interface.
Now of course both select and poll are ancient unix interfaces for non-blocking I/O, and both come at a high cost for systems under high load.
Specifically, we are getting reports from osmo-bsc users that indicate a busy BSC with 100 BTS ( 400 TRX)_is spending about 40% of its CPU cycles in the (kernel side) sock_poll, tcp_poll, do_sys_poll.
There are other interfaces such as linux aio, posix aio and epoll, but the brightest and shiniest new I/O interface on Linux is io_uring
. Contrary to any of its predecessors, io_uring can, in the "worst" case, operate without any system calls at all anymore. io_uring recognizes that each syscall is associated with a rather high context switch cost.
io_uring consists of memory-mapped (between kernel and userspace process) queues for requests and completions, as well as lockless primitives to enqueue/dequeue from these.
The requests in the queue are requests like read N bytes from this file descriptor or write N bytes to that file descriptor. But io_uring can do much more (many other syscalls), though the read/write is the most relevant part to us.
we already have two io_uring users in the osmocom universe: the GTP and the UDP/RTP load generators I wrote some time ago. They manage their file descriptors internally.
This ticket is now about introducing io_uring support into libosmocore itself, in a way to enable all osmocom programs to use that shared infrastructure.
Conceptual differences¶
reading from a socket¶
Conceptually, the existing code typically works like this:
- register some socket file descriptor for read
- libosmocore includes it in the poll-set
- libosmocore calls poll()
- kernel returns from poll, indicating fd is readable
- libosmocore dispatches to the application call-back
- application allocates msgb, reads data from socket
- application processes data in msgb
With io_uring, this model needs to change to something like this:
- application tells us it wants to read from a socket
- libosmocore or application pre-allocate the msgb
- libosmocore uses liburing to add a read request to the io_uring submission queue
- kernel signals us at some point a completion event via io_uring / liburing
- libosmocore dispatches pre-filled msgb to application call-back
- application processes data n msgb
So as we can see, the responsibility for the actual reading transfers from application (or intermediate library like libosmo-netif / libosmo-sigtran) into library.
writing to a socket¶
Conceptually, the existing code typically works like this:
- register some socket file descriptor for read
- libosmocore includes it in the poll-set
- libosmocore calls poll()
- kernel returns from poll, indicating fd is writeable
- libosmocore dispatches to the application call-back
- application writes data to msgb and free's msgb.
With io_uring, this model needs to change to something like this:
- application tells us it wants to write to a socket, including the msgb
- libosmocore uses liburing to add a write request to the io_uring submission queue
- kernel signals us at some point a completion event via io_uring / liburing
- libosmocore releases the msgb with msgb_free()
Again, the actual reading/writing passes into the library, and outside the scope of the application (or intermediate library like libosmo-netif / libosmo-sigtran)
Related issues
Updated by laforge about 1 year ago
I'd like the idea of splitting tihs into two separate sub-tasks:
- introduce the conceptual API changes of having the actual read/write done inside libosmocore; then start to port applications over to that new API
- subsequently (and fully optionally) introduce an io_uring backend to libosmocore so it can benefit from the related performance improvements.
By splitting this is up into two parts, we can more easily pinpoint any related problems, as we can test one part without the other.
Furthermore, on any older systems that don't have kernels with io_uring support, we can simply not use it, as the second step is independent of the first step. The applications simply always use the same API, whether or not libosmocore uses io_uring becomes an implementation detail unknown to the applications.
Updated by laforge about 1 year ago
- Related to Feature #5752: io_uring support in libosmo-sigtran added
Updated by laforge about 1 year ago
- Related to Feature #5753: io_uring support in libosmo-netif added
Updated by laforge about 1 year ago
- Related to Feature #5754: io_uring support in libosmo-mgcp-client added
Updated by laforge about 1 year ago
- Related to Feature #5755: io_uring support in osmo-bsc added
Updated by laforge about 1 year ago
- Related to Bug #5756: io_uring support in libosmo-abis added
Updated by laforge about 1 year ago
for some existing example how to use io_uring in the osmocom context, check out rtp-load-gen at https://gitea.osmocom.org/cellular-infrastructure/osmo-mgw/src/branch/laforge/rtp-load-gen/contrib/rtp-load-gen and grep for io_uring_ showing the various API calls. There's also https://gitea.osmocom.org/cellular-infrastructure/gtp-load-gen
io_uring_get_sqe
returns an unused submission queue entryio_uring_prep_write
andio_uring_prep_write
fills that submission queue entry with a fd, pointer to data + lengthio_uring_submit
submits whatever prepared submission queue entries
- io_uring tutorial at https://unixism.net/loti/tutorial/index.html
- liburing code at https://github.com/axboe/liburing
The libosmocore integration with the existing select/poll would likely be done via an eventfd. So applications will continue to use osmo_select_main() etc. and can use any number of their file descriptors as they did so far. But libosmocore will internally register an eventfd with the existing select/poll API, so that any time io_uring wants to notify us about completions, it marks that eventfd as readable, triggering our select/poll loop to handle those completion events. So why is this faster? Because there will be one such eventfd-poll-trigger for a virtually unlimited number of io_uring completion events, as opposed to one poll+read/write syscall for each of them.
Updated by Hoernchen about 1 year ago
Please keep in mind that IORING_REGISTER_IOWQ_AFF is a fairly recent feature, so unless that exists "automatically" turning on uring support, if available, leads to a bunch of theads ( as for the number and other details: https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/ is worth a read) that just end up somewhere, without easy ways to move those to a specific cpu.
Updated by laforge about 1 year ago
On Wed, Nov 09, 2022 at 01:58:54PM +0000, Hoernchen wrote:
Please keep in mind that IORING_REGISTER_IOWQ_AFF is a fairly recent feature, so unless that exists "automatically" turning on uring support, if available, leads to a bunch of theads ( as for the number and other details: https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/ is worth a read) that just end up somewhere, without easy ways to move those to a specific cpu.
AFAICT there are no kernel threads created for socket read/write, as sockets support non-blocking operation.
99.9% of all I/O we are doing is on sockets (UDP, TCP, SCTP, Unix) for talking to other network elements or
the user via VTY/CTRL. There is a bit of file I/O when reading config files (not worth optimzing anyway) and from osmo-hlr / osmo-msc for the respective database, which is accessed in blocking I/O anyway.
Updated by laforge about 1 year ago
- Related to Feature #5766: use Linux kernel KCM for IPA header? added
Updated by laforge about 1 year ago
- Assignee changed from laforge to daniel
Update: I've been playing for a few days with some of the concepts and trying to bring all our requirements in-line toward the first step (new API that can support poll and later io_uring backend).
I've handed this over to daniel now as he has more time available right now and indicated an interest in this topic. We just had a call where I explained my thoughts and the latest results how I think it shuold all be put together.
I'm of course available whenever feedback/questions arise.
Updated by laforge about 1 year ago
summary of some of my ideas / thoughts on the new I/O provider so far:
- modes. The new I/O provider will need to offer the following modes:
- read/write (e.g. tcp sockets for IPA OML/RSL/GSUP as well as CBSP, VTY, CTRL, ...)
- recvfrom/sendto (e.g. UDP sockets used for RTP, GTP, MGCP, ...)
- io_uring doesn't directly support those syscalls. However, it does support recvmsg/sendmsg, which is a superset of recfrom/sendto combined with readv/writev
- we have to convert recfrom/sendto by API users (applications) to recvmsg/sendmsg
- sctp_recvfrom/sctp_sendto (SCTP sockets for anything M3UA/SUA/sigtran)
- this API from libsctp is just a 20-line wrapper around normal recvmsg/sendmsg calls
- we have to re-implement this wrapper in our io_uring code
- introduction of a new
struct osmo_io_fd
which will be used instead ofosmo_fd
, containing- fd
const char *name
for application to provide a human-readable name of the FD (in case I/O provider wants to log something)- parameters for msgb_alloc (headroom, context, size)
- a built-in write-queue with semantics like osmo_wqueue
- call-back functions for the user application (read/write completion call-backs)
- priv/priv_nr for context of application (like osmo_fd)
- write operation
- application does something like
osmo_io_write(struct osmo_io_fd *, struct msgb *)
- I/O provider enqueues any write into write queue and marks FD as "wants to write"
- io_uring backend
- would check if write is pending completion. If not, submit first entry of write_queue to io_uring
- at some later point, I/O provider io_uring backend is notified via osmo_fd-wapped-eventfd that io_uring has completed something
- once I/O provider io_uring backend identifies a write has completed, it will call the
io_fd->write_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg)
call-back
- classic poll backend
- would now check if OSMO_FD_WRITE is active. If not, set it.
- gets notified that osmo_fd is writable
- issues normal non-blocking
write()
syscall - call the
io_fd->write_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg)
call-back
- the application can now act basd on rc (short write, negative error, dead socket, etc)
- once call-back returns, I/O provider does
msgb_free(msg)
- application does something like
- read operation
- application notifies I/O provider that it wants to read from
osmo_io_fd
- io_uring backend
- allocates a msgb (using parameters provided by application stored in
osmo_io_fd
- submits a
read()
syscall to io_uring submission queue pointing to msgb memory - completion is handled just like the write completion via osmo_fd-wrapped-eventfd
io_fd->read_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg)
is called
- allocates a msgb (using parameters provided by application stored in
- classic poll backend
- enables
OSMO_FD_READ
on socket - gets notified that osmo_fd is readable once data is available
- allocates a msgb (using parameters provided by application stored in
osmo_io_fd
- issues normal non-blocking
read()
syscall io_fd->read_cb(struct osmo_io_fd *fd, int rc, struct msgb *msg)
is called
- enables
- application notifies I/O provider that it wants to read from
Updated by laforge about 1 year ago
For the {send,recv}{to,from,msg}()
family of calls, we need to extend the above slightly. In addition the raw msgb, we have metadata like the struct sockaddr
to send to.
I originally thought we could push this to the front of the msgb headroom, but sockaddr_storage is already 128 bytes plus the struct msghdr
struct iovec
etc. quickly adds up to something like 200 bytes. Since msgb size (including headroom) is limited to 16bit (historical mistake), I'm not sure if it's the right way.
I then decided to go for a struct serialized_msghdr
which we allocate at the time the user issues e.g. a osmo_io_sendto(struct osmo_io_fd *, struct msgb *msg, int flags, const struct sockaddr *dest_addr, socklen_t addrlen)
call. The function would then copy the provided parameters into that heap-allocated serialized_msghdr
, and enqueue that (instead of the pure msg) into the in-memory transmit queue. Once the actual sendmsg
call is performed (async via io_uring or directly via syscall), we dequeue that msghdr and make use of it. On completion we call the user completion call-back and then free the serialized_msghdr
as well as the msgb afterwards.
The same approach also works for the recvmsg/recvffrom
case, where we can have an application call-back like void (*recvfrom_cb)(struct osmo_io_fd *iofd, int rc, struct msgb *msg, struct sockaddr *src_addr, socklen_t *addrlen);
Equally this approach works for sctp_sendmsg/sctp_recvmsg
as those are just wrappers with different function arguments that all get encoded into a struct msghdr
.
Updated by daniel about 1 year ago
- % Done changed from 0 to 30
An update on the io_uring osmo_io progress so far:
The WIP commits are in libosmocore.git branch daniel/io_uring.
https://gitea.osmocom.org/osmocom/libosmocore/src/branch/daniel/io_uring
What's done¶
I managed to get a basic version of osmo_io working with the poll backend and have also working backend for io_uring.
With it the NS2 UDP socket used osmo_io in with sendto()/recvfrom(). The control interface is also using osmo_io complete with IPA parsing/segmentation (with read()/write() mode).
With this the ttcn3 osmo-gbproxy tests (which also uses the ctrl_if) as well as make distcheck pass.
libosmocore currently tries to build with uring support unless passed `--disable-uring` during configure. The default will be io_uring if it's enabled.
The environment variable `LIBOSMO_IO_BACKEND` can be used to switch backends at runtime. Setting it to something other than "IO_URING" will use the poll/osmo_fd backend. This can be verified by setting the new DIO loglevel to DEBUG and watching for the message:
"iofd(<name>) using backend poll/uring"
Open issues¶
- Porting over the ipa.c/ipaccess.c code in libosmo-abis will be a significant amount of work since quite a few functions get direct access to an osmo_fd of even a plain fd. They then write()/send() directly to those which will need to move to a tx queue-aware model.
- libosmo-netif has some similar issues in its ipa code, but in general looks much better because the osmo_stream api already uses a tx_queue internally and matches the callback api of osmo_io much better.
- sctp support is not implemented in osmo_io yet. This will be a wrapper around send/recvmsg, so shouldn't be too complicated.
API notes¶
The osmo_io api currently has a _setup function that takes and registers a plain fd and returns a newly allocated struct osmo_io_fd *. This worked ok for ctrl_if and gprs_ns2, but I noticed in a couple places in libosmo-abis that the osmo_fd struct (with callbacks, data, ...) is initialized in one part of the code with the fd set to -1 and only registered in another when the fd is actually present.
Right now the osmo_io assumes that the fd is configured/connected/... correctly and will not do anything there except try to read/write from it. This should be ok for now and you can always get the raw fd and do some get/setsockopts on there.
Updated by laforge 9 months ago
- Priority changed from Urgent to Immediate
This ticket is in need of updates for months. The branch has not seen any commits since early December. Yet from spoken status reports I know there has been more recent activity.
Please make sure to update the relevant tickets and keep pushing the current branches, thanks.
Updated by osmith 4 months ago
- % Done changed from 30 to 40
The patch has been merged to master:
https://gerrit.osmocom.org/c/libosmocore/+/32536
I've adjusted infrastructure to fix failing builds related to the new liburing dependency:
https://gerrit.osmocom.org/q/topic:osmo-io+author:osmith%2540sysmocom.de+-is:merged
Updated by laforge 23 days ago
- Checklist item sctp support in osmo_io added
- Assignee changed from daniel to laforge
regarding the high-level aspects of SCTP support, see some updates in #5752#note-9
One of the unexpected problems is that msgb_sctp_{ppid,stream} is definted in libosmo-netif and hence is not available in libosmocore. We hence cannot use those existing definitions to pass parameters around in msgb :/ - and as usual, moving stuff between libraries is hard as it might break users and lead to duplicate definitions, etc.