Bug #2507

RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un

Added by neels 11 months ago. Updated 9 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:
Spec Reference:


Two runs in a row on RnD saw

05:12:38.580338 run          osmo-bts-trx(pid=23979): ERR: Terminated: ERROR {rc=1}  [trial-101↪aoip_sms:trx-b200↪osmo-bts-trx↪osmo-bts-trx(pid=23979)]
05:12:38.599545 run          osmo-bts-trx(pid=23979): stderr:
 | (launched: 2017-09-08_05:12:38.510355)
| 20170908051238536 DLCTRL <0017> control_if.c:788 CTRL at 4238
| 20170908051238536 DLGLOBAL <0010> telnet_interface.c:102 telnet at 4241
| 20170908051238536 DPCU <0009> pcu_sock.c:895 Could not create /home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/ unix socket: Address already in use
| PCU L1 socket failed 

Investigate why it happened / whether this persists...
AFAICT it should never happen because it is a dir location created specifically for each test run.



#1 Updated by neels 11 months ago

  • Priority changed from Normal to Urgent

#2 Updated by neels 11 months ago

interestingly enough didn't happen in
...I don't know what to make of it...

#3 Updated by neels 11 months ago

Ah! It seems we hit a maximum path length.
The path seems to get truncated, and instead of the path
a shorter version gets used:

This is coincidentally exactly at the dir boundary, but I see other socket files there:

/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-sysmo [cell5000]
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/sms:trx-sysmocell5 [000]

suggesting the limit is 107 (weird number).

My guess is that if two tests end up with the same truncated socket file, it would fail.
But in this case we hit exactly the dir name of the test and hence find an existing dir at the place.

And that explains why it started showing up at exactly test #100, because before, the 'trial-99' was one char less and created socket files inside the dir being one letter long.

The truncation also seem to happen on prod, but with the job numbers being 4 digits, we don't hit the dir exactly.
Truncations are visible in the end in the tar warnings:

+ tar czf /home/jenkins/workspace/osmo-gsm-tester_run/trial-2748-run.tgz run.2017-09-08_05-26-39
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-b200/mo_mt_sms.: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-b200/ socket ignored
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-sysmocell5000/m: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-sysmocell5000/mo_mt_: socket ignored

#4 Updated by neels 11 months ago

As we see in the error message, osmo-bts still has the full path.
It feeds it to libosmocore/src/socket.c osmo_sock_unix_init().
This does:

        struct sockaddr_un local;
        strncpy(local.sun_path, socket_path, sizeof(local.sun_path));
        local.sun_path[sizeof(local.sun_path) - 1] = '\0';

and whaddaya know, x86_64-linux-gnu/sys/un.h

struct sockaddr_un
    __SOCKADDR_COMMON (sun_);
    char sun_path[108];   /* Path name.  */

Thats the 107 we see above plus NUL term.

We simply cannot create sockets with path names of this size!

Instead, we could feed a relative path to the osmo-bts.cfg, because we know the CWD of the osmo-bts binary (just tested, it works in principle).
We can shorten it from os.path.relpath(socket_path, osmo_bts_cwd).

Until we do (or RnD job numbers hit 1000), that particular test will continue to fail.

#5 Updated by neels 11 months ago

  • Subject changed from RnD: failure to create pcu socket to RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un

#6 Updated by pespin 11 months ago

Wow, good catch, I was unaware of this limitation with unix sockets.

I'm not entirely sure using relpaths for it is a good idea. I mean, it works from point of view of osmo-bts, but then what from osmo-pcu? I think about it because I created a commit to share the pcu socket between them in the pespin/gprs branch: See and grep for "pcu_socket_path".

We could make pcu_socket_path return an absolute path (as it is done now), and then inside OsmoPcu and OsmoBts classes translate that into a relpath. Do you agree on doing so?

#7 Updated by laforge 11 months ago

You could simply use something like mkdtemp() to crate a unique temporary directory
and then create the pcu_sock in that directory?

I don't think there's any reason to give the socket a semantic name, as it's not soemthing
you are archiving (like a log file or an artefact).

#8 Updated by pespin 11 months ago

  • Status changed from New to Feedback

Patch submitted using mkdtemp() in to solve the issue.

#9 Updated by pespin 10 months ago

  • Status changed from Feedback to Resolved

Patch was merged solving the issue.

#10 Updated by laforge 9 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)