Project

General

Profile

Bug #2507

RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un

Added by neels 12 days ago. Updated about 6 hours ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
Start date:
09/08/2017
Due date:
% Done:

0%

Spec Reference:

Description

Two runs in a row on RnD saw

05:12:38.580338 run          osmo-bts-trx(pid=23979): ERR: Terminated: ERROR {rc=1}  [trial-101↪aoip_sms:trx-b200↪osmo-bts-trx↪osmo-bts-trx(pid=23979)]
05:12:38.599545 run          osmo-bts-trx(pid=23979): stderr:
 | (launched: 2017-09-08_05:12:38.510355)
| 20170908051238536 DLCTRL <0017> control_if.c:788 CTRL at 127.0.0.1 4238
| 20170908051238536 DLGLOBAL <0010> telnet_interface.c:102 telnet at 127.0.0.1 4241
| 20170908051238536 DPCU <0009> pcu_sock.c:895 Could not create /home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/mo_mt_sms.py/osmo-bts-trx/pcu_bts unix socket: Address already in use
| PCU L1 socket failed 

Investigate why it happened / whether this persists...
AFAICT it should never happen because it is a dir location created specifically for each test run.

aoip_sms:trx-b200 mo_mt_sms.py
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/101/console
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/100/console

History

#1 Updated by neels 12 days ago

  • Priority changed from Normal to Urgent

#2 Updated by neels 12 days ago

interestingly enough didn't happen in https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/99/
...I don't know what to make of it...

#3 Updated by neels 12 days ago

Ah! It seems we hit a maximum path length.
The path seems to get truncated, and instead of the path
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/mo_mt_sms.py/osmo-bts-trx/pcu_bts
a shorter version gets used:
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/

This is coincidentally exactly at the dir boundary, but I see other socket files there:

/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-sysmo [cell5000]
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/sms:trx-sysmocell5 [000]

suggesting the limit is 107 (weird number).

My guess is that if two tests end up with the same truncated socket file, it would fail.
But in this case we hit exactly the dir name of the test and hence find an existing dir at the place.

And that explains why it started showing up at exactly test #100, because before, the 'trial-99' was one char less and created socket files inside the dir being one letter long.

The truncation also seem to happen on prod, but with the job numbers being 4 digits, we don't hit the dir exactly.
Truncations are visible in the end in the tar warnings:

+ tar czf /home/jenkins/workspace/osmo-gsm-tester_run/trial-2748-run.tgz run.2017-09-08_05-26-39
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-b200/mo_mt_sms.: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-b200/mo_mt_sms.py/os: socket ignored
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-sysmocell5000/m: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-sysmocell5000/mo_mt_: socket ignored

#4 Updated by neels 12 days ago

As we see in the error message, osmo-bts still has the full path.
It feeds it to libosmocore/src/socket.c osmo_sock_unix_init().
This does:

        struct sockaddr_un local;
[...]
        strncpy(local.sun_path, socket_path, sizeof(local.sun_path));
        local.sun_path[sizeof(local.sun_path) - 1] = '\0';

and whaddaya know, x86_64-linux-gnu/sys/un.h

struct sockaddr_un
  {
    __SOCKADDR_COMMON (sun_);
    char sun_path[108];   /* Path name.  */
  };

Thats the 107 we see above plus NUL term.

We simply cannot create sockets with path names of this size!

Instead, we could feed a relative path to the osmo-bts.cfg, because we know the CWD of the osmo-bts binary (just tested, it works in principle).
We can shorten it from os.path.relpath(socket_path, osmo_bts_cwd).

Until we do (or RnD job numbers hit 1000), that particular test will continue to fail.

#5 Updated by neels 12 days ago

  • Subject changed from RnD: failure to create pcu socket to RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un

#6 Updated by pespin 12 days ago

Wow, good catch, I was unaware of this limitation with unix sockets.

I'm not entirely sure using relpaths for it is a good idea. I mean, it works from point of view of osmo-bts, but then what from osmo-pcu? I think about it because I created a commit to share the pcu socket between them in the pespin/gprs branch: See https://git.osmocom.org/osmo-gsm-tester/commit/?h=pespin/gprs&id=80c6d83a0069f5dbb1f354b5b823f76aeabadf56 and grep for "pcu_socket_path".

We could make pcu_socket_path return an absolute path (as it is done now), and then inside OsmoPcu and OsmoBts classes translate that into a relpath. Do you agree on doing so?

#7 Updated by laforge 12 days ago

You could simply use something like mkdtemp() to crate a unique temporary directory
and then create the pcu_sock in that directory?

I don't think there's any reason to give the socket a semantic name, as it's not soemthing
you are archiving (like a log file or an artefact).

#8 Updated by pespin 12 days ago

  • Status changed from New to Feedback

Patch submitted using mkdtemp() in https://gerrit.osmocom.org/#/c/3894/ to solve the issue.

#9 Updated by pespin about 6 hours ago

  • Status changed from Feedback to Resolved

Patch was merged solving the issue.

Also available in: Atom PDF