Project

General

Profile

Actions

Bug #2507

closed

RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un

Added by neels over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Target version:
-
Start date:
09/08/2017
Due date:
% Done:

0%

Spec Reference:

Description

Two runs in a row on RnD saw

05:12:38.580338 run          osmo-bts-trx(pid=23979): ERR: Terminated: ERROR {rc=1}  [trial-101↪aoip_sms:trx-b200↪osmo-bts-trx↪osmo-bts-trx(pid=23979)]
05:12:38.599545 run          osmo-bts-trx(pid=23979): stderr:
 | (launched: 2017-09-08_05:12:38.510355)
| 20170908051238536 DLCTRL <0017> control_if.c:788 CTRL at 127.0.0.1 4238
| [0;m20170908051238536 DLGLOBAL <0010> telnet_interface.c:102 telnet at 127.0.0.1 4241
| [0;m20170908051238536 DPCU <0009> pcu_sock.c:895 Could not create /home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/mo_mt_sms.py/osmo-bts-trx/pcu_bts unix socket: Address already in use
| [0;mPCU L1 socket failed 

Investigate why it happened / whether this persists...
AFAICT it should never happen because it is a dir location created specifically for each test run.

aoip_sms:trx-b200 mo_mt_sms.py
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/101/console
https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/100/console

Actions #1

Updated by neels over 6 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by neels over 6 years ago

interestingly enough didn't happen in https://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_manual-run-all/99/
...I don't know what to make of it...

Actions #3

Updated by neels over 6 years ago

Ah! It seems we hit a maximum path length.
The path seems to get truncated, and instead of the path
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/mo_mt_sms.py/osmo-bts-trx/pcu_bts
a shorter version gets used:
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-b200/

This is coincidentally exactly at the dir boundary, but I see other socket files there:

/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/aoip_sms:trx-sysmo [cell5000]
/home/jenkins/workspace/osmo-gsm-tester_manual-run-all/trial-101/run.2017-09-08_05-08-33/sms:trx-sysmocell5 [000]

suggesting the limit is 107 (weird number).

My guess is that if two tests end up with the same truncated socket file, it would fail.
But in this case we hit exactly the dir name of the test and hence find an existing dir at the place.

And that explains why it started showing up at exactly test #100, because before, the 'trial-99' was one char less and created socket files inside the dir being one letter long.

The truncation also seem to happen on prod, but with the job numbers being 4 digits, we don't hit the dir exactly.
Truncations are visible in the end in the tar warnings:

+ tar czf /home/jenkins/workspace/osmo-gsm-tester_run/trial-2748-run.tgz run.2017-09-08_05-26-39
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-b200/mo_mt_sms.: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-b200/mo_mt_sms.py/os: socket ignored
tar: run.2017-09-08_05-26-39/aoip_sms\:trx-sysmocell5000/m: socket ignored
tar: run.2017-09-08_05-26-39/sms\:trx-sysmocell5000/mo_mt_: socket ignored

Actions #4

Updated by neels over 6 years ago

As we see in the error message, osmo-bts still has the full path.
It feeds it to libosmocore/src/socket.c osmo_sock_unix_init().
This does:

        struct sockaddr_un local;
[...]
        strncpy(local.sun_path, socket_path, sizeof(local.sun_path));
        local.sun_path[sizeof(local.sun_path) - 1] = '\0';

and whaddaya know, x86_64-linux-gnu/sys/un.h

struct sockaddr_un
  {
    __SOCKADDR_COMMON (sun_);
    char sun_path[108];   /* Path name.  */
  };

Thats the 107 we see above plus NUL term.

We simply cannot create sockets with path names of this size!

Instead, we could feed a relative path to the osmo-bts.cfg, because we know the CWD of the osmo-bts binary (just tested, it works in principle).
We can shorten it from os.path.relpath(socket_path, osmo_bts_cwd).

Until we do (or RnD job numbers hit 1000), that particular test will continue to fail.

Actions #5

Updated by neels over 6 years ago

  • Subject changed from RnD: failure to create pcu socket to RnD: failure to create pcu socket, because path is too long to fit in struct sockaddr_un
Actions #6

Updated by pespin over 6 years ago

Wow, good catch, I was unaware of this limitation with unix sockets.

I'm not entirely sure using relpaths for it is a good idea. I mean, it works from point of view of osmo-bts, but then what from osmo-pcu? I think about it because I created a commit to share the pcu socket between them in the pespin/gprs branch: See https://git.osmocom.org/osmo-gsm-tester/commit/?h=pespin/gprs&id=80c6d83a0069f5dbb1f354b5b823f76aeabadf56 and grep for "pcu_socket_path".

We could make pcu_socket_path return an absolute path (as it is done now), and then inside OsmoPcu and OsmoBts classes translate that into a relpath. Do you agree on doing so?

Actions #7

Updated by laforge over 6 years ago

You could simply use something like mkdtemp() to crate a unique temporary directory
and then create the pcu_sock in that directory?

I don't think there's any reason to give the socket a semantic name, as it's not soemthing
you are archiving (like a log file or an artefact).

Actions #8

Updated by pespin over 6 years ago

  • Status changed from New to Feedback

Patch submitted using mkdtemp() in https://gerrit.osmocom.org/#/c/3894/ to solve the issue.

Actions #9

Updated by pespin over 6 years ago

  • Status changed from Feedback to Resolved

Patch was merged solving the issue.

Actions #10

Updated by laforge over 6 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)