Project

General

Profile

Actions

Bug #2561

closed

osmo-mgw OBS nightly builds failing for 3 consecutive days

Added by laforge over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
-
Start date:
10/09/2017
Due date:
% Done:

100%

Spec Reference:

Description

I've been receiving OBS build failure notifications like this for about 3 days, so I think since friday:

Visit
https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.0
4/x86_64

Package network:osmocom:nitb-split:nightly/osmo-mgw failed to build in xUbuntu_16.04/x86_64

Check out the package for editing:
  osc checkout network:osmocom:nitb-split:nightly osmo-mgw

Last lines of build log:
[  108s]
[  108s] Makefile:693: recipe for target 'check-local' failed
[  108s] make[5]: *** [check-local] Error 1
[  108s] make[5]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:546: recipe for target 'check-am' failed
[  108s] make[4]: *** [check-am] Error 2
[  108s] make[4]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:398: recipe for target 'check-recursive' failed
[  108s] make[3]: *** [check-recursive] Error 1
[  108s] make[3]: Leaving directory '/usr/src/packages/BUILD/tests'
[  108s] Makefile:506: recipe for target 'check-recursive' failed
[  108s] make[2]: *** [check-recursive] Error 1
[  108s] make[2]: Leaving directory '/usr/src/packages/BUILD'
[  108s] Makefile:797: recipe for target 'check' failed
[  108s] make[1]: *** [check] Error 2
[  108s] make[1]: Leaving directory '/usr/src/packages/BUILD'
[  108s] dh_auto_test: make -j1 check returned exit code 2
[  108s] debian/rules:17: recipe for target 'build' failed
[  108s] make: *** [build] Error 2
[  108s] dpkg-buildpackage: error: debian/rules build gave error exit status 2
[  108s]
[  108s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct  9 02:58:31 UTC 2017.
[  108s]
[  108s] ### VM INTERACTION START ###
[  111s] [  103.628587] reboot: Power down
[  111s] ### VM INTERACTION END ###
[  111s]
[  111s] lamb68 failed "build osmo-mgw_1.0.2.20171009.dsc" at Mon Oct  9 02:58:34 UTC 2017.
[  111s]

The builds fail consistently on all architectures, so we must have introduced some regression there. I'm assigning it to neels, as he has committed/pushed the most recent changes that co-incide with the time of failure. Please investigate (feel free to reassign in the sysmocom team, if you think somebody else is better suited to fix it), thanks!


Files

log.txt log.txt 183 KB lynxis, 10/11/2017 05:49 AM
Actions #1

Updated by neels over 6 years ago

  • Status changed from New to Feedback
  • Assignee changed from neels to lynxis

The decisive output is:

[  105s] Regression tests.
[  105s] 
[  105s]   1: legacy_mgcp                                     ok
[  105s]   2: legacy_mgcp-trans                               skipped (testsuite.at:12)
[  105s]   3: mgcp_client                                     ok
[  105s]   4: mgcp                                            FAILED (testsuite.at:27)

But I cannot reproduce the mgcp test failure.

The OBS builds (still) don't output the test logs and I am not sure how we can obtain them.
Something like the cat-testlogs.sh our jenkins builds are doing would be very helpful.
https://git.osmocom.org/osmo-ci/tree/scripts/cat-testlogs.sh

I would like to redirect to lynxis to probe whether we can get the test logs, and to dexter for an mgcp_test.c fix.

Actions #2

Updated by lynxis over 6 years ago

neels: I've extended the log similiar to libosmocore [1]. I've manual deployed this patch to OBS. [1] or attached log.txt now shows the exact problem.

dexter: can you take a look on the test output?

[1] https://gerrit.osmocom.org/#/c/4212/
[2] https://build.opensuse.org/build/network:osmocom:nitb-split:nightly/Debian_9.0/x86_64/osmo-mgw/_log

Actions #3

Updated by lynxis over 6 years ago

  • Status changed from Feedback to In Progress
Actions #4

Updated by dexter over 6 years ago

  • Assignee changed from dexter to lynxis

I am sorry, without further information such as log outputs I can not help much.

Actions #5

Updated by neels over 6 years ago

  • Assignee changed from lynxis to neels

dexter, there is a link right there with log outputs, you should have been able to investigate.

Actions #6

Updated by neels over 6 years ago

anyway, here is one again https://build.opensuse.org/package/live_build_log/network:osmocom:nitb-split:nightly/osmo-mgw/xUbuntu_16.10/x86_64
which was also posted on the chat. So it seems like you have other reasons to not look into this.

Actions #7

Updated by neels over 6 years ago

  • Assignee changed from neels to dexter

thx

Actions #8

Updated by dexter over 6 years ago

Actions #9

Updated by dexter over 6 years ago

  • Assignee changed from dexter to neels

It seems that it is a problem with the override of sendto(). I also wonder why
the legacy test does not fail since it is very similar to the mgcp test that
fails. If the following solution works out we might want to apply it there
as well.

In mgcp_test.c one can find an override of sendto:

#define MGCP_DUMMY_LOAD 0x23
static int dummy_packets = 0;
/* override and forward */
ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
        const struct sockaddr *dest_addr, socklen_t addrlen)
{
    typedef ssize_t (*sendto_t)(int, const void *, size_t, int,
            const struct sockaddr *, socklen_t);
    static sendto_t real_sendto = NULL;
    uint32_t dest_host = htonl(((struct sockaddr_in *)dest_addr)->sin_addr.s_addr);
    int      dest_port = htons(((struct sockaddr_in *)dest_addr)->sin_port);

    if (!real_sendto)
        real_sendto = dlsym(RTLD_NEXT, "sendto");

    if (len == 1 && ((const char *)buf)[0] == MGCP_DUMMY_LOAD ) {
        fprintf(stderr, "Dummy packet to 0x%08x:%d, msg length %zu\n%s\n\n",
               dest_host, dest_port,
               len, osmo_hexdump(buf, len));
        dummy_packets += 1;
    }

    return real_sendto(sockfd, buf, len, flags, dest_addr, addrlen);
}

I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.

The interesting part is that the override sendo is counting the transmitted
dummy packets.

After the counting the packets are forwared using the real_sendto. This is
why we see the packets in wireshark. Probably we want to remove this so that
no actual packets leave. At least when I comment the real_sendto out the
test still passes. So that would be a way to get rid of the packets that
OBS probably won't like.

Lets have a look at the log output:

[   80s] <0010> mgcp_network.c:136 endpoint:5 Failed to send dummy RTP packet.
[   80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:5 connection successfully created
[   80s] [0;m<0010> mgcp_protocol.c:522 CRCX: creating new connection ...
[   80s] [0;m<0010> mgcp_sdp.c:300 Got media info via SDP: port 5904, payload 97 (GSM-EFR), duration 40, addr 123.12.12.123
[   80s] [0;mPolicy CB got state 1 on endpoint 1
[   80s] Dummy packet to 0x7b0c0c7b:5904, msg length 1
[   80s] 23 
[   80s] 
[   80s] <0010> mgcp_network.c:136 endpoint:1 Failed to send dummy RTP packet.
[   80s] [0;m<0010> mgcp_protocol.c:747 CRCX: endpoint:1 connection successfully created
[   80s] [0;m<0010> mgcp_osmux.c:682 All Osmux circuits are in use!
[   80s] [0;m--- expout    2017-10-13 13:15:05.388398199 +0000
[   80s] +++ /usr/src/packages/BUILD/tests/testsuite.dir/at-groups/4/stdout    2017-10-13 13:15:05.436398197 +0000
[   80s] @@ -16,12 +16,12 @@
[   80s]  Testing MDCX1
[   80s]  Testing MDCX2
[   80s]  Testing CRCX
[   80s] -Dummy packets: 2
[   80s] +Dummy packets: 1

We can see that the sendto() override is executed. Then we spot a message
from mgcp_network.c. This could mean that the dummy sendto returned with
an error code. Probably because the real_sendto returned an eror.

The dummy packet transmission terminates on this function:

int mgcp_send_dummy(struct mgcp_endpoint *endp, struct mgcp_conn_rtp *conn)
{
    static char buf[] = { MGCP_DUMMY_LOAD };
    int rc;
    int was_rtcp = 0;

    OSMO_ASSERT(endp);
    OSMO_ASSERT(conn);

    LOGP(DLMGCP, LOGL_DEBUG,
         "endpoint:%x sending dummy packet...\n", ENDPOINT_NUMBER(endp));
    LOGP(DLMGCP, LOGL_DEBUG, "endpoint:%x conn:%s\n",
         ENDPOINT_NUMBER(endp), mgcp_conn_dump(conn->conn));

    rc = mgcp_udp_send(conn->end.rtp.fd, &conn->end.addr,
               conn->end.rtp_port, buf, 1);

    if (rc == -1)
        goto failed;

    if (endp->tcfg->omit_rtcp)
        return rc;

    was_rtcp = 1;
    rc = mgcp_udp_send(conn->end.rtcp.fd, &conn->end.addr,
               conn->end.rtcp_port, buf, 1);

    if (rc >= 0)
        return rc;

failed:
    LOGP(DLMGCP, LOGL_ERROR,
         "endpoint:%x Failed to send dummy %s packet.\n",
         ENDPOINT_NUMBER(endp), was_rtcp ? "RTCP" : "RTP");

    return -1;
}

Here we can see that the mgcp_udp_send() should normally execute twice. This
matches the observed behaviour. We see the output from the sendto() override,
wich executes ok until the point where it outputs. We also count the
dummy_packets one up.

Probably the real_sendto at the end of the sendto() override fails. This
eventually leads into a bad return code of mgcp_udp_send() and we will send
to failed: This is why we see the error message immediately after. By this
we omit the second mgcp_udp_send(), which would normally give us the second
dummy dummy packet.

Since the test still passes when the real_sendto is removed we possibly may
try this. A patch is available for review: https://gerrit.osmocom.org/4260

Actions #10

Updated by neels over 6 years ago

  • Assignee changed from neels to dexter

dexter wrote:

I have problems to understand this. There is a function pointer real_sendto
that is set to NULL, only to detect a few lines later that it is really NULL
and populate it then with the symbol of the sendto (the real sendto). However,
this happens only once and real_sendto is static. Its either a lengthy process
or it must not happen twice - however.

The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.

The interesting part is that the override sendo is counting the transmitted
dummy packets. [...] we see the error message [...] would normally give us the second
dummy dummy packet.

Indeed, this should be the actual cause.
Let's try to clarify why the test was even sending out packets in the first place.
Ask zecke??

Actions #11

Updated by dexter over 6 years ago

I will drop the dummy packets completely as demanded in #2574. I wrote Holger an Email since we potentially have the same problem in legacy_mgcp as well.

Actions #12

Updated by zecke over 6 years ago

neels wrote:

The code doesn't want to load the symbol using the dynamic linker more than once.
What I, too, fail to understand is why the test goes to all this trouble when
removing this doesn't seem to affect the test in the slightest.

We want to observe the side-effect (and we don't want to have a "network interface" indirection in the code).

Actions #13

Updated by dexter over 6 years ago

  • Status changed from In Progress to Feedback
  • Assignee changed from dexter to neels

neels Is the problem still present?

Actions #14

Updated by laforge over 6 years ago

Philipp,

On Mon, Oct 23, 2017 at 09:07:42PM +0000, dexter [REDMINE] wrote:

neels Is the problem still present?

would you please simply look at the OBS status/output yourself rathern
than assigning this rather mundane task to a colleague?

Actions #15

Updated by neels over 6 years ago

  • Assignee changed from neels to zecke
  • Priority changed from Urgent to Low
  • % Done changed from 0 to 100

dexter wrote:

neels Is the problem still present?

Patch is merged, and osmo-mgw builds successfully, as can be seen at https://build.opensuse.org/project/show/network:osmocom:nitb-split:nightly

Do we have a conclusion yet on whether we should re-add sending of packets in the test?
Personally, I would keep them disabled, and refer to Harald's TTCN3 setup to test for network operation related leaks and side effects.
zecke, would that be ok from your viewpoint?

Actions #16

Updated by neels over 6 years ago

  • % Done changed from 100 to 90
Actions #17

Updated by laforge about 6 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from zecke to neels
  • % Done changed from 90 to 100

no feedback about re-enabling the sendign in the unit tests, let's close this.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)