Project

General

Profile

Bug #2325

sporadic crash of osmo-bts-trx in osmo-gsm-tester runs

Added by neels 2 months ago. Updated about 1 month ago.

Status:
Stalled
Priority:
Normal
Assignee:
Category:
osmo-bts-trx
Target version:
-
Start date:
06/13/2017
Due date:
% Done:

20%

Spec Reference:

Description

The osmo-bts-trx process sometimes dies prematurely during osmo-gsm-tester runs using the Ettus B210.
The cause is not clear yet, no core file seems to be created.
For example, see http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/591/


Related issues

Related to OsmoGSMTester - Feature #2327: document NTP as cause for failing osmo-bts-trx runs in osmo-gsm-manuals? Rejected 06/14/2017
Related to OsmoBTS - Bug #2339: osmo-bts-trx uses non-monotonic system clock for frame number timer Closed 06/24/2017
Related to OsmoTRX - Bug #2344: OsmoTRX is not using SCHED_RR In Progress 06/29/2017

History

#1 Updated by neels 2 months ago

This happens quite frequently, apparently more than one out of ten runs

#2 Updated by neels 2 months ago

instances of this are seen in abovementioned run 591, again in 595, 597, 599.
Each run has two trx runs, so, roughly: out of 20 runs, four hit the "Process ended prematurely" for osmo-bts-trx.

#3 Updated by neels 2 months ago

  • Related to Bug #2321: osmo-gsm-tester: store properly coredump files when a process crashes added

#4 Updated by neels 2 months ago

  • Related to deleted (Bug #2321: osmo-gsm-tester: store properly coredump files when a process crashes)

#5 Updated by neels 2 months ago

The sporadic "crashes" are actually intentional shutdown by osmo-bts-trx:

20170614021016906 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx

The end of the log shows "Shutdown timer expired", which has a three second expiry time.
About 30 logging lines above that, the shutdown reason is logged.

The "No clock" is logged about two seconds after the OML has successfully set up the BTS.
Immediately following that, the OML is torn down again, concluding in

Shutdown timer expired

I am not so sure about how to fix this, will have to ask osmo-trx guys.

#6 Updated by neels 2 months ago

  • Status changed from New to In Progress
  • Assignee set to neels

https://gerrit.osmocom.org/2909 <-- sets l1c logging level to notice

results:

20170614032014399 DRSL <0000> rsl.c:2333 (bts=0,trx=0,ts=0,ss=2) Fwd RLL msg EST_IND from LAPDm to A-bis
20170614032018533 DL1C <0006> scheduler_trx.c:1451 PC clock skew: elapsed uS 4136730
20170614032018533 DOML <0001> bts.c:208 Shutting down BTS 0, Reason No clock from osmo-trx
20170614032018533 DL1C <0006> scheduler.c:240 Exit scheduler for trx=0
20170614032018533 DL1C <0006> scheduler.c:216 Init scheduler for trx=0
20170614032018533 DOML <0001> oml.c:280 OC=RADIO-CARRIER INST=(00,00,ff) AVAIL STATE OK -> Off line
[...]
Shutdown timer expired

I asked on openbsc@: http://lists.osmocom.org/pipermail/openbsc/2017-June/010802.html

#7 Updated by neels 2 months ago

  • Related to Feature #2327: document NTP as cause for failing osmo-bts-trx runs in osmo-gsm-manuals? added

#8 Updated by neels 2 months ago

Vadim indicated NTP as possible cause for clock skews. I switched off the ntp service on the main unit now. It will come back upon reboot, so we may need to switch it off again. If this proves to be the cause of failures, we can uninstall ntp completely or something.

#9 Updated by laforge about 2 months ago

  • Category set to osmo-bts-trx
FYI:
  • osmo-bts uses gettimeofday() to determine how much system time has expired between two FN indications from OSmoTRX
  • the maximum permitted value is 50*4.6ms, i.e. 50 TDMA frames, totalling to 230750, i.e. 230.75ms
  • the actual skew as reported here is 4.13 seconds

I assume that the -r or --realtime argument is used when starting osmo-bts-trx, at least http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/591/ indicates so.

#10 Updated by laforge about 2 months ago

  • Related to Bug #2339: osmo-bts-trx uses non-monotonic system clock for frame number timer added

#11 Updated by laforge about 2 months ago

#12 Updated by laforge about 2 months ago

  • Assignee changed from neels to pespin
  • % Done changed from 0 to 20

I've just tried to reproduce this on my laptop (no APU board here at home), but couldn't.

I've used 'stress-ng' with various options, such as stress-ng --vm 4 --exec 4 --hdd 4 --netdev 4 --sync-file 4 to generate a combination of i/o and cpu load.

osmo-bts-trx was running with real-time priority and was completely unaffected, even at a system load > 30.

When running without real-time priority, I got the occasional

<0006> scheduler_trx.c:1587 FN timer expire_count=3: We missed 2 timers

but that's not 44 frame periods, and osmo-bts-trx continues to live.

Let's try that on an APU board next.

#13 Updated by laforge about 2 months ago

Tried to reproduce this on an PCEngines APU1D4 with Debian 8. No success.

Even while stressing the system using stress-ng --vm 4 --hdd 4 --cpu 4 --open 4 --sem 4 --sock 4 --float 2 --io 2 --timer 2 while running osmo-bts-trx and osmo-bts, I was not able to get more than the occasional

<0006> scheduler_trx.c:1718 We were 1 FN slower than TRX, compensated
<0006> scheduler_trx.c:1704 We were 1 FN faster than TRX, compensating
<0006> scheduler_trx.c:1718 We were 2 FN slower than TRX, compensated
<0006> scheduler_trx.c:1571 FN timer expire_count=2: We missed 1 timers

which is all expected. So whatever it is, it must be something specific to the osmo-gsm-tester. Maybe it' enabling excessive logging to a file or stderr, and then that logging is blocking the process?

#14 Updated by laforge about 2 months ago

During startup, I get a couple of

<0006> scheduler_trx.c:1704 We were 39 FN faster than TRX, compensating

but that's before all initialization is done, i.e. before the system is really running.

#15 Updated by laforge about 2 months ago

http://git.osmocom.org/osmo-gsm-tester/tree/src/osmo_gsm_tester/templates/osmo-bts-trx.cfg.tmpl doesn't look like there is any file-based logging, nor that there's lots of logging enabled to begin with.

#16 Updated by laforge about 2 months ago

  • Related to Bug #2344: OsmoTRX is not using SCHED_RR added

#17 Updated by pespin about 2 months ago

We do redirect stderr and stdout to files, as can be seen here: http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/ws/trial-1036/last_run/sms:trx/mo_mt_sms.py/osmo-bts-trx/osmo-bts-trx/stderr/*view*/

However the amount of data written doesn't seem to be super big. I'll try anyway to filter the output a lot more and run some tests to be sure that's not the issue.

#18 Updated by pespin about 2 months ago

Disabling logging to stderr doesn't fix the issue.

On the other hand, I detected that I'm not able to reproduce the issue unless I power up one modem to connect to the BTS.

#19 Updated by pespin about 2 months ago

I have seen at least once osmo-bts-trx stopping with a failure printing that 65 timerfd events had expired before we read them. Each event is triggered around 120/26ms, which means then it took us more than 300ms to read the FD after it was first signaled. The only option I can think of is that the system is really congested or that this process is starving.

I checked the priorities on the osmo-gsm-tester main unit while running the tests:

# for p in $(ls /proc/$(pidof osmo-trx)/task/); do chrt -p $p; done
pid 4214's current scheduling policy: SCHED_OTHER
pid 4214's current scheduling priority: 0
pid 4236's current scheduling policy: SCHED_OTHER
pid 4236's current scheduling priority: 0
pid 4237's current scheduling policy: SCHED_OTHER
pid 4237's current scheduling priority: 0
pid 4257's current scheduling policy: SCHED_OTHER
pid 4257's current scheduling priority: 0
pid 4287's current scheduling policy: SCHED_OTHER
pid 4287's current scheduling priority: 0
pid 4337's current scheduling policy: SCHED_RR
pid 4337's current scheduling priority: 43
pid 4338's current scheduling policy: SCHED_RR
pid 4338's current scheduling priority: 44
pid 4339's current scheduling policy: SCHED_RR
pid 4339's current scheduling priority: 45
pid 4340's current scheduling policy: SCHED_RR]
pid 4340's current scheduling priority: 42
pid 4341's current scheduling policy: SCHED_RR
pid 4341's current scheduling priority: 40

# for p in $(ls /proc/$(pidof osmo-bts-trx)/task/); do chrt -p $p; done
pid 4570's current scheduling policy: SCHED_RR
pid 4570's current scheduling priority: 1

As reading from http://man7.org/linux/man-pages/man7/sched.7.html , that means that the priorities of osmo-trx threads are a lot higher than the one used in osmo-bts-trx, and I see those thread taking a lot of CPU, so it could indeed be happening that the osmo-bts-trx process is starving (it tries to keep up due to RR but even then not good enough).

I'll try to do some tests:
- Increase osmo-bts-trx rt priority to 50 or similar.
- Pin osmo-bts-trx to one core and all threads from osmo-trx to another core.

#20 Updated by pespin about 2 months ago

Setting SCHED_RR with priority 50 on osmo-bts-trx (-r 50) didn't help to improve the issue.
Pinning osmo-trx to one code and osmo-bts-trx to the other one didn't help neither. (taskset -a -c $cpunum $binary)

The issue seems to be during the initial seconds when both osmo-bts-trx and osmo-trx are powered on and start the configuration phase. Once the PH-RTS.ind and PH-DATA.req start appearing, the jitter is normal and it works nicely.

I have the impression that during that inital face, the FN doesn't advance in the osmo-trx for long periods while it does in osmo-bts-trx (we keep sending stuff). From osmo-bts-trx (the "elapsed_fn= 0" means the IND CLOCK we received contained same FN than the previous one):

20170704122819041 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us= 185576, elapsed_fn=  0, error_us=+185576
20170704122819041 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -622844 us (elapsed_fn=-50)
20170704122819041 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819042 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    516, elapsed_fn=  0, error_us= +516
20170704122819042 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -623360 us (elapsed_fn=-50)
20170704122819042 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819043 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=   1317, elapsed_fn=  0, error_us=+1317
20170704122819043 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -624678 us (elapsed_fn=-50)
20170704122819043 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819044 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=   1329, elapsed_fn=  0, error_us=+1329
20170704122819045 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -626007 us (elapsed_fn=-50)
20170704122819045 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819045 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    392, elapsed_fn=  0, error_us= +392
20170704122819045 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -626400 us (elapsed_fn=-50)
20170704122819045 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819045 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    410, elapsed_fn=  0, error_us= +410
20170704122819045 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -626810 us (elapsed_fn=-50)
20170704122819045 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819046 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    342, elapsed_fn=  0, error_us= +342
20170704122819046 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -627153 us (elapsed_fn=-50)
20170704122819046 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819046 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    386, elapsed_fn=  0, error_us= +386
20170704122819046 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -627539 us (elapsed_fn=-50)
20170704122819046 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819047 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    534, elapsed_fn=  0, error_us= +534
20170704122819047 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -628074 us (elapsed_fn=-50)
20170704122819047 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating
20170704122819047 DL1C <0006> scheduler_trx.c:1677 TRX Clock Ind: elapsed_us=    457, elapsed_fn=  0, error_us= +457
20170704122819047 DL1C <0006> scheduler_trx.c:1695 GSM clock jitter: -628531 us (elapsed_fn=-50)
20170704122819047 DL1C <0006> scheduler_trx.c:1704 We were 50 FN faster than TRX, compensating

#21 Updated by laforge about 2 months ago

pespin wrote:

The issue seems to be during the initial seconds when both osmo-bts-trx and osmo-trx are powered on and start the configuration phase. Once the PH-RTS.ind and PH-DATA.req start appearing, the jitter is normal and it works nicely.

Then please establish the details of that early phase. I think

  • osmo-trx should not start sending TIME indications via UDP before it is "stable" and able to send those TIME indications in proper intervals
  • osmo-bts-trx should not start the internal FN timer until that first TIME message has been received from osmo-trx.

I have the impression that during that inital face, the FN doesn't advance in the osmo-trx for long periods while it does in osmo-bts-trx (we keep sending stuff). From osmo-bts-trx

why are we at all sending bursts if osmo-trx is not yet ready? I think that's the problem.

#22 Updated by pespin about 2 months ago

Indeed, I was writing the following in the task but you were faster and wrote before I submited it:

Enabling info level for DTRX confirms what I was saying: it seems the clock number in IND CLOCK is not updated until we power on the TRX with "CMD POWERON" and "RSP POWERON 0" is received.

When it fails, it does so before receiving the "RSP POWERON 0". Now the question is who's doing it wrong:
- osmo-trx: It should update the clock while powered off (it may make no sense, because it actually it's not transmitting).
- osmo-bts-trx: It should not do clock sanity check during the period in between "CMD POWEROFF" and "RSP POWERON".

I'll do as you suggested. I am actually not sure if we are sending burst or we are only doing the clock sanity checks. I'll look at it.

#23 Updated by pespin about 2 months ago

I just realized that during the poweroff timeslice (between POWEROFF and RSP POWERON) we basically receive a IND CLOCK just after very "RSP *" message.

#24 Updated by pespin about 2 months ago

I pushed this osmo-trx commit to gerrit1, it seems to be helping a lot with improving stability of runs in osmo-gsm-tester with osmo-bts-trx.

I think it still doesn't completely fix the issue though, some work needs to be done on the osmo-bts-trx side too. For instance with an earler version of the osmo-trx patch I once got something like the line below, which is probably saying something's really wrong regarding how we setup the timers:

FN timer expire_count=1302: We missed 1301 timers

[1] https://gerrit.osmocom.org/#/c/3120/

#25 Updated by pespin about 2 months ago

  • Status changed from In Progress to Feedback

Yesterday evening I built osmo-trx used by osmo-gsm-tester with this patch enabled, and since then (built 1146) the osmo-gsm-tester with osmo-bts-trx seems a lot more stable: http://jenkins.osmocom.org/jenkins/view/osmo-gsm-tester/job/osmo-gsm-tester_run/

The issues I saw in osmo-bts-trx with the expire_count may have appeared due to use of previous version of the osmo-trx patch which was sending the first clock indication to early. Let's wait for the osmo-trx patch to be reviewed/merged and see how osmo-gsm-tester evolves. If we see more sporadic failures, then we have some new state and symptoms and I'll spend more time fixing those as a next step.

#26 Updated by laforge about 1 month ago

  • Status changed from Feedback to Stalled

Also available in: Atom PDF