Project

General

Profile

Actions

Bug #3273

closed

osmo-bts-octphy failing to start in prod setup

Added by pespin almost 6 years ago. Updated over 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Target version:
-
Start date:
05/16/2018
Due date:
% Done:

50%

Spec Reference:

Description

Since a few days ago, the osmo-bts-octphy tests fail during startup of the OCTBTS:

(launched: 2018-05-16_13:19:33.210299)
<0006> l1_if.c:770 model_init()
[0;m20180516131933306 [1;33mDLCTRL[0;m <0017> control_if.c:866 CTRL at 127.0.0.1 4238
[0;m20180516131933306 [1;33mDLGLOBAL[0;m <0010> telnet_interface.c:104 telnet at 127.0.0.1 4241
[0;m20180516131933307 [1;33mDLINP[0;m <0012> input/ipaccess.c:891 enabling ipaccess BTS mode, OML connecting to 10.42.42.7:3002
[0;m20180516131933307 [1;34mDABIS[0;m <000d> abis.c:229 Input Signal 4 received
[0;m20180516131933307 [1;32mDL1C[0;m <0006> phy_link.c:58 PHY link state change shutdown -> connecting
[0;m20180516131933307 [1;33mDL1C[0;m <0006> l1_if.c:1740 Opening L1 interface for OctPHY (enp2s0)
[0;m20180516131933321 [1;33mDLINP[0;m <0012> input/ipa.c:131 10.42.42.7:3002 connection done
[0;m20180516131933321 [1;33mDLINP[0;m <0012> input/ipaccess.c:712 received ID get from 8/0/0
[0;m20180516131933322 [1;32mDABIS[0;m <000d> abis.c:107 OML Signalling link up
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=SITE-MANAGER INST=(ff,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=GPRS-NSE INST=(00,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=GPRS-CELL INST=(00,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=GPRS-NSVC INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=GPRS-NSVC INST=(00,01,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,00) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,01) Tx STATE CHG REP
[0;m[1;36m20180516131933322 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,02) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,03) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,04) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,05) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,06) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=RADIO-CARRIER INST=(00,00,ff) AVAIL STATE Power off -> OK
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=RADIO-CARRIER INST=(00,00,ff) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:467 OC=RADIO-CARRIER INST=(00,00,ff) Tx SW ACT REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) AVAIL STATE Power off -> OK
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:467 OC=BASEBAND-TRANSCEIVER INST=(00,00,ff) Tx SW ACT REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,00) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,00) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,00) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,01) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,01) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,01) Tx STATE CHG REP
[0;m[1;36m20180516131933323 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,02) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,02) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,02) Tx STATE CHG REP
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,03) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,03) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,03) Tx STATE CHG REP
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,04) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,04) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,04) Tx STATE CHG REP
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,05) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,05) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,05) Tx STATE CHG REP
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,06) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,06) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,06) Tx STATE CHG REP
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:348 OC=CHANNEL INST=(00,00,07) AVAIL STATE Power off -> Dependency
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=CHANNEL INST=(00,00,07) OPER STATE NULL -> Disabled
[0;m[1;36m20180516131933324 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=CHANNEL INST=(00,00,07) Tx STATE CHG REP
[0;m[1;36m20180516131933325 [1;34mDOML[0;m[1;36m <0001> oml.c:548 OC=BTS(01) INST=(ff,ff,ff) [0;m[1;36mRx GET ATTR
[0;m[1;36m20180516131933325 [1;32mDOML[0;m[1;36m <0001> oml.c:277 BTS Tx Get Attribute Response
[0;m[1;36m20180516131933380 [1;34mDOML[0;m[1;36m <0001> oml.c:548 OC=BASEBAND-TRANSCEIVER(04) INST=(00,00,ff) [0;m[1;36mRx GET ATTR
[0;m[1;36m20180516131933380 [1;32mDOML[0;m[1;36m <0001> oml.c:277 BASEBAND-TRANSCEIVER Tx Get Attribute Response
[0;m[1;36m20180516131933380 [1;31mDOML[0;m[1;36m <0001> oml.c:229 O&M Get Attributes [0], Manufacturer Dependent State is unsupported by TRX.
[0;m[1;36m20180516131933382 [1;34mDOML[0;m[1;36m <0001> oml.c:1001 OC=SITE-MANAGER(00) INST=(ff,ff,ff) [0;m[1;36mRx OPSTART
[0;m[1;36m20180516131933382 [1;34mDOML[0;m[1;36m <0001> oml.c:1012 ... automatic ACK, OP state already was Enabled
[0;m[1;36m20180516131933383 [1;34mDOML[0;m[1;36m <0001> oml.c:580 OC=BTS(01) INST=(00,ff,ff) [0;m[1;36mRx SET BTS ATTR
[0;m[1;36m20180516131933383 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[0] (150 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933383 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[1] (180 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933383 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[2] (180 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933383 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[3] (1680 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933383 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[4] (520 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933384 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[5] (165 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933384 [1;33mDOML[0;m[1;36m <0001> oml.c:681 Ignoring T200[6] (1680 ms) as sent by BSC due to suspected LAPDm bug!
[0;m[1;36m20180516131933384 [1;34mDOML[0;m[1;36m <0001> oml.c:451 Sending FOM ACK.
[0;m[1;36m20180516131933385 [1;34mDOML[0;m[1;36m <0001> oml.c:1029 OC=BTS(01) INST=(00,ff,ff) [0;m[1;36mRx CHG ADM STATE
[0;m[1;36m20180516131933385 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933386 [1;34mDOML[0;m[1;36m <0001> oml.c:1001 OC=BTS(01) INST=(00,ff,ff) [0;m[1;36mRx OPSTART
[0;m[1;36m20180516131933386 [1;32mDOML[0;m[1;36m <0001> oml.c:355 OC=BTS INST=(00,ff,ff) OPER STATE NULL -> Enabled
[0;m[1;36m20180516131933386 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=BTS INST=(00,ff,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933430 [1;34mDOML[0;m[1;36m <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) [0;m[1;36mRx OPSTART
[0;m[1;36m20180516131933432 [1;34mDOML[0;m[1;36m <0001> oml.c:750 OC=RADIO-CARRIER(02) INST=(00,00,ff) [0;m[1;36mRx SET RADIO CARRIER ATTR
[0;m[1;36m20180516131933432 [1;32mDOML[0;m[1;36m <0001> oml.c:782 Set RF Max Power Reduction = 0 dBm
[0;m20180516131933432 [1;32mDL1C[0;m <0006> tx_power.c:248 power_ramp_start(cur=0, tgt=0)
[0;m[1;36m20180516131933432 [1;34mDOML[0;m[1;36m <0001> oml.c:451 Sending FOM ACK.
[0;m[1;36m20180516131933433 [1;34mDOML[0;m[1;36m <0001> oml.c:1029 OC=RADIO-CARRIER(02) INST=(00,00,ff) [0;m[1;36mRx CHG ADM STATE
[0;m[1;36m20180516131933434 [1;33mDOML[0;m[1;36m <0001> oml.c:1055 ADM state already was Unlocked
[0;m[1;36m20180516131933434 [1;32mDOML[0;m[1;36m <0001> oml.c:313 OC=RADIO-CARRIER INST=(00,00,ff) Tx STATE CHG REP
[0;m[1;36m20180516131933435 [1;34mDOML[0;m[1;36m <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) [0;m[1;36mRx OPSTART
[0;m20180516131933435 [1;32mDL1C[0;m <0006> l1_oml.c:1228 Tx APP-INFO.req
[0;m20180516131933435 [1;32mDL1C[0;m <0006> l1_oml.c:1159 Tx APP-INFO-SYSTEM.req
[0;m20180516131933436 [1;32mDL1C[0;m <0006> l1_oml.c:1364 Tx TRX-OPEN.req(trx=0, rf_port=0, arfcn=868, center=868, tsc=7, rx_gain=70, tx_atten=0)
[0;m20180516131934384 [1;31mDLINP[0;m <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
[0;m20180516131934433 [1;32mDL1C[0;m <0006> tx_power.c:181 ramping TRX board output power to 0 mdBm.
[0;m20180516131935385 [1;31mDLINP[0;m <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
[0;m20180516131936385 [1;31mDLINP[0;m <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
[0;m20180516131937386 [1;31mDLINP[0;m <0012> e1_input.c:235 abis_sendmsg: msg->dst == NULL: 0c 12 01 90 0f ff ff 
[0;m20180516131938321 [1;31mDL1C[0;m <0006> l1_if.c:203 Timeout waiting for L1 primitive TRX-CLOSE-ALL

Attach a run.tar.gz from osmo-gsm-tester showing the issue.


Files

trial-1304-run.tgz trial-1304-run.tgz 12.1 MB pespin, 05/16/2018 02:28 PM
Actions #1

Updated by dexter almost 6 years ago

  • Status changed from New to Feedback
  • Assignee changed from dexter to pespin
  • % Done changed from 0 to 50

I have now checked back on this problem. The DSP (both!) were crashed, so that they became unresponsive. A powercycle helped to get it working again.

In order to test I temporarily connected the LTHW Octbts to my test laptop. I got the exact same results as shown in the log above. So it is definitely nothing wrong with the tester.

I fear that this problem will come back from time to time. Both DSPs have their reset lines connected to a GPIO of the Phytec board. Those reset lines can be controlled via dsp_rst.sh, which comes pre-installed on the phytec board. The reset procedure would look like:

dsp_rst.sh on p
dsp_rst.sh off p
dsp_rst.sh on s
dsp_rst.sh off s

Where p = Primary DSP, s = Secondary DSP, on = Reset is asserted, off = Reset is de-asserted

My suggestion is to integrate the reset procedure in the testing process. Maybe we only should reset only when necessary. We have discussed earlier if it makes sense to reset the DSPs before each test, but we came to the conclusion that this is not wise as this is not a normal operating condition. The dsp firmware should survive multiple restarts of openbts.

Actions #2

Updated by pespin almost 6 years ago

Ok Thanks, let's not add the reset by default before each tests unless we see it's really necessary. Good to have it documented here, so we can come back to it if it happens again. I'll close the ticket once we have next test run passing with octphy.

Actions #3

Updated by dexter almost 6 years ago

I have now inspected the log files in the jenkins build artefacts (run.2018-06-25_10-03-06). The current problem seems not to be octphy related. osmo-bts-octphy fails to start because it has problems to bind the telnet/ctrl interface. I have looked into all octphy related tests. The error is the same everywhere. I do not know the environment well enough, but maybe there were some changes or there is still another instance running when osmo-bsc-octphy starts.

(launched: 2018-06-25_11:24:21.828663)
<0006> l1_if.c:770 model_init()
[0;m20180625112421929 [1;31mDLGLOBAL[0;m <0010> socket.c:360 unable to bind socket:10.42.42.52:4238: Cannot assign requested address
[0;m20180625112421930 [1;31mDLGLOBAL[0;m <0010> socket.c:371 no suitable addr found for: 10.42.42.52:4238
[0;m20180625112421930 [1;31mDLGLOBAL[0;m <0010> socket.c:360 unable to bind socket:10.42.42.52:4241: Cannot assign requested address
[0;m20180625112421930 [1;31mDLGLOBAL[0;m <0010> socket.c:371 no suitable addr found for: 10.42.42.52:4241
[0;m20180625112421930 [1;31mDLGLOBAL[0;m <0010> telnet_interface.c:100 Cannot bind telnet at 10.42.42.52 4241
[0;mError initializing telnet
Actions #4

Updated by pespin almost 6 years ago

  • Assignee changed from pespin to dexter

Previous issues fixed.

Current issue:

[0;m[1;36m20180627051913862 [1;34mDOML[0;m[1;36m <0001> oml.c:1001 OC=RADIO-CARRIER(02) INST=(00,00,ff) [0;m[1;36mRx OPSTART
[0;m20180627051913862 [1;32mDL1C[0;m <0006> l1_oml.c:1228 Tx APP-INFO.req
[0;m20180627051913862 [1;32mDL1C[0;m <0006> l1_oml.c:1159 Tx APP-INFO-SYSTEM.req
[0;m20180627051913862 [1;32mDL1C[0;m <0006> l1_oml.c:1364 Tx TRX-OPEN.req(trx=0, rf_port=0, arfcn=868, center=868, tsc=7, rx_gain=70, tx_atten=0)
[0;m20180627051913863 [1;32mDL1C[0;m <0006> l1_oml.c:1180 Rx APP-INFO.resp (name='octsdr_gsm', desc='Software Define Radio', ver='02.11.00-B1927', ver_hdr='02.11.00-B1927')
[0;m20180627051913864 [1;32mDL1C[0;m <0006> l1_oml.c:1128 Rx APP-INFO-SYSTEM.resp (platform='Opus2', version='OCTSYS_VERSION=01.02.26.B1;OCTODK_VERSION=01.17.05-B7;OCTADF_VERSION=04.10.01-B3387;')
[0;m20180627051913864 [1;32mDL1C[0;m <0006> l1_oml.c:1131 Note: compiled with multi-trx support.
[0;m20180627051913864 [1;32mDL1C[0;m <0006> l1_oml.c:1291 TRX-OPEN.resp(trx=0) = cOCTVC1_GSM_RC_TRX_RADIO_CONFIG
[0;m20180627051913865 [1;31mDL1C[0;m <0006> l1_oml.c:1296 TRX-OPEN failed: cOCTVC1_GSM_RC_TRX_RADIO_CONFIG
Actions #5

Updated by dexter almost 6 years ago

  • Assignee changed from dexter to pespin

I have tested the octbts locally again. I got the same error. A powercycle helped to resolve the problem. The octbts is now up and running again.

(I just wonder if there is a license option that limits the runtime...)

Actions #6

Updated by pespin over 5 years ago

  • Assignee changed from pespin to dexter

It seems the Firmware crashed again. It basically crashes every time we run a round of tests in one jenkins job. We should investigate how/when the firmware crashes (if it's our fault) and contact OCTOBTS support to get it fixed.

Actions #7

Updated by dexter over 5 years ago

Unfortunately I do not have a response from Octasic yet.

I think I will leave test board I have on my desktop on over night to see if the problem is even connected to any outer factors.

Is there a way to tell which test is causing the fault. Does it hang always at the same test or is it always different?

Actions #8

Updated by pespin over 5 years ago

The way to check if a specific test is triggering the issue is to restart the firmware (having it in a working state), then running the hourly set of tests in osmo-gsm-tester, then check the logs to see the first test failing, and repeating the entire procedure some times. If it always fails in the same test, it means probably either that one or the previous one is crashing the firmware. Then manually reproducing same test can help discover more information. However, it could be that we are unable to manually trigger the issue due to timing constrains (ie. osmo-gsm-tester triggering events quickly automatically as a response to other events).

Actions #9

Updated by dexter over 5 years ago

  • Status changed from Feedback to Stalled
Actions #10

Updated by dexter over 5 years ago

  • Status changed from Stalled to Resolved
Actions #11

Updated by dexter over 5 years ago

  • Status changed from Resolved to Rejected
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)