Bug #6375
closeddefault TCP user timeout is way too low
Added by laforge 3 months ago. Updated about 1 month ago.
100%
Description
The default USERT_IMEOUT for TCP keep-alive (which is active by default) should have been:
val = 1000 * line->keepalive_num_probes * line->keepalive_probe_interval + line->keepalive_idle_timeout;
Which should have been 1000 * 10 * 3 + 30 which expands to roughly 30 seconds.
However, I guess there's two bugs in the code, looking at it:
- there should have been parenthesis around the + operator (line->keepalive_probe_interval + line->keepalive_idle_timeout) as the keepalive_idle_timeout is in seconds, not milli-seconds.
- in the default case, all those values are configured to -1 (E1INP_USE_DEFAULT). This means we're using
1000 * -1 * -1 + -1 = 999
, i.e. just below aq second which clearly is not enough for a lossy satellite back-haul.
This is actaully rather critical, a proposed untested fix is in https://gerrit.osmocom.org/c/libosmo-abis/+/36079
Files
testsuite-netem-1200ms.pcap | testsuite-netem-1200ms.pcap | 200 KB | fixeria, 03/17/2024 08:30 PM | ||
osmo-bts-netem-1200ms.pcap | osmo-bts-netem-1200ms.pcap | 483 KB | fixeria, 03/17/2024 08:30 PM | ||
osmo-bsc-netem-1200ms.pcap | osmo-bsc-netem-1200ms.pcap | 289 KB | fixeria, 03/17/2024 08:58 PM |
Checklist
- add log output for the timeout values when they are applied
- back-port + patch releases for 1.4.x and 1.5.x
- add a news entry
Related issues
Updated by laforge 3 months ago
- Checklist item add log output for the timeout values when they are applied added
- Checklist item automatic testing with non-zero packet-loss/latency/jitter added
what we learn from this:
- our automatic testing is insufficient. We are not ever testing over realistic non-terrestrial back-haul latencies. We should try running at least some of our BTS / BSC tests over an Abis link with simulated latency and non-zero packet loss
- it's not a good idea to make a feature like this the default without actually even manually testing once if the timeout actually expires after the expected 30s, instead of the current < 1s
- intermediate computed values should potentially be printed/logged
Updated by osmith 3 months ago
- % Done changed from 0 to 20
automatic testing with non-zero packet-loss/latency/jitter
I'm looking into reproducing the bug by running the ttcn3 testsuite with netem, and adding a separate jenkins job that runs the testsuite with netem. Then we can see if there are additional failures, and compare with the regular run of the testsuite.
Updated by osmith 3 months ago
- Related to Feature #4541: add ttcn3 tests for situations of Abis latency spikes, where chan act/deact ACK messages arrive late added
Updated by laforge 2 months ago
- Status changed from In Progress to Resolved
- % Done changed from 20 to 100
Applied in changeset libosmo-abis|12fae9aeebd86e242abd5c682a12fd1c9da68497.
Updated by osmith 2 months ago
- % Done changed from 30 to 40
I'm having a hard time getting e.g. osmo-bts into a state where it loses the abis link.
What I have tried:- use netem to just slow down a specific test, which should in theory result in losing the TCP connection if the delay is >= 1s and the fix is not applied to libosmo-abis
- picked TC_est_dchan from the BTS testsuite, which has some abis traffic
- however this resulted in hitting unrelated timeouts in the testsuite, so I have tried another approach: writing a separate ttcn3 test that adds a delay in the TCP connection to get into the error state. As I looked closer into this, I realized that it is probably quite some effort to do this at the TCP level with our TTCN-3 testsuite. So I went back to the netem approach.
- I have adjusted a timeout in the testsuite that caused the unrelated failure, and then the test took significantly longer and was able to pass with an artificial delay of 900 ms, 1000ms, 1500ms.
- Unfortunately the test also passes with the delay if libosmo-abis is not patched...
So looking further into reproducing it. My plan is to make the patch releases after verifying that the default timeout now works as intended.
EDIT: tc command I'm using: "tc qdisc add dev eth0 root netem delay 1500ms"
Updated by laforge about 2 months ago
can we please get the back-ports to 1.4.x and 1.5.x and related news items out of the door soon-ish?
Updated by fixeria about 2 months ago
- % Done changed from 40 to 50
laforge wrote in #note-9:
can we please get the back-ports to 1.4.x and 1.5.x and related news items out of the door soon-ish?
Below are the related patches:
v1.4.2¶
https://gerrit.osmocom.org/c/libosmo-abis/+/36286 [cosmetic] input/ipaccess.c: We use tab indent, not spaces [wtf] [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36277 input/ipaccess: Make sure to say "TCP keepalive" [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36278 Fix critical bug in default TCP keepalive user timeout [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36279 Bump version: 1.4.1.3-b8a4 → 1.4.2 [NEW]
v1.5.2¶
https://gerrit.osmocom.org/c/libosmo-abis/+/36280 [cosmetic] input/ipaccess.c: We use tab indent, not spaces [wtf] [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36281 input/ipaccess: Make sure to say "TCP keepalive" [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36282 Fix critical bug in default TCP keepalive user timeout [NEW]
https://gerrit.osmocom.org/c/libosmo-abis/+/36283 Bump version: 1.5.1.3-0eed → 1.5.2 [NEW]
Updated by fixeria about 2 months ago
- Checklist item back-port + patch releases for 1.4.x and 1.5.x set to Done
All patches have been merged, patch releases tagged.
You mentioned the news items, so here is a template for review:
A critical bug has been discovered in libosmo-abis: the calculation of the TCP keepalive user timeout value was incorrect. This issue remained unnoticed for quite some time, ever since the initial support for TCP_USER_TIMEOUT
was merged back in Jun 2019 (libosmo-abis >= 0.7.0). However, since it was not yet enabled by default at that time, only users with explicit TCP keepalive configuration in their config files would be affected. The TCP keepalive was later enabled by default in Nov 2022 (libosmo-abis >= 1.4.0), impacting all users and resulting in a significantly shorter TCP_USER_TIMEOUT
value than expected.
The problem has been fixed in master. Patch releases tagged for two latest minor versions:
Release | Tag |
---|---|
1.4.2 | https://gitea.osmocom.org/osmocom/libosmo-abis/releases/tag/1.4.2 |
1.5.2 | https://gitea.osmocom.org/osmocom/libosmo-abis/releases/tag/1.5.2 |
This issue primarily impacted production users operating IPA connections (A-bis/OML, A-bis/RSL, GSUP) over links with real-life delays, while it is unlikely to have affected the lab users.
laforge is this looking good to you?
Updated by laforge about 2 months ago
Please rephrase below:
On Sat, Mar 16, 2024 at 06:06:02AM +0000, fixeria wrote:
This issue primarily impacted production users operating IPA connections (A-bis/OML, A-bis/RSL, GSUP) over links with real-life delays, while it is unlikely to have affected the lab users.
over "links with delays with RTT approaching one second, such as overloaded/flakey wifi links or VSAT
back-haul. It is unlikely to have affected users with non-overloaded terrestrial/wired back-haul".
Apart from that: all good!
Updated by fixeria about 2 months ago
- Checklist item add a news entry added
Updated by fixeria about 2 months ago
osmith wrote in #note-7:
I'm having a hard time getting e.g. osmo-bts into a state where it loses the abis link.
Likewise, I did some experiments and yet was unable to make TTCN-3 test cases fail by introducing artificial delays.
What I have tried:
- use netem to just slow down a specific test, which should in theory result in losing the TCP connection if the delay is >= 1s and the fix is not applied to libosmo-abis
- picked TC_est_dchan from the BTS testsuite, which has some abis traffic
- however this resulted in hitting unrelated timeouts in the testsuite, so I have tried another approach: writing a separate ttcn3 test that adds a delay in the TCP connection to get into the error state. As I looked closer into this, I realized that it is probably quite some effort to do this at the TCP level with our TTCN-3 testsuite. So I went back to the netem approach.
- I have adjusted a timeout in the testsuite that caused the unrelated failure, and then the test took significantly longer and was able to pass with an artificial delay of 900 ms, 1000ms, 1500ms.
- Unfortunately the test also passes with the delay if libosmo-abis is not patched...
So looking further into reproducing it. My plan is to make the patch releases after verifying that the default timeout now works as intended.
EDIT: tc command I'm using: "tc qdisc add dev eth0 root netem delay 1500ms"
Below is what I did:
- modify the the
ttcn3-bts-test
image to include theiproute2
package, which provides the tc-netem - build the
osmo-bts-master
image with an old/unpatched version of libosmo-abis.git (v1.5.1) - enable the TCP keepalive, which is disabled by default in
osmo-bts.cfg
diff --git a/ttcn3-bts-test/generic/osmo-bts.confmerge b/ttcn3-bts-test/generic/osmo-bts.confmerge
index 9015904..17bd5ab 100644
--- a/ttcn3-bts-test/generic/osmo-bts.confmerge
+++ b/ttcn3-bts-test/generic/osmo-bts.confmerge
@@ -1,5 +1,7 @@
line vty
bind 0.0.0.0
+e1_input
+ e1_line 0 keepalive
phy 0
osmotrx ip local 172.18.9.20
osmotrx ip remote 172.18.9.21
- patch
ttcn3-bts-test/jenkins.sh
- to use custom images (note the
:netem
tag) - to execute
ttcn3-bts-test
with--cap-add=NET_ADMIN
(needed for tc-netem) - to configure an artificial delay before starting
ttcn3-bts-test
- to use custom images (note the
diff --git a/ttcn3-bts-test/jenkins.sh b/ttcn3-bts-test/jenkins.sh
index f932b4a..e374044 100755
--- a/ttcn3-bts-test/jenkins.sh
+++ b/ttcn3-bts-test/jenkins.sh
@@ -50,7 +50,7 @@ start_bts() {
-e "SLEEP_BEFORE_RESPAWN=$sleep_time_respawn" \
--name ${BUILD_TAG}-bts -d \
$DOCKER_ARGS \
- $REPO_USER/osmo-bts-$IMAGE_SUFFIX \
+ $REPO_USER/osmo-bts-$IMAGE_SUFFIX:netem \
/bin/sh -c "/usr/local/bin/respawn.sh osmo-bts-$variant -c /data/osmo-bts.gen.cfg >>/data/osmo-bts.log 2>&1"
}
@@ -117,8 +117,10 @@ start_testsuite() {
-v $VOL_BASE_DIR/bts-tester-${variant}:/data \
-v $VOL_BASE_DIR/unix:/data/unix \
--name ${BUILD_TAG}-ttcn3-bts-test \
+ --cap-add=NET_ADMIN \
$DOCKER_ARGS \
- $REPO_USER/ttcn3-bts-test
+ $REPO_USER/ttcn3-bts-test:netem \
+ /bin/sh -c "tc qdisc add dev eth0 root netem delay 1000ms; ttcn3-docker-run bts BTS_Tests;"
}
set_pcuif_version() {
With that, I am able to execute BTS_Tests.TC_est_dchan
and it's passing reliably (without any additional changes) with 500ms, 1000ms, 1500ms.
I confirmed the delay is applied correctly by running ping. The testcase execution time also increases significantly.
As a side effect, for some unknown reason, the testsuite fails to start tcpdump with those modifications:
fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ ls -la /tmp/tmp.IjrR8l78tv/bts-tester-generic/ total 7732 drwxr-xr-x 3 fixeria fixeria 4096 Mar 17 08:04 . drwxr-xr-x 11 fixeria fixeria 4096 Mar 17 08:04 .. -rw-r--r-- 1 fixeria fixeria 751 Mar 17 08:04 BTS_Tests.cfg -rw-r--r-- 1 root root 7882935 Mar 17 08:04 BTS_Tests.TC_est_dchan.merged -rw-r--r-- 1 root root 0 Mar 17 08:02 BTS_Tests.TC_est_dchan.netcat.stderr -rw-r--r-- 1 root root 151 Mar 17 08:02 BTS_Tests.TC_est_dchan.pcap.stderr -rw-r--r-- 1 root root 151 Mar 17 08:02 BTS_Tests.TC_est_dchan.pcap.stdout -rw-r--r-- 1 root root 0 Mar 17 08:04 BTS_Tests.TC_est_dchan.talloc -rw-r--r-- 1 root root 204 Mar 17 08:04 junit-xml-22.log drwxr-xr-x 2 root root 4096 Mar 17 08:02 unix fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ cat /tmp/tmp.IjrR8l78tv/bts-tester-generic/BTS_Tests.TC_est_dchan.pcap.stderr Capturing on 'any' dumpcap: The file to which the capture would be saved ("/data/BTS_Tests.TC_est_dchan.pcap") could not be opened: Permission denied. fixeria@fixeria-dev:~/docker-playground/ttcn3-bts-test$ cat /tmp/tmp.IjrR8l78tv/bts-tester-generic/BTS_Tests.TC_est_dchan.pcap.stdout Capturing on 'any' dumpcap: The file to which the capture would be saved ("/data/BTS_Tests.TC_est_dchan.pcap") could not be opened: Permission denied.
Updated by fixeria about 2 months ago
fixeria wrote in #note-14:
With that, I am able to execute
BTS_Tests.TC_est_dchan
and it's passing reliably (without any additional changes) with 500ms, 1000ms, 1500ms.
I did some experiments, running osmo-bsc and osmo-bts with TCP keepalive locally (not in Docker) and I was able to reproduce the problem: osmo-bts does terminate the A-bis link.
Updated by fixeria about 2 months ago
- File testsuite-netem-1200ms.pcap testsuite-netem-1200ms.pcap added
- File osmo-bts-netem-1200ms.pcap osmo-bts-netem-1200ms.pcap added
I was able to record a few PCAPs (by manually running tcpdump in containers), attaching them for further analysis.
One is recorded on the IUTs side, another is on the testsuite's side.
Updated by fixeria about 2 months ago
- File osmo-bsc-netem-1200ms.pcap osmo-bsc-netem-1200ms.pcap added
- % Done changed from 50 to 60
fixeria wrote in #note-15:
I did some experiments, running osmo-bsc and osmo-bts with TCP keepalive locally (not in Docker) and I was able to reproduce the problem: osmo-bts does terminate the A-bis link.
Ok, I was also able to reproduce the problem in Docker by introducing artificial delay on the A-bis/OML (not RSL) link. This was a bit tricky, because having the delay during the OML bootstrapping would slow down the RSL connection establishment a lot and cause the guard timer to fire, so this had to be done manually in the right time. Attaching another PCAP (recorded on the BSC's side) for analysis.
I think it does not affect the RSL link because, compared to the OML link, it's not idling for enough time. At least the RF RESource INDication
is being sent every so often for each TRX instance. The OML link, on the other hand, remains idle most of the time after the bootstrapping is completed.
Updated by fixeria about 2 months ago
Yesterday I gave ttcn3-bts-test a spin with simulated delay of 600ms on the A-bis/RSL interface.
fixeria@fixeria-dev:/tmp/tmp.VKcOSQ8db6/bts-tester-generic$ ls -lah | grep TC_chan_act_stress -rw-r--r-- 1 root root 870K Mar 18 21:20 BTS_Tests.TC_chan_act_stress.merged -rw-r--r-- 1 root root 0 Mar 18 19:27 BTS_Tests.TC_chan_act_stress.netcat.stderr
Judging by this output, executing generic profile of the testsuite took ~2h. Executing the hopping profile would be +2h, so ~4h in total.
According to Jenkins, right now it takes ~2h to execute both profiles, so adding a delay of 600ms would roughly double the testsuite execution time.
As was agreed during the Jitsi call today, I will experiment further with lower delay values (<100ms) and other testsuites.
Updated by fixeria about 2 months ago
- Checklist item deleted (
automatic testing with non-zero packet-loss/latency/jitter) - Status changed from In Progress to Resolved
- % Done changed from 60 to 100
I created a separate ticket #6411 for the TTCN-3 testing, this ticket can be closed.
Updated by osmith about 1 month ago
Update for OE: https://gerrit.osmocom.org/c/meta-telephony/+/36517