Bug #4312
openGSUP keepalives / connection loss detection
0%
Description
In the presence of unreliable back-haul mesh between villages, the GSUP
connection can also not be seen as reliable. We would expect to see TCP
stalls due to packet loss, etc.
Have you considered this in your implementation and/or done any testing
based on simulated lossy networks to ensure we properly use either TCP
keepalives or IPA application-level PING/PONG to detect lost connections
and recover from such situations (by closing the old and
re-establishing)?
Unreliable networks can be easily simulated by Linux built-in 'tc netem'
for providing configurable packet loss / latency / jitter.
I also saw some comments / code related to "if a second connection using
the same IPA ID arrives, we're screwed" (paraphrasing here). I would
expect this not to be uncommon even if every MSC/HLR out there is
configred correctly exactly because e.g .the remote MSC/HLR has already
decided that the TCP/GSUP is dead and starts to reconnect by performing
a local-end release, while the "local" MSC/HLR still thinks the old
connection is alive. If the old connection "wins" (i.e. is preferred)
I see potential trouble here.
Situations like that probably warrant some carefully designed tests to
create exactly those situations.
Goals:
a) ensuring that keepalive on either TCP or IPA is enabled and works, and
b) creating situations where the same peer establishes a second new connection
while the old one is still not torn down (timeout not expired yet, FIN packets
lost, ...)
(Keeping as one issue because these aspects are tightly related...)