On Tue, Feb 21, 2023 at 03:39:24AM +0000, neels wrote:
This is a question about SCCP concepts for link loss detection.
If I want to be strict to my understanding of SS7, what you are asking doesn't exist.
A signaling link is 20 miles below SCCP in the protocol stack. SCCP should never be concerned with that. A link is part of a linkset. From the SS7 PoV linkset should already contain multiple links, and there can and should be multiple routes (via multiple linksets) to reach any given destination point code.
In the SIGTRAN world, each of those ss7 links is mapped to an SCTP association, in turn can and should have multiple IP addresses on each side. And below that you have an (or multiple!) IP networks, each of which should have dynamic routing, with multiple routes. And yet below that you can have redundant ethernet links.
So the deep stack provides ample of opportunity at every layer to make sure that telecom companies in their traditional "you can never have too much reliability" attitude end up in situations where the SS7 link loss does not matter to the applications.
From a layering point of view, SCCP doesn't even know (nor should know) what a signaling link, SCCP or (god forbid) IP is. All it is aware of is point codes and global titles, and whether or not those are reachable right now or not.
I immediately see lots of low level DLINP, DLSS7 and some DLSCCP logging showing that an SCTP SHUTDOWN event was processed, and that the XUA AS restarts and tries to reconnect. But none of this makes its way up into osmo-hnbgw.
I think it's a question whether it should. Whether a non-STP entity in the SS7 network should generate SCCP USER SAP primitives that normally are originated by (in my understanding) STP/SCPs.
Furthermore, SCCP state is not reset on any loss of a single (or even all) underling SS7 links. They could recover some milliseconds or seconds later, and both sides would continue as if nothing happened. So closing SCCP connections just because an underlying signalling link has potentially temporarily disappeared could possibly lead to error amplification.
Look at an analogy: Do all your TCP connections close just because you disconnect your ethernet link? No. Do all your TCP sockets get notified of the Ethernet link loss? No. Why? because TCP runs on top of IP, and TCP has no notion of what happens below IP. All that TCP would note is that at some point its own timeouts are triggering, if neither IP nor underlying transport layers are recovering in time. Note that TCP timeouts / keepalives are entirely optional, and it is valid for an open TCP connection to stay indefinitely that way, until/unless either side starts to transmit something, which will eventually lead to timeouts if ACKs for that are never received.
After about 15 minutes(!), I receive sccp_sap_up(N-DISCONNECT.indication) on the SCCP connection.
Yes, that's the normal SCCP level timeout. Which can of course (from the SS7 point of view) be reconfigured by any user as they please.
- Is it expected to take this long, given that an SCTP SHUTDOWN is detected in libosmo-sigtran immediately?
I would presume the default of 15 minutes is taken directly out of the SCCP specs. It's only a bug if our default is outside what Q.7xx specifies.
My idea was to trigger a LINK_LOST event on the SCCP link to the CN, i.e. a signal that the entire SCCP layer is gone. Does that exist, conceptually?
No. SCCP should not care about a given single SS7 link, see above. If at all, conceptually one could think of notifications of the last of all potential SS7 links that could route to a given destination point coe, which is what we can indicate using N-PCSTATE indications.
In contrast, when the RUA side of osmo-hnbgw sees an SCTP SHUTDOWN, osmo-hnbgw immediately registers that all HNB are disconnected, by means of the read cb() passed to osmo_stream_srv_create().
You're comparing RUA, a protocol directly over SCTP with an application that runs over a complex, multi-layered SS7 signaling network. Compare that with sending raw ethernet frames and using a TCP socket. Of course they will behave differently.
Also, a SCTP SHUTDOWN doesn't happen at the time a (ethernet, or whatever physical) link is lost. That shutdown would only happen once SCTP itself has either been disconnected, or determined (via its own internal timesouts) that the association is dead.
Here I immediately see an N-PCSTATE.indication containing a DUNA (Destination Unavailable) coming up the SCCP user SAP, one each for the MSC and the SGSN point-code.
This reflects my understanding of SS7.
osmo-hnbgw ignores N-PCSTATE so far, I guess it might be a good idea to implement acting on the DUNA messages. I see now that we can simply read out prim->u.pcstate.
It could, if it wanted to. However, keep in mind that immediately closing all SCCP connections could lead to error amplification (see above). So if at all it might make sense to have a timer, for each destination point code, starting on a negative N-PCSTATE.ind and stopping at a positive one. But at that time you're basically replicating functionality that could be achieved via lower SCCP connection-level timeouts, rihgt?
I think the application (SCCP user) behavior to those indications is user-defined, i.e. up to the application.
So in summary:
- when a remote entity behind STP goes bust, i can already now trigger my LINK_LOST event when osmo-hnbgw sees a PCSTATE indicating that the remote point-code for CS / PS CN becomes unavailable.
again, it's not a link that is lost, but the reachability of a given point code has (potentially temporarily) changed. This would typically mean that the last ss7 link of any ss7 linkset in the path between HNBGW and the given pointcode is lost at the last potential path/route in the ss7 network. So all redundancy mechanisms at all layers on all potential redundant routes have failed.
- when the first SCTP hop goes bust (kill osmo-stp), maybe we can implement some prim going up the SCCP user SAP? Would that also be a DUNA, based on active SCCP conns' remote point-code, or is that a hacky layer violation?
I think that could potentially make sense. It's a bit a question of whether we see the SCCP on a SSP (end node) should behave to its local user as a SGP (STP, router) would. Feel free to check the relevant Q.7xx and see if you can find any guidance to that.