Project

General

Profile

Bug #4351

smpp: Deliver-SM error response from SMPP handler does not result in RP-ERROR to the MS

Added by neels about 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
01/09/2020
Due date:
% Done:

100%

Resolution:

Description

apparently osmo-nitb was able to directly return an RP-ERROR to the MS when delivery via SMPP returned an error, but osmo-msc fails to do that.
To test this, I used an esme.py script that always returns RINVSRCADR when the osmo-msc sends out a Deliver-SM via SMPP.
I also tried to make sure the conn is kept alive by adding a use count on the msc_a conn, to no avail.

In attached smpp_failure.pcapng (view with a wireshark that is able to display gsmtap_log), see these packets:

120: MS dispatches MO SMS to osmo-msc
137: trans tid-11 is created
157: SMR changes state to WAIT_TO_TX_RP_ACK
169: SMR already triggers RP-ACK and
172: changes to state IDLE
187: SMPP Deliver_sm is sent from osmo-msc to the esme script
188: RP-ACK already goes back to the MS (too fast)
216: SMS transaction and SMC,SMR get freed (too soon)
223: SMPP error response comes back to osmo-msc (too late)
228: osmo-msc notices that the transaction is already gone, nothing left to do

In another trace, the timing was seen such that the Deliver_sm error response came before the SMS transaction was freed,
but then the RP-ACK already had been triggered and the SMR ignores the RP-ERROR that wants to get sent with "unhandled at this state (IDLE)".
See packet 204 in smpp_failure_2.pcapng -- it's the same thing, the RP-ACK has already happened before SMPP gets a chance to respond.

smpp_failure.pcapng smpp_failure.pcapng 79.3 KB neels, 01/09/2020 03:43 AM
smpp_failure_2.pcapng smpp_failure_2.pcapng 68 KB neels, 01/09/2020 03:43 AM

History

#1 Updated by neels about 2 months ago

code trees used: tags neels/os4351 in osmo-hlr and osmo-msc

#2 Updated by neels about 2 months ago

#3 Updated by keith about 1 month ago

BTW, you don't even need an esme connected, just configure an SMPP default-route, leave it un-connected. osmo-msc will loose the SMS in the void and send RP-ACK anyway.

#4 Updated by keith about 1 month ago

This might be part of what is missing.
https://gerrit.osmocom.org/#/c/openbsc/+/3900/

Also maybe #2390 is related?

#5 Updated by keith about 1 month ago

Please take a look at gsm_04_11.c:gsm411_rx_rp_ud()

link to current master: https://gerrit.osmocom.org/plugins/gitiles/osmo-msc/+/4a5ba81f7d057373ed44ab66169faa1f9d7b46ea/src/libmsc/gsm_04_11.c#700

    rc = gsm340_rx_tpdu(trans, msg, rph->msg_ref);
    if (rc == 0)
        return gsm411_send_rp_ack(trans, rph->msg_ref);
    else if (rc > 0)
        return gsm411_send_rp_error(trans, rph->msg_ref, rc);
    else
        return rc;
}

So we'll send RP-ACK based on the return from gsm340_rx_tpdu() but if we look at this (and compare to openbsc) we can see that the return value from sms_route_mt_sms() is probably nearly always overwritten at line 641.

In openbsc there were two checks right after the call to sms_route_mt_sms():

    rc = sms_route_mt_sms(conn, gsms);

    /* This SMS got routed through SMPP. */
    if (gsms->smpp.esme)
        return -EINPROGRESS;

    if (!gsms->receiver)
        return rc;

Not the complete story by any means.. but it's a start?

#6 Updated by fixeria about 1 month ago

BTW, you don't even need an esme connected, just configure an SMPP default-route, leave it un-connected. osmo-msc will loose the SMS in the void and send RP-ACK anyway.

This looks like a complete TTCN-3 test case scenario ;)

#7 Updated by fixeria about 1 month ago

  • % Done changed from 0 to 30

One important detail about sms_route_mt_sms() is that we use an osmo_wqueue for sending SMPP messages: the result of gsm340_rx_tpdu() may reflect the return value of the enqueuing function smpp_cmd_enqueue(), which would be 0 even if the remote SMPP peer is not connected.

Another important detail about sms_route_mt_sms(): if SMPP is enabled and a matching route is found, it sends the SMS immediately. If SMPP is not enabled, it tries to find the destination subscriber in the local VLR, but never sends the SMS immediately, leaving this task up to the SMS queue.

An unrelated problem was noticed while looking through the code: smpp_route() may return a CC (Call Control) specific cause (e.g GSM48_CC_CAUSE_UNASSIGNED_NR), while we're dealing with SMS. This cause may (theoretically) end up in an RP-ERROR message and confuse the user.

Regarding sending RP-ACK immediately from gsm411_rx_rp_ud(), as mentioned by keith, I think it is correct: we actually acknowledge reception and storage of the SMS, even if we failed to find an SMPP route or the local subscriber. Take a look at the end of gsm340_rx_tpdu(), where we immediately override the result of sms_route_mt_sms() by calling gsm340_rx_sms_submit(). In the end, the loss of SMPP peer might be temporary, and we may want to try again later...

216: SMS transaction and SMC,SMR get freed (too soon)
223: SMPP error response comes back to osmo-msc (too late)
228: osmo-msc notices that the transaction is already gone, nothing left to do

I think the main problem is that SMPP error response triggers RP-ACK or RP-ERROR. This shall not happen unless the MSC is acting as a transceiver, i.e. does not store the SMS in its database. Instead, we should send a negative delivery report - that's it.

I hope you will find this analysis helpful.

#8 Updated by keith about 1 month ago

fixeria wrote:

One important detail about sms_route_mt_sms() is that we use an osmo_wqueue for sending SMPP messages: the result of gsm340_rx_tpdu() may reflect the return value of the enqueuing function smpp_cmd_enqueue(), which would be 0 even if the remote SMPP peer is not connected.

Hmmm. Whatever way it was (is) working with osmo-nitb, this is not the case.

Another important detail about sms_route_mt_sms(): if SMPP is enabled and a matching route is found, it sends the SMS immediately. If SMPP is not enabled, it tries to find the destination subscriber in the local VLR, but never sends the SMS immediately, leaving this task up to the SMS queue.

I guess a question here is also if we at some point remove the sms queue from the MSC altogether - Moving it into the SMSC for example. In the meantime, having smmp-first and default-route in the esme config should (and does in nitb) transfer control of whether we ACK or ERROR the SMS to the ESME, essentially bypassing the SMS queue for MO SMS. Yes, if the message is ultimately destined for the local MSC, it will come back to us and end up in the MSC's queue, but that's a different issue. For MO SMS, we should consider that maybe the SMS is destined for another MSC/HLR or for another (non-osmo) gateway.

An unrelated problem was noticed while looking through the code: smpp_route() may return a CC (Call Control) specific cause (e.g GSM48_CC_CAUSE_UNASSIGNED_NR), while we're dealing with SMS. This cause may (theoretically) end up in an RP-ERROR message and confuse the user.

I think the codes are the "same" to the ME, it's just the constant that is used in osmo CODE. so no fear of confusing the (ME) user, only the coder. (which is why I sent https://gerrit.osmocom.org/#/c/osmo-msc/+/16806/), which you saw.

Regarding sending RP-ACK immediately from gsm411_rx_rp_ud(), as mentioned by keith, I think it is correct: we actually acknowledge reception and storage of the SMS, even if we failed to find an SMPP route or the local subscriber.

I certainly don't want to do this in production. If the user (the person with the ME) attempts to send an unrouteable SMS, I don't want to store and (try to forward later), I want to reject it. This is what I see on commercial networks. Take for example, the scenario when the message is unrouteable due to lack of funds in the user's account. Yes, it's possible to send a (non)-delivery report later, but maybe the user (ME) didn't request delivery reports.

I don't think we should ever accept SMS from an ME when we know we are not going to deliver it.

#9 Updated by fixeria about 1 month ago

Hmmm. Whatever way it was (is) working with osmo-nitb, this is not the case.

Well. yes. There is a little piece of code that OpenBSC has while OsmoMSC does not:

https://git.osmocom.org/openbsc/tree/openbsc/src/libmsc/gsm_04_11.c#n521

/* This SMS got routed through SMPP. */
if (gsms->smpp.esme)
    return -EINPROGRESS;

if (!gsms->receiver)
    return rc;

What is the impact?

  • Positive: if SMPP is enabled and a MO SMS was successfully enqueued to the Tx queue, we return a negative number => the SMR transaction will still be waiting for RP-ACK / RP-ERROR, which we send on receipt of the answer from ESME;
  • Negative: if SMPP is enabled and a MO SMS was successfully enqueued to the Tx queue (not yet forwarded to ESME), OpenBSC would not store it in the database. Thus if for some reason forwarding fails, we lose that SMS. The sender would receive RP-ERROR though.
  • Negative: if SMPP is not enabled, and the receiver is not (yet) attached, OpenBSC would not store that SMS in the database, and immediately return RP-ERROR.

So adding the first condition back to OsmoMSC would most likely help you to get the old behaviour.

#10 Updated by fixeria about 1 month ago

I think the codes are the "same" to the ME, it's just the constant that is used in osmo CODE. so no fear of confusing the (ME) user, only the coder. (which is why I sent https://gerrit.osmocom.org/#/c/osmo-msc/+/16806/), which you saw.

In the current code a cause value returned by sms_route_mt_sms() will never end up in RP-ERROR, because (as I already mentioned) we override this value and never check/use it (see the second 'if' statement in my previous comment) :/

If the user (the person with the ME) attempts to send an unrouteable SMS, I don't want to store and (try to forward later), I want to reject it.

What do you mean by 'unrouteable'? If I understand correctly, this is a MO SMS for which we failed to find a route. Yes, it would make sense to reject such messages. But the current approach is to fall-back to the inter-MSC delivery attempt if there is no suitable route. In other words, OsmoMSC would lookup the destination MSISDN in its VLR and try to deliver the message directly. That's why it might still make sense to store the message - the receiver may not be attached.

On the other hand, we may probably want to make this behaviour configurable. AFAIR, we already have so-called 'transaction mode' for ESME originated SMS when OsmoMSC does not store the received message and tries to deliver it immediately.

Yes, it's possible to send a (non)-delivery report later, but maybe the user (ME) didn't request delivery reports.

You will definitely get a negative delivery report even if you did not request it. Only positive reports are optional.

#11 Updated by keith about 1 month ago

fixeria wrote:

I think the codes are the "same" to the ME, it's just the constant that is used in osmo CODE. so no fear of confusing the (ME) user, only the coder. (which is why I sent https://gerrit.osmocom.org/#/c/osmo-msc/+/16806/), which you saw.

In the current code a cause value returned by sms_route_mt_sms() will never end up in RP-ERROR, because (as I already mentioned) we override this value and never check/use it (see the second 'if' statement in my previous comment) :/

Yep, I think we both (asynchronously) noted this one :)

to reconfirm, in the nitb, it does work, I spent some time on this. Actually, what was also interesting is that most modern ME doesn't care what the code is, (and just blindly retries, up to eight times) the Nokia 6770 however, displays quite a range of error messages on the screen, and backs off immediately for some, but that's another topic.

I started tracing the changes since msc split and there were some things done, undone etc..

I'm experimenting with returning nitb behaviour although I don't know yet:

a) if this is wanted by everyone.

b) if i'm going to run into a bsc<->msc communication problem.

If the user (the person with the ME) attempts to send an unrouteable SMS, I don't want to store and (try to forward later), I want to reject it.

What do you mean by 'unrouteable'?

I defined it in my comment as "unroutable due to lack of funds". Could also be an invalid dest address. +79130000000 for example. That's invalid and un routeable, right? (maybe we don't know it, but some upstream would) Although maybe that's an ambigious example. How about just 1234. We might know straight away that we cannot route that.

Let's just say: anything starting with + is unrouteable if we have no international SMS gateway.

If I understand correctly, this is a MO SMS for which we failed to find a route. Yes, it would make sense to reject such messages. But the current approach is to fall-back to the inter-MSC delivery attempt if there is no suitable route. In other words, OsmoMSC would lookup the destination MSISDN in its VLR and try to deliver the message directly. That's why it might still make sense to store the message - the receiver may not be attached.

OK :-)
I think this (valid) logic comes from only looking at the osmo CNI as a self contained entity. If one is to forget about the concept of "receiver", being "attached" for a moment, and consider that maybe the destination address is a shortcode that responds with weather information, or the current market price of a KG of coffee, i.e. something outside of osmo-infrastructure, you can see different needs.

Another thing is the in-MSC database can be a problem; How do you get a message out of it, once it is in there, if that message now needs to be delivered to a complete separate osmo core network? Well, this was my solution: (and it's not so pretty)
https://dev.rhizomatica.org:8888/rhizomatica/rccn/blob/master/rccn/sqs.py

Also, lynxis has voiced concerns about the SMS-queue-ing mechanism. (and I observed that delivery delays by random, sometimes large amounts of time, even for connected ME at the time of sending) and iirc, neels looked at it and found it cryptic to follow. in short, it needs refactoring, or, again, just we remove it altogether...

On the other hand, we may probably want to make this behaviour configurable. AFAIR, we already have so-called 'transaction mode' for ESME originated SMS when OsmoMSC does not store the received message and tries to deliver it immediately.

Yep. And as I was saying, we also need to decide at some point whether or not osmo-msc is eventually ONLY going to have "transaction mode".

Yes, it's possible to send a (non)-delivery report later, but maybe the user (ME) didn't request delivery reports.

You will definitely get a negative delivery report even if you did not request it. Only positive reports are optional.

Oh, ok that's good. I didn't check the spec on that, but I trust you! In osmo-msc, we don't support negative delivery reports though.

Thanks!

#12 Updated by keith about 1 month ago

fixeria wrote:

Hmmm. Whatever way it was (is) working with osmo-nitb, this is not the case.

Well. yes. There is a little piece of code that OpenBSC has while OsmoMSC does not:

Exactly as i pointed out previously...

  • Negative: if SMPP is enabled and a MO SMS was successfully enqueued to the Tx queue (not yet forwarded to ESME), OpenBSC would not store it in the database. Thus if for some reason forwarding fails, we lose that SMS. The sender would receive RP-ERROR though.

BTW, I think that's fine. As I don't have a good store and forward solution, this is what I do for inter-village SMS at TIC:

Village A sends SMS to number in village B but village B is offline/unreachable.

SMS->NITB->SMPP->ESME... trying to contact village B.... fail...->error to SMPP->RP_ERROR.

I disagree with you in a small way: The end result of that is the SMS is not lost at all! It's now "queued", where? in the ME!! And what's more, the user is under no misunderstanding that the SMS was sent and is delivered. Some ME will auto retry. Things like AOSP Messaging app require manual intervention, but at least it's clearly marked with an Attention! icon that shows it was not sent.

In the case you're describing, where if the local ESME process has died for some reason, then we should also return RP_ERROR to the ME. If you're going to store that in the database, you then need a mechanism to go through the database and attempt re-delivery to the ESME, we don't have that, we can ONLY attempt delivery to connected ME. see the problem?

So adding the first condition back to OsmoMSC would most likely help you to get the old behaviour.

Yep.. as i said in previous comment.. will test options and report or post patches.

#13 Updated by keith about 1 month ago

keith wrote:

If you're going to store that in the database, you then need a mechanism to go through the database and attempt re-delivery to the ESME, we don't have that, we can ONLY attempt delivery to connected ME. see the problem?

BTW, Maybe we want that?

I.E. Have database storage in the MSC between BSC and SMSC and vice versa?

#14 Updated by keith about 1 month ago

fixeria wrote:

One important detail about sms_route_mt_sms() is that we use an osmo_wqueue for sending SMPP messages: the result of gsm340_rx_tpdu() may reflect the return value of the enqueuing function smpp_cmd_enqueue(), which would be 0 even if the remote SMPP peer is not connected.

Hi, I'm trying to understand this, (apologies if I'm wrong)

First, smpp_cmd_enqueue() is not the osmo_wqueue related function, that is rather setting a timer and callback to handle the case where the ESME does not respond within 5 seconds.

Anyway, If the remote SMPP peer is not connected, we really should not get as far as calling PACK_AND_SEND and osmo_wqueue_enqueue() because back up the chain, smpp_route() will have returned GSM411_RP_CAUSE_MO_NET_OUT_OF_ORDER
to sms_route_mt_sms() and therefore to gsm340_rx_tpdu()

Anyway, as we've both noted various times, the return value of sms_route_mt_sms() is totally ignored due to patches to osmo-msc since the split that remove the checks on rc after the call.

#15 Updated by keith about 1 month ago

OK, so there are two of Pablo's commits to openbsc that did not make it into osmo-msc

Specific and critical to this issue is 9051421d75eed22c02f01b373cfab58dbadcd4b5
a.k.a. https://gerrit.osmocom.org/#/c/openbsc/+/3900/

and 968a6c2365c0f772fa65ebe66466715d6861e7fc
https://gerrit.osmocom.org/#/c/openbsc/+/3899/

#16 Updated by keith about 1 month ago

  • % Done changed from 30 to 60

I've prepared and manually tested a patch based on forward-porting https://gerrit.osmocom.org/#/c/openbsc/+/3900/ and this seems to solve the issue as per this ticket title.

A question now is maybe, do we want to extend the options for behaviour when the ESME responds with an error or is unavailable?

The inline help in the vty says:
smpp-first Try SMPP routes before the subscriber DB
but this is not correct, as it really means "smpp-only".

Should we implement:
smpp-only - i.e. try SMPP and RP-ERROR or RP-ACK correspondingly to the ME?
smpp-first - i.e. do what the current inline help says?

The second patch I mention above is trivial but I sent https://gerrit.osmocom.org/#/c/osmo-msc/+/16839/ anyway

#17 Updated by keith about 1 month ago

  • % Done changed from 60 to 90

#18 Updated by fixeria about 1 month ago

I have prepared a TTCN-3 test case:

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/16882/ MSC/SMPP: introduce TC_smpp_mo_sms_rp_error for OS#4351

It fails with the current master, but should theoretically pass with your change applied.

I have also corrected the test case expectations in TC_smpp_mo_sms:

https://gerrit.osmocom.org/c/osmo-ttcn3-hacks/+/16883/

#19 Updated by keith about 1 month ago

  • Status changed from New to Closed
  • Assignee set to keith
  • % Done changed from 90 to 100

Patch and test case are merged - closing.

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)