Project

General

Profile

Actions

Feature #1592

closed

VLR in libmsc, to connect to HLR asynchronously

Added by laforge about 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
-
Start date:
02/23/2016
Due date:
% Done:

100%

Spec Reference:

Description

OsmoNITB is currently still stuck with an internal synchronous HLR database. This needs to be done asynchronous in order to support access to an external subscriber database or even a real HLR via a MAP gateway/proxy. The use of GSUP as protocol towards HLR+AUC should be investigate in order to be in-line with what OsmoSGSN does.

Even without the need for external HLR, this synchronous database access is leading to problems, if
  • another process is opening the database and thereby blocking OsmoNITB, or
  • the system (particularly file-system I/O) is getting slow and thereby blocking OsmoNITB

Related issues

Related to OsmoSGSN - Feature #1644: Use the new/upcoming external HLR created for OsmoNITBClosedlaforge03/11/2016

Actions
Related to OsmoSGSN - Feature #1645: mechanism for enabling/disabling GPRS on per-user basisClosedmsuraev03/11/2016

Actions
Related to OsmoNITB - Bug #1591: libdbi is buggy and slow, get rid of itClosedneels02/23/2016

Actions
Related to OpenBSC - Bug #30: sqlite3 database / asynchronous access to itClosedneels

Actions
Related to OsmoNITB - Feature #1711: 3G AuthClosedneels05/14/2016

Actions
Related to OsmoNITB - Support #1922: comprehensive test of MSC subscriber connection and request handlingClosedneels01/18/2017

Actions
Blocks OsmoHLR - Feature #1643: programmatic access to new asynchronous external HLRNew03/11/2016

Actions
Actions #2

Updated by laforge about 8 years ago

  • Blocks Feature #1643: programmatic access to new asynchronous external HLR added
Actions #3

Updated by laforge about 8 years ago

  • Related to Feature #1644: Use the new/upcoming external HLR created for OsmoNITB added
Actions #4

Updated by laforge about 8 years ago

  • Related to Feature #1645: mechanism for enabling/disabling GPRS on per-user basis added
Actions #5

Updated by laforge about 8 years ago

  • Assignee set to laforge
Actions #6

Updated by laforge almost 8 years ago

  • Status changed from New to In Progress
Actions #7

Updated by laforge almost 8 years ago

  • Related to Bug #1591: libdbi is buggy and slow, get rid of it added
Actions #8

Updated by laforge almost 8 years ago

  • Related to Bug #30: sqlite3 database / asynchronous access to it added
Actions #9

Updated by laforge almost 8 years ago

  • Priority changed from High to Urgent
Actions #10

Updated by neels almost 8 years ago

Actions #11

Updated by laforge over 7 years ago

  • Assignee changed from laforge to neels
Actions #12

Updated by laforge over 7 years ago

  • Target version set to Asynchronous HLR+AUC for CS
Actions #13

Updated by neels over 7 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from Asynchronous HLR / database access to VLR in libmsc, to connect to HLR asynchronously

The HLR is being developed at https://git.osmocom.org/osmo-hlr/.

The OsmoNITB related part is the upcoming VLR that connects to the HLR.
See branch neels/libvlr in openbsc https://git.osmocom.org/openbsc/log/?h=neels/vlr

Hence there will be no database access in OsmoNITB. Thus I'm changing the subject of this issue
to name the VLR explicitly.

An integral reason to have the VLR is that it will allow handling of 3G authentication tokens, for #1711.

Actions #14

Updated by neels over 7 years ago

  • % Done changed from 0 to 30
Status update:
  • took over the users/laforge/vlr branch, now continued as neels/vlr
  • libvlr will use the GSUP client so far kept within openbsc's gprs/ subdir.
    Freed GSUP and the loosely connected OAP client implementations by moving into libcommon.
  • Improved test coverage and fixed a bug in OAP while at it.
  • OAP message composition part has moved to libosmocore, with a new, separate test suite.
  • Got the vlr code to compile for the first time!
    Numerous fixes and additions were necessary to achieve this.
    Particularly in the vlr_test.c, I don't really know what I'm doing yet.
    It seems to be based on an older libvlr API; I got it to compile but not sure whether
    it does anything useful. I'll probably leave it aside for now.
  • The vlr_subscriber struct will gradually replace the gsm_subscriber.
    So far, gsm_subscriber.vsub points at a corresponding vlr_subscriber,
    which allows moving users of gsm_subscriber to vlr_subscriber gradually.
    When all are migrated, the intermediate step of subscr->vsub-> will collapse to just vsub->.
    See comments at gsm_subscriber for details and progress.

I think I will next simply try to run the bastard and see where it takes me,
trying to punch my way through to a first location update...

Actions #15

Updated by neels over 7 years ago

GSUP+OAP related VLR-prep patches currently waiting in gerrit:
openbsc: https://gerrit.osmocom.org/1381 thru https://gerrit.osmocom.org/1396
libosmocore: https://gerrit.osmocom.org/1375 thru https://gerrit.osmocom.org/1380

Actions #16

Updated by neels over 7 years ago

The openbsc patches will build only when G#1375 is merged to libosmocore.

Actions #17

Updated by neels over 7 years ago

  • Fixed wrong osmo_ prefixes in the openbsc GSUP and OAP patches (not submitted for review yet, see 4. below).
  • Added proper build system for osmo-hlr: added autoconf/automake and created jenkins builds for gerrit as well as master.
  • Fixed GSUP client init in osmo-nitb; so far a NULL hostname, threw a quick VTY config at it. (on neels/vlr branch)

Being slightly distracted by the GSUP/OAP migration and connected issues,
as a side track to actual VLR development. It took some time to sort things out,
but now the patch reviews may take as long as they might without impending progress, really.

The patch merging plan:

1. (✓) fix DLGSUP logging category: merge https://gerrit.osmocom.org/1402
  • (✓) tag this as libosmocore 0.9.5
2. (✓) merge osmo-hlr follow-up for DLGSUP: https://gerrit.osmocom.org/1403
  • ( ) depend on libosmocore 0.9.5 in configure.ac (patch waiting)
3. (✓) merge OAP addition to libosmocore: https://gerrit.osmocom.org/1375 https://gerrit.osmocom.org/1378 https://gerrit.osmocom.org/1379 https://gerrit.osmocom.org/1380
  • (✓) tag this as libosmocore 0.9.6
  • (✓) (DLOAP had the same problem as DLGSUP à la G#1402, fixed in submitted OAP patches)
4. (✓) push the updated GSUP and OAP related patches for openbsc for review (ready and waiting)
  • (✓) depend on libosmocom versions 0.9.5 and 0.9.6 as appropriate
    I4f245a7d78d0889b37084c52478372bddb8289d6 I2f06aaa6eb54eafa860cfed8e72e41d82ff1c4cf

5. (✓) review and merge, then start pushing more directly VLR related patches for review, as become ready

Actions #18

Updated by neels over 7 years ago

Another distraction: found bounds checking bugs in libosmocore's logging API, submitted patches with unit tests.
I intended to spend half an hour on it, but found more problems after I started.
I should avoid distractions like these and create redmine issues instead :/

The problems have slight effects on the GSUP/OAP patches in queue,
which could also have been worked around pretty quickly.

Actions #19

Updated by neels over 7 years ago

Excellent: got my first Location Updating Reject =)

Trivially implemented the gsm_subscriber_conn->master_fsm.
Not entirely sure yet why, but Harald's code dispatches events to it.

Also implemented the msc_vlr_subscr_assoc() step.

Fixed various null pointers and false assumptions (mostly introduced by me before).

Next: add valid subscriber data to the HLR's db and continue towards a LU Accept.

Actions #20

Updated by neels over 7 years ago

  • % Done changed from 30 to 40

Success: got first Location Updating Accept, and a subsequent CM Service Request (for USSD).
Both with 2G authentication, without ciphering.

I'm surprised by how fast the authentication went through. From segfaults, missing
initializations and false asserts, things jumped directly through to successful auth. Very nice.
Probably Harald fixed it all with "dry" tests of the auth code before I took over.

Limitations / next up:
  • authentication is still hardcoded mandatory, need to make it optional.
  • haven't tested LU without auth yet.
  • the USSD *#100# returns: "Your extension is " [sic]
    so: need to move extension lookup to the vlr_subscriber.
Actions #21

Updated by neels over 7 years ago

oh, and

20161216155824845 DMM <0002> ../../../src/libmsc/gsm_04_08.c:877 IMSI DETACH INDICATION: MI(IMSI)=901700000004620
20161216155824846 DMM <0002> ../../../src/libmsc/gsm_04_08.c:912 Unknown Subscriber ?!?
20161216155824846 DLGLOBAL <001e> ../../src/fsm.c:384 Trying to dispatch event 2 to non-existing FSM Instance!
20161216155824851 DLGLOBAL <001e> ../../src/backtrace.c:47 backtrace() returned 11 addresses
20161216155824851 DLGLOBAL <001e> ../../src/backtrace.c:57     /usr/local/lib/libosmocore.so.7(_osmo_fsm_inst_dispatch+0x2ca) [0x7ffff7776d0a]

(but osmo-nitb keeps running, no crash)

Actions #22

Updated by neels over 7 years ago

Success: got first Location Updating without authentication.

This basically lacked merely a permission for a state transition from VLR_ULA_S_IDLE to VLR_ULA_S_WAIT_HLR_UPD,
for the case that all information is available and no authentication is necessary.

But I took a little detour to avoid a number of segfaults that the above missing state transition was causing,
in order to randomly improve robustness:

  • Made sure a subscriber <-> lu_fsm association can fail without self destruction.
  • Made sure that when a subscr conn is freed, the FSM instances are properly terminated
    instead of being freed quietly along with the conn's talloc context; particularly so that an orphaned
    LU attempt doesn't leave an invalid lu_fsm pointer in the vlr_subscriber when the conn is discarded.

Todo: So far we used to explicitly request the IMEI from the MS, which seems to not happen anymore.
Clarify whether we actually don't want to ask for the IMEI anymore, or whether that's a bug.

Actions #23

Updated by laforge over 7 years ago

On Mon, Dec 19, 2016 at 01:13:36AM +0000, neels [REDMINE] wrote:

Todo: So far we used to explicitly request the IMEI from the MS, which seems to not happen anymore.
Clarify whether we actually don't want to ask for the IMEI anymore, or whether that's a bug.

should be a policy decision (i.e. operator configuration). We can ignore
it for now, it is more like a gimmick than a primary requirement.

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #24

Updated by neels over 7 years ago

actually, the extension is reported empty by *#100# mostly because, in the osmo-hlr:

        /* FIXME: deal with encoding the following data */
        gsup.msisdn_enc;
        gsup.hlr_enc;

i.e. the MSISDN isn't yet sent over GSUP at all...

Actions #25

Updated by neels over 7 years ago

implemented MSISDN encoding and proper decoding in GSUP messaging from osmo-hlr to libvlr.
Now *#100# works (for both with and without authentication).

Found another problem: when authentication is required, the CM Service Accept and
reply to requests are sent before the MS replies with the Authentication Response:

No.     Time           Source                Destination Protocol Length 
   3368 388.392552000  192.168.0.125         192.168.0.132         RSL      79     CHANnel ReQuireD 
   3370 388.392659000  192.168.0.132         192.168.0.125         RSL      95     CHANnel ACTIVation 
   3371 388.401590000  192.168.0.125         192.168.0.132         RSL      76     CHANnel ACTIVation ACKnowledge 
   3372 388.401635000  192.168.0.132         192.168.0.125         RSL      98     IMMEDIATE ASSIGN COMMAND (CCCH) (RR) Immediate Assignment 
   3382 388.720842000  192.168.0.125         192.168.0.132         RSL      94     ESTablish INDication (DTAP) (MM) CM Service Request 
   3383 388.721343000  192.168.0.132         192.168.0.125         RSL      97     DATA REQuest (DTAP) (MM) Authentication Request 
   3385 388.722016000  192.168.0.132         192.168.0.125         RSL      80     DATA REQuest (DTAP) (MM) CM Service Accept 
   3389 388.956097000  192.168.0.125         192.168.0.132         RSL      97     DATA INDication (DTAP) (RR) Classmark Change 
   3407 389.897628000  192.168.0.125         192.168.0.132         RSL      155    DATA INDication (DTAP) (RR) Utran Classmark Change 
   3416 390.133178000  192.168.0.125         192.168.0.132         RSL      91     DATA INDication (DTAP) (RR) GPRS Suspension Request 
   3426 390.603823000  192.168.0.125         192.168.0.132         RSL/GSM MAP 106    DATA INDication (DTAP) (SS) Register (GSM MAP) invoke processUnstructuredSS-Request 
   3428 390.604183000  192.168.0.132         192.168.0.125         RSL/GSM MAP 121    DATA REQuest (DTAP) (SS) Release Complete (GSM MAP) returnResultLast processUnstructuredSS-Request
                                                                                                                                             ^ USSD reply: "Your extension is 46071" 
   3432 390.839038000  192.168.0.125         192.168.0.132         RSL      84     DATA INDication (DTAP) (MM) Authentication Response 
   3628 401.666229000  192.168.0.125         192.168.0.132         RSL      78     RELease INDication 

Next: figure out the cause...

Actions #26

Updated by neels over 7 years ago

neels wrote:

Next: figure out the cause...

simple: upon starting the access request fsm, we go on invoking the old code.
When that is disabled, the CM Service Request is lacking. So the answer is:
not yet implemented. Doing that now.

Actions #27

Updated by neels over 7 years ago

Success: CM Service Accept is now sent only after authentication (if authentication is required).

But this drew my attention towards subscriber conn validation: we happily reply to service
requests even before the conn has proven itself worthy (e.g. USSD requests).

My plan is to extend the "master_fsm" at the subscriber conn. So far it has meaningless
states. I am enhancing it to reflect whether a conn is new and unvalidated, ready to
be served or in release.

enum subscr_conn_fsm_state {
        SUBSCR_CONN_S_NEW,
        SUBSCR_CONN_S_ACCEPTED,
        SUBSCR_CONN_S_REJECTED, [maybe collapse with _RELEASING?]
        SUBSCR_CONN_S_RELEASING,
        SUBSCR_CONN_S_RELEASED,
};

Then I want to add checks in most of the gsm0408_rcv_* cases to reject all requests
unless the subscr conn is in state SUBSCR_CONN_S_ACCEPTED. It shall reach _ACCEPTED only
when all policy has been satisfied (authentication? ciphering? IMEI known? and such more).

Basically only the LU, CM Service and <insert more initial requests that should be allowed> requests
may go ahead in state SUBSCR_CONN_S_NEW.

Actions #28

Updated by laforge over 7 years ago

On Mon, Dec 19, 2016 at 11:28:12PM +0000, neels [REDMINE] wrote:

Then I want to add checks in most of the gsm0408_rcv_* cases to reject all requests
unless the subscr conn is in state SUBSCR_CONN_S_ACCEPTED. It shall reach _ACCEPTED only
when all policy has been satisfied (authentication? ciphering? IMEI known? and such more).

Basically only the LU, CM Service and <insert more initial requests that should be allowed> requests
may go ahead in state SUBSCR_CONN_S_NEW.

Sounds great. Implementtion-wise, it might make sense to do this at one
central location before going too much deeper into the dispatch into
individual gsm0408_rcv* functions happens.

Something like an explicit check for the common MM procedures
(CM/LU/AUTH/CIPH) early on (and let them pass all the time), and
then one if (state != S_ACCEPTED) exit before dispatching to all of the
remaining functions.

I think this is more or less what is implemented in the SGSN, where
only very few messages are permitted/proessed unless a mm_context for
the subscriber has been created/established. And all other messages are
not processed unless that mm context exists.

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #29

Updated by neels over 7 years ago

Creating a unit test that fakes incoming 04.08 messages and wraps GSUP tx/rx functions
in order to test the subscriber connection's state transitions and accept/reject behavior.

So far I've been doing it with live BTS and MS, but in the long run having a unit test will
actually reduce the effort (besides all the obvious benefits of a unit test).

We do also have a vlr_test.c on the vlr branch, which so far acts as a client and needs
an osmo-hlr running at localhost. My aim was to test the subscriber connection's FSM, so
I started off creating a new subscr_conn_test.c, but now it seems incorporating all the
VLR's FSMs as well will be fairly easy, with the GSUP functions wrapped away as fake
implementations. We'll see whether one of the tests will end up absorbing the other.

Actions #30

Updated by neels over 7 years ago

(I'm writing this down in a slightly unstructured way to try and help clarify the landscape for myself.
There's a sort of conclusion below, following the voluminous prose describing details...)

(btw, the master_fsm is the subscr_conn fsm, I guess it needs a rename)

The unit test is very helpful in uncovering remaining ambiguities about instance life cycles and ownership.

For example, during the vlr_proc_acc_req FSM startup, the FSM immediately starts by a PR_ARQ_E_START event,
and in case the IMSI cannot be found, this directly terminates the FSM with VLR_PR_ARQ_RES_UNIDENT_SUBSCR.
So the pointer returned by vlr_proc_acc_req() has already been deallocated before that function exits.
I could want to return NULL when the FSM's state is DONE, but it has deallocated itself from within the
state's action function. i.e. when that happens, the only way to find out that it is gone is by an event
dispatched to a parent FSM. If this stays this way, it means that when we start an FSM, we can't keep a pointer
to it without being ready for termination events and cleanup of that pointer -- right from the start of
creating the FSM in the first place. If the FSM allocating function returns a pointer to the FSM instance,
we need to be aware of FSM termination even before we assign that FSM inst pointer to a local var
(or e.g. to some subscr_conn->my_parq_fsm member). So it might make sense to avoid terminating from within.
In the LU FSM, this is solved by way of the subscr_assoc vlr.ops callback.

Another interesting design question is how to communicate validity/invalidity to the 04.08 layer code.

We have these situations:
  • after a CM Service Accept, the conn should stay open awaiting the actual service request by the MS;
  • after a LU success we immediately close the conn;
  • and after a Paging Response's successful PARQ we look up any pending actions waiting for the MS.

How to communicate / where to draw the line between VLR and MSC?
What deserves to be a vlr.ops entry, what can be done directly, what is communicated via conn->master_fsm events?

There are various aspects / approaches with their own implementation details...

  • Add vlr.ops callbacks for everything coming back from the VLR to 04.08.
  • Have events to the subscr_conn master_fsm (which lives in MSC land) with appropriate actions in its
    action functions. E.g. instead of calling vlr.ops.tx_cm_service_accept() from within PARQ FSM, rather
    dispatch an event to the subscr_conn FSM, which calls gsm48_tx_mm_serv_acc() directly.
    But we anyway have vlr.ops that send 04.08 messages while they are still busy figuring out
    IMEI, TMSI, authen etc etc.
  • VLR sends to the subscriber_conn's master_fsm such events that merely say "LU finished" or "PARQ finished"
    and on the subscr conn side figure out whether that was successful or not and what actions should happen.
    A basic problem is that we'd like to e.g. know the vlr_proc_acc_fsm's type value (LU or CM Service Req?)
    but to send the "finished" event to the subscriber_conn's master_fsm, the proc_acc_fsm has actually
    already terminated and deallocated all of its priv data. The event to the master_fsm could pass a
    static struct along that has the relevant priv data copied, but this is not possible from FSM termination,
    since osmo_fsm_inst_term() first deallocates and then dispatches the event. Not really that beautiful.
  • Have fine-grained events to the subscr_conn's master_fsm like LU_SUCCESS, LU_FAIL, CM_SERVICE_SUCCESS,
    CM_SERVICE_FAIL, PAGING_SUCCESS, PAGING_FAIL, such that there's a separate event for each kind
    of action to be taken by the subscr_conn -- this way the event itself conveys all information and
    we don't need to keep the FSM's priv data. All Accept/Reject messaging has happened within the FSM
    via the vlr.ops callbacks. The subscr_conn doesn't really keep pointers to the FSM instances (besides
    being children of the master_fsm), so they can discard themselves without having to clear pointers
    elsewhere.
    Potential problems here are explosion of nr of events (need to stay <= 32), and another thing
    is that we can't easily figure out whether there are multiple LU / CM Service / Paging responses
    overlapping, so if the MS sends various requests repeatedly, we would create as many FSMs that would
    run in parallel, instead of being able to discard the previous request first.
    Could be solved by a SUBSCR_CONN_S_VALIDATION_STARTED state in which we discard the previous/the new
    request(?) by clearing up the master_fsm, or somesuch.
  • Keep pointers to the FSM instances in subscr_conn, and do not discard FSM instances from within
    themselves (except the "nested inner" FSMs). When the LU or PARQ FSMs reach their "done" state,
    they send an event to the subscr_conn FSM and don't terminate themselves. The subscr_conn can
    then figure out from the FSM's private data what the intention and results are and carry on
    appropriately. When all is done, terminate the LU / PARQ FSMs.
    This way we reduce the nr of events and it's also easy to look up whether one of them is already
    running.
  • There could simply be the "conn accepted" event, be it from LU, CM Serv Req or Paging Response,
    the VLR internally took care of 04.08 messaging, and regardless of how conn acceptance came about,
    the subscr_conn looks up all and any pending actions to take for a subscriber. Upon first receiving
    a CM Service Request, we set a flag that tells us to not terminate the conn immediately; this
    doesn't come from the PARQ FSM but is remembered before we even start the PARQ FSM.
    Any failure to accept from any FSM also clears this flag.
  • We now have reference counting on the subscr_conn. So each FSM being started could increase the
    ref count, whenever it is done we decrease the ref count, and as soon as the count is zero we
    release the conn; we could have a ref count API so that the FSMs can directly increase/decrease
    counts without having to know the actual struct gsm_subscriber_connection.
    But when the CM Service Request's success terminates the FSM to tell the subscr_conn about success,
    the ref count would decrease to zero, the conn would be discarded and we wouldn't wait for the
    MS to actually send the service request it has waiting.
    To solve, the subscr_conn master_fsm could add another ref count if it intends to keep the conn
    open, e.g. one count for pending paging transactions, one count awaiting the successful CM
    Service Ack; these counts are decreased when appropriate events take place.
    It's basically identical to just taking the appropriate action when events are received.
  • Instead we could forget about subscr_conn ref counts and let the subscr_conn->master_fsm decide.
    As soon as the conn is accepted (be it due to successful LU, CM Service Req or Paging Response with
    authen/ciph), the subscr_conn goes through all llists of pending operations, and closes itself
    and the conn by transitioning to the RELEASED state when nothing pending is found. After async
    actions (e.g. sending SMS?), the master_fsm could receive a BUMP event to re-check whether it is
    through yet. A flag indicates whether a CM Service Request was seen, meaning that we will wait
    for the master_fsm's state timeout instead of closing an idle conn immediately.
  • The current situation is that we have a bit of a mix: both a subscr_con_put() to zero as well
    as a master_fsm->S_RELEASED transition will discard the subscr_conn, but the two concepts aren't
    really working together: when we transition to RELEASED, the conn is discarded even if the
    ref count on the conn is not yet zero. Again, the conn->master_fsm could stay in the ACCEPTED
    state until the ref count goes to zero, but we do also need a mechanism to decide that this
    conn is invalid and needs to go now. i.e. we anyway need cleanup of pending requests, and
    this kind of obsoletes the idea of ref counts implicitly keeping the conn open.
  • recursion problems? Sudden conn release event cleans up pending items, ref count reaches zero, thus
    dispatches another conn release event? (if this shows up, could solve with an intermediate
    RELEASING state; currently I only have NEW, ACCEPTED and RELEASED)
  • We could have a ref list API that includes error callbacks to discard a ref prematurely.
    Like a struct embedded in each referencing struct, linking itself to the ref list.
    This would not be a nice-and-simple ref count anymore, but more like a garbage collection "framework".
    It could have benefits, but I doubt that we really need to go that far. Having
    tailored code to check whether a subscr_conn still has pending items is probably easier.

sort-of conclusion:
After writing this I tend towards:

  • have simply a "conn accepted" event sent from all of the LU/PARQ FSMs to the subscr_conn FSM
    (or a CN_CLOSE in case of failure to accept)
  • subscr_conn FSM looks up all pending actions and closes conn if done, receives BUMP event for async
  • all 04.08 messaging via vlr.ops from within the FSMs (instead of from subscr_conn FSM upon events)
  • the subscr_conn master_fsm manages conn lifetime, don't need subscr_conn ref counts
  • subscr_conn keeps flag whether CM Service Request was seen, in order to not close immediately
    (any conn acceptance failure clears this)
  • subscr_conn does not keep pointers to FSM instances, rejects multiple concurrent LU/PARQ
    by means of an intermediate CHECKING state that allows no further LU/CM Service/Paging response.
    If any concurrent requests appear, the whole conn is closed down.
    (Need to refine this in case we want to e.g. allow a paging response on the same conn that has
    already sent a LU and/or CM Service Request... is this needed? I'd like to assume that each of
    these are always received on a fresh new subscr conn to make things simpler.)

Ideas/preferences on any aspects are very welcome.

Actions #31

Updated by neels over 7 years ago

neels wrote:

For example, during the vlr_proc_acc_req FSM startup, the FSM immediately starts by a PR_ARQ_E_START event,
and in case the IMSI cannot be found, this directly terminates the FSM with VLR_PR_ARQ_RES_UNIDENT_SUBSCR.
So the pointer returned by vlr_proc_acc_req() has already been deallocated before that function exits.

The best way to get around this problem is to not dispatch any events in the allocating function.
First return the fi pointer, store, and only then dispatch events.

Anyway, in most cases it seems like I will get away with not storing any explicit child pointers.

In order to be able to tell the subscr_conn whether a LU has been successful, I distinguished the
lu_compl_fsm's termination into a success and a failure event, did the same with the lu_fsm, and
now the lu_compl_fsm's result trickles through to the subscr_conn fsm. Looking good so far.

In other words, a basic scheme for nested FSMs communicating outwards is worked out.
I'm using DONE states' onenter() events instead of "brutally" terminating FSMs.
The parent FSM may then decide to terminate children, or even keep them until own termination.

I'm trying to move more towards "formal" FSM parts (onenter() and events) instead of external
"magic" functions.

The subscr_conn unit test shall grow to cover "all" cases of LU, CM Service Request, Paging, with
permutations of with-/without-authentication, -ciphering, -TMSI, -IMEI etc.. When those are worked
out, real world operation should also be ready... will verify a match every now and then.

Today, Steve asked me to estimate when I'd be done. It already does a lot of things well, but the
engine parts are still spread out on the workshop floor. I'm expecting stuff to be useful within
January (alpha), and I'd say end of 1st quartal 2017 for beta stability.

(But frankly I have no real idea whether that is optimistic or pessimistic. It feels like I'll be
able to move fast, but these feelings tend to mislead.)

Actions #32

Updated by neels over 7 years ago

It turns out that some of my conclusions aren't as practical as I first thought.
I'll not make as many words this time, but take previous comments with a pinch of salt.

I'm figuring out more and more details in my understanding of BSC, MSC, VLR and HLR interactions.
The unit test so far has LU, CM Service and USSD with/out authentication and ciphering.
Currently I'm deciding when and how to discard subscriber conns; maybe a ref count will help after all.

Actions #33

Updated by neels about 7 years ago

re "when and how to discard subscriber conns": on the neels/vlr branch, a subscr conn is now
owned by the conn_fsm and discarded when entering the RELEASED state.

The only exception is that libbsc still wants to discard a conn when compl_l3() returned
failure. Placed a temporary workaround in the form of an ownership flag, which will go away
when we separate the MSC subscriber connection struct from the BSC (commented in-code).

The various unit tests are looking good with this ownership scheme, ensuring that a conn is
discarded implicitly or stays open as needed, and checking when "non-initial" messages are allowed.
Also added Subscriber Detach messages (which used to crash the NITB before).

next up:

  • actually add Paging as a unit test (the counterpart to CM Service Request in the Acc Req FSM)
    paired with MT SMS.
  • so far both a gsm_subscriber and a vlr_subscriber are around, neutralize the gsm_subscriber.
  • duplicate all unit tests with TMSI-required (and continue with other special cases).
  • add timeouts to all conn_fsm states and verify effectiveness, with faked time.
  • add more unit tests for rejection in various stages.

I often feel tempted to also digress and separate the MSC's subscriber conn struct from the BSC's,
but so far I'm resisting that to push the VLR forward instead.

Q: It seems that the Ciphering Mode Complete is basically optional in the specs. If the MS
doesn't send this in reply to a Ciphering Mode Command, the conn stays un-ciphered. The ciphering
unit test so far omits the Ciphering Mode Complete, it could be enhanced to actually cipher.
I'm thinking it might be a nice feature to enforce waiting for the MS to start ciphering before e.g.
sending SMS upon a paging response, but so far it seems that that's not part of the specs' state
machines? --> check that later / compare with 3G where, IIRC, we do wait for ciphering complete.

Actions #34

Updated by laforge about 7 years ago

On Wed, Jan 04, 2017 at 03:15:13AM +0000, neels [REDMINE] wrote:

Q: It seems that the Ciphering Mode Complete is basically optional in the specs. If the MS
doesn't send this in reply to a Ciphering Mode Command, the conn stays un-ciphered.

where do you get this from? In Rel 1998 04.08 states:

When the appropriate action on the CIPHERING MODE COMMAND has been
taken, the mobile station sends back a CIPHERING MODE COMPLETE message.
If the "cipher response" field of the cipher response information
element in the CIPHERING MODE COMMAND message specified "IMEI must be
included" the mobile station shall include its IMEISV in the CIPHERING
MODE COMPLETE message.

The Erroneous cases (MSC sends unsupported algorithm, ciphering already
enabmled) should send back a RR STATUS message with "Protocol Error
unspecified". Also, the BSC could reject the A-Interface CIPH MODE CMD
in case of errors.

Even in 44.018 Reelase 12 Chapter 3.4.7:


Whenever the mobile station receives a valid CIPHERING MODE COMMAND
message, it shall, if a SIM is present
and considered valid by the ME and the ciphering key sequence number
stored on the SIM indicates that a ciphering
key is available, load the ciphering key stored on the SIM into the ME.
A valid CIPHERING MODE COMMAND
message is defined to be one of the following:
- one that indicates "start ciphering" and is received by the mobile
station in the "not ciphered" mode;
- one that indicates "no ciphering" and is received by the MS in the
"not ciphered" mode; or
- one that indicates "no ciphering" and is received by the mobile
station in the "ciphered" mode.

Other CIPHERING MODE COMMAND messages shall be regarded as erroneous, an
RR STATUS message with cause "Protocol error unspecified" shall be
returned, and no further action taken.

[...]

When the appropriate action on the CIPHERING MODE COMMAND has been
taken, the mobile station sends back a CIPHERING MODE COMPLETE message


From the VLR point-of-view, 29.002 defines MAP_SET_CIPHERING_MODE as
unconfirmed service. But that doesn't mean that the MSC will
communicate with the MS in absence of ciphering bein confirmed by the MS

The ciphering unit test so far omits the Ciphering Mode Complete, it
could be enhanced to actually cipher. I'm thinking it might be a nice
feature to enforce waiting for the MS to start ciphering before e.g.
sending SMS upon a paging response, but so far it seems that that's
not part of the specs' state machines? --> check that later / compare
with 3G where, IIRC, we do wait for ciphering complete.

Clearly, if a network has a policy to use encryption, no CC, SMS, or
other user data should be communicated unless the ciphering has been
established [and there is a common cipher supported by MS and BTS, and
that cipher is permitted by operator policy].

See 43.020 Ch. 4.5 "No information elements for which protection is
needed must be sent before the ciphering and deciphering processes are
operating."

Regards,
Harald

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #35

Updated by neels about 7 years ago

I got the impression from two things: a dim memory of a conversation with Holger
which I may have misunderstood (something like, when ciphering is enabled, the MS may still send
unencrypted messages until it sends the ciphering mode complete, meaning that we sort of have
to expect both encrypted and unencrypted messages to come in).
And secondly from the FSM graphs in the 3GPP specs lacking a Ciphering Mode Complete event.

details{
3GPP TS 23.012 version 6.4.0 Release 6, page 11 in the FSM graph "process Update_Location_Area_MSC",
there is a "Cipher Command" arrowed-box on the bottom right, sending a Ciphering Mode
Command to the MS, and then the FSM goes on to state "Wait_For_TMSI".
On the next page, where the Wait_For_TMSI continues, there is no item to
receive a ciphering mode complete from the MS. Hence we also don't have such a wait
state in our FSMs yet. Basically the same in "procedure Location_Update_Completion_VLR", p.23,
and in TS 23.018 "Procedure Process_Access_Request_MSC" p.29.
It looks like the "New TMSI accepted" (23.012, p.12, 4th from the bottom) and
"Wait_For_TMSI_Reallocation" (23.018, p.29) imply that ciphering is enabled,
or that the VLR doesn't care about the outcome of ciphering.
}

Looking into it now, I found some more info:

In TS 43.020 chapter 5 "Synthetic Summary" (p.29) the graph shows that ciphering
should be established before the "Location Updating Complete" message.

I guess 43.020 p.33 "Scheme 1 (concluded)" explains why I'm confused:
there's a "start ciphering" from the VLR to the BSS/MSC, and the BSS/MSC handles
the Ciphering Mode Command and waits for Ciphering Mode Complete. So the VLR level
does not get any reply on the ciphering completion. Does that mean we should have
a separate FSM outside of the VLR? I assume it would be simpler to integrate this
in our existing VLR FSMs instead.

Currently our VLR ops have .tx_lu_acc() to directly send a LU Accept to the MS.
According to spec that would instead be a kind of .lu_success() op to tell
the BSS/MSC to complete ciphering, and only after that we would tx a LU Accept.

I would instead add another state after the "Set Ciphering Mode" and
before we call ops.tx_lu_acc(), awaiting an event back from the MSC code
when the Ciphering Mode Complete is received:

43.020 spec:

  MS          BSS/MSC                  VLR
                 <---start ciphering----
                 <---LU acc-------------
   <--Ciph Cmd----
   ---Ciph Cmpl-->
   <--LU Acc------
   ---TMSI Cmpl-->
                 ---TMSI ack----------->

my plan:

  MS          BSS/MSC                     VLR
                 <---ops.set_ciph_mode()---
   <--Ciph Cmd----
   ---Ciph Cmpl-->
                 ----ciph ack------------->
                 <---ops.tx_lu_acc()-------
   <--LU Acc------
   ---TMSI Cmpl-->
                 ----TMSI ack------------->
                 <---conn ACCEPTED --------

This makes more sense to me: the existing FSM in libvlr can simply
handle the waiting, instead of adding state in libmsc. Do you agree?

Side thought: am I integrating libmsc with libvlr too closely?
Should I instead keep struct vlr_subscriber out of libmsc to keep the
entities separate? Could this bite us in the future like the MSC split effort?

Actions #36

Updated by laforge about 7 years ago

On Wed, Jan 04, 2017 at 03:12:16PM +0000, neels [REDMINE] wrote:

I got the impression from two things: a dim memory of a conversation with Holger
which I may have misunderstood (something like, when ciphering is enabled, the MS may still send
unencrypted messages until it sends the ciphering mode complete, meaning that we sort of have
to expect both encrypted and unencrypted messages to come in).

Well, you have to consider (as always) that messages are still in
flight. So the time the VLR/MSC sends the CIPH MOD CMD to the BSC is
not the same that the BSC sends the RSL to the BTS and not the same as
the BTS sends it (the first time) to the MS, or the time one or multiple
LAPDm retransmissions happen, etc. So it might very well be a second or
so until the MS has received the message. Meanwhile, it may keep
sending unencrypted frames.

However, the above 'ciphering synchronization' is entirely implemented
in the BTS, as it needs to enable downlink and uplink encryption at
different moments in time, etc.

Messages on Abis or A carry no indication wherther they were encrypted
on the air interface or not. The only way to know is implicitly by
whether they were received before or after the Cipher Mode Complete (or
the corresponding equivalent on Abis/A).

In TS 43.020 chapter 5 "Synthetic Summary" (p.29) the graph shows that ciphering
should be established before the "Location Updating Complete" message.

We shouldn't send a TMSI reallocation (explicit or implicit inside a LU
ACCEPT) before ciphering is known to be active. Otherwise we leak the
TMSI in cleartext, rendering the use of the TMSI completely useless.

Also, we shouldn't proceed with any other non-MM procedures/services
like MO/MT SMS, calls, USSD, etc. until ciphering is known to be active.

I guess 43.020 p.33 "Scheme 1 (concluded)" explains why I'm confused:
there's a "start ciphering" from the VLR to the BSS/MSC, and the BSS/MSC handles
the Ciphering Mode Command and waits for Ciphering Mode Complete. So the VLR level
does not get any reply on the ciphering completion. Does that mean we should have
a separate FSM outside of the VLR? I assume it would be simpler to integrate this
in our existing VLR FSMs instead.

Yes, in the spec it's out of scope for the VLR. If the MSC would never
receive a ciphering mode complete, the MSC would probably simply release
the radio connection at some point, and the VLR would get an
indication of that.

Currently our VLR ops have .tx_lu_acc() to directly send a LU Accept to the MS.
According to spec that would instead be a kind of .lu_success() op to tell
the BSS/MSC to complete ciphering, and only after that we would tx a LU Accept.

That's one of the points where I think it adds abstraction for the sake
of abstraction, without little/no benefit? Hence I decided to put it
directly into a call-back function. The call-back was needed to
differentiate the MSC/SGSN use case,where the LU ACCEPT message is
formatted differently.

If you have reason to change that, feel free to do so. The code has
evolved after you took it over.

my plan:

looks great!

Side thought: am I integrating libmsc with libvlr too closely?
Should I instead keep struct vlr_subscriber out of libmsc to keep the
entities separate? Could this bite us in the future like the MSC split effort?

  • the MSc can fully depend on libvlr
  • libvlr should not depend on the MSC (in tems of [later?] SGSN re-use of libvlr)
  • the MSC split is about the A interface between MSC and BSC, and not
    between MSC and VLR

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #37

Updated by neels about 7 years ago

  • % Done changed from 40 to 50

- paging is implemented and tested (without auth, pending unit test for paging with auth and ciph)
- ciphering wait states as above are implemented and work for both LU and CM Service Request,
verified by subscr_conn unit test.

Actions #38

Updated by neels about 7 years ago

- paging with auth and ciph implemented, verified by subscr_conn_test.

Actions #39

Updated by neels about 7 years ago

next up:

  • duplicate all unit tests with TMSI-required (so far all are with IMSI only).
  • so far both a gsm_subscriber and a vlr_subscriber are around, neutralize the gsm_subscriber.
  • add timeouts to all conn_fsm states and verify effectiveness, with faked time.
  • add more unit tests for rejection in various stages.
Actions #40

Updated by neels about 7 years ago

unit tests now verify that the VLR basically can do:

  • Location Updating
  • CM Service Request (with USSD)
  • Paging (with SMS delivery)
  • Detach

with these combinations:

use IMSI only use TMSI require IMEI TMSI+IMEI retrieve IMEISV
no auth
auth
auth+ciphering

So far only A5/1 is tested.

next up:

  • neutralize the gsm_subscriber.
  • more tests (timeouts, rejections)
  • add unit test with 3G auth+ciph
Actions #41

Updated by neels about 7 years ago

BTW, the neels/vlr branch is getting excessively long. It might be good to collapse its commits
to simply add the resulting VLR code, with changes to pre-existing code in separate commits.

Actions #42

Updated by neels about 7 years ago

On the road to using only vlr_subscriber in libmsc.
gsm_subscriber.extension, .imsi and .tmsi are removed, ensuring that all callers use gsm_subscriber.vsub->*.

This has highlighted use of struct gsm_subscriber in libbsc+osmo-bsc as well as osmo-sgsn.
To properly separate the realms, I created a BSC subscriber (struct bsc_sub) to be used in libbsc,
as well as a GPRS subscriber (struct gprs_sub) with the necessary few separate API functions.
Thus each subscriber struct neatly contains only those elements needed in that area.
Each area has its own llist and ref counting -- slight code dup, but IMHO it's worth it.
(Will resolve sgsn_test failure tomorrow, probably a minor GPRS subscriber detail;
probably some other details in hiding, needs testing with a real BTS.)

This naturally creates a separation where I convert a vlr_subscriber to a bsc_subscriber to pass to BSC API,
in order to do paging. One could see it as unnecessary copying of IMSI/TMSI, but it reflects the
road forward to an A interface, where this "conversion" happens via an A message being sent to the BSC.

I can now completely dismantle the gsm_subscriber and finally replace with a vlr_subscriber
without affecting BSC nor GPRS lands. There is no more confusion possible on who uses what when.

next up: go on until gsm_subscriber no longer exists ... and verify that things still work with real hardware.

Actions #43

Updated by neels about 7 years ago

sgsn_test failure fixed.

Ran test with physical sysmoBTS and phone, and (to my positive surprise) things basically just work.
Some minor problems exist, of course.

The VLR closes the CM Service Request too soon when authorization is off,
and paging is not being kicked off.

(details: so far I set a flag that a CM Service Request came in, keeping the conn open,
but as soon as the first request was served, this flag is turned back off. Now the phone
sends a UTRAN Classmark Change that for some reason gets passed on and thus discards the conn.
Notably, when authorization is required, the UTRAN Classmark Change is sent in the middle
of the auth messaging, and a CM Service Request thus works as expected, with the conn
staying open until the first actual request by the MS has been served.)

LU works, CM Service Request works (with auth), USSD works,
GPRS works (interesting because of the new struct grps_sub).

next up:
  • I'll read up in the specs when the CN shall "officially" close a conn after a CM Service Request.
  • I'll fix the paging trigger,
  • and then test voice calls; the MO call setup looks promising already.
  • The assign-tmsi config apparently doesn't kick through to the VLR cfg yet.
  • I should add subscr_conn_tests for LU-with-TMSI.
Actions #44

Updated by neels about 7 years ago

We need to clarify subscriber IDs: previously, with only one hlr.sqlite3 for the NITB also
storing the SMS, there was one central subscriber ID which always made perfect sense.
With SMS still stored in the old db, but the VLR and HLR being different entities, we
need to either sync those IDs, or rather move away from using IDs altogether, for the
benefit of another key (IMSI?) to match things up.

At least that's why SMS aren't dispatched from the queue and paging isn't even remotely attempted:
we're looking up SMS in the DB by ID, but the old gsm_subscriber->id is not populated from the VLR
because I so far intended for it to go away entirely / to stay within the VLR... will see about that.

To clarify, the paging in the subscr_conn unit test is working, because there, sending of an SMS is
kicked off explicitly; so this is just about why a real setup doesn't kick off sms to cause paging.

(Why voice call paging isn't working may be due to other reasons, not checked that yet.)

Actions #45

Updated by neels about 7 years ago

The voice call paging doesn't happen simply because gsm_subscriber->lac is no longer populated.
So deciding to first carry on with neutralizing gsm_subscriber completely, because most likely
these issues will be resolved in the course of that anyway.

Actions #46

Updated by neels about 7 years ago

As said before, subscriber->id needs some thought.
Related are also subscriber->authorized, ->lac, possibly others.
We previously kept these values in our hlr.sqlite3.

The osmo-hlr could add these to the DB, but also the GSUP protocol would need to be extended.
So some thought should go into where we're going from here.

  • id: osmo-hlr has a unique subscriber id in its database. We also have VTY commands to access
    subscribers by id, SMS storage API uses the id, and vlr_subscriber has an ->id. We'd need to
    add the id to GSUP's Insert Data to allow using the HLR's subscriber id everywhere.
  • authorized: previously, we blocked even basic access if the IMSI didn't match certain criteria
    that allowed thwarting invalid IMSIs in policy 'closed'.
    Apart from MCC+MNC matching the IMSI, we had the authorized flag to allow specific subscribers
    to talk to the MSC. So far the VLR code acts like the 'open' policy and leaves it up to the HLR
    to reject IMSIs as unknown. Setting a subscriber as authorized is achieved by adding it to
    osmo-hlr's database; GSUP communication is always necessary to find out.
    Is this sufficient? In other words, should we drop our IMSI based access control rules?
    If not, should 'authorized' go into osmo-hlr or stay with the VLR somehow, e.g. in the sqlite db
    where we still keep the pending SMSes? (this SMS storage might go away in future, though...)
    I think it would be good to have a mechanism in osmo-hlr to have a subscriber record that is
    marked not-authorized.
  • LAC: IIUC the HLR should remember where a subscriber was last seen. So we need to add this
    to osmo-hlr and the GSUP protocol ... right??
  • name: I never really understood the point of the subscriber->name. Should we keep it?
    add to HLR and GSUP?

BTW, what about OsmoNITB's 'subscriber create' command? We will drop this and say
"go to osmo-hlr instead", right?

Actions #47

Updated by neels about 7 years ago

neels wrote:

  • id: [...] vlr_subscriber has an ->id.

correction: vlr_subscriber doesn't actually have an id field (yet)

Actions #48

Updated by neels about 7 years ago

idea aka premature optimisation: convert the IMSI to int64_t to use as numeric id:
with max 15 decimal digits, an IMSI as integer fits in an int64_t [1].
This would optimize comparing IDs (instead of using IMSIs) because it needs no strcmp().

It would lose the benefit of short subscriber IDs as VTY interface: even if there were
only 100 IMSIs in your system, the IDs would still have 15 digits, and the user might
as well just use 'subscriber imsi 123...' instead.

So it would mostly be a feat to still allow using a subscriber->id field with
legacy API (previously using unsigned long long args, i.e. large enough),
with IMSI based prevention of ID collisions.

I think I'll just do this until we have decided on dropping the ID vs. using the one
from the HLR.

[1] hex(999999999999999) == '0x38d7ea4c67fff', being 12-and-a-half nibbles or 50 bits.

Actions #49

Updated by laforge about 7 years ago

On Fri, Jan 13, 2017 at 12:33:17AM +0000, neels [REDMINE] wrote:

idea aka premature optimisation: convert the IMSI to int64_t to use as numeric id:
with max 15 decimal digits, an IMSI as integer fits in an int64_t [1].
This would optimize comparing IDs (instead of using IMSIs) because it needs no strcmp().

I'd say let's postpone such unrelated changes for now. I doubt that
string compare is the kind of performance we need to worry about. You
can start with converting all the linar lists with more efficient data
structures first. But even that only as needed (as profiling shows) and
not urgently right now. Thanks :)

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #50

Updated by neels about 7 years ago

laforge wrote:

idea aka premature optimisation: convert the IMSI to int64_t to use as numeric id:

I'd say let's postpone such unrelated changes for now. I doubt that

sorry, writing "optimisation" was probably misleading. I'm not replacing the IMSI string,
rather looking for a simple way to have a subscriber->id. Particularly the sms_queue.c
relies on a numeric id that it can increment, using SQL queries to find all pending
SMS with an id > N to round robin. So in addition to the string IMSI, I'm now having
a vlr_subscriber->id that is simply populated by atoll(vsub->imsi). It spares us
having to rewire large parts of the code to work without an id for now.

I'm much more curious about what you think of extending GSUP and/or the osmo-hlr
for authorized, lac, name; see above https://osmocom.org/issues/1592#note-46
There's still time for that, but it would be good to know the general direction.

Actions #51

Updated by neels about 7 years ago

Now I see more answers by mail:

No, just remove the notion of an ID. It's an artefact of using a SQL
database with a unique row identifier.

above comment #50 stands ... we're so far still using SQL for the SMS storage.
We could of course move that to an in-memory llist, but then all SMS would
be lost by a VLR restart. I can drop the ID later.

to talk to the MSC. So far the VLR code acts like the 'open' policy and leaves it up to the HLR
to reject IMSIs as unknown. Setting a subscriber as authorized is achieved by adding it to
osmo-hlr's database; GSUP communication is always necessary to find out.

correct.

Is this sufficient? In other words, should we drop our IMSI based access control rules?

yes. yes. All this policy can be implemented in the HLR.

ack, thanks.

  • LAC: IIUC the HLR should remember where a subscriber was last seen. So we need to add this
    to osmo-hlr and the GSUP protocol ... right??

The LAC is not stored in the HLR, it is stored in the VLR. We introduce
this split among other things to align with standards, so let's keep it
that way. The HLR only stores the VLR address from which the last
location update was seen (which I believe it either alrady does, or it
is implicit as right now we only support one VLR connected to
osmo-gsup-hlr).

See TS 03.08 for a listing of which data is stored at which network
element.

At a VLR restart, all LAC information is lost (it is volatile), and the
VLR will need to page in all LACs in case of a MT transaction. Over
time, with LU's and MO and MT transactions going on, the LAC per
subscriber information is re-built.

Ah, indeed, there is a vlr_number in the hlr.db. Thanks for the explanation;
I could have looked it up myself, but indeed it's very helpful to get the
right factoid without searching around.

  • name: I never really understood the point of the subscriber->name.

the point is that it makes debugging a lot easier if you are looking at
log files and see the name of the subscriber, rather than just a MSISDN
or IMSI.

Should we keep it? add to HLR and GSUP?

Yes, but as optional field. So if the HLR is osmo-gsup-hlr and it
provides this feature, then we use it. If not (e.g. in the csae of a
GSUP-MAP translator) the field is empty and we have to fall back on
printing the IMSI (like now)

ack. First step is to prefer the name in vlr_sub_name() for log output,
and at some point will extend GSUP. There will be no path sending the
name back to the HLR; as above, the user has to go to the HLR db.

BTW, what about OsmoNITB's 'subscriber create' command? We will drop this and say
"go to osmo-hlr instead", right?

yes. For the first many years of osmo-nitb, there was no 'subscriber
create' anyway.

ack.

(I'm starting to see that once the VLR is merged and the MSC is split, we will see
a lot of users' complaints about removed features =) ... I still see ways to preserve
some, like pointers between the MSC subscriber and the BSC subscriber for more
informative logging / 'show' commands / statistics in osmo-nitb. But even that
will be hard if we have a proper separation by A-interface.)

Actions #52

Updated by neels about 7 years ago

note to self: don't forget to look at what subscr_expire_callback() did and adapt in libvlr

Actions #53

Updated by neels about 7 years ago

We also have the 'token' auth policy, with libmsc/token_auth.c having a hardcoded
SMS text:

#define TOKEN_SMS_TEXT "HAR 2009 GSM.  Register at http://har2009.gnumonks.org/ " \
                        "Your IMSI is %s, auth token is %08X, phone no is %s." 

It appears that this hasn't been used in a long time and can be dropped?
Nevertheless I dimly remember a use case where a vessel in international waters
would like to hand out tokens to guests to allow easy GSM access... related?

For now I'm completely '#if 0' disabling token_auth.c.

Actions #54

Updated by laforge about 7 years ago

On Fri, Jan 13, 2017 at 04:36:24PM +0000, neels [REDMINE] wrote:

We also have the 'token' auth policy, with libmsc/token_auth.c having a hardcoded
SMS text:

just remove it, it hasn't been used since 2009.

For now I'm completely '#if 0' disabling token_auth.c.

remove it completely.

--
- Harald Welte <> http://laforge.gnumonks.org/ ============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)

Actions #55

Updated by neels about 7 years ago

Q: how long do we keep a subscriber in the VLR? i.e. how long will we remember the TMSI we assigned?
Details: Upon LU, we fetch a subscriber's details from osmo-hlr via GSUP.
We assign a new TMSI. Requests may be served and so on, and finally the subscriber sends a DETACH.
At this point I would assume that we remove the vlr_subscriber from the VLR's in-memory list.
Only, if the subscriber comes back and asks for a LU with the previous TMSI, we no longer have it.
Of course we can go on and send an ID request, and the IMSI will be sent by the subscriber.
In the unit tests I would like to play through the case where there is a LU using a previous TMSI,
and I notice that after a proper DETACH, we would always ask for the IMSI again.
Should we keep a subscriber some longer? with a timeout? Or, maybe, store the last TMSI in osmo-hlr?
How large do we want the in-memory list of the VLR to become? Do we have "infinite" space there?
If we keep any and all subscribers, we should probably push a subscriber to the start of the list
with each activity, to avoid iterating all non-active subscribers for every single MM message...?
Opinions/hints welcome.
For now I'll deallocate vlr_subscribers upon IMSI DETACH, and will comment-out the DETACH in the
unit test to play through a TMSI LU that omits an ID request for the IMSI.

Actions #56

Updated by laforge about 7 years ago

Hi neels,

On Tue, Jan 17, 2017 at 09:37:10PM +0000, neels [REDMINE] wrote:

and I notice that after a proper DETACH, we would always ask for the IMSI again.

Yes, that's the way how the disign is done. You can disable IMSI DETACH
functionality in system information, and then you would suddenly only
have implicit DETACH after the periodic LU is missing.

As IMSI DETACH is specified to be inherently insecure (there's no way to
authenticate it, let alone require authentication), I would think that a
safe/sane network configuration these days diables IMSI DETACH anyway.

Should we keep a subscriber some longer? with a timeout? Or, maybe, store the last TMSI in osmo-hlr?

It is clear that from a GSM architectural point of view, the HLR has no
business in storing the TMSI, so that is not the proper response.

How large do we want the in-memory list of the VLR to become? Do we have "infinite" space there?

I think we have two aspects here: Limiting the memory footprint, and
worrying about lookup times. With stupid linear list iteration, I
wouldn't want to keep more than needed. If we'd use a hash table hashed
by TMSI, I would say keep it until a user-defined memory limit is
reached and then evict/replace the oldest entries first?

Actions #57

Updated by neels about 7 years ago

General status update:

The unit test shows that the new libvlr is working for all core operations:
  • Location Updating, CM Service Requests, Paging
  • Without authentication, with authentication, with authentication and ciphering
  • Without TMSI, with TMSI, for known and for unknown TMSI sent by MS

Operation with real BTS and phone hardware has last not worked because the
gsm_subscriber struct had not fully replaced by the new vlr_subscriber yet.
In the last week, gsm_subscriber has been almost completely removed, which
has highlighted places in the code that still needed to be switched over
to using vlr_subscriber -- done now. So it's looking good for real tests as well.

Immediate tasks:
  • completely 100% remove gsm_subscriber from the code base (almost done).
  • confirm with real equipment that the VLR now indeed works as expected.
  • remove other parts that are now merely dead code.
Pending after that:
  • stability work: ensure that various failure scenarios are handled well, both with unit tests as well as real equipment.
  • code review: ensure that legacy code paths are disabled / deprecated / show sensible error messages.
  • start merging libvlr changes to openbsc master.
Actions #58

Updated by neels about 7 years ago

  • Related to Support #1922: comprehensive test of MSC subscriber connection and request handling added
Actions #59

Updated by msuraev about 7 years ago

  • Related to Feature #1860: Include Ubuntu 16.04 LTS in nightly builds added
Actions #60

Updated by neels about 7 years ago

  • Related to deleted (Feature #1860: Include Ubuntu 16.04 LTS in nightly builds)
Actions #61

Updated by neels about 7 years ago

RIP gsm_subscriber, it now no longer exists on the vlr branch (replaced by vlr_subscriber, bsc_sub and gprs_sub).

In effect, tests with real sysmoBTS and actual phones show that paging for voice calls is now working,
as are the voice calls.

Remaining problems:

SMS

With SMS, the problem is still the legacy db. Obtaining an SMS from the DB entails a JOIN with the no longer
populated Subscriber table (to pick only those that are active, i.e. lac >= 0).
I'll have to rethink the SMS code.

How to deal with SMS that have no active subscriber: in the long term we will need a plan
for undelivered SMS that show up with a different VLR, so more of a quick hack is
probably sufficient at this point, as long as we assume that there is only one VLR.

The SMS code should also be moved away from libdbi to native sqlite.
I will see whether doing both changes in one go makes sense.
The database file should probably also be renamed to sms.db.

CM Service Request

I still discard the conn as soon as any DTAP is served after a CM Service Request,
in effect they are thus broken without authentication (because of the classmark
change being sent) and also with both auth+ciph (because of some other message
being sent before the actual SMS or USSD); in short, CM Service Request only
works when exactly authentication is enabled, without ciphering.
-> Find out from the specs what the policy should be. Failing that "ignore"
exactly those messages that currently close the conn too early.

Next: fix CM Service Request for all cases, should be easy.

Actions #62

Updated by neels about 7 years ago

neels wrote:

CM Service Request
auth+ciph (because of some other message being sent before the actual SMS or USSD)

(Actually, only SMS fail to be sent, a USSD service request works with auth+ciph.
Will find out the difference and fix.)

Actions #63

Updated by neels about 7 years ago

neels wrote:

(Actually, only SMS fail to be sent, a USSD service request works with auth+ciph.
Will find out the difference and fix.)

The difference is actually the phone: the 2G phone that sends no UTRAN Classmark Change works fine.
With the smart phone the odd UTRAN Classmark Change causes the CM Service Request to close early.
So ignoring that message should fix it, reading the specs can't hurt either.

Actions #64

Updated by neels about 7 years ago

If I interpret Figure 4.1a / 3GPP TS 04.08 correctly, I should wait for the first message
after a CM Service Accept, but it's a bit of a stretch to interpret it that way.

I'm so far "ignoring" all RR messages in the sense that they don't close the
CM Service Request initiated connection. It solves the problem; maybe further
insights will change this later.

Another detail regarding SMS: if the recipient is not active in the VLR (even though
it exists in the HLR), we currently reject the SMS with "Number not assigned".
It should probably ask the HLR and store the SMS for later instead.

Actions #65

Updated by neels about 7 years ago

(development is ongoing, mostly concerning UMTS AKA in MSC+VLR -> #1711)

Actions #66

Updated by neels about 7 years ago

Added mechanism where the MSC tells the VLR that a connection has timed out
and the VLR FSMs end gracefully so that e.g. a LU Reject is still sent.
This is similar to the old anchor timer of 5 seconds, needs to be consolidated
with the VLR timeouts / the various T timers from the specs; i.e. at some
point the VLR can be the timeout source -- still todo.

The subscr_conn vs. FSM inst deallocation is a bit tricky; if the FSM
is allocated in the conn's ctx, the FSM's cleanup function may already
free the FSM instance, after which its own free() causes a double free.
This was solved by placing the subscr_conn FSM instance under the
gsm_network context instead of the subscr_conn itself.
Once the BSC is split off, this could be changed to put the conn under
the FSM's talloc context (i.e. "reversed").

Actions #67

Updated by neels about 7 years ago

Taking a closer look at the timeouts in place in the code prior to libvlr integration: mostly there is
abovementioned 5 second anchor timer and a loc_operation->updating_timer.

Let's look at the behavior of the current master branch, before libvlr.

LU:

conn is established
5 seconds "anchor" timeout (hardcoded)
LU Request
5 seconds loc_operation->updating_timer (hardcoded) (If another LU comes in during this period it is directly rejected)
LU is complete, including potentially ID Request + Response and the entire Authentication and TMSI Reallocation Complete; or alternatively a LU Reject.

So there are 5 seconds timeout until the first LU Request on a conn, and after that,
if ID, Auth and TMSI negotiation take longer than another 5 seconds, the LU is rejected.
This is not any T<N> timer, just a hardcoded 5.

CM Service:

conn is established
5 seconds "anchor" timeout (hardcoded)
CM Service Request, authorization and all, up to the first CC, SMS or USSD request

i.e. a CM Service Request is stricter in that everything up to the first
"meaningful" request from the subscriber must happen within the 5 second
"anchor" timer. No "T" timers involved.

Paging:

This is fundamentally different in that the CN requests the MS to come back
with a Paging Response, and there are T timers involved to manage the time the
BSC+BTS+MS levels may take to 'wake up'. There is no open connection waiting.

For the sake of the Subscriber Connection FSM, I'm thus only looking at the
Paging Response part.

conn is established
5 seconds "anchor" timeout (hardcoded)
Paging Response, authorization and all, request sent to MS up to the first CC, SMS (or USSD) reply from the MS

Hence a Paging Response is even stricter still, where after the entire Paging
Response business, another DTAP goes out to the MS and needs to be replied
upon within the 5 second hardocded "anchor" timer.

What this means for the VLR

First the current state of the branch:

We have a subscr_conn->conn_fsm, a state machine that is in state NEW until the
valid subscriber on the other end is established. A successful Authorization
including ID Request(s) and TMSI negotiation leads to a conn_fsm state
ACCEPTED. For the FSM to reach the ACCEPTED state, at which DTAP messaging may
commence, I currently have a 5 second timer (since posting above comment).

For a LU, this is stricter than before, where technically between first conn
establishment and LU Request nearly 5 seconds were allowed to pass, followed by
another near 5 seconds until the subscriber is "ACCEPTED".

(I wonder whether the LU Request itself is actually the cause for creating the
conn and whether it is realistically possible to have a delay between
establishing the conn and receiving the LU Request. -->later)

For CM Service Request and Paging, this is more lenient than before. Now the
timeout stops as soon as ID + authenticity + TMSI are established. The MS may
now take as long as it wants to do DTAP (which hopefully starts other timers).

There are timeouts passed to some VLR states, but since they are 10 seconds and
far surpass the outer conn_fsm's 5 second timeout, they will never have any
effect. The FSMs would have been deallocated before half the time has passed.

Where we want to be:

Do we really need individual timeouts in the VLR's FSMs? It currently doesn't
seem so. If we have the conn_fsm timeouting with, say, 10 seconds until the
ACCEPTED state is reached, the VLR's FSMs don't need to bother with it.

We do in fact want to avoid the following scenario: say the conn_fsm reaches
ACCEPTED and hence the timeout is lifted, and then the MS never replies. No
DTAP has kicked off and no timers have started, the conn stays open forever. So
probably the conn_fsm should have another timeout between the ACCEPTED state
and another, new state meaning "started the first real request/response
negotiation with its own timeouts", say a "COMMUNICATING" state. Say 5 seconds
until ACCEPTED, another 5 seconds until COMMUNICATING.

(Instead of hardcoding these timeouts, it can't hurt to make them VTY
configurable.)

When DTAP negotiation starts, say an SMS is delivered with its various ACK
dances, there will be timeouts involved. It seems that it would be kind of nice
to also handle these timeouts in the conn_fsm in the COMMUNICATING state. So
that when in the COMMUNICATING state, arbitrary code could tell this state to
now timeout in N seconds. When the timeout expires, the conn is torn down. Then
again, if e.g. SMS delivery times out, we probably want some callback to clean
up things, and instead of placing this cb pointer in the conn_fsm, we might as
well start individual osmo_timer instances which, when they time out, simply
send a BUMP event to the conn_fsm. (This BUMP already has the effect that the
conn's state concerning transactions, silent_call etc is checked and the conn
is closed if nothing more is pending.) If anything "harmful" happens, arbitrary
code can also send the conn_fsm a CN_CLOSE event to immediately trash the conn.
The COMMUNICATING state could BUMP every N seconds as a fallback for bugs, but
if said bugs fail to clean up the conn's state, any BUMPs would be in vain.

Once a conn is COMMUNICATING, the conn is expected to stay open and timeouts
have to come from elsewhere. It could make sense to have FSMs for voice calls,
SMS and USSD messaging, with their own timeouts.

Also: add tests to ensure we disallow concurrent LU Req / CM Service / Paging
Responses, i.e. reject whichever comes second while any other of them is
already busy (like seen in LU of the pre-VLR code). Also to be on the safe
side, make sure that any second subscriber connection is rejected while another
one is still busy for a given subscriber (of course making double sure that a
broken conn can never stick around forever, so that we will never lock out a
subscriber by accident).

Actions #68

Updated by neels about 7 years ago

Thinking about timeouts: if we route RTP via an MGCP gateway (osmo-bsc_mgcp),
i.e. the BSC/RNC level sends the voice stream directly to a separate process, how
will the MSC be able to verify that an actual call is ongoing? What if the RTP
just stops due to some equipment failure and no call ending messages are ever
received by the MSC? How do we prevent this from staying a zombie conn forever?
(I assume this will turn out to be no problem, but I can't see it at the moment.)

Actions #69

Updated by laforge about 7 years ago

On Sat, Feb 11, 2017 at 10:44:25PM +0000, neels [REDMINE] wrote:

Thinking about timeouts: if we route RTP via an MGCP gateway (osmo-bsc_mgcp),
i.e. the BSC/RNC level sends the voice stream directly to a separate process, how
will the MSC be able to verify that an actual call is ongoing? What if the RTP
just stops due to some equipment failure and no call ending messages are ever
received by the MSC? How do we prevent this from staying a zombie conn forever?
(I assume this will turn out to be no problem, but I can't see it at the moment.)

The BSC will get periodic measurement reports from the BTS via RSL as
long as the channel is open, and it will get RSL signalling if the radio
channel is lost. This in turn translates to a release of the associated
A/SCCP connection towards the MSC. Do you think that's insufficient?
None of the above requires any signalling messages.

Actions #70

Updated by neels about 7 years ago

laforge wrote:

The BSC will get periodic measurement reports from the BTS via RSL as
long as the channel is open, and it will get RSL signalling if the radio
channel is lost. This in turn translates to a release of the associated
A/SCCP connection towards the MSC. Do you think that's insufficient?

If the BSC never tells the MSC that a call is still ongoing, we can't have a safety timeout in the MSC to tear down a stale connection. I'm trying to get it waterproof, but maybe this is too paranoid. In other words, once a voice call is open, we simply don't timeout on the conn. In the NITB we could bump the timeout on measurement reports, but in OsmoCSCN/3G that's no longer possible. So I guess we simply have to rely on BSC/RNC to not leave any stale connections, which is fair enough. I'll ask again if I see any problems with that later...

Actions #71

Updated by laforge about 7 years ago

On Mon, Feb 13, 2017 at 02:12:52PM +0000, neels [REDMINE] wrote:

If the BSC never tells the MSC that a call is still ongoing, we can't
have a safety timeout in the MSC to tear down a stale connection. I'm
trying to get it waterproof, but maybe this is too paranoid. In other
words, once a voice call is open, we simply don't timeout on the conn.

seems to make sense to me.

In the NITB we could bump the timeout on measurement reports, but in
OsmoCSCN/3G that's no longer possible. So I guess we simply have to
rely on BSC/RNC to not leave any stale connections, which is fair
enough. I'll ask again if I see any problems with that later...

Please note that SCCP (and SUA) have Rx+Tx interval timers for each
established SCCP connection. The default is 7/15 seconds. So if the
BSC looses some connection, the assumption is that it also looses the
1:1 mapped SCCP signalling connection and thus there would be a time-out
on SCCP level which leasd to a release of the SCCP connection and thus
also a release of the related MSC state. So if the BSC looses state of
a single SCCP connection or the BSC is re-started or whatever else, SCCP
T(iar) and T(ias) will time-out. Assuming that the SCCP/SUA stack we
use on the MSC side will handle those correctly (which I'm working on
right now), we should be safe.

So the only case that I can see not covered is a BSC that looses the
radio/RSL connection on Abis but keeps the SCCP connection to the MSC.
And that should clearly be fixed in the BSC.

Actions #72

Updated by ipse about 7 years ago

Do measurement report include lack of incoming RTP from MSC aka downlink audio? I was under impression we can only react to uplink audio loss using measurement reports.

Actions #73

Updated by laforge about 7 years ago

On Mon, Feb 13, 2017 at 06:52:40PM +0000, ipse [REDMINE] wrote:

Do measurement report include lack of incoming RTP from MSC aka
downlink audio?

measurement reports are about the Um interface. And as there's no RTP
on the Um interface: No. It doesn't belong in there.

I was under impression we can only react to uplink
audio loss using measurement reports.

Can you point me to anything in the 3GPP specs that would
require/recommend a MSC or BSC to tear down the signalling connection
(or even only the voice call using CC signalling) in the case of loss of
voice frames? This is not a trick question, I'm really curios if you
ever saw something like that.

I think the call control signalling is completely independent of the
voice frames, and generally it is the assumption that the user will hang
up in case he's not satisfied with the voice performance?

If one wanted to do something like this, I could see the following options here:
  • the BSC releasing the channel if it doesnt' receive dowlink RTP for a
    to-be-defined time
  • the MGW / rtp-proxy / ... noticing the absence of DL voice by
    interpreting the RTCP that the BTS sends for DL RTP. Further action
    could then be to somehow signal/indicate this to the MSC?
Actions #74

Updated by ipse about 7 years ago

laforge wrote:

On Mon, Feb 13, 2017 at 06:52:40PM +0000, ipse [REDMINE] wrote:

Do measurement report include lack of incoming RTP from MSC aka
downlink audio?

measurement reports are about the Um interface. And as there's no RTP
on the Um interface: No. It doesn't belong in there.

Yes, that's exactly how I know it. I probably misread your comment, as I thought you suggested it has indication of status of the voice data flow aside from Q value.

I was under impression we can only react to uplink
audio loss using measurement reports.

Can you point me to anything in the 3GPP specs that would
require/recommend a MSC or BSC to tear down the signalling connection
(or even only the voice call using CC signalling) in the case of loss of
voice frames? This is not a trick question, I'm really curios if you
ever saw something like that.

No, I've never seen anything like this. My thinking is coming from VoIP world where it's common to monitor presence of RTP data and assume a call is dropped when there is no RTP coming for X seconds. Keep-alives are optional in SIP, so a situation when an endpoint is dead can be only reliably detected by monitoring RTP data which is assumed to be coming constantly (unless paused by re-INVITE).

So in our case I think the question is whether we assume that MSC has its own way to reliably detect call failure at the far end and reliably signal this to BSC.

An intersting question is how is this going to work with MNCC-SIP interface when an MSC is not a real MSC. We can proably assume that a SIP PBX will detect call failures on its own (e.g. by RTP loss) and will signal it to over SIP/MNCC. But then we must ensure that SIP-MNCC interface is very reliable or mandate keep-alive support. But that's probably a separate discussion.

Actions #75

Updated by neels about 7 years ago

  • % Done changed from 50 to 70

Progress: implemented SMS delivery on the VLR branch, by round-robin on SMS recipient MSISDN with VLR lookups for subscriber data.
Previous code used the subscriber data present in the SQL to find a pending SMS for an attached subscriber in one SQL query.
Now we need to pair up the SMS data from the SQlite db with the subscriber data in the RAM; this is now implemented.

SMS delivery is verified to work with the neels/vlr branch using real equipment (sysmoBTS, Galaxy S4m).
Scalability is on a different page and will probably have to wait until we implement a separate SMSC.

So, the VLR is looking good: at this point we have verified voice calls, SMS, USSD and GPRS working.
Basic end-to-end tests in openbsc/tests/msc_vlr/ verify graceful timeout and error handling (could add more tests).
(GPRS: I haven't actually tried connecting the SGSN with osmo-hlr yet, it works because of 'auth-policy accept-all')

Actions #76

Updated by neels about 7 years ago

Implemented the COMMUNICATING state described above, with several msc_vlr regression tests to verify it.

Actions #77

Updated by neels about 7 years ago

  • % Done changed from 70 to 90

The 3G branch has been rebased on top of the VLR branch and works.
In other words, the VLR is capable of handling both our 2G NITB and 3G MSC code bases, providing async R99 auth.
We will certainly still be touching this very central code in the future, but it is basically done and working.

I would technically set this to 100%, but leaving at 90 to resolve merging to the master branch.

Actions #78

Updated by neels about 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

It's ready on openbsc.git's vlr_2G branch, and we've decided to not merge it to openbsc.git's master.
Decisions when and where it will become a master branch are independent from the implementation, which is done.

Actions #79

Updated by laforge almost 7 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)