Project

General

Profile

Actions

Bug #5563

open

OsmoMSC sometimes stalls for dozens of seconds in a production deployment

Added by laforge almost 2 years ago. Updated over 1 year ago.

Status:
Stalled
Priority:
High
Assignee:
Category:
SMS
Target version:
-
Start date:
05/14/2022
Due date:
% Done:

50%

Resolution:
Spec Reference:

Description

When we take a long-term (8 hours) bpftrace showing us the delay between subsequent calls to poll() (by libosmocore/src/select.c) in osmo-msc, we get the following histogram (units in milli-seconds):

@poll: 
[0]               532245 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                13088 |@                                                   |
[2, 4)              5621 |                                                    |
[4, 8)              5566 |                                                    |
[8, 16)             2746 |                                                    |
[16, 32)            5282 |                                                    |
[32, 64)            5262 |                                                    |
[64, 128)           6139 |                                                    |
[128, 256)         14273 |@                                                   |
[256, 512)         18357 |@                                                   |
[512, 1K)          13806 |@                                                   |
[1K, 2K)            4222 |                                                    |
[2K, 4K)            1331 |                                                    |
[4K, 8K)             450 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             5 |                                                    |
[64K, 128K)           17 |                                                    |
[128K, 256K)           2 |                                                    |
So as we can see
  • the majority is very low (sub-second to 128ms)
  • there is a smaller peak in the order of 128ms to 1s (surprisingly long)
  • there are still several thousand of instances where the delay isn the 1s..4s. interval (too long!)
  • there ar rare occasions where we don't return to poll for 32, 64, or evne more than 128 seconds (crazy!)

If we contrast this with the amount of time we spent in dbi_conn_queryf, this is clearly not the culprit:

@dbi_query: 
[0]                37008 |@                                                   |
[1]              1640233 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)           1245771 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[4, 8)             21406 |                                                    |
[8, 16)              325 |                                                    |
[16, 32)              71 |                                                    |
[32, 64)              17 |                                                    |

So the longest duration DB query was in the order of 32..63 ms. Not good, but not a problem either with all the MSC (MM, CC, SMS, BSSAP, SCCP, ...) time-outs being in the multiple-second range.

So now we have to find out if the stalls are

  1. due to excessive system load (like I/O) outside of osmo-msc, or
  2. due to something osmo-msc is doing by itself (like calling thousands of database queries of several milli-seconds each) without going through the libosmocore poll main loop.

Related issues

Related to OsmoMSC - Bug #5564: blocking database I/O by SMS databaseStalledlaforge05/15/2022

Actions
Related to OsmoMSC - Bug #5559: OsmoMSC at 100% CPU and unresponsive for up to several minutes!Stalledlaforge05/13/2022

Actions
Related to OsmoMSC - Feature #5566: avoid using synchronous = FULLResolvedlaforge05/17/2022

Actions
Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)