Project

General

Profile

Actions

Bug #5671

closed

rate_ctr: strange per-minute/hour/day e1 tx/rx frame counts

Added by cquirin over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/30/2022
Due date:
% Done:

100%

Spec Reference:

Description

I am seeing this in osmo-e1d

  E1 Tx Frames multiplexed: 899771261 (8000/s 712004/m 35520752/h 892818365/d)
  E1 Rx Frames demultiplexed: 899771232 (8000/s 712000/m 35520752/h 892818352/d)

8000 * 60 = 480000 right?
712000 / 8000 = 89

Actions #1

Updated by laforge over 1 year ago

this is the vty output of the libosmocore rate_ctr API. so it may very well be a bug in libosmocore affecting all our counters.

There's one global timer in libosmocore which iterates over all rate_ctr and computes the per-minute/hour/day aggregate stats.

Actions #2

Updated by laforge over 1 year ago

Looking at the output of some of my long-running osmo-e1d instances:

          tx:frames_muxed: 80257523645 (8011/s 512011/m 49440128/h 748802000/d) E1 Tx Frames multiplexed
        rx:frames_demuxed: 80257529819 (8008/s 512008/m 49440136/h 748802000/d) E1 Rx Frames demultiplexed

that's also odd:
  • 512011/60 = 8533/s
  • 49440128/3600 = 13733/s
  • 748802000/86400 = 8667/s

So clearly something odd is going on here.

Actions #3

Updated by laforge over 1 year ago

So the libosmocore/rate_ctr works like this:
  • schedule a libosmocore timer 1s in the future
  • timer call-back is rate_ctr_group_intv()
    • increment timer_ticks
    • update the per-second interval stats
    • if timer_ticks % 60 == 0, update the per-minute stats
    • etc. for the higher intervals
  • re-arm the timer 1s in the future
So this has the following problems:
  • the timer expiration might be delayed by the scheduler
  • the processing of the timer takes some time
  • thus, in reality, the timer expiration is likely always "some amount more" than 1s, where "some amount" is the sum of the scheduler delay and the sum of the processing time of the timer callback function

so ideally, we would use a periodic timer expiring regularly at 1s intervals rather than multiple consecutive relative 1s timers. That way, the error doesn't accumulate over time.

When rate_ctr was written, we didn't yet have support for periodic timerfd in libosmocore yet, so we had to use what was available (osmo_timer, which are not repeating).

So this explains why there is some amount of error. However, it does not explain the magnitude of error we're seeing (13733/s is way more than 8533/s).

Actions #4

Updated by cquirin over 1 year ago

How about doing a gettimeofday() each time the counter get actually updated? Some math should then allow for the correct values to be deduced. That can be done with or without FP.

Actions #5

Updated by laforge over 1 year ago

On Wed, Aug 31, 2022 at 02:47:48PM +0000, cquirin wrote:

How about doing a gettimeofday() each time the counter get actually updated? Some math should then allow for the correct values to be deduced. That can be done with or without FP.

using a timerfd avoids the need for any of this. Even if the system is
stalled for several seconds, it will tell you how many timer intervals
(1s in our case) have expired meanwhile. We use this
in all the critical timing paths of other osmocom software, just not
for the rate_ctr.

However, I still don't think this is the root cause of what we're
seeing.

Actions #6

Updated by laforge over 1 year ago

  • Project changed from osmo-e1d to libosmocore
  • Subject changed from strange per-minute/hour/day e1 tx/rx frame counts to rate_ctr: strange per-minute/hour/day e1 tx/rx frame counts
Actions #7

Updated by cquirin over 1 year ago

rate_ctl.c:307 contains the following code

/* update the rate of the next bigger interval.  This will                                                                                                                                                                          
 * be overwritten when that next larger interval expires */
if (intv + 1 < ARRAY_SIZE(ctr->intv))
ctr->intv[intv+1].rate += ctr->intv[intv].rate;

This code is reasonable when the very first interval has not yet elapsed, as it incrementally fills the higher values with some kind of estimates.

However, as soon as the rate has been computed the regular way, by differentiation, the code should no longer run.

I commented the code to check. I have rates of 0 for the first interval, of course, (meaning that the per day value stays 0 for a whole day) but I do have reasonable values as soon as the first interval has elapsed.

Actions #8

Updated by laforge over 1 year ago

  • Assignee changed from laforge to msuraev

I really didn't find the time so far, maybe something Max could look into?

Actions #9

Updated by msuraev over 1 year ago

  • Status changed from New to In Progress

Let's convert this to timerfd first and than troubleshoot and fix remaining inaccuracies.

Actions #10

Updated by laforge over 1 year ago

On Wed, Nov 23, 2022 at 11:10:59AM +0000, msuraev wrote:

Let's convert this to timerfd first and than troubleshoot and fix remaining inaccuracies.

those are actually two separate topics. The last update from cquirin handles a generic
logic bug int he code, which is IMHO higher priority than any "inaccoracies.

Actions #11

Updated by msuraev over 1 year ago

laforge wrote in #note-10:

those are actually two separate topics. The last update from cquirin handles a generic
logic bug int he code, which is IMHO higher priority than any "inaccoracies.

I completely agree regarding priorities. From the implementation point of view it's simply easier to fix the code which won't be affected by rewrite right away.

Actions #12

Updated by msuraev over 1 year ago

  • % Done changed from 0 to 50

The preliminary fix is available in https://gerrit.osmocom.org/q/topic:ratectr
N. B: it hasn't been thoroughly tested yet.

Actions #13

Updated by msuraev over 1 year ago

  • % Done changed from 50 to 60

What would be better semantics for long-running rate counters?

For instance, day counter should be:
- 0 until 24 hours has passed
- updated once with the value from hourly counter
- updated every time hourly counter fires

The current code is kinda confusing so I wonder how the rate estimation code should be changed. Should I adjust the estimation to run only for 0 (never updated) counters? Or should I try to implement estimation to run in-between counter updates?

Which variant is more convenient and less confusing?

Actions #14

Updated by msuraev over 1 year ago

  • % Done changed from 60 to 80

After checking rate estimation in other monitoring soft it seems like suggestion by cquirin makes perfect sense: dropping the errorneous rate estimator and getting 0 for day counter until 24 hours passed.

The fix is under review in https://gerrit.osmocom.org/c/libosmocore/+/30304

Additionally, one can check https://gerrit.osmocom.org/c/libosmocore/+/30301/ for example setup which uses statsd and netdata to obtain rate counter values independently from libosmocore's rate counter/vty code.

Actions #15

Updated by msuraev over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

The fix is merged, related doc update is available via https://gerrit.osmocom.org/c/osmo-gsm-manuals/+/30715

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)