Project

General

Profile

Actions

Bug #5671

open

rate_ctr: strange per-minute/hour/day e1 tx/rx frame counts

Added by cquirin 3 months ago. Updated 12 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
08/30/2022
Due date:
% Done:

60%

Spec Reference:

Description

I am seeing this in osmo-e1d

  E1 Tx Frames multiplexed: 899771261 (8000/s 712004/m 35520752/h 892818365/d)
  E1 Rx Frames demultiplexed: 899771232 (8000/s 712000/m 35520752/h 892818352/d)

8000 * 60 = 480000 right?
712000 / 8000 = 89

Actions #1

Updated by laforge 3 months ago

this is the vty output of the libosmocore rate_ctr API. so it may very well be a bug in libosmocore affecting all our counters.

There's one global timer in libosmocore which iterates over all rate_ctr and computes the per-minute/hour/day aggregate stats.

Actions #2

Updated by laforge 3 months ago

Looking at the output of some of my long-running osmo-e1d instances:

          tx:frames_muxed: 80257523645 (8011/s 512011/m 49440128/h 748802000/d) E1 Tx Frames multiplexed
        rx:frames_demuxed: 80257529819 (8008/s 512008/m 49440136/h 748802000/d) E1 Rx Frames demultiplexed

that's also odd:
  • 512011/60 = 8533/s
  • 49440128/3600 = 13733/s
  • 748802000/86400 = 8667/s

So clearly something odd is going on here.

Actions #3

Updated by laforge 3 months ago

So the libosmocore/rate_ctr works like this:
  • schedule a libosmocore timer 1s in the future
  • timer call-back is rate_ctr_group_intv()
    • increment timer_ticks
    • update the per-second interval stats
    • if timer_ticks % 60 == 0, update the per-minute stats
    • etc. for the higher intervals
  • re-arm the timer 1s in the future
So this has the following problems:
  • the timer expiration might be delayed by the scheduler
  • the processing of the timer takes some time
  • thus, in reality, the timer expiration is likely always "some amount more" than 1s, where "some amount" is the sum of the scheduler delay and the sum of the processing time of the timer callback function

so ideally, we would use a periodic timer expiring regularly at 1s intervals rather than multiple consecutive relative 1s timers. That way, the error doesn't accumulate over time.

When rate_ctr was written, we didn't yet have support for periodic timerfd in libosmocore yet, so we had to use what was available (osmo_timer, which are not repeating).

So this explains why there is some amount of error. However, it does not explain the magnitude of error we're seeing (13733/s is way more than 8533/s).

Actions #4

Updated by cquirin 3 months ago

How about doing a gettimeofday() each time the counter get actually updated? Some math should then allow for the correct values to be deduced. That can be done with or without FP.

Actions #5

Updated by laforge 3 months ago

On Wed, Aug 31, 2022 at 02:47:48PM +0000, cquirin wrote:

How about doing a gettimeofday() each time the counter get actually updated? Some math should then allow for the correct values to be deduced. That can be done with or without FP.

using a timerfd avoids the need for any of this. Even if the system is
stalled for several seconds, it will tell you how many timer intervals
(1s in our case) have expired meanwhile. We use this
in all the critical timing paths of other osmocom software, just not
for the rate_ctr.

However, I still don't think this is the root cause of what we're
seeing.

Actions #6

Updated by laforge 3 months ago

  • Project changed from osmo-e1d to libosmocore
  • Subject changed from strange per-minute/hour/day e1 tx/rx frame counts to rate_ctr: strange per-minute/hour/day e1 tx/rx frame counts
Actions #7

Updated by cquirin 3 months ago

rate_ctl.c:307 contains the following code

/* update the rate of the next bigger interval.  This will                                                                                                                                                                          
 * be overwritten when that next larger interval expires */
if (intv + 1 < ARRAY_SIZE(ctr->intv))
ctr->intv[intv+1].rate += ctr->intv[intv].rate;

This code is reasonable when the very first interval has not yet elapsed, as it incrementally fills the higher values with some kind of estimates.

However, as soon as the rate has been computed the regular way, by differentiation, the code should no longer run.

I commented the code to check. I have rates of 0 for the first interval, of course, (meaning that the per day value stays 0 for a whole day) but I do have reasonable values as soon as the first interval has elapsed.

Actions #8

Updated by laforge 17 days ago

  • Assignee changed from laforge to msuraev

I really didn't find the time so far, maybe something Max could look into?

Actions #9

Updated by msuraev 16 days ago

  • Status changed from New to In Progress

Let's convert this to timerfd first and than troubleshoot and fix remaining inaccuracies.

Actions #10

Updated by laforge 16 days ago

On Wed, Nov 23, 2022 at 11:10:59AM +0000, msuraev wrote:

Let's convert this to timerfd first and than troubleshoot and fix remaining inaccuracies.

those are actually two separate topics. The last update from cquirin handles a generic
logic bug int he code, which is IMHO higher priority than any "inaccoracies.

Actions #11

Updated by msuraev 15 days ago

laforge wrote in #note-10:

those are actually two separate topics. The last update from cquirin handles a generic
logic bug int he code, which is IMHO higher priority than any "inaccoracies.

I completely agree regarding priorities. From the implementation point of view it's simply easier to fix the code which won't be affected by rewrite right away.

Actions #12

Updated by msuraev 15 days ago

  • % Done changed from 0 to 50

The preliminary fix is available in https://gerrit.osmocom.org/q/topic:ratectr
N. B: it hasn't been thoroughly tested yet.

Actions #13

Updated by msuraev 12 days ago

  • % Done changed from 50 to 60

What would be better semantics for long-running rate counters?

For instance, day counter should be:
- 0 until 24 hours has passed
- updated once with the value from hourly counter
- updated every time hourly counter fires

The current code is kinda confusing so I wonder how the rate estimation code should be changed. Should I adjust the estimation to run only for 0 (never updated) counters? Or should I try to implement estimation to run in-between counter updates?

Which variant is more convenient and less confusing?

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)