Metrics roll up

2.7.4.png To improve query performance and reduce loading times, Kloudfuse computes and aggregates metrics data in 5 minute intervals directly from the data stream. Depending on the time span of the query, Kloudfuse calculates results either from raw data, or from rolled up data. In the shorter time spans, we continue to use raw metrics because the calculation approach could potentially smooth out the data and potentially miss important signals, such as outliers.

  • Metric roll up is off by default. To enable it, contact Kloudfuse Support.

  • By default, Kloudfuse rolls up metric data in 5 minute intervals. You can configure that interval.

Benefits

The primary benefit of this approach is a reduced I/O cost, as Kloudfuse samples aggregate metrics instead of raw values. Query performance improves by these pre-calculated aggregates. And quicker calculation means faster loading results for dashboards and graphs. Additionally, it is relatively inexpensive to increase retention times for these aggregated metrics.

Consider the situations where raw data stream has intervals of 15 or 30 seconds, and compare the number of records that each query processes with the number of records when using pre-aggregated metrics data at 5 minute interval, and at 10 minute interval. When the data stream is at 15 or 30 seconds, using rolled up (pre-aggregated) metrics at 5 minutes improves efficiency by reducing the data retrieval time by a factor of 20 or 10, respectively. With a roll up interval of 10 minutes, data retrieval performance improves by a factor of 40 or 20, respectively.

Query Duration

Number of stored records

1 metric

200 metrics

Raw data

Rolled up data

Raw data

Rolled up data

15s

30s

5 min

10 min

15s

30s

5 min

10 min

6h

1,440

720

120

60

288, K

14 K

24 K

12 K

2d

11,520

5,760

960

480

2,304 K

1,152 K

19.2 K

96 K

7d

40,320

20,160

3,360

1,680

8,064 K

4,032 K

672 K

336 K

2w

80,640

40,320

6,720

3,360

16,126 K

8,063 K

1,344 K

672 K

1mo=30d

172,800

86,400

14,400

7,200

34,560 K

17,280 K

2,880 K

1,440 K

1y=365.25d

2,104 K

1,052 K

175 K

88 K

420,768 K

210,384 K

35,064 K

17,532 K

Compare the disk I/O times with raw metrics vs. with rolled up metrics.

image-20240909-162818.png
Select metrics from the Kloudfuse plane

For details on how Kloudfuse implements metric roll up, see Ingestion, Processing, Calculation, Storage, and Queries, Metrics Processing and Storage, and Metrics Queries.

Drawbacks

In addition to some storage overhead, potentially adding disks when you plan to retain large amounts of historical data, metrics roll up uses more in-memory resources than raw metrics.

Ingestion, Processing, Calculation, Storage, and Queries

The following diagram illustrates how Kloudfuse handles metrics, from life stream processing, to queries from dashboards and alerts. We are assuming the default roll up of 5 minutes.

metrics-rollup-architecture.png
Metrics processing, storage, and queries

Metrics processing and storage

Refer to the upper part of the diagram in the Ingestion, Processing, Calculation, Storage, and Queries section. This workflow illustrates the default roll up setting of 5 minutes. The numbers in light blue circles correspond to these steps:

  1. Kloudfuse gets life stream time series data from your environment, either through agents or from cloud sources.

  2. The Ingester Service pre-processes the data stream, and routes it to Kafka as kf_metrics_topic.

  3. Kafka handles the same data stream in two parallel processes:

    1. It forwards kf_metrics_topic directly to Pinot.

    2. It uses kf_metrics_topic to extract roll-up metrics:

      1. Kafka sends kf_metrics_topic to the Metrics Transformer.

      2. The Metrics Transformer creates kf_metrics_rollup_topic to create aggregations and markers for 5 minute intervals, and sends it back to Kafka.

      3. Kafka forwards kf_metrics_rollup_topic to Pinot.

  4. Pinot handles the topics in the following manner:

    1. Raw metrics:

      1. The Metrics Decoder receives kf_metrics_topic, performs necessary calculations, and writes it to table kf_metrics.

      2. The table columns are name (of metric), timestamp, labels, value, and le.

    2. Rolled up metrics:

      1. The Metrics Rollup Decoder receives kf_metrics_rollup_topic, performs necessary calculations and aggregations, and writes it to the table kf_metrics_rollup.

      2. The table columns are name (of metric), timestamp, labels, sum, count, min, max, counter, first, first_ts, and le.

      3. The aggregations sum, count, min, and max are calculated over the raw value in the other table.

      4. Kloudfuse uses both counter (last counter value that accounts for resets within the rollup window), first (first value encounter in the bucket), and first_ts (timestamp of first) to ensure data integrity.

Metrics queries

Refer to the lower part of the diagram in the Ingestion, Processing, Calculation, Storage, and Queries section. The numbers in the dark blue circles correspond to these steps:

  1. Kloudfuse gets a query request either from the Kloudfuse UI, or from Grafana UI.
    This may be triggered by starting the Metrics interface, loading dashboards, changing and reloading dashboards and reports, changing the time picker values, and so on.

  2. The Query Service determines, based on the time interval (more or less than 2 days) or step size (more or less than 5 minutes), from which table it should read metrics, and issues the appropriate read requests.

  3. The Query Service receives results for all queries from:

    • Raw metrics, in table kf_metrics

    • Rolled up metrics, in table kf_metrics_rollup

  4. The Query combines the results and forwards it to the original requesting UI.

Configuration

Enabling Metrics Rollup

Kloudfuse optimizes metrics through roll up in the background. This feature is disabled by default. To turn on this feature, use the following configuration in the custom_values.yaml file.

global: metrics: rollupEnabled: true

Rollup Metrics in Query Service

The Query Service uses rolled up metrics by default, on all queries that span a time interval of 2 days or longer or step size of 5 minutes or larger. For shorter intervals, query service uses raw metrics. The switch is transparent to the user. For example, if you are using a time picker to select the time interval, the query service switches to rolled up metrics for time intervals greater than 2 days, and then back to raw metrics when the interval decreases to less than 2 days. This can be adjusted in the custom_values.yaml file.

query-service: config: Metric: RollupTimerangeThresholdSecs: 172800 RollupStepThresholdSecs: 300

The Query Service uses roll up metrics when it is appropriate. To turn off the roll up option for queries, use the following configuration command in the custom_values.yaml file:

query-service: config: Metric: AutoUseRollup: false

Retention

By default, the roll up metrics have the same retention policy as raw metrics. To change this timing, typically to a longer retention period, add the new retention policy retentionTimeRollupValue and retentionTimeRollupUnit: to the global.retentionPolicy section of the custom_values.yaml:

Mixed data

To handle mixed data, where some (likely older) data has no corresponding rolled up data, the query service accesses the kf_metrics_rollup table for the minimum timestamps. If the query time range contains the minimum timestamp, the query service splits the query into multiple component queries. The older data uses the kf_metrics table, and newer data uses the rolled up table kf_metrics_rollup. The Query Service then merges the responses before returning the result.