Metrics roll up
To improve query performance and reduce loading times, Kloudfuse computes and aggregates metrics data in 5 minute intervals directly from the data stream. Depending on the time span of the query, Kloudfuse calculates results either from raw data, or from rolled up data. In the shorter time spans, we continue to use raw metrics because the calculation approach could potentially smooth out the data and potentially miss important signals, such as outliers.
Metric roll up is off by default. To enable it, contact Kloudfuse Support.
By default, Kloudfuse rolls up metric data in 5 minute intervals. You can configure that interval.
Benefits
The primary benefit of this approach is a reduced I/O cost, as Kloudfuse samples aggregate metrics instead of raw values. Query performance improves by these pre-calculated aggregates. And quicker calculation means faster loading results for dashboards and graphs. Additionally, it is relatively inexpensive to increase retention times for these aggregated metrics.
Consider the situations where raw data stream has intervals of 15 or 30 seconds, and compare the number of records that each query processes with the number of records when using pre-aggregated metrics data at 5 minute interval, and at 10 minute interval. When the data stream is at 15 or 30 seconds, using rolled up (pre-aggregated) metrics at 5 minutes improves efficiency by reducing the data retrieval time by a factor of 20 or 10, respectively. With a roll up interval of 10 minutes, data retrieval performance improves by a factor of 40 or 20, respectively.
Query Duration | Number of stored records | |||||||
---|---|---|---|---|---|---|---|---|
1 metric | 200 metrics | |||||||
Raw data | Rolled up data | Raw data | Rolled up data | |||||
15s | 30s | 5 min | 10 min | 15s | 30s | 5 min | 10 min | |
6h | 1,440 | 720 | 120 | 60 | 288, K | 14 K | 24 K | 12 K |
2d | 11,520 | 5,760 | 960 | 480 | 2,304 K | 1,152 K | 19.2 K | 96 K |
7d | 40,320 | 20,160 | 3,360 | 1,680 | 8,064 K | 4,032 K | 672 K | 336 K |
2w | 80,640 | 40,320 | 6,720 | 3,360 | 16,126 K | 8,063 K | 1,344 K | 672 K |
1mo=30d | 172,800 | 86,400 | 14,400 | 7,200 | 34,560 K | 17,280 K | 2,880 K | 1,440 K |
1y=365.25d | 2,104 K | 1,052 K | 175 K | 88 K | 420,768 K | 210,384 K | 35,064 K | 17,532 K |
Compare the disk I/O times with raw metrics vs. with rolled up metrics.
For details on how Kloudfuse implements metric roll up, see Ingestion, Processing, Calculation, Storage, and Queries, Metrics Processing and Storage, and Metrics Queries.
Drawbacks
In addition to some storage overhead, potentially adding disks when you plan to retain large amounts of historical data, metrics roll up uses more in-memory resources than raw metrics.
Ingestion, Processing, Calculation, Storage, and Queries
The following diagram illustrates how Kloudfuse handles metrics, from life stream processing, to queries from dashboards and alerts. We are assuming the default roll up of 5 minutes.
Metrics processing and storage
Refer to the upper part of the diagram in the Ingestion, Processing, Calculation, Storage, and Queries section. This workflow illustrates the default roll up setting of 5 minutes. The numbers in light blue circles correspond to these steps:
Kloudfuse gets life stream time series data from your environment, either through agents or from cloud sources.
The Ingester Service pre-processes the data stream, and routes it to Kafka as
kf_metrics_topic
.Kafka handles the same data stream in two parallel processes:
It forwards
kf_metrics_topic
directly to Pinot.It uses
kf_metrics_topic
to extract roll-up metrics:Kafka sends
kf_metrics_topic
to the Metrics Transformer.The Metrics Transformer creates
kf_metrics_rollup_topic
to create aggregations and markers for 5 minute intervals, and sends it back to Kafka.Kafka forwards
kf_metrics_rollup_topic
to Pinot.
Pinot handles the topics in the following manner:
Raw metrics:
The Metrics Decoder receives
kf_metrics_topic
, performs necessary calculations, and writes it to tablekf_metrics
.The table columns are
name
(of metric),timestamp
,labels
,value
, andle
.
Rolled up metrics:
The Metrics Rollup Decoder receives
kf_metrics_rollup_topic
, performs necessary calculations and aggregations, and writes it to the tablekf_metrics_rollup
.The table columns are
name
(of metric),timestamp
,labels
,sum
,count
,min
,max
,counter
,first
,first_ts
, andle
.The aggregations
sum
,count
,min
, andmax
are calculated over the rawvalue
in the other table.Kloudfuse uses both
counter
(last counter value that accounts for resets within the rollup window),first
(first value encounter in the bucket), andfirst_ts
(timestamp of first) to ensure data integrity.
Metrics queries
Refer to the lower part of the diagram in the Ingestion, Processing, Calculation, Storage, and Queries section. The numbers in the dark blue circles correspond to these steps:
Kloudfuse gets a query request either from the Kloudfuse UI, or from Grafana UI.
This may be triggered by starting the Metrics interface, loading dashboards, changing and reloading dashboards and reports, changing the time picker values, and so on.The Query Service determines, based on the time interval (more or less than 2 days) or step size (more or less than 5 minutes), from which table it should read metrics, and issues the appropriate read requests.
The Query Service receives results for all queries from:
Raw metrics, in table
kf_metrics
Rolled up metrics, in table
kf_metrics_rollup
The Query combines the results and forwards it to the original requesting UI.
Configuration
Enabling Metrics Rollup
Kloudfuse optimizes metrics through roll up in the background. This feature is disabled by default. To turn on this feature, use the following configuration in the custom_values.yaml
file.
global:
metrics:
rollupEnabled: true
Rollup Metrics in Query Service
The Query Service uses rolled up metrics by default, on all queries that span a time interval of 2 days or longer or step size of 5 minutes or larger. For shorter intervals, query service uses raw metrics. The switch is transparent to the user. For example, if you are using a time picker to select the time interval, the query service switches to rolled up metrics for time intervals greater than 2 days, and then back to raw metrics when the interval decreases to less than 2 days. This can be adjusted in the custom_values.yaml
file.
query-service:
config:
Metric:
RollupTimerangeThresholdSecs: 172800
RollupStepThresholdSecs: 300
The Query Service uses roll up metrics when it is appropriate. To turn off the roll up option for queries, use the following configuration command in the custom_values.yaml
file:
query-service:
config:
Metric:
AutoUseRollup: false
Retention
By default, the roll up metrics have the same retention policy as raw metrics. To change this timing, typically to a longer retention period, add the new retention policy retentionTimeRollupValue
and retentionTimeRollupUnit:
to the global.retentionPolicy
section of the custom_values.yaml
:
Mixed data
To handle mixed data, where some (likely older) data has no corresponding rolled up data, the query service accesses the kf_metrics_rollup
table for the minimum timestamps. If the query time range contains the minimum timestamp, the query service splits the query into multiple component queries. The older data uses the kf_metrics
table, and newer data uses the rolled up table kf_metrics_rollup
. The Query Service then merges the responses before returning the result.