Configure metric alerts

 

Alert Types

First select the type of alert or detection method. Kloudfuse provides support for following for metric stream.

Threshold Alert

A threshold alert compares metric values to a static threshold. On each alert evaluation It will calculate the average/minimum/maximum/sum over the selected period and check if it is above/below the threshold. This is the standard alert case where you know what sorts of values are unexpected.

Change Alert

A change alert evaluates the difference between a value N minutes ago and now. Each alert evaluation calculates the raw difference (not absolute value) between the series now and N minutes ago then compute the average/minimum/maximum/sum over the selected period. An alert is triggered when this computed series crosses the threshold.

Outliers Alert

Outlier monitors detect when a member of a group (e.g., hosts, availability zones, partitions) is behaving unusually compared to the rest. Each alert evaluation checks whether or not all groups are clustered together, exhibiting the same behavior. An alert is triggered whenever at least one group diverges from the rest of the groups. Understand and use available configurations in the configure section.

Anomaly Alert

An anomaly alert uses past behavior to detect when a metric is behaving abnormally. Anomaly monitors detect when a metric is behaving differently than it has in the past, taking into account trends and seasonal day-of-week and time-of-day patterns. Understand and use available configurations in the configure section.

Forecast Alert

A forecast alert predicts the future behavior of a metric and compares it to a static threshold. An alert is triggered whenever a metric is forecast to cross a threshold in the future. Please see the details of each algorithm when selecting the algo name below in "Set Conditions. Understand and use available configurations in the configure section.

Creating Alert

Step 1. Choose Detection method

Choose the detection type from the available tabs (Threshold/Change/Outliers/Anomaly/Forecast)

 

Step 2. Define the search query

Construct a query to measure metrics, group by one or several dimensions, etc. using the query builder. The underlying query is a PromQL query. Optionally configure additional functions to be used on top of the query (to see an explanation of the list of functions, please see here).

 

Step 3: Populate condition and Evaluation

Threshold Alert

  • Populate the Condition section by defining the:

    • aggregate to be used on the query result from the drop-down.

    • query or expression from the drop-down

    • thresholds that should be breached for the alert to be firing

  • Populate the Evaluation section by defining the:

    • evaluation frequency that determines the frequency at which alert expression/query must be evaluated (must be a multiple of 10 seconds. For example, 1m, 30s, etc.) and

    • specify the duration for which the condition must be true before an alert fires

(Note: Once a condition is breached, the alert goes into the “Pending” state. If the condition remains breached for the duration specified in “For”, the alert transitions to the “Firing” state, otherwise it reverts to the “Normal” state)

Change Alert

  • Populate the Condition section by defining the:

    • aggregate to be used on the query result from the drop-down.

    • select change type from drop down:

      • change - threshold is compared to difference in new value and old value

      • change % - threshold is compared to the % change relative to the old value ((new - old) / old)

    • query or expression from the drop-down.

    • thresholds that should be breached for the alert to be firing

  • Populate the Evaluation section by defining the duration for which the condition must be true before an alert fires.

Outlier Alert

  • Choose the outlier detection algorithm from the available list and define the duration for which the outlier condition must hold before an alert fires.

    • DBSCAN

      • Details: DBSCAN is a clustering algorithm which can group similar data together in groups. Read more here.

      • Required Parameters:

        • Tolerance: Use this parameter to tune the outlier-ness. This parameter, should be thought of as % difference from the median values. For example, without knowing the actual CPU usage value, one can say, a replica pod is an outlier if it’s using 50% (tolerance = 0.50) more CPU than the median value of CPU usages across all replicas.

Anomaly Alert

  • Choose the anomaly detection algorithm from the available list of algorithms and define the duration for which the anomalous condition holds before an alert fires.

    • Basic

      • Details: Use when metrics have no repeating seasonal pattern. Basic uses a simple lagging rolling quantile computation to determine the range of expected values. It uses little data and adjusts quickly to changing conditions but has no knowledge of seasonal behavior or longer trends.

      • Parameters

        • Window: rollup time duration to use.

        • Bound: Deviation bound to use for acceptable value. Input values outside of the acceptable bounds are considered anomalous. A value of 1 for bound means that the anomalous values are greater (or lesser) than 1 standard deviation from the input values.

        • Band:

          • upper: Use if only values higher than the upper bound are to considered anomalous.

          • lower: Use if only values lower than the lower bound are considered anomalous.

          • both: Use if values higher than upper bound and lower than lower bound are considered anomalous.

    • Agile

      • Details: Use when metrics are seasonal and expected to shift. The algorithm quickly adjusts to metric level shifts. A robust version of the SARIMA algorithm, it incorporates the immediate past into its predictions, allowing quick updates for level shifts at the expense of being less robust to recent, long-lasting anomalies.

      • Parameters

        • Window: rollup time duration to use.

        • Bound: Deviation bound to use for acceptable value. Input values outside of the acceptable bounds are considered anomalous. A value of 1 for bound means that the anomalous values are greater (or lesser) than 1 standard deviation from the input values.

        • Band:

          • upper: Use if only values higher than the upper bound are to considered anomalous.

          • lower: Use if only values lower than the lower bound are considered anomalous.

          • both: Use if values higher than upper bound and lower than lower bound are considered anomalous.

    • Robust

      • Details: Use when seasonal metrics expected to be stable, and slow, level shifts are considered anomalies. A seasonal-trend decomposition algorithm, it is stable and predictions remain constant even through long-lasting anomalies at the expense of taking longer to respond to intended level shifts (for example, if the level of a metric shifts due to a code change.)

      • Parameters

        • Window: rollup time duration to use.

        • Bound: Deviation bound to use for acceptable value. Input values outside of the acceptable bounds are considered anomalous. A value of 1 for bound means that the anomalous values are greater (or lesser) than 1 standard deviation from the input values.

        • Model: Use additive model when the seasonal component does not vary with the level of the time series. Use multiplicative model if the seasonal component is proportional to the level of the time series.

        • Period: It’s value should be less than or equal to the Window selected in minutes.

        • Band:

          • upper: Use if only values higher than the upper bound are to considered anomalous.

          • lower: Use if only values lower than the lower bound are considered anomalous.

          • both: Use if values higher than upper bound and lower than lower bound are considered anomalous.

    • RRCF (Robust Random Cut Forest)

      • Details: Use when static thresholds are not viable. RRCF algorithm is stable to both seasonality and trend as long as the parameters used are such that the input data captures the seasonality and the trend.

      • Parameters

        • Global history: Time window to use for the rolling dataset (from the metric query done over this time window). At any point in time, RRCF algo captures the signal behavior seen over this time window (essentially to capture trend).

        • Local history: Time window to use for capturing the signal behavior in recent past (essentially to capture seasonality).

Time window to use for capturing the signal behavior in recent past.Time window to use for capturing the signal behavior in recent past.Time window to use for capturing the signal behavior in recent past.

Forecast Alert

  • Select the algorithm to use from the list of available options and define the duration for which the forecasted value must breach the threshold before an alert fires.

    • Linear

      • Details: Use the linear algorithm for metrics that have steady trends but no repeating seasonal pattern. This does a robust linear regression through the entire history.

      • Parameters

        • History duration: the amount of past data that should be used for making the forecast.

        • Forecast duration: Predict the value of time series "forecast duration" from now.

    • Seasonal

      • Use when there’s seasonality in the metric. This uses Double exponential smoothing Holt Winters.

      • Parameters

        • History duration: the amount of past data that should be used for making the forecast.

        • Seasonality factor: Smoothening factor to use for prediction. The lower the smoothing factor, the more importance is given to old data. Value needs to within 0 to 1.

        • Trend factor: Trend factor to use for prediction. The higher the trend factor, the more trends in the data is considered. Value needs to be within 0 to 1.

        •  

Step 4: Populate Name and Title details

  • Choose the folder to which the alert definition should be saved. (If you need to create a separate folder, then create one using the “new folder” option in the drop-down menu).

  • Rule Name: set a descriptive name for the rule.

  • Group Name: Specify a group name. Rules within a group are run sequentially at regular intervals, with the same evaluation time.

  • Populate title and summary with variables to include additional information in the alert.

Step 5: Configure a contact point

  • Choose how notifications are sent to your teams (email, Slack, PagerDuty, etc). Choose an existing contact point from the drop-down menu for notifications when this alert fires, or create a new one. To configure a new contact point, please see details for each type of contact point in this section. Once done, click “Create Rule”