Max Latency vs P99 Latency

TLDR: P99 (and other P-*) latencies are computed using extrapolation. Max latency is recorded from source spans, and does not use extrapolation. This leads to instances where P99 > Max

A common observation on Kloudfuse screens with RED metrics is that P99 latency for a service or span is sometimes higher than the recorded Max latency. This is a side-effect of the metric bucketing system, which relies on extrapolation logic to compute P99 and other latencies, especially when the number of data points in low.

The image below represents latency-data flowing into the Kloudfuse system for an arbitrary time window. As the data is streaming in, Kloudfuse keeps track of the Max value seen during the individual evaluation intervals (example Max for every minute of data). This Max value is stored per evaluation interval.

image-20240514-172907.png

On the other hand, the P99 latency calculation uses a bucketing logic shown in the image below. Here, instead of storing the exact values for latencies over a time period, points are added to larger buckets. Once a latency-value is mapped to a bucket, the system does not have the granularity to figure out where the point is exactly located within the bucket. For example, the 1 point recorded in Bucket 3 below could be at 101ms or 999ms. In this example, when query is made to calculate P99, the system figures out that the latency is located in Bucket 3. After that, an extrapolation is done to estimate where the point lies in the bucket. This can result in P99 estimated value to be higher than the Max value.

image-20240514-172929.png

What can be done to fix this?

  • Increase time range for the query so that more points are looked at by the query. This will make the extrapolated values to be closer to the real values.

  • Add more granular buckets using config. Each bucket represents all the points within it, and more granular the buckets the less information will be lost.

In addition to the above, Kloudfuse team has a longer term plan to minimize the delta between real and extrapolated values.

Related pages