...
Source-Level Diagnosis: Grouping fingerprints by source allows you to understand which parts of your system are generating specific log patterns. For example, if a certain error fingerprint is seen predominantly from a specific service (such as an authentication service), this could indicate that service is the source of the issue. Without grouping by source, you may miss the root cause.
Resource Allocation and Scaling: If one particular source (like an API gateway or database) is generating a disproportionate number of fingerprints, it may indicate a bottleneck or resource contention issue. Understanding this allows for more targeted scaling or resource allocation to that part of the system to ensure overall system health.
Faster Troubleshooting: When logs are grouped by source, it becomes much easier to identify which part of the system is responsible for certain issues. If you know that certain fingerprints correspond to recurring problems (e.g., database errors, network issues, etc.), tracking those patterns by source helps you focus on the right area quickly.
Average of a Duration/Number Facet
...
Identify Bottlenecks and Latency Trends: If your logs contain durations (e.g., response times for API requests, transaction times, query execution times), calculating the average duration over time helps identify performance trends. For example, if the average duration of an API call is gradually increasing over time, this might signal that something in the system is slowing down and requires optimization (e.g., database queries taking longer, network latency increasing, etc.).
Estimate Resource Requirements: Knowing the average duration of specific processes or operations (e.g., API calls, data processing tasks) helps estimate resource requirements. For example, if the average duration of a batch job is increasing over time, it may indicate that more CPU or memory resources are needed to handle the load. By calculating averages, teams can plan for future scaling needs and ensure that the system can handle increasing load without performance degradation.
...
Error Rate Formula
...
Use Cases:
Failure Detection: A spike in the error rate could indicate that a system component has failed or is malfunctioning. For example, a sudden rise in errors across the logs could point to a service crash, a network failure, or a hardware issue (e.g., disk failures). Quickly catching these spikes allows teams to react faster and bring the system back to normal operation.
Trend Analysis: Over time, monitoring the error rate helps identify trends that might not be immediately apparent. Gradual increases in error rates, even if subtle, can signal an issue that needs to be addressed (e.g., a misconfigured system or slowly degrading performance). Monitoring these trends allows teams to take action before a small issue becomes a major failure.
Advanced Functions
Anomaly on Count of Error Logs
In the above Image Around 8:40, there is a sudden, sharp spike in error logs that breaches the gray band. This anomaly is highlighted in red to indicate that the error count has exceeded the expected range, suggesting an unusual event, such as a system malfunction, a deployment issue, or an unexpected traffic surge that is causing increased errors.
Outlier
In this scenario, the error logs are being monitored across various sources within a distributed system.Two namespaces—are marked as outliers, meaning their error log rates differ significantly from other namespaces. This suggests potential issues within these specific components, such as increased load, configuration issues, or code changes that may be causing higher-than-normal errors.
This outlier detection allows teams to prioritize investigation into these specific sources, helping to identify and resolve issues before they impact the broader system.
Forecast
Log Math Operator to Scale the Y-Axis Down
...