APM (Traces)
Kloudfuse Application Performance Monitoring (APM) gives deep visibility into your applications with out-of-the-box performance dashboards for services, resources, and databases to monitor requests, errors, throughput, and latency. Distributed traces seamlessly correlate to different sessions, hosts, and containers.
APM Terms and Concepts
The APM UI provides many tools to troubleshoot application performance and correlate it throughout the product, which helps you find and resolve issues in distributed systems.
CONCEPT | DESCRIPTION |
---|---|
Service | Services are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs to build your application. |
Resource | Resources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job. |
Alerts | APM metric alerts work like regular metric monitors, but with controls tailored specifically to APM. Use these to receive alerts at the service level on hits, errors, and a variety of latency measures. |
Trace | A trace is used to track the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans. |
Span | A span represents a logical unit of work in a distributed system for a given period. Multiple spans construct a trace. |
Service entry span | A span is a service entry span when it is the entry point method for a request to a service. You can visualize this within Kloudfuse APM UI when the color of the immediate parent on a flame graph is a different color. |
Trace root span | A span is the root span when it is the entry point method for the trace. Its start marks the beginning of the trace. |
Trace metrics | Trace metrics are automatically collected and kept with a 15-month retention policy similar to other Kloudfuse metrics. They can be used to identify and alert on hits, errors, or latency. Statistics and metrics are always calculated based on all traces and are not impacted by ingestion controls. |
Span tags | Tag spans in the form of key-value pairs to correlate a request in the Trace View or filter in Analytics. |
Execution Time | Total time that a span is considered ‘active’ (not waiting for a child span to complete), scaled according to the number of concurrent active spans. |
Sending traces to Kloudfuse data plane:
Please refer to https://docs.datadoghq.com/tracing/trace_collection/ if you are using the Kloudfuse agent.
Please refer to Traces using Otel Collector if you are using the otel collector.
In the open telemetry collector’s configuration set the otlphttp section as follows:
otlphttp:
tls:
insecure: true
traces_endpoint: http://<kloudfuse ingress IP>/ingester/otlp/traces
If configuring for an https endpoint:
otlphttp:
tls:
insecure_skip_verify: true
traces_endpoint: http://<kloudfuse ingress IP>/ingester/otlp/traces
If you are using the elastic agent, set the host under apm-server as follows:
apm-server:
host: "https://<kloudfuse ingress IP>/ingester/"
Services Observability:
Services List:
After instrumenting your application, the Services List is your main landing page for APM data.
The services list is a high-level view of all services reporting from your infrastructure. Services are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs for the purposes of scaling instances. Some examples:
A group of URL endpoints may be grouped together under an API service.
A group of DB queries that are grouped together within one database service.
A group of periodic jobs configured in the crond service.
The screenshot below is a microservice distributed system for an e-commerce site builder. There’s a web-store
, ad-server
, payment-db
, and auth-service
all represented as services in APM.
Filtering the service list
Filter the services list by:
Environment
Kubernetes tags (cluster, namespace, etc.)
Cloud tags (region, availability zone, etc.)
Services type, version and language
All services can be found in the Service List and visually represented on the Service Map. Each service has its own Service page where trace metrics like throughput, latency, and error rates can be viewed and inspected. Use these metrics to create dashboard widgets, create alerts, and see the performance of every resource such as a web endpoint or database query belonging to the service.
Services types
Every service monitored by your application is associated with a type. Kloudfuse automatically determines this type based on the span.type
attribute attached to your spans. The type specifies the name of the application or framework that the Kloudfuse Agent is integrating with.
For example, if you use the official Flask Integration, the Type
is set to “Web”. If you are monitoring a custom application, the Type
appears as “Custom”.
The type of the service can be one of:
Cache
Custom
DB
Http
Queue
Web
Some integrations alias to types. For example, Postgres, MySQL, and Cassandra map to the type “DB”. Redis and Memcache integrations map to the type “Cache”.
Service columns
The service list contains the following columns:
Requests: Total amount of requests traced (per seconds)
Median/p75/p90/p95/p99/Max Latency: Median/p75/p90/p95/p99/Max latency of your traced requests
Error Rate: Amount of requests traced (per seconds) that ended with an error
Apdex: Apdex score of the service. Learn more about Apdex.
Overview
Selecting a service on the services page leads you to the detailed service page. A service is a set of processes that do the same job - for example a web framework or database (read more about how services are defined in Getting Started with APM).
Consult on this page:
Out-of-the-box graphs
Resources associated to this service
Additional tabs
Deployments, Error Tracking, Traces, and more
Out-of-the-box graphs
Kloudfuse provides out-of-the-box graphs for any given Service:
Requests - Choose to display:
The Total amount of requests and errors
The amount of Requests per second
Latency - Choose to display:
The Median/p75/p90/p95/p99/Max latency of your traced requests
The Latency distribution
The Apdex score for web services; learn more about Apdex
Error - Choose to display:
The Total amount of errors
The amount of Errors per second
The % Error Rate
Dependency Map:
The Dependency Map showing upstream and downstream services.
Sub-services: When there are multiple services involved, a fourth graph (in the same toggle option as the Dependency Map) breaks down your Total time spent/%of time spent/Avg time per request of your service by services or type.
This represents the total, relative, and average time spent by traces in downstream services from the current service to the other services or type.
Note: For services like Postgres or Redis, which are “final” operations that do not call other services, there is no sub-services graph.
Resources
See Requests, Latency, and Error graphs broken down by resource to identify problematic resources. Resources are particular actions for your services (typically individual endpoints or queries). Read more in Getting Started with APM.
Below, there’s a list of resources associated with your service. Sort the resources for this service by requests, latency, errors, and time, to identify areas of high traffic or potential trouble. Note that these metric columns are configurable (see image below).
Click on a resource to open a side panel that displays the resource’s out-of-the-box graphs (requests, errors, and latency), a resource dependency map, and a span summary table, which toggles between upstream and downstream spans.
Use keyboard navigation keys to toggle between resources on the Resources list and compare resources in a service. To view the full resource page, click Open Full Page.
Columns
Choose what to display in your services list:
Requests: Absolute amount of requests traced
Requests per second: Absolute amount of requests traced per second
Median/p75/p90/p95/p99/Max Latency: The Median/p75/p90/p95/p99/Max latency of your traced requests
Errors: Absolute amount of errors traced
Errors per second: Absolute amount of errors traced per second
Error Rate: Percent of spans traced that were errors
Traces
View the list of traces associated with the service in the traces tab, which is already filtered on your service, environment, and operation name. Drill down to problematic spans using core facets such as status, resource, and error type. For more information, click a span to view a flame graph of its trace and more details.
Resource page
A resource is a particular action for a given service (typically an individual endpoint or query). Read more about resources in Getting Started with APM. For each resource, APM automatically generates a dashboard page covering:
Key health metrics
Monitor status for all monitors associated with this service
List and metrics for all resources associated with this service
Out-of-the-box graphs
Kloudfuse provides out-of-the-box graphs for any given resource:
Requests - Choose to display:
The Total amount of requests
The amount of Requests per second
Latency - Choose to display:
The Median/p75/p90/p95/p99/Max latency of your traced requests
Error - Choose to display:
The Total amount of errors
The amount of Errors per second
The % Error Rate
Sub-Services: When there are multiple services involved, a fourth graph is available that breaks down your Total time spent/%of time spent/Avg time per request of your service by services or type.
This represents the total/relative/average time spent by traces from the current service to the other services or type.
Note: For services like Postgres or Redis, which are “final” operations that do not call other services, there is no sub-services graph.
Dependency Map
You can also view a map of all of a resource’s upstream and downstream service dependencies. With the Dependency Map, you can quickly see the flow of services with spans that go through the specific resource (such as endpoints or database queries) end-to-end.
Hover over a node to view metrics of each service including requests/second, error rate, and average latency.
Span summary
For a given resource, Kloudfuse APM provides you a span analysis breakdown of all matching traces:
The displayed metrics represent, per span:
Avg Spans/trace
Average number of occurrences of the span, for traces including the current resource, where the span is present at least once.% of Traces
Percentage of traces including the current resource where the span is present at least once.Avg Duration
Average duration of the span, for traces including the current resource, where the span is present at least once.Avg % Exec Time
Average ratio of execution time for which the span was active, for traces including the current resource, where the span is present at least once.
Note: A span is considered active when it’s not waiting for a child span to complete, scaled according to the number of concurrent active spans. The active spans at a given time, for a given trace, are all the currently executing leaf spans (in other words, spans without children).
Traces
Consult the list of traces associated with this resource in the Trace search modal already filtered on your environment, service, operation, and resource name:
Trace View
View an individual trace to see all of its spans and associated metadata. Each trace can be viewed either as a flame graph or as a list (grouped by service or host).
Calculate the breakdown of execution time and adjust the color scheme by either service or host.
To get a closer look at the flame graph, zoom in by scrolling:
If you are analyzing a trace reporting an error, the error has a specific display if you follow the special meaning tags rules. When submitting your traces you can add attributes to the meta
parameter.
Some attributes have special meanings that lead to a dedicated display or specific behavior in Kloudfuse:
ATTRIBUTE | DESCRIPTION |
---|---|
| Allows specific SQL query formatting and display in Kloudfuse’s UI. |
| Allows dedicated display for error message. |
| Allows dedicated display for error types. Available types include, for instance, in Python |
| Allows a better display of the stack trace of an exception in Kloudfuse’s UI (red boxes, etc…) |
Request Flow Map
Request flow maps combine two key features of Kloudfuse APM: the service map and live exploring, to help you understand and track request paths through your stack. Quickly identify noisy services and choke points, or how many database calls are generated by a request to a specific endpoint.
No additional configuration is required to use these flow maps, and they are powered by your ingested spans. Scope your traces to any combination of tags and generate a dynamic map that represents the flow of requests between every service. The map is automatically generated based on your search criteria, and will regenerate live after any changes.
Navigating the request flow map
Hover over an edge that connects two services to see metrics for requests, errors, and latency for requests between those two services that match the query parameters.
The highest throughput connections are highlighted to show the most common path.
The current request flow map is a great way to generate a live architecture diagram, or one scoped to a specific user flow.
These metrics capture request counts, error counts, and latency measures. They are calculated based on 100% of the application’s traffic, regardless of any trace ingestion sampling configuration. Ensure that you have full visibility into your application’s traffic by using these metrics to spot potential errors on a service or a resource, and by creating dashboards, monitors, and SLOs.
Note: If your applications and services are instrumented with OpenTelemetry libraries and you set up sampling at the SDK level and/or at the collector level, APM metrics are calculated based on the sampled set of data.
Trace metrics
Trace metrics are generated for service entry spans and certain operations depending on integration language. For example, the Django integration produces trace metrics from spans that represent various operations (1 root span for the Django request, 1 for each middleware, and 1 for the view).
The trace metrics namespace is formatted as:
Trace execution time histograms.
trace_execution_time_ms_bucket
trace_execution_time_ms_count
trace_execution_time_ms_sum
trace_span_count
trace edge level metrics.
trace_edge_error_total
trace_edge_spans_total
trace_edge_latency_ms
trace_edge_latency_ms_bucket
trace_edge_latency_ms_count
trace_edge_latency_ms_sum
Span level metrics
span_total
span_error_total
span_latency_ms_bucket
span_latency_ms_sum
span_latency_ms_count
With the following definitions:
Trace metrics tags, possible tags are: env
, version
, resource
, parent_service_name, parent_span_name, service_name, span_name
, and any Agent tags (including the host and kubernetes tags). Note: Tags set on spans do not count and will not be available as tags for your traces metrics.