APM (Traces)

Kloudfuse Application Performance Monitoring (APM) gives deep visibility into your applications with out-of-the-box performance dashboards for services, resources, and databases to monitor requests, errors, throughput, and latency. Distributed traces seamlessly correlate to different sessions, hosts, and containers.

APM Terms and Concepts

The APM UI provides many tools to troubleshoot application performance and correlate it throughout the product, which helps you find and resolve issues in distributed systems.

CONCEPT

DESCRIPTION

CONCEPT

DESCRIPTION

Service

Services are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs to build your application.

Resource

Resources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job.

Alerts

APM metric alerts work like regular metric monitors, but with controls tailored specifically to APM. Use these to receive alerts at the service level on hits, errors, and a variety of latency measures.

Trace

A trace is used to track the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans.

Span

A span represents a logical unit of work in a distributed system for a given period. Multiple spans construct a trace.

Service entry span

A span is a service entry span when it is the entry point method for a request to a service. You can visualize this within Kloudfuse APM UI when the color of the immediate parent on a flame graph is a different color.

Trace root span

A span is the root span when it is the entry point method for the trace. Its start marks the beginning of the trace.

Trace metrics

Trace metrics are automatically collected and kept with a 15-month retention policy similar to other Kloudfuse metrics. They can be used to identify and alert on hits, errors, or latency. Statistics and metrics are always calculated based on all traces and are not impacted by ingestion controls.

Span tags

Tag spans in the form of key-value pairs to correlate a request in the Trace View or filter in Analytics.

Execution Time

Total time that a span is considered ‘active’ (not waiting for a child span to complete), scaled according to the number of concurrent active spans.

 

Sending traces to Kloudfuse data plane:

Please refer to https://docs.datadoghq.com/tracing/trace_collection/ if you are using the Kloudfuse agent.

Please refer to Traces using Otel Collector if you are using the otel collector.

In the open telemetry collector’s configuration set the otlphttp section as follows:

otlphttp: tls: insecure: true traces_endpoint: http://<kloudfuse ingress IP>/ingester/otlp/traces

If configuring for an https endpoint:

otlphttp: tls: insecure_skip_verify: true traces_endpoint: http://<kloudfuse ingress IP>/ingester/otlp/traces

If you are using the elastic agent, set the host under apm-server as follows:

apm-server: host: "https://<kloudfuse ingress IP>/ingester/"

Services Observability:

APM

Services List:

 

After instrumenting your application, the Services List is your main landing page for APM data.

The services list is a high-level view of all services reporting from your infrastructure. Services are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs for the purposes of scaling instances. Some examples:

  • A group of URL endpoints may be grouped together under an API service.

  • A group of DB queries that are grouped together within one database service.

  • A group of periodic jobs configured in the crond service.

The screenshot below is a microservice distributed system for an e-commerce site builder. There’s a web-store, ad-server, payment-db, and auth-service all represented as services in APM.

 

Filtering the service list

Filter the services list by:

  • Environment

  • Kubernetes tags (cluster, namespace, etc.)

  • Cloud tags (region, availability zone, etc.)

  • Services type, version and language

All services can be found in the Service List and visually represented on the Service Map. Each service has its own Service page where trace metrics like throughput, latency, and error rates can be viewed and inspected. Use these metrics to create dashboard widgets, create alerts, and see the performance of every resource such as a web endpoint or database query belonging to the service.

Services types

Every service monitored by your application is associated with a type. Kloudfuse automatically determines this type based on the span.type attribute attached to your spans. The type specifies the name of the application or framework that the Kloudfuse Agent is integrating with.

For example, if you use the official Flask Integration, the Type is set to “Web”. If you are monitoring a custom application, the Type appears as “Custom”.

The type of the service can be one of:

  • Cache

  • Custom

  • DB

  • Http

  • Queue

  • Web

Some integrations alias to types. For example, Postgres, MySQL, and Cassandra map to the type “DB”. Redis and Memcache integrations map to the type “Cache”.

Service columns

The service list contains the following columns:

  • Requests: Total amount of requests traced (per seconds)

  • Median/p75/p90/p95/p99/Max Latency: Median/p75/p90/p95/p99/Max latency of your traced requests

  • Error Rate: Amount of requests traced (per seconds) that ended with an error

  • Apdex: Apdex score of the service. Learn more about Apdex.

Overview

Selecting a service on the services page leads you to the detailed service page. A service is a set of processes that do the same job - for example a web framework or database (read more about how services are defined in Getting Started with APM).

Consult on this page:

  • Out-of-the-box graphs

  • Resources associated to this service

  • Additional tabs

    • Deployments, Error Tracking, Traces, and more

Out-of-the-box graphs

Kloudfuse provides out-of-the-box graphs for any given Service:

  • Requests - Choose to display:

    • The Total amount of requests and errors

    • The amount of Requests per second

  • Latency - Choose to display:

    • The Median/p75/p90/p95/p99/Max latency of your traced requests

    • The Latency distribution

    • The Apdex score for web services; learn more about Apdex

  • Error - Choose to display:

    • The Total amount of errors

    • The amount of Errors per second

    • The % Error Rate

  • Dependency Map:

    • The Dependency Map showing upstream and downstream services.

  • Sub-services: When there are multiple services involved, a fourth graph (in the same toggle option as the Dependency Map) breaks down your Total time spent/%of time spent/Avg time per request of your service by services or type.

    This represents the total, relative, and average time spent by traces in downstream services from the current service to the other services or type.

    Note: For services like Postgres or Redis, which are “final” operations that do not call other services, there is no sub-services graph.

Resources

See Requests, Latency, and Error graphs broken down by resource to identify problematic resources. Resources are particular actions for your services (typically individual endpoints or queries). Read more in Getting Started with APM.

Below, there’s a list of resources associated with your service. Sort the resources for this service by requests, latency, errors, and time, to identify areas of high traffic or potential trouble. Note that these metric columns are configurable (see image below).

Click on a resource to open a side panel that displays the resource’s out-of-the-box graphs (requests, errors, and latency), a resource dependency map, and a span summary table, which toggles between upstream and downstream spans.

Use keyboard navigation keys to toggle between resources on the Resources list and compare resources in a service. To view the full resource page, click Open Full Page.

Columns

Choose what to display in your services list:

  • Requests: Absolute amount of requests traced

  • Requests per second: Absolute amount of requests traced per second

  • Median/p75/p90/p95/p99/Max Latency: The Median/p75/p90/p95/p99/Max latency of your traced requests

  • Errors: Absolute amount of errors traced

  • Errors per second: Absolute amount of errors traced per second

  • Error Rate: Percent of spans traced that were errors

Traces

View the list of traces associated with the service in the traces tab, which is already filtered on your service, environment, and operation name. Drill down to problematic spans using core facets such as status, resource, and error type. For more information, click a span to view a flame graph of its trace and more details.

Resource page

A resource is a particular action for a given service (typically an individual endpoint or query). Read more about resources in Getting Started with APM. For each resource, APM automatically generates a dashboard page covering:

  • Key health metrics

  • Monitor status for all monitors associated with this service

  • List and metrics for all resources associated with this service

Out-of-the-box graphs

Kloudfuse provides out-of-the-box graphs for any given resource:

  • Requests - Choose to display:

    • The Total amount of requests

    • The amount of Requests per second

  • Latency - Choose to display:

    • The Median/p75/p90/p95/p99/Max latency of your traced requests

  • Error - Choose to display:

    • The Total amount of errors

    • The amount of Errors per second

    • The % Error Rate

  • Sub-Services: When there are multiple services involved, a fourth graph is available that breaks down your Total time spent/%of time spent/Avg time per request of your service by services or type.

    This represents the total/relative/average time spent by traces from the current service to the other services or type.

    Note: For services like Postgres or Redis, which are “final” operations that do not call other services, there is no sub-services graph.

Dependency Map

You can also view a map of all of a resource’s upstream and downstream service dependencies. With the Dependency Map, you can quickly see the flow of services with spans that go through the specific resource (such as endpoints or database queries) end-to-end.

Hover over a node to view metrics of each service including requests/second, error rate, and average latency.

Span summary

For a given resource, Kloudfuse APM provides you a span analysis breakdown of all matching traces:

The displayed metrics represent, per span:

Avg Spans/traceAverage number of occurrences of the span, for traces including the current resource, where the span is present at least once.% of TracesPercentage of traces including the current resource where the span is present at least once.Avg DurationAverage duration of the span, for traces including the current resource, where the span is present at least once.Avg % Exec TimeAverage ratio of execution time for which the span was active, for traces including the current resource, where the span is present at least once.

Note: A span is considered active when it’s not waiting for a child span to complete, scaled according to the number of concurrent active spans. The active spans at a given time, for a given trace, are all the currently executing leaf spans (in other words, spans without children).

Traces

Consult the list of traces associated with this resource in the Trace search modal already filtered on your environment, service, operation, and resource name:

Trace View

View an individual trace to see all of its spans and associated metadata. Each trace can be viewed either as a flame graph or as a list (grouped by service or host).

Calculate the breakdown of execution time and adjust the color scheme by either service or host.

To get a closer look at the flame graph, zoom in by scrolling:

If you are analyzing a trace reporting an error, the error has a specific display if you follow the special meaning tags rules. When submitting your traces you can add attributes to the meta parameter.

Some attributes have special meanings that lead to a dedicated display or specific behavior in Kloudfuse:

ATTRIBUTE

DESCRIPTION

ATTRIBUTE

DESCRIPTION

sql.query

Allows specific SQL query formatting and display in Kloudfuse’s UI.

error.msg

Allows dedicated display for error message.

error.type

Allows dedicated display for error types. Available types include, for instance, in Python ValueError or Exception and in Java ClassNotFoundException or NullPointerException.

error.stack

Allows a better display of the stack trace of an exception in Kloudfuse’s UI (red boxes, etc…)

Request Flow Map

Request flow maps combine two key features of Kloudfuse APM: the service map and live exploring, to help you understand and track request paths through your stack. Quickly identify noisy services and choke points, or how many database calls are generated by a request to a specific endpoint.

No additional configuration is required to use these flow maps, and they are powered by your ingested spans. Scope your traces to any combination of tags and generate a dynamic map that represents the flow of requests between every service. The map is automatically generated based on your search criteria, and will regenerate live after any changes.

  • Hover over an edge that connects two services to see metrics for requests, errors, and latency for requests between those two services that match the query parameters.

  • The highest throughput connections are highlighted to show the most common path.

  • The current request flow map is a great way to generate a live architecture diagram, or one scoped to a specific user flow.

 

These metrics capture request counts, error counts, and latency measures. They are calculated based on 100% of the application’s traffic, regardless of any trace ingestion sampling configuration. Ensure that you have full visibility into your application’s traffic by using these metrics to spot potential errors on a service or a resource, and by creating dashboards, monitors, and SLOs.

Note: If your applications and services are instrumented with OpenTelemetry libraries and you set up sampling at the SDK level and/or at the collector level, APM metrics are calculated based on the sampled set of data.

Trace metrics

Trace metrics are generated for service entry spans and certain operations depending on integration language. For example, the Django integration produces trace metrics from spans that represent various operations (1 root span for the Django request, 1 for each middleware, and 1 for the view).

The trace metrics namespace is formatted as:

Trace execution time histograms.

  • trace_execution_time_ms_bucket

  • trace_execution_time_ms_count

  • trace_execution_time_ms_sum

  • trace_span_count

 

trace edge level metrics.

  • trace_edge_error_total

  • trace_edge_spans_total

  • trace_edge_latency_ms

  • trace_edge_latency_ms_bucket

  • trace_edge_latency_ms_count

  • trace_edge_latency_ms_sum

 

Span level metrics

  • span_total

  • span_error_total

  • span_latency_ms_bucket

  • span_latency_ms_sum

  • span_latency_ms_count

 

With the following definitions:

Trace metrics tags, possible tags are: env, version, resource, parent_service_name, parent_span_name, service_name, span_name, and any Agent tags (including the host and kubernetes tags). Note: Tags set on spans do not count and will not be available as tags for your traces metrics.