APM Service Identification and Metrics

APM Service Identification

In releases 2.6.0+, we have introduced configurable Service Identity to provide logical separation of services based on multiple configurable labels. For example, we can now distinguish “Service A” running in “production” environment from “Service A” running in “staging” environment by including “environment” in the Service Identity Labels configuration.

The default configuration identifies services on multiple pre-defined labels: "availability_zone", "cloud_account_id", "kube_cluster_name", "kube_namespace", "project", "region", "kf_platform", "service_name"

Kloudfuse users should carefully plan the service identify configuration, as this defines the granularity at which service-level metrics are tracked in the Kloudfuse system.

APM Service Identification (Special Case: Database Service)

Database services are derived from the client spans emitted by services making database calls. Database services inherit the service identification labels from the calling service. One important difference is that the Database Service Name is constructed using attributes of the calling service’s client span.

The database name is constructed as follows:

  • For OTLP

    • {kf_db_system} + "://" + {kf_server_address} + "/" + {kf_db_name}

    • Where the kf* attributes are computed as follows:

      • “db.system“ → kf_db_system

      • “server.address“ or “network.peer.address“ → kf_server_address

      • “db.name” → kf_db_name

  • For Elastic APM

    • {kf_db_system} + "://" + {kf_server_address} + "/" + {kf_db_name}

    • Where the kf* attributes are computed as follows:

      • “span.subtype“ (part before the “.” delimiter) → kf_db_system

      • "span.destination.service.resource" or "span.context.destination.service.resource" → kf_server_address

APM Service Metrics

Kloudfuse APM solution provides multiple features that are derived from span data. Some of these features, for instance Service Map and Service List, rely on metrics that are internally generated by the Kloudfuse system. The image below shows an overview of the pipeline that generates the metrics based on span data.

image-20240612-174312.png

 

The metrics internally generated by the Kloudfuse system are:

  • Edge Latency Metrics (edge_latency_*) - These metrics capture RED metrics for Parent → Child edges based on span data. The edge metrics have a fixed set of labels determined by the Kloudfuse team, and subject to change in future releases.

    • edge_latency_count - represents the cumulative count of edges received (including dangling edges where either the parent or the child is missing). This metric can be used to construct Request-per-second and related queries.

    • edge_latency_sum - represents the cumulative sum duration/latency data received for the processed edges. This metric, combined with edge_latency_count, can be used to construct Average-Duration-per-second and related queries.

    • In addition to the above, there are 3 other metrics that capture histogram buckets per time period (edge_latency_bucket, edge_latency_min, edge_latency_max). These metrics are used to calculate the P-* duration/latency (eg. P99) and the Min/Max values for specified time periods.

  • Node Latency Metrics (service_latency_*) - The Node metrics are similar to the edge metrics. These are directly derived from individual server spans. In other words, these metrics are emitted without having to establish parent-child relationships in the span data. The Node metrics are also emitted at the cardinality of Services - this means that the aggregation performed is available at the granularity of a Service Identity, and not at a more granular level (eg. no span-name granularity).