Kloudfuse Control Plane

Kloudfuse Control Plane renders information about the current Kloudfuse (Kfuse) cluster instance, allowing you monitor the cluster, its nodes and various components of the system. Currently, the control plane is just informational, where you can monitor various metrics of the system.

To access the control plane, click on Control Plane banner on the main UI page.

 

Kloudfuse Cluster Overview

This section mainly has the following information pertinent to the Kfuse cluster instance:

  • Cluster information - this table has information has information about Kfuse software version, namespace in which the the stack is deployed and the helm chart deployment revision.

  • Cluster storage usage - total storage used by the cluster datastore across the 4 main streams (Metrics, Events, Logs and Traces). Note that this just captures the usage for the persistent data on the datastore. This does not capture information about any ephemeral or temporary storage such as message queue storage. The value displayed here is for the lifetime of the cluster, and not based on the time range from the time picker drop down menu.

  • Nodes - this table has information about node hostnames, instance type and the state of the nodes.

  • Services - this table lists all the services deployed as part of Kfuse cluster installation, and their state.

  • Service Status - an aggregation panel which looks at all the service status, and prints either

    • OK - if all the pods are up and running.

    • DEGRADED - if one or more pods are down

    • DOWN - if all pods are down.

  • Pods in Failed State - if the Service Status panel shows DEGRADED state, then refer to this table to see which pods are not running.

Monitored Cluster

This table contains the list of clusters which are sending data to Kfuse cluster using Datadog agent . Currently, this table is only monitoring data received from Datadog agent. We’ll be enhancing this table in the future versions to display all agent types, supported by Kfuse stack.

Stream Overiew

This section has various stat panels to display information across 4 different streams: Metrics, Events, Logs and Traces.

  • Total space used - The first row has total space used per stream on the backend datastore.

Note that total space used is the value since the cluster was installed. So this value will not change, if you pick a different time in the time picker drop down.

  • Space per sample - This is an approximation of how much space is used per sample (for instance, metric sample or log line) for each stream. This value is expressed in bytes. A lower value in these panels directly corresponds to a lower cluster storage footprint across all streams.

  • Throughput - An approximation of throughput across streams. For instance, for logs stream this is captured as log lines per second.

Service Overview

This section has panels and sparkline charts for some of the main services used by the Kfuse stack.

  • Ingester Latency - these panels track the latencies of ingester across the 4 main streams. These correlate to ingestion speed.

  • Query Service Latency - Query services are what UI queries into. These charts correspond to how quickly can the Kfuse stack render data on the UI.

  • Kafka Consumer Lag - Kafka is the message queue used by Kfuse. If the lag is growing, then ingestion may have stalled somewhere in the pipeline. This will correlate to UI not rendering newer metrics/events/logs/traces. Consumer lag is plotted across all partitions.

  • Pinot Table Ingestion Rate - Pinot is the backend datastore used by Kfuse. Higher ingestion rate corresponds to faster ingest and higher ingestion throughput.

  • HTTP Error Count - This tracks how many requests on ingester or query service failed with an error.

 

In addition to the main Kloudfuse Overview dashboard, a set of detailed dashboards per stream (i.e. Metrics, Events, Logs, Traces) , a System dashboard and a dashboard for Pinot & Kafka are available under Control Plane banner on the main UI page.

 

Detailed Dashboards

  • Metrics - This dashboard has charts for the metrics stream pipeline (Ingester → Kafka → Decoder → Pinot). These charts cover both ingestion and query paths.

  • Events - This dashboard has charts for the events stream pipeline (Ingester → Kafka → Decoder → Pinot). These charts cover both ingestion and query paths.

  • Logs - This dashboard has charts for the logs stream pipeline (Ingester → Kafka → Logs-Parser → Kafka → Decoder → Pinot). These charts cover both ingestion and query paths.

  • Traces - This dashboard has charts for the traces stream pipeline (Ingester → Kafka → Decoder → Pinot). These charts cover both ingestion and query paths.

  • System - This dashboard has information regarding pods CPU usage, disk I/O read write rates and read/write IOPS.

  • Pinot - This dashboard has charts regarding Pinot Ingestion Rate, Freshness Lag, Realtime Table Count, Query QPS, Query Latency, Consuming Latency, Segment Build, Controller CPU metrics, JVM metrics, different charts related to Server memory, segment buffer, etc.

  • Kafka - This dashboard has charts regarding various Kafka metrics.

These charts can be rendered on the Grafana UI too, which is accessible from the main Kloudfuse UI page.