Troubleshooting

Networking

Kfuse is unreachable from external host

Symptom

Unable to access Kfuse from the external IP/host or DNS.

curl http://EXTERNAL_IP curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out curl https://EXTERNAL_IP --insecure curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out

Resolution

Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.

 

Packet is getting dropped by ingress-nginx

Symptom

ingress-nginx logs client intended to send too large body error.

2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"

Resolution

ingress-nginx can be configured to accept larger request body size. The default is 1m. Upgrade Kfuse with the following section in the custom values file.

ingress-nginx: controller: config: proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.>

 

Pinot

Pinot Server Realtime Pods in Crash Loop Back Off

Symptoms

  • Container logs shows the following JFR initialization errors:

  • Pinot server realtime disk usage is at 100%.

Resolution

DeepStore access issues

Symptoms

  • Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).

  • Pinot-controller logs deep store access-related exception.

    • On AWS S3, the exception has the following format

Resolution

  • Refer to Configure GCP/AWS/Azure Object Store for Pinot for setting the access for Pinot.

    • On GCP, ensure that the secret has correct access to the cloud storage bucket.

    • On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated

  • If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.



Getting ideal state and external view for segments from pinot-controller

Enable port-forward for pinot-controller by running:

Ensure that pinot-controller-0 pod is running and fully up by running kubectl get pods

Dump the ideal state and external view for segments for by running:

If you do not have jq or an equivalent tool installed already, follow the installation instructions from here.

Replace <tableName> with one of the following:

  • Metrics: kf_metrics_REALTIME

  • Events: kf_events_REALTIME

  • Logs: kf_logs_REALTIME

  • Traces: kf_traces_REALTIME

For instance, to get the ideal state and external view for logs table, copy-paste the following commands:

Realtime usage is increasing continuously

The pinot-server-realtime persistent volume usage increases continuously if there’s any disconnect for segment movement. This is something that has been partially addressed in version 2.6.5. There are two ways to verify the behaviour,

  • Since v2.6.5, there is an alert in place for notification if the pvc usage is above 40%

  • You can navigate to Kloudfuse Overview → System dashboards an verify from the PV Used Space panel to see the graph for pinot-server-realtime

Remediation

To remediate the situation it is recommended to restart the pinot-realtime & offline servers with following command.

If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.

Storage

Increasing the existing PVC size

Note that for Azure, PremiumV2_LRS does not currently support online resizing of PVC. Please follow the steps in Troubleshooting | Resizing PVC on Azure

 

In certain scenarios, you might face a requirement to increase the size of pvc. You can use the resize_pvc.sh script for doing it.

Example if you want increase the size of kafka stateful set pvcs to 100GB in kfuse namespace

Resizing PVC on Azure

On Azure, the PremiumV2_LRS disk needs to be in unattached state before it can be resized.

Follow these steps to resize PVC on Azure.

  1. Cordon all the nodes

  2. Delete the statefulset

  3. Verify that the corresponding disk is in unattached state in Azure Portal. See example screenshot.

    Screenshot 2024-04-04 at 9.03.41 AM.png
  4. Patch all the PVC with the desired size.

  5. Uncordon the node

  6. Update custom_values.yaml with the disk size for the statefulset disk.

  7. Run helm upgrade of kfuse with the updated custom_values.yaml

     

Fluent-Bit

Duplicate logs show up in Kfuse stack

Symptoms

You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.

Resolution

If you look at Fluent-Bit logs you’ll notice the following error in the logs:

This seems like a known issue with Fluent-Bit. Refer to these 2 issues in Fluent-Bit repo here and here. This happens with the default buffer size for the tail plugin. A workaround is to increase the max buffer size by adding Buffer_Chunk_Size and Buffer_Max_Size to the tail plugin configuration.

This configuration is per tail plugin. So if you have multiple tail plugin configurations, then you'll need to add the buffer configuration to every tail plugin.

Additional info

One way to deduce that the duplication was introduced in the Fluent-Bit’s tail plugin is to add a randomly generated number/string as part of the Fluent-Bit record. This will show up as a log facet in the kfuse stack. If the duplicate log lines all have different numbers/strings, then it confirms the theory that the duplication happened in the Fluent-Bit agent. In order to get a randomly generated number/string, add the following filter to your Fluent-Bit config:

Datadog Agent

Kube_cluster_name label does not show in Kfuse stack

Symptom

MELT data ingested from Datadog agent is missing the kube_cluster_name label.

Resolution

There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See https://github.com/DataDog/datadog-agent/issues/24406 .

 

A workaround is to do a rollout restart of the Datadog agent daemonset.

kubectl rollout restart daemonset datadog-agent

Access denied while creating Alert / Contact point

Symptom

A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.

{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}

or

{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}

Resolution

Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.