/
Troubleshooting

Troubleshooting

Networking

Kfuse is unreachable from external host

Symptom

Unable to access Kfuse from the external IP/host or DNS.

curl http://EXTERNAL_IP curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out curl https://EXTERNAL_IP --insecure curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out

Resolution

Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.

 

Packet is getting dropped by ingress-nginx

Symptom

ingress-nginx logs client intended to send too large body error.

2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"

Resolution

ingress-nginx can be configured to accept larger request body size. The default is 1m. Upgrade Kfuse with the following section in the custom values file.

ingress-nginx: controller: config: proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.>

 

Pinot

Pinot Server Realtime Pods in Crash Loop Back Off

Symptoms

  • Container logs shows the following JFR initialization errors:

    jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM Failure when starting JFR on_create_vm_2
  • Pinot server realtime disk usage is at 100%.

Resolution

  • In Kfuse version 2.6.5 or earlier

    • The pinot server realtime runs out of disk space if Pinot is unable to move the segments to the offline server. On some cases, the offline servers hit an exception, stops handling new messages and need to be restarted.

      kubectl rollout restart -n kfuse statefulset pinot-server-offline
    • The persistent disk attached to Pinot Server Realtime needs to be increased. Refer to Troubleshooting | Increasing the existing PVC size

  • From Kfuse version 2.6.7 onwards, there is no need to resize the pinot server realtime disks. Follow the following steps.

    1. Restart pinot-server-offline.

    2. Edit pinot-server-realtime sts remove or set DISK_BALLOON env variable to false.

    3. Wait for pinot server realtime to start up and has complete moving segments to offline servers.

    4. Edit pinot-server-realtime sts to add back DISK_BALLOON env variable to true.

DeepStore access issues

Symptoms

  • Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).

  • Pinot-controller logs deep store access-related exception.

    • On AWS S3, the exception has the following format

      Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)

Resolution

  • Refer to Configure GCP/AWS/Azure Object Store for Pinot for setting the access for Pinot.

    • On GCP, ensure that the secret has correct access to the cloud storage bucket.

    • On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated

      pinot: deepStore: enabled: true type: "s3" useSecret: true createSecret: true dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data" s3: region: "YOUR REGION" accessKey: "YOUR AWS ACCESS KEY" secretKey: "YOUR AWS SECRET KEY"
  • If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.



Getting ideal state and external view for segments from pinot-controller

Enable port-forward for pinot-controller by running:

kubectl port-forward pinot-controller-0 9000:9000

Ensure that pinot-controller-0 pod is running and fully up by running kubectl get pods

Dump the ideal state and external view for segments for by running:

curl "http://localhost:9000/tables/<tableName>/idealstate" | jq > ideal_state.json 2>&1 curl "http://localhost:9000/tables/<tableName>/externalview" | jq > external_state.json 2>&1

If you do not have jq or an equivalent tool installed already, follow the installation instructions from here.

Replace <tableName> with one of the following:

  • Metrics: kf_metrics_REALTIME

  • Events: kf_events_REALTIME

  • Logs: kf_logs_REALTIME

  • Traces: kf_traces_REALTIME

For instance, to get the ideal state and external view for logs table, copy-paste the following commands:

curl "http://localhost:9000/tables/kf_logs_REALTIME/idealstate" | jq > ideal_state.json 2>&1 curl "http://localhost:9000/tables/kf_logs_REALTIME/externalview" | jq > external_state.json 2>&1

Realtime usage is increasing continuously

The pinot-server-realtime persistent volume usage increases continuously if there’s any disconnect for segment movement. This is something that has been partially addressed in version 2.6.5. There are two ways to verify the behaviour,

  • Since v2.6.5, there is an alert in place for notification if the pvc usage is above 40%

  • You can navigate to Kloudfuse Overview → System dashboards an verify from the PV Used Space panel to see the graph for pinot-server-realtime

Remediation

To remediate the situation it is recommended to restart the pinot-realtime & offline servers with following command.

kubectl rollout restart sts pinot-server-offline pinot-server-realtime

If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.

Storage

Increasing the existing PVC size

Note that for Azure, PremiumV2_LRS does not currently support online resizing of PVC. Please follow the steps in Troubleshooting | Resizing PVC on Azure

 

In certain scenarios, you might face a requirement to increase the size of pvc. You can use the resize_pvc.sh script for doing it.

Example if you want increase the size of kafka stateful set pvcs to 100GB in kfuse namespace

sh resize_pvc.sh kafka 100Gi kfuse

Resizing PVC on Azure

On Azure, the PremiumV2_LRS disk needs to be in unattached state before it can be resized.

Follow these steps to resize PVC on Azure.

  1. Cordon all the nodes

    kubectl cordon <NODE>
  2. Delete the statefulset

    kubectl sts <STATEFULSET>
  3. Verify that the corresponding disk is in unattached state in Azure Portal. See example screenshot.

    Screenshot 2024-04-04 at 9.03.41 AM.png
  4. Patch all the PVC with the desired size.

    kubectl patch pvc <PVC> --patch '{"spec": {"resources": {"requests": {"storage": "'<SIZE>'" }}}}'
  5. Uncordon the node

    kubectl uncordon <NODE>
  6. Update custom_values.yaml with the disk size for the statefulset disk.

  7. Run helm upgrade of kfuse with the updated custom_values.yaml. This step is optional now as the script recreates the sts.

     

Fluent-Bit

Duplicate logs show up in Kfuse stack

Symptoms

You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.

Resolution

If you look at Fluent-Bit logs you’ll notice the following error in the logs:

[error] [in_tail] file=<path_to_filename> requires a larger buffer size, lines are too long. Skipping file

This seems like a known issue with Fluent-Bit. Refer to these 2 issues in Fluent-Bit repo here and here. This happens with the default buffer size for the tail plugin. A workaround is to increase the max buffer size by adding Buffer_Chunk_Size and Buffer_Max_Size to the tail plugin configuration.

[INPUT] Name tail Path <file_path_to_tail> Tag <tag> Buffer_Chunk_Size 1M Buffer_Max_Size 8M

This configuration is per tail plugin. So if you have multiple tail plugin configurations, then you'll need to add the buffer configuration to every tail plugin.

Additional info

One way to deduce that the duplication was introduced in the Fluent-Bit’s tail plugin is to add a randomly generated number/string as part of the Fluent-Bit record. This will show up as a log facet in the kfuse stack. If the duplicate log lines all have different numbers/strings, then it confirms the theory that the duplication happened in the Fluent-Bit agent. In order to get a randomly generated number/string, add the following filter to your Fluent-Bit config:

[FILTER] Name lua Match * Call append_rand_number Code function append_rand_number(tag, timestamp, record) math.randomseed(os.clock()*100000000000); new_record = record; new_record["rand_id"] = tostring(math.random(1, 1000000000)); return 1, timestamp, new_record end

Datadog Agent

Kube_cluster_name label does not show in Kfuse stack

Symptom

MELT data ingested from Datadog agent is missing the kube_cluster_name label.

Resolution

There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See If dd-agent starts before cluster agent, cluster name returns empty · Issue #24406 · DataDog/datadog-agent .

 

A workaround is to do a rollout restart of the Datadog agent daemonset.

kubectl rollout restart daemonset datadog-agent

Access denied while creating Alert / Contact point

Symptom

A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.

{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}

or

{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}

Resolution

Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.

 

Related content