1 Networking
- 1.1 Kfuse is unreachable from external host
  - 1.1.1 Symptom
  - 1.1.2 Resolution
- 1.2 Packet is getting dropped by ingress-nginx
  - 1.2.1 Symptom
  - 1.2.2 Resolution
2 Pinot
- 2.1 Pinot Server Realtime Pods in Crash Loop Back Off
  - 2.1.1 Symptoms
  - 2.1.2 Resolution
- 2.2 DeepStore access issues
  - 2.2.1 Symptoms
  - 2.2.2 Resolution
- 2.3 Getting ideal state and external view for segments from pinot-controller
- 2.4 Realtime usage is increasing continuously
  - 2.4.1 Remediation
3 Storage
- 3.1 Increasing the existing PVC size
  - 3.1.1 Resizing PVC on Azure
4 Fluent-Bit
- 4.1 Duplicate logs show up in Kfuse stack
  - 4.1.1 Symptoms
  - 4.1.2 Resolution
  - 4.1.3 Additional info
5 Datadog Agent
- 5.1 Kube_cluster_name label does not show in Kfuse stack
  - 5.1.1 Symptom
  - 5.1.2 Resolution
6 Access denied while creating Alert / Contact point
- 6.1 Symptom
- 6.2 Resolution

Networking

Kfuse is unreachable from external host

Symptom

Unable to access Kfuse from the external IP/host or DNS.

curl http://EXTERNAL_IP 
curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out

curl https://EXTERNAL_IP --insecure
curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out

Resolution

Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.

Packet is getting dropped by ingress-nginx

Symptom

ingress-nginx logs client intended to send too large body error.

2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"

Resolution

ingress-nginx can be configured to accept larger request body size. The default is 1m. Upgrade Kfuse with the following section in the custom values file.

ingress-nginx:
  controller:
    config:
      proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.>

Pinot

Pinot Server Realtime Pods in Crash Loop Back Off

Symptoms

Container logs shows the following JFR initialization errors:
jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM Failure when starting JFR on_create_vm_2
Pinot server realtime disk usage is at 100%.

Resolution

In Kfuse version 2.6.5 or earlier
- The pinot server realtime runs out of disk space if Pinot is unable to move the segments to the offline server. On some cases, the offline servers hit an exception, stops handling new messages and need to be restarted.
  kubectl rollout restart -n kfuse statefulset pinot-server-offline
- The persistent disk attached to Pinot Server Realtime needs to be increased. Refer to Troubleshooting | Increasing the existing PVC size
From Kfuse version 2.6.7 onwards, there is no need to resize the pinot server realtime disks. Follow the following steps.
1. Restart pinot-server-offline.
2. Edit pinot-server-realtime sts remove or set DISK_BALLOON env variable to false.
3. Wait for pinot server realtime to start up and has complete moving segments to offline servers.
4. Edit pinot-server-realtime sts to add back DISK_BALLOON env variable to true.

DeepStore access issues

Symptoms

Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).
Pinot-controller logs deep store access-related exception.
- On AWS S3, the exception has the following format
  Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)

Resolution

Refer to https://kloudfuse.atlassian.net/wiki/spaces/EX/pages/724664352 for setting the access for Pinot.
- On GCP, ensure that the secret has correct access to the cloud storage bucket.
- On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated
  pinot: deepStore: enabled: true type: "s3" useSecret: true createSecret: true dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data" s3: region: "YOUR REGION" accessKey: "YOUR AWS ACCESS KEY" secretKey: "YOUR AWS SECRET KEY"
If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.

Getting ideal state and external view for segments from `pinot-controller`

Enable port-forward for pinot-controller by running:

kubectl port-forward pinot-controller-0 9000:9000

Ensure that pinot-controller-0 pod is running and fully up by running kubectl get pods

Dump the ideal state and external view for segments for by running:

curl "http://localhost:9000/tables/<tableName>/idealstate" | jq > ideal_state.json 2>&1
curl "http://localhost:9000/tables/<tableName>/externalview" | jq > external_state.json 2>&1

If you do not have jq or an equivalent tool installed already, follow the installation instructions from here.

Replace <tableName> with one of the following:

Metrics: kf_metrics_REALTIME
Events: kf_events_REALTIME
Logs: kf_logs_REALTIME
Traces: kf_traces_REALTIME

For instance, to get the ideal state and external view for logs table, copy-paste the following commands:

curl "http://localhost:9000/tables/kf_logs_REALTIME/idealstate" | jq > ideal_state.json 2>&1
curl "http://localhost:9000/tables/kf_logs_REALTIME/externalview" | jq > external_state.json 2>&1

Realtime usage is increasing continuously

The pinot-server-realtime persistent volume usage increases continuously if there’s any disconnect for segment movement. This is something that has been partially addressed in version 2.6.5. There are two ways to verify the behaviour,

Since v2.6.5, there is an alert in place for notification if the pvc usage is above 40%
You can navigate to Kloudfuse Overview → System dashboards an verify from the PV Used Space panel to see the graph for pinot-server-realtime

Remediation

To remediate the situation it is recommended to restart the pinot-realtime & offline servers with following command.

kubectl rollout restart sts pinot-server-offline pinot-server-realtime

If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.

Storage

Increasing the existing PVC size

Note that for Azure, PremiumV2_LRS does not currently support online resizing of PVC. Please follow the steps in https://kloudfuse.atlassian.net/wiki/spaces/EX/pages/786038817/Troubleshooting#Resizing-PVC-on-Azure

In certain scenarios, you might face a requirement to increase the size of pvc. You can use the resize_pvc.sh script for doing it.

Example if you want increase the size of kafka stateful set pvcs to 100GB in kfuse namespace

sh resize_pvc.sh kafka 100Gi kfuse

Resizing PVC on Azure

On Azure, the PremiumV2_LRS disk needs to be in unattached state before it can be resized.

Follow these steps to resize PVC on Azure.

Cordon all the nodes
kubectl cordon <NODE>
Delete the statefulset
kubectl sts <STATEFULSET>
Verify that the corresponding disk is in unattached state in Azure Portal. See example screenshot.
Patch all the PVC with the desired size.
kubectl patch pvc <PVC> --patch '{"spec": {"resources": {"requests": {"storage": "'<SIZE>'" }}}}'
Uncordon the node
kubectl uncordon <NODE>
Update custom_values.yaml with the disk size for the statefulset disk.
Run helm upgrade of kfuse with the updated custom_values.yaml. This step is optional now as the script recreates the sts.

Fluent-Bit

Duplicate logs show up in Kfuse stack

Symptoms

You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.

Resolution

If you look at Fluent-Bit logs you’ll notice the following error in the logs:

[error] [in_tail] file=<path_to_filename> requires a larger buffer size, lines are too long. Skipping file

This seems like a known issue with Fluent-Bit. Refer to these 2 issues in Fluent-Bit repo here and here. This happens with the default buffer size for the tail plugin. A workaround is to increase the max buffer size by adding Buffer_Chunk_Size and Buffer_Max_Size to the tail plugin configuration.

[INPUT]
    Name              tail
    Path              <file_path_to_tail>
    Tag               <tag>
    Buffer_Chunk_Size 1M
    Buffer_Max_Size   8M

This configuration is per tail plugin. So if you have multiple tail plugin configurations, then you'll need to add the buffer configuration to every tail plugin.

Additional info

One way to deduce that the duplication was introduced in the Fluent-Bit’s tail plugin is to add a randomly generated number/string as part of the Fluent-Bit record. This will show up as a log facet in the kfuse stack. If the duplicate log lines all have different numbers/strings, then it confirms the theory that the duplication happened in the Fluent-Bit agent. In order to get a randomly generated number/string, add the following filter to your Fluent-Bit config:

[FILTER]
   Name lua
   Match *
   Call append_rand_number
   Code function append_rand_number(tag, timestamp, record) math.randomseed(os.clock()*100000000000); new_record = record; new_record["rand_id"] = tostring(math.random(1, 1000000000)); return 1, timestamp, new_record end

Datadog Agent

Kube_cluster_name label does not show in Kfuse stack

Symptom

MELT data ingested from Datadog agent is missing the kube_cluster_name label.

Resolution

There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See https://github.com/DataDog/datadog-agent/issues/24406 .

A workaround is to do a rollout restart of the Datadog agent daemonset.

kubectl rollout restart daemonset datadog-agent

Access denied while creating Alert / Contact point

Symptom

A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.

{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}

or

{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}

Resolution

Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.

Troubleshooting

Networking

Kfuse is unreachable from external host

Symptom

Resolution

Packet is getting dropped by ingress-nginx

Symptom

Resolution

Pinot

Pinot Server Realtime Pods in Crash Loop Back Off

Symptoms

Resolution

DeepStore access issues

Symptoms

Resolution

Getting ideal state and external view for segments from pinot-controller

Realtime usage is increasing continuously

Remediation

Storage

Increasing the existing PVC size

Resizing PVC on Azure

Fluent-Bit

Duplicate logs show up in Kfuse stack

Symptoms

Resolution

Additional info

Datadog Agent

Kube_cluster_name label does not show in Kfuse stack

Symptom

Resolution

Access denied while creating Alert / Contact point

Symptom

Resolution

Getting ideal state and external view for segments from `pinot-controller`