Networking
Kfuse is unreachable from external host
Symptom
Unable to access Kfuse from the external IP/host or DNS.
curl http://EXTERNAL_IP curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out curl https://EXTERNAL_IP --insecure curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out
Resolution
Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.
Packet is getting dropped by ingress-nginx
Symptom
ingress-nginx logs client intended to send too large body
error.
2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"
Resolution
ingress-nginx can be configured to accept larger request body size. The default is 1m. Upgrade Kfuse with the following section in the custom values file.
ingress-nginx: controller: config: proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.>
Pinot
Pinot Server Realtime Pods in Crash Loop Back Off
Symptoms
Container logs shows the following JFR initialization errors:
jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM Failure when starting JFR on_create_vm_2
Pinot server realtime disk usage is at 100%.
Resolution
In Kfuse version 2.6.5 or earlier
The pinot server realtime runs out of disk space if Pinot is unable to move the segments to the offline server. On some cases, the offline servers hit an exception, stops handling new messages and need to be restarted.
kubectl rollout restart -n kfuse statefulset pinot-server-offline
The persistent disk attached to Pinot Server Realtime needs to be increased. Refer to https://kloudfuse.atlassian.net/wiki/spaces/EX/pages/edit-v2/786038817#Increasing-the-existing-PVC-size
From Kfuse version 2.6.7 onwards, there is no need to resize the pinot server realtime disks. Follow the following steps.
Restart pinot-server-offline.
Edit pinot-server-realtime sts remove or set BALLOON_DISK env variable to false.
Wait for pinot server realtime to start up and has complete moving segments to offline servers.
Edit pinot-server-realtime sts to add back BALLOON_DISK env variable to true.
DeepStore access issues
Symptoms
Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).
Pinot-controller logs deep store access-related exception.
On AWS S3, the exception has the following format
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)
Resolution
Refer to Configure GCP/AWS/Azure Object Store for Pinot for setting the access for Pinot.
On GCP, ensure that the secret has correct access to the cloud storage bucket.
On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated
pinot: deepStore: enabled: true type: "s3" useSecret: true createSecret: true dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data" s3: region: "YOUR REGION" accessKey: "YOUR AWS ACCESS KEY" secretKey: "YOUR AWS SECRET KEY"
If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.
Getting ideal state and external view for segments from pinot-controller
Enable port-forward for pinot-controller
by running:
kubectl port-forward pinot-controller-0 9000:9000
Ensure that pinot-controller-0
pod is running and fully up by running kubectl get pods
Dump the ideal state and external view for segments for by running:
curl "http://localhost:9000/tables/<tableName>/idealstate" | jq > ideal_state.json 2>&1 curl "http://localhost:9000/tables/<tableName>/externalview" | jq > external_state.json 2>&1
If you do not have jq
or an equivalent tool installed already, follow the installation instructions from here.
Replace <tableName>
with one of the following:
Metrics:
kf_metrics_REALTIME
Events:
kf_events_REALTIME
Logs:
kf_logs_REALTIME
Traces:
kf_traces_REALTIME
For instance, to get the ideal state and external view for logs table, copy-paste the following commands:
curl "http://localhost:9000/tables/kf_logs_REALTIME/idealstate" | jq > ideal_state.json 2>&1 curl "http://localhost:9000/tables/kf_logs_REALTIME/externalview" | jq > external_state.json 2>&1
Realtime usage is increasing continuously
The pinot-server-realtime persistent volume usage increases continuously if there’s any disconnect for segment movement. This is something that has been partially addressed in version 2.6.5. There are two ways to verify the behaviour,
Since v2.6.5, there is an alert in place for notification if the pvc usage is above 40%
You can navigate to Kloudfuse Overview → System dashboards an verify from the PV Used Space panel to see the graph for pinot-server-realtime
Remediation
To remediate the situation it is recommended to restart the pinot-realtime & offline servers with following command.
kubectl rollout restart sts pinot-server-offline pinot-server-realtime
If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.
Storage
Increasing the existing PVC size
Note that for Azure, PremiumV2_LRS does not currently support online resizing of PVC. Please follow the steps in https://kloudfuse.atlassian.net/wiki/spaces/EX/pages/786038817/Troubleshooting#Resizing-PVC-on-Azure
In certain scenarios, you might face a requirement to increase the size of pvc. You can use the resize_pvc.sh
script for doing it.
Example if you want increase the size of kafka stateful set pvcs to 100GB in kfuse namespace
sh resize_pvc.sh kafka 100Gi kfuse
Resizing PVC on Azure
On Azure, the PremiumV2_LRS disk needs to be in unattached state before it can be resized.
Follow these steps to resize PVC on Azure.
Cordon all the nodes
kubectl cordon <NODE>
Delete the statefulset
kubectl sts <STATEFULSET>
Verify that the corresponding disk is in
unattached
state in Azure Portal. See example screenshot.Patch all the PVC with the desired size.
kubectl patch pvc <PVC> --patch '{"spec": {"resources": {"requests": {"storage": "'<SIZE>'" }}}}'
Uncordon the node
kubectl uncordon <NODE>
Update
custom_values.yaml
with the disk size for the statefulset disk.Run helm upgrade of kfuse with the updated
custom_values.yaml
Fluent-Bit
Duplicate logs show up in Kfuse stack
Symptoms
You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.
Resolution
If you look at Fluent-Bit logs you’ll notice the following error in the logs:
[error] [in_tail] file=<path_to_filename> requires a larger buffer size, lines are too long. Skipping file
This seems like a known issue with Fluent-Bit. Refer to these 2 issues in Fluent-Bit repo here and here. This happens with the default buffer size for the tail
plugin. A workaround is to increase the max buffer size by adding Buffer_Chunk_Size
and Buffer_Max_Size
to the tail plugin configuration.
[INPUT] Name tail Path <file_path_to_tail> Tag <tag> Buffer_Chunk_Size 1M Buffer_Max_Size 8M
This configuration is per tail
plugin. So if you have multiple tail
plugin configurations, then you'll need to add the buffer configuration to every tail
plugin.
Additional info
One way to deduce that the duplication was introduced in the Fluent-Bit’s tail
plugin is to add a randomly generated number/string as part of the Fluent-Bit record. This will show up as a log facet in the kfuse stack. If the duplicate log lines all have different numbers/strings, then it confirms the theory that the duplication happened in the Fluent-Bit agent. In order to get a randomly generated number/string, add the following filter to your Fluent-Bit config:
[FILTER] Name lua Match * Call append_rand_number Code function append_rand_number(tag, timestamp, record) math.randomseed(os.clock()*100000000000); new_record = record; new_record["rand_id"] = tostring(math.random(1, 1000000000)); return 1, timestamp, new_record end
Datadog Agent
Kube_cluster_name label does not show in Kfuse stack
Symptom
MELT data ingested from Datadog agent is missing the kube_cluster_name
label.
Resolution
There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See https://github.com/DataDog/datadog-agent/issues/24406 .
A workaround is to do a rollout restart of the Datadog agent daemonset.
kubectl rollout restart daemonset datadog-agent
Access denied while creating Alert / Contact point
Symptom
A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.
{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}
or
{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}
Resolution
Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.