...
Code Block |
---|
ingress-nginx: controller: config: proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.> |
Pinot
Pinot Server Realtime Pods in Crash Loop Back Off
Symptoms
Container logs shows the following JFR initialization errors:
Code Block jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM Failure when starting JFR on_create_vm_2
Pinot server realtime disk usage is at 100%.
Resolution
In Kfuse version 2.6.5 or earlier
The pinot server realtime runs out of disk space if Pinot is unable to move the segments to the offline server. On some cases, the offline servers hit an exception, stops handling new messages and need to be restarted.
Code Block kubectl rollout restart -n kfuse statefulset pinot-server-offline
The persistent disk attached to Pinot Server Realtime needs to be increased. Refer to https://kloudfuse.atlassian.net/wiki/spaces/EX/pages/edit-v2/786038817#Increasing-the-existing-PVC-size
From Kfuse version 2.6.7 onwards, there is no need to resize the pinot server realtime disks. Follow the following steps.
Restart pinot-server-offline.
Edit pinot-server-realtime sts remove or set BALLOON_DISK env variable to false.
Wait for pinot server realtime to start up and has complete moving segments to offline servers.
Edit pinot-server-realtime sts to add back BALLOON_DISK env variable to true.
DeepStore access issues
Symptoms
Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).
Pinot-controller logs deep store access-related exception.
On AWS S3, the exception has the following format
Code Block Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)
Resolution
Refer to Configure GCP/AWS/Azure Object Store for Pinot for setting the access for Pinot.
On GCP, ensure that the secret has correct access to the cloud storage bucket.
On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated
Code Block pinot: deepStore: enabled: true type: "s3" useSecret: true createSecret: true dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data" s3: region: "YOUR REGION" accessKey: "YOUR AWS ACCESS KEY" secretKey: "YOUR AWS SECRET KEY"
If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.
...
If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.
...
Duplicate logs show up in Kfuse stack
Symptoms
You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.
Resolution
If you look at Fluent-Bit logs you’ll notice the following error in the logs:
...
MELT data ingested from Datadog agent is missing the kube_cluster_name
label.
Resolution
There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See https://github.com/DataDog/datadog-agent/issues/24406 .
...
kubectl rollout restart daemonset datadog-agent
Access denied while creating Alert / Contact point
Symptom
A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.
{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}
or
{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}
Resolution
Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.