Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
ingress-nginx:
  controller:
    config:
      proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable any limit.>

Pinot

Pinot Server Realtime Pods in Crash Loop Back Off

Symptoms

  • Container logs shows the following JFR initialization errors:

    Code Block
    jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM
    Failure when starting JFR on_create_vm_2
  • Pinot server realtime disk usage is at 100%.

Resolution

  • In Kfuse version 2.6.5 or earlier

  • From Kfuse version 2.6.7 onwards, there is no need to resize the pinot server realtime disks. Follow the following steps.

    1. Restart pinot-server-offline.

    2. Edit pinot-server-realtime sts remove or set BALLOON_DISK env variable to false.

    3. Wait for pinot server realtime to start up and has complete moving segments to offline servers.

    4. Edit pinot-server-realtime sts to add back BALLOON_DISK env variable to true.

DeepStore access issues

Symptoms

  • Pinot-related jobs are stuck in crash loop back-off (e.g., kfuse-set-tag-hook, pinot-metrics-table-creation, etc).

  • Pinot-controller logs deep store access-related exception.

    • On AWS S3, the exception has the following format

      Code Block
      Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)

Resolution

  • Refer to Configure GCP/AWS/Azure Object Store for Pinot for setting the access for Pinot.

    • On GCP, ensure that the secret has correct access to the cloud storage bucket.

    • On AWS S3, if the node does not have permission to the S3 bucket, then ensure that the access key and secret access key is populated

      Code Block
      pinot:
          deepStore:
            enabled: true
            type: "s3"
            useSecret: true
            createSecret: true
            dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data"
            s3:
              region: "YOUR REGION"
              accessKey: "YOUR AWS ACCESS KEY"
              secretKey: "YOUR AWS SECRET KEY"
  • If Pinot has the correct access credentials to the deep store, then the configured bucket will have the directory created that matches the dataDir.

...

If you find out that the PV usage has reached 100% and cannot be restarted gracefully, you need to increase the pvc size of pinot-realtime pvcs by 10% or so to accommodate the increased requirement and restart the pinot-server offline & realtime.

...

In certain scenarios, you might face a requirement to increase the size of pvc. You can use the resize_pvc.sh script below for doing it.code

set -x
sts_name=$1
size=$2
namespace=$3

# https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/volume-expansion#using_volume_expansion
# if the storageclass is not resizeable add the following line at top level to make it resizeable
# allowVolumeExpansion: true

if [ -z "$sts_name" ] || [ -z "$size" ]; then
echo "Usage: ./resize_pvc.sh <statefulset name> <size> [namespace]"
exit 1
fi

if [ -z "$namespace" ]; then
namespace="kfuse"
fi


for pod in `kubectl get pods -n $namespace -o 'custom-columns=NAME:.metadata.name,CONTROLLER:.metadata.ownerReferences[].name' | grep $sts_name$ | awk '{print $1}'`
do
  for pvc in `kubectl get pods -n $namespace $pod -o 'custom-columns=PVC:.spec.volumes[].persistentVolumeClaim.claimName' | grep -v PVC`
  do
    echo Patching $pvc
    echo "kubectl patch pvc $pvc -n $namespace --patch '{\"spec\": {\"resources\": {\"requests\": {\"storage\": \"'$size'\" }}}}'"
    kubectl patch pvc $pvc -n $namespace --patch '{"spec": {"resources": {"requests": {"storage": "'$size'" }}}}'
    if [ $? -ne 0 ]; then
      echo "failed to patch pvc. can not move forward."
      exit 1
    fi
    echo "kubectl delete sts $sts_name --cascade=orphan -n $namespace"
    kubectl delete sts $sts_name --cascade=orphan -n $namespace
    echo Run helm upgrade to redeploy the statefulset with the updated disk size
    echo If resizing the PVC on observe cluster, use the ToT of staging branch,
    echo update the observe.yaml locally and then do the helm upgrade with the
    echo checked in version.
    echo Check-in the updated observe.yaml on main branch only so that it gets picked up
    echo  on the next full upgrade of observe cluster.
  done
done
Example if you want increase the size of kafka stateful set pvcs to 100GB in kfuse namespace

...

Duplicate logs show up in Kfuse stack

Symptoms

You notice that there are duplicate logs with the same timestamp and log event in kfuse stack. But if you check the application logs (either on the host or in the container), there is no evidence of duplication. This issue happens only when the agent is Fluent-Bit.

Resolution

If you look at Fluent-Bit logs you’ll notice the following error in the logs:

...

MELT data ingested from Datadog agent is missing the kube_cluster_name label.

Resolution

There is a known issue in Datadog agent cluster name detection that requires the cluster agent to be up. If the agent starts up before the cluster agent, then it fails to detect the cluster name. See https://github.com/DataDog/datadog-agent/issues/24406 .

...

kubectl rollout restart daemonset datadog-agent

Access denied while creating Alert / Contact point

Symptom

A non admin (SSO user) may get a permission error when creating Alert / Contact point as follows.

{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}

or

{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}

Resolution

Workaround: Login as admin to create contact point / alert as an SSO user isn’t provided permissions to create contact points or alerts manually.