DR Backup/Recovery

Kloudfuse is installed on a kubernetes cluster (gke/eks, etc.) by a single helm chart. The data is stored in persistent volumes and backed up on cloud storage (GCS/S3, etc.). The backup and recovery for region/availability zone failures will be handled through the respective cloud provider’s kubernetes cluster backup and recovery feature.

Google Cloud Backup and Recovery for GKE

To handle region/availability zone failures, we recommend the following steps:

  • Create a GKE cluster as per the installation docs in the primary region/availability zone

  • Create another GKE cluster as per the installation docs in the failover region/availability zone

  • For DNS/TLS setup, run the pre-requisite steps on both the GKE clusters as per the instructions here.

  • Enable GKE backup and recovery for the above clusters as per the GKE instructions

  • Create GCS bucket with cross region/availability zone access as needed

  • Install kloudfuse helm chart in the primary gke cluster - see the installation instructions

  • For any secrets and config maps created by the user (e.g. for SSO/SAML setup, TLS certs, etc.), following labels to be added:

    app.kubernetes.io/instance: kfuse
  • Setup GKE backup policy based on the RPO and retention needed. This will automatically create backups of kloudfuse installations to meet the selected RPO. The backups will be retained for the given period and deleted afterwards.

    gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \ --project=<project> \ --location=<location of the primary gke cluster> \ --cluster=<name of the primary gke cluster> \ --selected-applications=<namespace>/kloudfuse \ --include-secrets \ --target-rpo-minutes=60 \ --backup-retain-days=1

Note: The minimum RPO that can be set is 60 minutes using the option --target-rpo-minutes. If you want RPO to be lesser than 60 minutes, use the option --cron-schedule. Also, note that these two options cannot be used together in one backup plan: --target-rpo-minutes and --cron-schedule.

  • For setting up GKE backup policy based on --cron-schedule. For example "10 3 * * *" creates a backup at 3:10 AM every day. All times are interpreted as UTC.

    gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \ --project=<project> \ --location=<location of the primary gke cluster> \ --cluster=<name of the primary gke cluster> \ --selected-applications=<namespace>/kloudfuse \ --include-secrets \ --cron-schedule="10 3 * * *" \ --backup-retain-days=1
  • Setup GKE restore plan

  • On failure of the primary region/availability zone (this needs to be detected out of band), administrator will restore the kloudfuse to the target/failover GKE cluster.

  • If you’re using regional static IP address for your load balancer, then you will need to do additional steps:

    • update the load balancer IP in the kloudfuse ingress to the regional static IP for the failover region

    • update the DNS record to point to this new static IP

  • If you’re using global static IP address for your load balancer, then you will need to ensure that the failed GKE cluster is fenced off; otherwise, if the primary GKE cluster recovers it might bind to the same static IP address again accidentally.