GCP Alerting Policy for failed GKE CronJob-CodePudding

What would be the best way to set up a GCP monitoring alert policy for a Kubernetes CronJob failing? I haven't been able to find any good examples out there.

Right now, I have an OK solution based on monitoring logs in the Pod with ERROR severity. I've found this to be quite flaky, however. Sometimes a job will fail for some ephemeral reason outside my control (e.g., an external server returning a temporary 500) and on the next retry, the job runs successfully.

What I really need is an alert that is only triggered when a CronJob is in a persistent failed state. That is, Kubernetes has tried rerunning the whole thing, multiple times, and it's still failing. Ideally, it could also handle situations where the Pod wasn't able to come up either (e.g., downloading the image failed).

Any ideas here?

Thanks.

CodePudding user response：

First of all, confirm the GKE’s version that you are running. For that, the following commands are going to help you to identify the GKE’s default version and the available versions too:

Default version.

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.defaultVersion)"

Available versions.

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.validVersions)"

Now that you know your GKE’s version and based on what you want is an alert that is only triggered when a CronJob is in a persistent failed state, GKE Workload Metrics was the GCP’s solution that used to provide a fully managed and highly configurable solution for sending to Cloud Monitoring all Prometheus-compatible metrics emitted by GKE workloads (such as a CronJob or a Deployment for an application). But, as it is right now deprecated in GKE 1.24 and was replaced with Google Cloud Managed Service for Prometheus, then this last is the best option you’ve got inside of GCP, as it lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.

Plus, you have 2 options from the outside of GCP: Prometheus as well and Ranch’s Prometheus Push Gateway.

Finally and just FYI, it can be done manually by querying for the job and then checking it's start time, and compare that to the current time, this way, with bash:

START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
echo $START_TIME

Or, you are able to get the job’s current status as a JSON blob, as follows:

kubectl -n=your-namespace get job your-job-name -o json | jq '.status'

You can see the following thread for more reference too.