Home > Enterprise >  How to monitor different Spark jobs on the same cluster/SparkContext on Databricks?
How to monitor different Spark jobs on the same cluster/SparkContext on Databricks?

Time:11-25

I wanted to have a monitoring and alerting system in place (with a tool such as Datadog) that could fetch metrics and logs from my Spark applications in Databricks. Thing is, for not having to spin up, run and kill hundreds or even thousands of Job-clusters every day, it is better to re-use existing clusters for similar Data Extraction jobs.

To fetch the metrics from Databricks and Spark in Datadog, I have tried the following:

  1. Change the SparkSession.builder.appName within each notebook: doesn't work, since it is not possible to change it after the cluster's started. By default it will always be "Databricks Shell"
  2. Set a cluster-wide tag and unset it after the job has ended -> can lead to mismatch between tags, when concurrency happens. Also, I didn't find a clear way to "append" a tag there.
  3. Somehow fetch the Databricks' Job/Run Id from Datadog: I have no clue on how to do this.

Seems to me that it would be feasible, since every spark job on the same SparkSession has the name of my Databricks' Job/Run id. I just have to understand how to identify it on Datadog.

Thoughts? Anything silly i might me missing to achieve this?

CodePudding user response:

There are several points here:

  • When you use existing clusters to run jobs, you incur higher costs - automated clusters cost 15 cents/DBU vs 56 cents/DBU for interactive clusters
  • When you run jobs with different libraries, etc. you may end with libraries conflicts, etc.
  • You can't change tags on the existing clusters
  • Concurrent jobs may affect performance of each other

So I really would recommend to use separate automated clusters. If you want to reuse nodes, and have shorter startup times, you can use instance pools.

If you want to monitor resource usage, etc. I would recommend to look onto project Overwatch, that is able to collect data from different sources, like, cluster logs, APIs, etc., and then build unified view on the performance, costs, etc. One of the its advantages is that you can attribute costs, resource load, etc. down to the users/notebooks/individual Spark jobs. It's not the "classical" real-time monitoring tool, but used by many customers already.

CodePudding user response:

I’m not sure I fully understand your use case. But you can use simple python code to get the job id based on the rest api.

from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
from pyspark.sql import Row
import base64
import requests
import json

databricks_instance ="<databricks-instances>"

url_list = f"{databricks_instance}/api/2.0/jobs/runs/get?run_id=39347"

headers = {
  'Authorization': 'Bearer <databricks-access-token>',
  'Content-Type': 'application/json'
}

response = requests.request("GET", url_list, headers=headers).json()
print(response)
print(response['job_id'])
print(response['start_time'])
print(response['end_time'])
  • Related