Home > Back-end >  Log worker ID when using PySpark UDF
Log worker ID when using PySpark UDF

Time:09-29

I have a compute-heavy Python function that wrap into a PySpark UDF and run on about 100 rows of data. When looking at CPU utilization, it looks like some worker nodes are not even utilized. I realize that this could have a multitude of reasons, and am trying to debug this.

Inside the UDF, I am already logging various statistics (e.g. start and finish time of each UDF execution). Is there any way to log the worker node ID as well? The intention being that I want to make sure that the jobs are evenly distributed between all worker nodes.

I guess IP of the worker or any other unique feature that I can log inside the UDF would work as well.

CodePudding user response:

The following works:

import socket

def my_udf_func(params):
    # your code here
    host = socket.gethostname()

You can then either return host inside the return parameter (e.g. in a dict) or write it to your logs. Host name provided by databricks is the name of the cluster the ip address of the worker node, example:

0927-152944-dorky406-10-20-136-4

10-20-136-4 in this case is the IP address.

socket.getsockname() seems to be inconsistent - I would not recommend using it.

  • Related