I have a compute-heavy Python function that wrap into a PySpark UDF and run on about 100 rows of data. When looking at CPU utilization, it looks like some worker nodes are not even utilized. I realize that this could have a multitude of reasons, and am trying to debug this.
Inside the UDF, I am already logging various statistics (e.g. start and finish time of each UDF execution). Is there any way to log the worker node ID as well? The intention being that I want to make sure that the jobs are evenly distributed between all worker nodes.
I guess IP of the worker or any other unique feature that I can log inside the UDF would work as well.
CodePudding user response:
The following works:
import socket
def my_udf_func(params):
# your code here
host = socket.gethostname()
You can then either return host
inside the return parameter (e.g. in a dict) or write it to your logs. Host name provided by databricks is the name of the cluster the ip address of the worker node, example:
0927-152944-dorky406-10-20-136-4
10-20-136-4
in this case is the IP address.
socket.getsockname()
seems to be inconsistent - I would not recommend using it.