I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way?
parse = udf (lambda x: (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] 'Z', StringType())
CodePudding user response:
udf(
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] 'Z',
StringType()
)
udf
in PySpark assigns a Python function which is run for every row of Spark df.
Creates a user defined function (UDF).
New in version 1.3.0.
Parameters:
- f : function
python function if used as a standalone function- returnType : pyspark.sql.types.DataType or str
the return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
The returnType
will be a string. Removing it, we get the function body we're interested in:
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] 'Z'
In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.
import datetime
from datetime import timedelta
def func(x):
return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] 'Z'
To really see what's going on you can create variables out of every element and print
them.
import datetime
from datetime import timedelta
def my_func(x):
v1 = datetime.datetime.utcnow()
v2 = timedelta(hours=x)
v3 = v1 - v2
v4 = v3.isoformat()
v5 = v4[:-3]
v6 = v5 'Z'
[print(e) for e in (v1, v2, v3, v4, v5)]
return v6
print(my_func(3))
# 2022-06-17 07:16:36.212566
# 3:00:00
# 2022-06-17 04:16:36.212566
# 2022-06-17T04:16:36.212566
# 2022-06-17T04:16:36.212
# 2022-06-17T04:16:36.212Z
This way you see how result changes after every step. You can print
whatever you want at any step you need. E.g. print(type(v4))