I have the following function
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, ArrayType
def f(row):
.
.
.
<compute my_field>
print(f'my_field: {my_field}; type(my_field): {type(my_field)}')
return str(my_field), StringType()
f_udf = udf(f)
new_df = df.withColumn('new_field', udf(struct([df[column] for column in df.columns if column != 'reserved']))
Here's a sample of what gets printed out -
my_field: erfSSSWqd; type(my_field): <class 'str'>
and here is new_df
-------------- ----------------------------
|field |new_field |
-------------- ----------------------------
|WERWERV511 |[Ljava.lang.Object;@280692a3|
|WEQMNHV381 |[Ljava.lang.Object;@3ee30d9c|
|FSLQCXV881 |[Ljava.lang.Object;@16cbf3a9|
|SDTEHLV980 |[Ljava.lang.Object;@54e6686 |
|SDFWERV321 |[Ljava.lang.Object;@72377b29|
-------------- ----------------------------
But I would expect strings in the new_field column.
It looks like the types are all right. In fact, I don't even need to wrap my_field
with str()
, but I did that just in case.
Does anybody know how to fix this?
CodePudding user response:
Instead of returning the tuple str(my_field), StringType()
only return the value str(my_field)
.
Moreover, you may specify the return type of your UDF as the second parameter here
f_udf = udf(f,StringType())
Let me know if this works for you.