pandas udf to split array of strings pyspark-CodePudding

I have the following table

id | country_mapping
--------------------
1  | {"GBR/bla": 1,
      "USA/bla": 2}

And I want to create a columns that contains the following

id | source_countries
--------------------
1  | ["GBR", "USA"]

And I need this to be done via a pandas udf. I created the following

import pyspark.sql.functions as F

@F.pandas_udf("string")
def func(s):
    return s.apply(lambda x: [y.split("/")[0] for y in x])

I thought this would work, because if I run this code in pure pandas it gives what i need.

import pandas as pd
s = pd.Series([["GBR/1", "USA/2"], ["ITA/1", "FRA/2"]])
s.apply(lambda x: [y.split("/")[0] for y in x])

gives

Out[1]: 0    [GBR, USA]
        1    [ITA, FRA]
dtype: object

But when I run

df.withColumn('source_countries', 
              func(F.map_keys(F.col("country_mapping")))).collect()

It fails with the following error when i run the below:

PythonException: An exception was thrown from a UDF: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'

I'm confused as of why - and how to fix my pandas udf.

CodePudding user response：

Instead of pandas_udf, you can just use udf in similar way

from pyspark.sql import functions as F
from pyspark.sql import types as T

def func(v):
    return [x.split('/')[0] for x in v]

(df
     .withColumn('source_countries', F.udf(func, T.ArrayType(T.StringType()))(F.map_keys(F.col('country_mapping'))))
     .show(10, False)
)

#  --- ---------------------------- ---------------- 
# |id |country_mapping             |source_countries|
#  --- ---------------------------- ---------------- 
# |1  |{USA/bla -> 2, GBR/bla -> 1}|[USA, GBR]      |
#  --- ---------------------------- ----------------

CodePudding user response：

The answer to this question is that

Currently, all Spark SQL data types are supported by Arrow-based conversion except MapType, ArrayType of TimestampType, and nested StructType.

See here and here.