how to coalesce every element of join pyspark-CodePudding

i have an array of join args (columns):

attrs = ['surname', 'name', 'patronymic', 'birth_date',
     'doc_type', 'doc_series','doc_number']

i'm trying to join two tables just like this but i need to coalesce each column for join to behave normally (cause it wont join correctly if there are nulls)

new_df = pre_df.join(res_df, attrs, how='leftanti')

i've tried listing every condition but is there a possibility to do this another way?

CodePudding user response：

If you try to union two dataset with the same columns. You don't perform a join but a union. Try with df = df.unionByName(df2)

CodePudding user response：

so i've figured this out:

join_attrs = [F.coalesce(pre_df[elem], F.lit('')) == F.coalesce(res_df[elem], F.lit('')) for elem in attrs]

also this works too, but not sure what's faster:

join_attrs = [pre_df[elem].eqNullSafe(res_df[elem]) for elem in attrs]