Creating a new boolean column based on another dataframe in Spark-CodePudding

I have a big dataset with many columns:

df =

my_id    attr_1    attr_2   ...  attr_n
13900    null      USA           384.24
13900    null      UK            399.24
13999    3467      USA           314.25
13911    3556      CND           386.77
13922    5785      USA           684.21

I also have a smaller dataframe whose first column is null:

df_2 =

    col_1    col_2
    null     13900
    null     13999
    null     34002

I want to add a new column to df that indicates whether the respective my_id is present is df_2:

my_id    attr_1    attr_2   ...  attr_n   check
13900    null      USA           384.24   yes
13900    null      UK            399.24   yes
13999    3467      USA           314.25   yes
13911    3556      CND           386.77   no
13922    5785      USA           684.21   no

I was thinking of left joining df_2 to df, create a column that is yes when col_2 is populated and no when it isn't, and then dropping col_2, but is there any more elegant way?

CodePudding user response：

Your reasoning is correct: you can do a left join and then using conditional function when, derive the column check basing on the left-joined column. A sample could could look something like this:

from pyspark.sql.functions import col, when, lit

# 1. Do a left join
df_3 = df.join(df_2, col("my_id") == col("col_2"), how="left")

# 2. Derive the value of `check` column 
df_3.withColumn("check", when(col("col_2").isNotNull(), lit("yes")).otherwise(lit("no")