I have a big dataset with many columns:
df =
my_id attr_1 attr_2 ... attr_n
13900 null USA 384.24
13900 null UK 399.24
13999 3467 USA 314.25
13911 3556 CND 386.77
13922 5785 USA 684.21
I also have a smaller dataframe whose first column is null:
df_2 =
col_1 col_2
null 13900
null 13999
null 34002
I want to add a new column to df
that indicates whether the respective my_id
is present is df_2
:
my_id attr_1 attr_2 ... attr_n check
13900 null USA 384.24 yes
13900 null UK 399.24 yes
13999 3467 USA 314.25 yes
13911 3556 CND 386.77 no
13922 5785 USA 684.21 no
I was thinking of left joining df_2
to df
, create a column that is yes
when col_2
is populated and no
when it isn't, and then dropping col_2
, but is there any more elegant way?
CodePudding user response:
Your reasoning is correct: you can do a left
join and then using conditional function when
, derive the column check
basing on the left-joined column. A sample could could look something like this:
from pyspark.sql.functions import col, when, lit
# 1. Do a left join
df_3 = df.join(df_2, col("my_id") == col("col_2"), how="left")
# 2. Derive the value of `check` column
df_3.withColumn("check", when(col("col_2").isNotNull(), lit("yes")).otherwise(lit("no")