I have a dataset
col_id col_2 col_3 col_id_b
ABC111 shfhs 34775 null
ABC112 shfhe 34775 DEF345
ABC112 shfhs 34775 GFR563
ABC112 shfgh 34756 TRS572
ABC113 shfdh 34795 null
ABC114 shfhs 34770 null
I am trying to create a new column that is identical to col_id_b
, except that the nulls take the value of the corresponding col_id
from that row. So:
col_id col_2 col_3 col_id_b col_new
ABC111 shfhs 34775 null ABC111
ABC112 shfhe 34775 DEF345 DEF345
ABC112 shfhs 34775 GFR563 GFR563
ABC112 shfgh 34756 TRS572 TRS572
ABC113 shfdh 34795 null ABC113
ABC114 shfhs 34770 null ABC114
I know about:
df.select(coalesce(df["col_id"], df["col_id_b"])).show()
But in my case there are my rows where both are not-null. How do I introduce this condition?
CodePudding user response:
Just invert the order of the columns:
df.select(coalesce(col('col_id_b'), col('col_id')))
coalesce
returns the first column that is not null; so if you specify col_id_b
first, it this is not null, you will have col_id_b
, otherwise col_id
.