PySpark - change dataframe column value based on its existence in other dataframe-CodePudding

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.

I have two dataframes, first is based on CSV which looks like this (displayed as table):

id	license_no
2005	1011
2006	1022
2007	3911

Second dataframe is based on CSV which looks like this:

license_no	active
1011	y
1022	y
3911	n

I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.

Dataframe should look like this after transformations:

id	license_no
992005	991011
992006	991022
882007	883911

I am not sure what is best solution for this, can I directly do this transformation in one spark command?

CodePudding user response：

#Join

s = df.join(df1,how='left',on='license_no')

#Contitionally concat prefix using list squares

s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()


 ---------- ------ 
|license_no|    id|
 ---------- ------ 
|    991011|992005|
|    991022|992006|
|    883911|882007|

---------- ------