Home > Mobile >  PySpark - change dataframe column value based on its existence in other dataframe
PySpark - change dataframe column value based on its existence in other dataframe

Time:12-01

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.

I have two dataframes, first is based on CSV which looks like this (displayed as table):

id license_no
2005 1011
2006 1022
2007 3911

Second dataframe is based on CSV which looks like this:

license_no active
1011 y
1022 y
3911 n

I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.

Dataframe should look like this after transformations:

id license_no
992005 991011
992006 991022
882007 883911

I am not sure what is best solution for this, can I directly do this transformation in one spark command?

CodePudding user response:

#Join

s = df.join(df1,how='left',on='license_no')

#Contitionally concat prefix using list squares

s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()


 ---------- ------ 
|license_no|    id|
 ---------- ------ 
|    991011|992005|
|    991022|992006|
|    883911|882007|

---------- ------

  • Related