I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.
I have two dataframes, first is based on CSV which looks like this (displayed as table):
id | license_no |
---|---|
2005 | 1011 |
2006 | 1022 |
2007 | 3911 |
Second dataframe is based on CSV which looks like this:
license_no | active |
---|---|
1011 | y |
1022 | y |
3911 | n |
I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.
Dataframe should look like this after transformations:
id | license_no |
---|---|
992005 | 991011 |
992006 | 991022 |
882007 | 883911 |
I am not sure what is best solution for this, can I directly do this transformation in one spark command?
CodePudding user response:
#Join
s = df.join(df1,how='left',on='license_no')
#Contitionally concat prefix using list squares
s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()
---------- ------
|license_no| id|
---------- ------
| 991011|992005|
| 991022|992006|
| 883911|882007|
---------- ------