Home > database >  Pyspark: match columns from two different dataframes and add value
Pyspark: match columns from two different dataframes and add value

Time:11-19

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:

df1=

| id |
| -- |
| 1  |
| 2  |
| 3  |
| 4  | 
| 5  |

df2 =

| id |
| -- |
| 2  |
| 5  |
| 1  |

So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:

df3 =

| id | is_used |
| -- | ------- |
| 1  |    X    |
| 2  |    X    |
| 3  |    NA   |
| 4  |    NA   |
| 5  |    X    |

I have tried this way, but the selection criteria places an "X" in all columns:

df3 = df3.withColumn('is_used', F.when(
    condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
    value = 'NA'
).otherwise('X'))

I would appreciate any help

CodePudding user response:

Try with fullouter join:

df3 = (
    df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
    .withColumn(
        "is_used",
        F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
    )
    .drop(F.col("df2.id"))
    .orderBy(F.col("id"))
)

Result:

 --- -------                                                                    
|id |is_used|
 --- ------- 
|1  |X      |
|2  |X      |
|3  |NA     |
|4  |NA     |
|5  |X      |
 --- ------- 

CodePudding user response:

Try the following code, it would give you a similar result and you can make the rest of the changes:


df3 = df1.alias("df1").\
    join(df2.alias("df2"), (df1.id==df2.id), how='left').\
    withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\ 
    select("df1.*","is_true")

df3.show()

  • Related