Home > database >  How to compare two dataframes and extract unmatched rows in pyspark?
How to compare two dataframes and extract unmatched rows in pyspark?

Time:05-10

Hi I have two dataframes. One is parent dataframe and second is incremental dataframe. I just want to extract those records which is present in incremental dataframe but not present in parent dataframe based on the key column.

Example:

Key Column : call_id

parent_dataframe:

call_id    call_nm    src
100        QC         Darzalex MM
105        XY         INVOKANA
107        CZ         Simponi  RA
117        NM         Guselkumab PSA
118        YC         STELARA
126        RF         INVOKANA

Incremental Dataframe:

call_id    call_nm    src
118        YC         STELARA
126        RF         INVOKANA
131        VG         STELARA
135        IJ         Stelara CD

Unmatched Dataframe:

call_id    call_nm    src
131        VG         STELARA
135        IJ         Stelara CD

CodePudding user response:

Use left_anti join with Incremenatl coming first. Left_anti checks to see if the values exist in the second df, they then keep values missing in df.

Incremental.join(parent_dataframe,on='call_nm', how='left_anti').show()

 ------- ------- ---------- 
|call_nm|call_id|       src|
 ------- ------- ---------- 
|     IJ|    135|Stelara CD|
|     VG|    131|   STELARA|
 ------- ------- ---------- 
  • Related