Home > Net >  Comparative dataframe in Pyspark
Comparative dataframe in Pyspark

Time:05-25

I have two dataframes, one coming from a database with the following fields:

name id
bakarery 010203040000150
store 010203040000160
market 010203040000180
hospital 010203040000190
bakery 010203040000200

And another dataframe that I need to compare to be able to update the IDs:

name id
bakarery 1020304050
store 010203040000160
market 010203040000180
hospital 3040506070
bakery 010203040000200

I need to create a third dataframe only with the IDs I need to update, looking at the name, if that name updated the ID then I create that dataframe.

How can I do this?

Expected output:

name id
bakarery 1020304050
hospital 3040506070

CodePudding user response:

assuming first one is df1 and second one df2:

df2.join(df1, on="name").where(df1["id"] != df2["id"]).show()

 -------- ---------- ---------------                                            
|    name|        id|             id|
 -------- ---------- --------------- 
|bakarery|1020304050|010203040000150|
|hospital|3040506070|010203040000190|
 -------- ---------- --------------- 

or also :

df2.subtract(df1).show()
 -------- ----------                                                            
|    name|        id|
 -------- ---------- 
|bakarery|1020304050|
|hospital|3040506070|
 -------- ---------- 

CodePudding user response:

d = {'bakery':'010203040000150','store':'010203040000160'}
import pandas as pd
df1=pd.DataFrame(data=d,index=[0])
d1={'bakery':'1020304050','store':'010203040000160'}
df2=pd.DataFrame(data=d1,index=[0])
df3=df1==df2
df4=df2.mask(~df3).fillna(df2)
df4

bakery  store
0   1020304050  010203040000160

The above code is executed for a small sample, but it should do the job.

  • Related