Home > Enterprise >  Compare 2 csv in pandas upon columns
Compare 2 csv in pandas upon columns

Time:07-21

I am using this code of pandas to get difference of 2 csv. I want to find all rows in 1.csv where userdid is same both csv and did is different.

How can i achieve this?

>>> import pandas as pd
>>> df1 = pd.read_csv('1.csv')
>>> df2 = pd.read_csv('3.csv')
>>> df1.apply(tuple, 1).isin(df2.apply(tuple, 1))

expected

R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"

1.csv

did,userdid,ResumeStatus,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"

3.csv

did,userdid,email,status,modified,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,[email protected],536870912,"2022-05-02 21:50:15.813","2022-05-02 21:50:15.907"
RD71J16YLWTXRDRV1YG,U3Y5VV6BP3SQQ5B8VHV,j.com,536870912,"2021-06-01 16:02:54.853","2021-06-01 16:02:52.15"

CodePudding user response:

Given:

df1

                   did              userdid ResumeStatus        sysinserteddt
0  R3X2PB661CST2HJQWVM  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03

df3

                   did              userdid        email     status                 modified            sysinserteddt
0  R3X2PB661CST2HJQWVM  U3Y5VV6BP3SQQ5B8VHV  [email protected]  536870912  2022-05-02 21:50:15.813  2022-05-02 21:50:15.907
1  RD71J16YLWTXRDRV1YG  U3Y5VV6BP3SQQ5B8VHV        j.com  536870912  2021-06-01 16:02:54.853   2021-06-01 16:02:52.15

Doing:

# All entries from df1 where userdid is in df3, but did is not~
df = df1[df1[['did', 'userdid']].isin(df3).eq([False, True]).all(axis=1)]
print(df)

Output:

                   did              userdid ResumeStatus        sysinserteddt
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03

To CSV:

df.to_csv('file.csv', header=False, index=False)

# We can see what that looks like:
print(df.to_csv(header=False, index=False))

Output:

R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,2021-01-01 00:27:03

For more info... read the documentation.

CodePudding user response:

You can first inner-join df2 and df1 on userid and then just query your data frame by all rows where did from df1 does not equal did from df2 (named did_temp):

################################# Data Frames #################################
import pandas as pd

ls1 = ['did,userdid,ResumeStatus,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"', 'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"']
ls1 = [i.split(',') for i in ls1]
df1 = pd.DataFrame(ls1[1:], columns=ls1[0])

ls2 = ['did,userdid,email,status,modified,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,[email protected],536870912,2022-05-02 21:50:15.813,2022-05-02 21:50:15.907', 'RD71J16YLWTXRDRV1YG,U00RT61PD6SHSH2PTL,j.com,536870912,2021-06-01 16:02:54.853,2021-06-01 16:02:52.15']
ls2 = [i.split(',') for i in ls2]

df2 = pd.DataFrame(ls2[1:], columns=ls2[0])
###############################################################################

df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp'])

Output:

    did                 userdid             ResumeStatus   sysinserteddt
1   R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active         "2021-01-01 00:27:03"

If you want a string output, you can do the following:

df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
','.join(df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp']).values[0])

Output:

'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"'

CodePudding user response:

This can simply be achieved using

>>> A = pd.read_csv("1.csv");
>>> B = pd.read_csv("3.csv");
>>> df =  (pd.merge(A, B, on='did', how='left'))
>>> df1 = df[df['email'].isna()].drop('email', axis=1)
>>> print (df1)
                   did            userdid_x ResumeStatus      sysinserteddt_x userdid_y  status modified sysinserteddt_y
1  R2N86Q6KXW9K1T0Q7PT  U3Y5VV6BP3SQQ5B8VHV       Active  2021-01-01 00:27:03       NaN     NaN      NaN             NaN
  • Related