I am using this code of pandas to get difference of 2 csv. I want to find all rows in 1.csv where userdid is same both csv and did is different.
How can i achieve this?
>>> import pandas as pd
>>> df1 = pd.read_csv('1.csv')
>>> df2 = pd.read_csv('3.csv')
>>> df1.apply(tuple, 1).isin(df2.apply(tuple, 1))
expected
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
1.csv
did,userdid,ResumeStatus,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"
3.csv
did,userdid,email,status,modified,sysinserteddt
R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,[email protected],536870912,"2022-05-02 21:50:15.813","2022-05-02 21:50:15.907"
RD71J16YLWTXRDRV1YG,U3Y5VV6BP3SQQ5B8VHV,j.com,536870912,"2021-06-01 16:02:54.853","2021-06-01 16:02:52.15"
CodePudding user response:
Given:
df1
did userdid ResumeStatus sysinserteddt
0 R3X2PB661CST2HJQWVM U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
df3
did userdid email status modified sysinserteddt
0 R3X2PB661CST2HJQWVM U3Y5VV6BP3SQQ5B8VHV [email protected] 536870912 2022-05-02 21:50:15.813 2022-05-02 21:50:15.907
1 RD71J16YLWTXRDRV1YG U3Y5VV6BP3SQQ5B8VHV j.com 536870912 2021-06-01 16:02:54.853 2021-06-01 16:02:52.15
Doing:
# All entries from df1 where userdid is in df3, but did is not~
df = df1[df1[['did', 'userdid']].isin(df3).eq([False, True]).all(axis=1)]
print(df)
Output:
did userdid ResumeStatus sysinserteddt
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03
To CSV:
df.to_csv('file.csv', header=False, index=False)
# We can see what that looks like:
print(df.to_csv(header=False, index=False))
Output:
R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,2021-01-01 00:27:03
For more info... read the documentation.
CodePudding user response:
You can first inner-join df2
and df1
on userid
and then just query your data frame by all rows where did
from df1
does not equal did
from df2
(named did_temp
):
################################# Data Frames #################################
import pandas as pd
ls1 = ['did,userdid,ResumeStatus,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"', 'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"']
ls1 = [i.split(',') for i in ls1]
df1 = pd.DataFrame(ls1[1:], columns=ls1[0])
ls2 = ['did,userdid,email,status,modified,sysinserteddt', 'R3X2PB661CST2HJQWVM,U3Y5VV6BP3SQQ5B8VHV,[email protected],536870912,2022-05-02 21:50:15.813,2022-05-02 21:50:15.907', 'RD71J16YLWTXRDRV1YG,U00RT61PD6SHSH2PTL,j.com,536870912,2021-06-01 16:02:54.853,2021-06-01 16:02:52.15']
ls2 = [i.split(',') for i in ls2]
df2 = pd.DataFrame(ls2[1:], columns=ls2[0])
###############################################################################
df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp'])
Output:
did userdid ResumeStatus sysinserteddt
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active "2021-01-01 00:27:03"
If you want a string output, you can do the following:
df_out = pd.merge(df1, df2[['did', 'userdid']].rename(columns={'did': 'did_temp'}), on=['userdid'], how="inner")
','.join(df_out[df_out['did'].ne(df_out['did_temp'])].drop(columns=['did_temp']).values[0])
Output:
'R2N86Q6KXW9K1T0Q7PT,U3Y5VV6BP3SQQ5B8VHV,Active,"2021-01-01 00:27:03"'
CodePudding user response:
This can simply be achieved using
>>> A = pd.read_csv("1.csv");
>>> B = pd.read_csv("3.csv");
>>> df = (pd.merge(A, B, on='did', how='left'))
>>> df1 = df[df['email'].isna()].drop('email', axis=1)
>>> print (df1)
did userdid_x ResumeStatus sysinserteddt_x userdid_y status modified sysinserteddt_y
1 R2N86Q6KXW9K1T0Q7PT U3Y5VV6BP3SQQ5B8VHV Active 2021-01-01 00:27:03 NaN NaN NaN NaN