I have two dataframes A and B that contain different sets of patient data, and need to append certain columns from B to A - however only for those rows that contain information from the same patient and visit, i.e. where A and B have a matching value in two particular columns. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, but many people discourage from looping over pandas dataframes (apart from the fact that my loop solution does not work because "Can only compare identically-labeled Series objects"). I read the options here How to iterate over rows in a DataFrame in Pandas but don't see which one I could apply here and would appreciate any tips!
Toy example (the actual dataframe has about 300 rows):
dict_A = {
'ID': ['A_190792','X_210392','B_050490','F_311291','K_010989'],
'Visit_Date': ['2010-10-31','2011-09-24','2010-30-01','2012-01-01','2013-08-13'],
'Score': [30, 23, 42, 23, 31],
}
A = pd.DataFrame(dict_A)
dict_B = {
'ID': ['G_090891','A_190792','Z_060791','X_210392','B_050490','F_311291','K_010989','F_230989'],
'Visit_Date': ['2013-03-01','2010-10-31','2013-04-03','2011-09-24','2010-30-01','2012-01-01','2013-08-13','2014-09-09'],
'Diagnosis': ['F12', 'G42', 'F34', 'F90', 'G98','G87','F23','G42'],
}
B = pd.DataFrame(dict_B)
for idx, row in A.iterrows():
A.loc[row,'Diagnosis'] = B['Diagnosis'][(B['Visit_Date']==A['Visit_Date']) & (B['ID']==A['ID'])]
# Appends Diagnosis column from B to A for rows where ID and date match
I have seen this question Append Columns to Dataframe 1 Based on Matching Column Values in Dataframe 2 but the only answer is quite specific to it and also does not address the question whether a loop can/should be used or not
CodePudding user response:
i think you can use merge:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='outer')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30.0 G42
1 X_210392 2011-09-24 23.0 F90
2 B_050490 2010-30-01 42.0 G98
3 F_311291 2012-01-01 23.0 G87
4 K_010989 2013-08-13 31.0 F23
5 G_090891 2013-03-01 NaN F12
6 Z_060791 2013-04-03 NaN F34
7 F_230989 2014-09-09 NaN G42
'''
if you want to only A:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='left')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30 G42
1 X_210392 2011-09-24 23 F90
2 B_050490 2010-30-01 42 G98
3 F_311291 2012-01-01 23 G87
4 K_010989 2013-08-13 31 F23
'''