I have two dataframes as follows:
patient id A B
0 EHR123456 0.415 0.230
1 EHR987653 0.854 0.705
2 EHR657483 0.364 0.881
Mood A B
0 Type A 0.4 0.8
1 Type B 0.2 0.6
2 Type C 0.2 0.4
3 Type D 0.4 0.2
4 Type E 0.6 0.2
5 Type F 0.8 0.4
6 Type G 0.8 0.6
7 Type H 0.6 0.8
The first dataframe contains patients with results from test A and test B (about 10k rows). The second dataframe contains 8 pre-defined categories (mood) with values for A and B. How can I assign a category (mood) to every patient based on the closest match with one of the categories from the second dataframe (taking A and B into account)?
CodePudding user response:
You can use scipy.cdist
to compute the pairwise distances using A and B (by default euclidean distance) and then idxmin
to search to the closest match. I recommend to put mood and user as index. Example:
import pandas as pd
df0 = pd.DataFrame([
['user0', 0.1231, 0.12312],
['user1', 0.34534, 0.345],
], columns=['user', 'A', 'B']).set_index('user')
df1 = pd.DataFrame([
['mood0', 0.2, 0.2],
['mood1', 0.3, 0.3],
['mood2', 0.4, 0.4],
], columns=['mood', 'A', 'B']).set_index('mood')
from scipy.spatial.distance import cdist
pd.DataFrame(
cdist(df0.values, df1.values),
index=df0.index, columns=df1.index,
).idxmin(axis=1)
CodePudding user response:
def get_mood(x):
s = mood.assign(delta=abs(mood.A - x.A) abs(mood.B - x.B)). \
sort_values(by='delta').iloc[0]
# s = mood.iloc[mood.assign(delta=abs(mood.A - x.A) abs(mood.B - x.B)).delta.idxmin()]
return f'{s.Mood} ({s.A} : {s.B})'
df['Mood'] = df.apply(get_mood, axis=1)
print(df)
Output:
patient id A B Mood
0 EHR123456 0.415 0.230 Type D (0.4 : 0.2)
1 EHR987653 0.854 0.705 Type G (0.8 : 0.6)
2 EHR657483 0.364 0.881 Type A (0.4 : 0.8)