categorise data in python based on two columns-CodePudding

I have two dataframes as follows:

  patient id      A      B
0  EHR123456  0.415  0.230
1  EHR987653  0.854  0.705
2  EHR657483  0.364  0.881


     Mood    A    B
0  Type A  0.4  0.8
1  Type B  0.2  0.6
2  Type C  0.2  0.4
3  Type D  0.4  0.2
4  Type E  0.6  0.2
5  Type F  0.8  0.4
6  Type G  0.8  0.6
7  Type H  0.6  0.8

The first dataframe contains patients with results from test A and test B (about 10k rows). The second dataframe contains 8 pre-defined categories (mood) with values for A and B. How can I assign a category (mood) to every patient based on the closest match with one of the categories from the second dataframe (taking A and B into account)?

CodePudding user response：

You can use scipy.cdist to compute the pairwise distances using A and B (by default euclidean distance) and then idxmin to search to the closest match. I recommend to put mood and user as index. Example:

import pandas as pd

df0 = pd.DataFrame([
    ['user0', 0.1231, 0.12312],
    ['user1', 0.34534, 0.345],
], columns=['user', 'A', 'B']).set_index('user')

df1 = pd.DataFrame([
    ['mood0', 0.2, 0.2],
    ['mood1', 0.3, 0.3],
    ['mood2', 0.4, 0.4],
], columns=['mood', 'A', 'B']).set_index('mood')

from scipy.spatial.distance import cdist

pd.DataFrame(
    cdist(df0.values, df1.values),
    index=df0.index, columns=df1.index,
).idxmin(axis=1)

CodePudding user response：

def get_mood(x):
    s = mood.assign(delta=abs(mood.A - x.A)   abs(mood.B - x.B)). \
        sort_values(by='delta').iloc[0]
    # s = mood.iloc[mood.assign(delta=abs(mood.A - x.A)   abs(mood.B - x.B)).delta.idxmin()]
    return f'{s.Mood} ({s.A} : {s.B})'


df['Mood'] = df.apply(get_mood, axis=1)
print(df)

Output:

  patient id      A      B                Mood
0  EHR123456  0.415  0.230  Type D (0.4 : 0.2)
1  EHR987653  0.854  0.705  Type G (0.8 : 0.6)
2  EHR657483  0.364  0.881  Type A (0.4 : 0.8)