Home > OS >  collapse hot encoded/dummied columns horizontally and then get max of combinations
collapse hot encoded/dummied columns horizontally and then get max of combinations

Time:02-21

I have these simplified data:

data = {'dummy_group_1_1':[1, 0, 0, 1],
        'dummy_group_1_2':[0, 1, 0, 0],
        'dummy_group_1_3':[0, 0, 1, 0],
        'dummy_group_2_1':[1, 0, 1, 0],
        'dummy_group_2_2':[0, 1, 0, 1],
        'number':[13.0, 15.0, 12.0, 12.0]}
haves = pd.DataFrame(data)
haves

enter image description here

Firstly, I would like to collapse columns horizontally for each "dummy variable group" dummy_group_1 and dummy_group_2 where the entry is 1. The result would be:

dummy_group_1, dummy_group_2, number

1, 1, 13
2, 2, 15
3, 1, 12
1, 2, 12 

There are many of these columns and I guess one could use something along those lines: haves.columns.str.startswith('dummy_group_1')? I would not know what to do to achieve this sorry ...

Having these intermediate results, I would like to get the dummy_group_1 and dummy_group_2 combination for the max value of "number" for the combination (tie situations take any). The results would be:

dummy_group_1, dummy_group_2, max

1, 1, 13
2, 2, 15
3, 1, 12

Is this possible?

PS:

Here is my miserable long wounded way of getting to step 1:

data = {'dummy_group_1_1':[1, 0, 0, 1],
        'dummy_group_1_2':[0, 1, 0, 0],
        'dummy_group_1_3':[0, 0, 1, 0],
        'dummy_group_2_1':[1, 0, 1, 0],
        'dummy_group_2_2':[0, 1, 0, 1],
        'number':[13.0, 15.0, 12.0, 12.0]}
haves = pd.DataFrame(data)
haves['surrogate_key'] = haves.reset_index().index
haves

group_1 = haves.loc[:, haves.columns.str.startswith('dummy_group_1') | haves.columns.str.startswith('surrogate_key')]
group_1 = pd.melt(group_1, id_vars=['surrogate_key']).query("value > 0")
group_1.drop('value', axis=1, inplace=True)
group_1['variable'] = group_1['variable'].str.replace('dummy_group_1_', '')
group_1.columns = group_1.columns.str.replace('variable', 'dummy_group_1')

group_2 = haves.loc[:, haves.columns.str.startswith('dummy_group_2') | haves.columns.str.startswith('surrogate_key')]
group_2 = pd.melt(group_2, id_vars=['surrogate_key']).query("value > 0")
group_2.drop('value', axis=1, inplace=True)
group_2['variable'] = group_2['variable'].str.replace('dummy_group_2_', '')
group_2.columns = group_2.columns.str.replace('variable', 'dummy_group_2')

numbers = haves[['surrogate_key', 'number']]

step1_data = pd.merge(numbers, group_1,  how='inner', left_on=['surrogate_key'], right_on = ['surrogate_key'])
step1_data = pd.merge(step1_data, group_2,  how='inner', left_on=['surrogate_key'], right_on = ['surrogate_key'])
step1_data 

Given parts of richardec's answer I can get to the final results doing:

step1_data.loc[step1_data.groupby('dummy_group_1')['number'].idxmax()]

which gives:

enter image description here

CodePudding user response:

Try something like this:

cols = ['number']
tmp = haves[haves.columns.difference(cols)]
tmp = pd.concat([tmp.apply(lambda col: col.map({1: int(col.name.split('_')[3])})).groupby(tmp.columns.str.split('_').str[2], axis=1).bfill().dropna(axis=1).astype(int), haves[cols]], axis=1)
tmp = tmp.loc[tmp.groupby('dummy_group_1_1')['number'].idxmax()]

Output:

>>> tmp
   dummy_group_1_1  dummy_group_2_1  number
0                1                1    13.0
1                2                2    15.0
2                3                1    12.0
  • Related