Home > Software engineering >  Pandas Create a Categorical Column After Condition
Pandas Create a Categorical Column After Condition

Time:11-23

I have a dataframe like this:

DURATION   CLUSTER   COEFF
3          0         0.34
3          1        -0.005
3          2         1
3          3         0.33 
4          0        -0.02
4          1        -0.28
4          2         0.22
4          3         0.48
5          0         0.65
5          1        -0.26
5          2         0.1
5          3         0.15

I want to create a RESULT categorical column according to the "COEFF" coefficients for each "DURATION". The one with the greatest "COEFF" value will be "First" and so on.

Desired output like this:

DURATION   CLUSTER   COEFF  RESULT
3          0         0.34   Second
3          1        -0.005  Fourth
3          2         1      First
3          3         0.33   Third
4          0        -0.02   Third
4          1        -0.28   Fourth
4          2         0.22   Second
4          3         0.48   First
5          0         0.65   First
5          1        -0.26   Fourth
5          2         0.1    Third
5          3         0.15   Second

Could you please help me about this?

CodePudding user response:

Use groupby.rank and map:

labels = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
df['RESULT'] = (df.groupby('DURATION')['COEFF']
                  .rank('dense', ascending=False).sub(1)
                  .map(dict(enumerate(labels)))
               )

Output:

    DURATION  CLUSTER  COEFF  RESULT
0          3        0  0.340  Second
1          3        1 -0.005  Fourth
2          3        2  1.000   First
3          3        3  0.330   Third
4          4        0 -0.020   Third
5          4        1 -0.280  Fourth
6          4        2  0.220  Second
7          4        3  0.480   First
8          5        0  0.650   First
9          5        1 -0.260  Fourth
10         5        2  0.100   Third
11         5        3  0.150  Second

CodePudding user response:

To go a bit further based on https://stackoverflow.com/a/74547858/7237062 excellent answer (I would not have found this myself that fast), I suggest using this Ordinal numbers replacement to completly automate the process.

import pandas as pd
# see answer https://stackoverflow.com/a/20007730/7237062, others exist
# code golfed version of an "ordinal" function (int -> ordinal string in english)
ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(n//10!=1)*(n<4)*n::4])
# copy pasta of OP input data
df = pd.read_clipboard()  # let pandas read the clipboard
df["RESULT"] = (df.groupby('DURATION')['COEFF']
                  .rank('dense', ascending=False)
                  .sub(1) # mozway's answer so far !
                  .astype(int)
                    1 #  1 so ordinals start at 1 (instead of 0)
                  ).apply(ordinal) 

results:

    DURATION  CLUSTER  COEFF RESULT
0          3        0  0.340    2nd
1          3        1 -0.005    4th
2          3        2  1.000    1st
3          3        3  0.330    3rd
4          4        0 -0.020    3rd
5          4        1 -0.280    4th
6          4        2  0.220    2nd
7          4        3  0.480    1st
8          5        0  0.650    1st
9          5        1 -0.260    4th
10         5        2  0.100    3rd
11         5        3  0.150    2nd
  • Related