Home > Back-end >  How to create rank column based on multiple columns with groupby in pandas
How to create rank column based on multiple columns with groupby in pandas

Time:03-15

I have the following dataframe

import pandas as pd
import numpy as np
import datetime
import pandas as pd

foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
                    'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21), 
                              datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25), 
                              datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
                    'use': [None, None, None, 
                              datetime.date(2020,8,30), None, datetime.date(2021,3,4), 
                              datetime.date(2020,7,8), datetime.date(2021,12,20), None]})

I would like to create a rank column, which will be grouped by id and will be ascending first by buy and then by use. If the use is None and the buy is the same, then the same rank should be given.

I tried:

foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby("id")[["buy", "use"]].rank(method="dense", ascending=True)

The above code works as expected for all ids except id=c

Any ideas why ?

UPDATE

The expected output should be:

import pandas as pd
import numpy as np
import datetime
import pandas as pd

foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
                    'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21), 
                              datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25), 
                              datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
                    'use': [None, None, None, 
                              datetime.date(2020,8,30), None, datetime.date(2021,3,4), 
                              datetime.date(2020,7,8), datetime.date(2021,12,20), None], 
'rank': [1,1,2,1,1,2,1,2,1]})

CodePudding user response:

If I understood correctly your comments, within an id/buy group, any None makes the values have the same rank.

Thus, a first step is to mask all use values if there is any None within this group. Then, to rank on several columns you need to aggregate as tuple:

foo['rank'] = (foo
 .assign(use2=foo.groupby(['id', 'buy'])['use']
                 .transform(lambda s: '' if s.isna().any() else s)
         )
 [['buy', 'use2']].apply(tuple, axis=1)
 .groupby(foo['id']).rank(method='dense', na_option='top')
)

output:

  id        buy         use  rank
0  a 2020-04-10        None     1
1  a 2020-04-10        None     1
2  a 2020-05-21        None     2
3  b 2020-08-28  2020-08-30     1
4  b 2020-08-28        None     1
5  b 2021-02-25  2021-03-04     2
6  c 2021-12-01  2020-07-08     1
7  c 2021-12-01  2021-12-20     2
8  d 2021-12-17        None     1

CodePudding user response:

Not sure if there is a more efficient way, but this works:

foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby(["id"])[["buy", "use"]].rank(method="dense", ascending=True)
foo["rank2"] = foo.groupby(["id"])[["buy", "use"]].rank(method="first", ascending=True)

foo = foo.merge(foo.groupby('id')['use'].apply(lambda x: x.isnull().any()).to_frame('BOOL').reset_index(), on='id')
foo['rank_f'] = np.where(foo['BOOL'], foo['rank'], foo['rank2'])

And rank_f is the correct column

  • Related