I have the following dataframe
import pandas as pd
import numpy as np
import datetime
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21),
datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25),
datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
'use': [None, None, None,
datetime.date(2020,8,30), None, datetime.date(2021,3,4),
datetime.date(2020,7,8), datetime.date(2021,12,20), None]})
I would like to create a rank
column, which will be grouped by id
and will be ascending first by buy
and then by use
. If the use
is None
and the buy
is the same, then the same rank
should be given.
I tried:
foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby("id")[["buy", "use"]].rank(method="dense", ascending=True)
The above code works as expected for all id
s except id=c
Any ideas why ?
UPDATE
The expected output should be:
import pandas as pd
import numpy as np
import datetime
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','b','b','b','c','c', 'd'],
'buy': [datetime.date(2020,4,10), datetime.date(2020,4,10), datetime.date(2020,5,21),
datetime.date(2020,8,28), datetime.date(2020,8,28), datetime.date(2021,2,25),
datetime.date(2021,12,1), datetime.date(2021,12,1), datetime.date(2021,12,17)],
'use': [None, None, None,
datetime.date(2020,8,30), None, datetime.date(2021,3,4),
datetime.date(2020,7,8), datetime.date(2021,12,20), None],
'rank': [1,1,2,1,1,2,1,2,1]})
CodePudding user response:
If I understood correctly your comments, within an id/buy group, any None makes the values have the same rank.
Thus, a first step is to mask all use values if there is any None within this group. Then, to rank on several columns you need to aggregate as tuple:
foo['rank'] = (foo
.assign(use2=foo.groupby(['id', 'buy'])['use']
.transform(lambda s: '' if s.isna().any() else s)
)
[['buy', 'use2']].apply(tuple, axis=1)
.groupby(foo['id']).rank(method='dense', na_option='top')
)
output:
id buy use rank
0 a 2020-04-10 None 1
1 a 2020-04-10 None 1
2 a 2020-05-21 None 2
3 b 2020-08-28 2020-08-30 1
4 b 2020-08-28 None 1
5 b 2021-02-25 2021-03-04 2
6 c 2021-12-01 2020-07-08 1
7 c 2021-12-01 2021-12-20 2
8 d 2021-12-17 None 1
CodePudding user response:
Not sure if there is a more efficient way, but this works:
foo['buy'] = pd.to_datetime(foo['buy'])
foo["rank"] = foo.groupby(["id"])[["buy", "use"]].rank(method="dense", ascending=True)
foo["rank2"] = foo.groupby(["id"])[["buy", "use"]].rank(method="first", ascending=True)
foo = foo.merge(foo.groupby('id')['use'].apply(lambda x: x.isnull().any()).to_frame('BOOL').reset_index(), on='id')
foo['rank_f'] = np.where(foo['BOOL'], foo['rank'], foo['rank2'])
And rank_f
is the correct column