suppose that I have this dataframe
date id
0 2019-10-01 303
1 2019-10-01 303
2 2019-10-01 404
3 2019-10-01 404
4 2019-10-02 303
5 2019-10-02 303
6 2019-10-02 404
7 2019-10-02 404
I want to apply ranking so that I have this dataframe
date id id_rank
0 2019-10-01 303 1
1 2019-10-01 303 1
2 2019-10-01 404 1
3 2019-10-01 404 1
4 2019-10-02 303 2
5 2019-10-02 303 2
6 2019-10-02 404 2
7 2019-10-02 404 2
I tried pandas.groupby.rank()
with this code
df["id_rank"] = df.groupby(["index_date", "id"])["id"].rank(method="first")
which gets me closer to the wanted result
date id id_rank
0 2019-10-01 303 1
1 2019-10-01 303 1
2 2019-10-01 404 1
3 2019-10-01 404 1
4 2019-10-02 303 1
5 2019-10-02 303 1
6 2019-10-02 404 1
7 2019-10-02 404 1
I want to rank based on date. If the id is registered in the dataframe for the first day, it takes rank 1 and if it's registered again on the second day, it takes rank 2 and so on; rank 3 if the id is registered on a third day
CodePudding user response:
The ranking logic is not entirely clear. There are two ways it can be interpreted: (a) everything with the same date has the same rank or (b) the dates are ranked per id. Given the name id_rank
I assume that it is the second option.
If the ranking logic is (a), you can simply write
df["id_rank"] = df.date.rank(method="dense").astype("int")
The
method="dense"
argument results in the ranking per unique value, rather than an index to the sorted results as the normalrank()
function returns.If the ranking logic is (b), you can use a similar idea as in the first method, but you first group by the
id
before ranking:df["id_rank"] = df.groupby("id").date.rank(method="dense").astype("int")
To illustrate the difference, suppose you instead have the dataframe
date id
0 2019-10-01 303
1 2019-10-01 404
2 2019-10-01 404
3 2019-10-02 303
4 2019-10-02 404
5 2019-10-02 405
6 2019-10-03 404
7 2019-10-03 405
Then the result given by the two different options would be:
date id option_a option_b
0 2019-10-01 303 1 1
1 2019-10-01 404 1 1
2 2019-10-01 404 1 1
3 2019-10-02 303 2 2
4 2019-10-02 404 2 2
5 2019-10-02 405 2 1
6 2019-10-03 404 3 3
7 2019-10-03 405 3 2
The key difference being whether the ranks for id 405
starts at 1 or 2.