Home > Mobile >  How to apply ranking for values across groups not within the group itself
How to apply ranking for values across groups not within the group itself

Time:09-12

suppose that I have this dataframe

    date        id
0   2019-10-01  303
1   2019-10-01  303
2   2019-10-01  404
3   2019-10-01  404
4   2019-10-02  303
5   2019-10-02  303
6   2019-10-02  404
7   2019-10-02  404

I want to apply ranking so that I have this dataframe

    date        id   id_rank
0   2019-10-01  303  1
1   2019-10-01  303  1
2   2019-10-01  404  1
3   2019-10-01  404  1
4   2019-10-02  303  2
5   2019-10-02  303  2
6   2019-10-02  404  2
7   2019-10-02  404  2

I tried pandas.groupby.rank() with this code

df["id_rank"] = df.groupby(["index_date", "id"])["id"].rank(method="first")

which gets me closer to the wanted result

    date        id   id_rank
0   2019-10-01  303  1
1   2019-10-01  303  1
2   2019-10-01  404  1
3   2019-10-01  404  1
4   2019-10-02  303  1
5   2019-10-02  303  1
6   2019-10-02  404  1
7   2019-10-02  404  1

I want to rank based on date. If the id is registered in the dataframe for the first day, it takes rank 1 and if it's registered again on the second day, it takes rank 2 and so on; rank 3 if the id is registered on a third day

CodePudding user response:

The ranking logic is not entirely clear. There are two ways it can be interpreted: (a) everything with the same date has the same rank or (b) the dates are ranked per id. Given the name id_rank I assume that it is the second option.

  1. If the ranking logic is (a), you can simply write

    df["id_rank"] = df.date.rank(method="dense").astype("int")
    

    The method="dense" argument results in the ranking per unique value, rather than an index to the sorted results as the normal rank() function returns.

  2. If the ranking logic is (b), you can use a similar idea as in the first method, but you first group by the id before ranking:

    df["id_rank"] = df.groupby("id").date.rank(method="dense").astype("int")
    

To illustrate the difference, suppose you instead have the dataframe

         date   id
0  2019-10-01  303
1  2019-10-01  404
2  2019-10-01  404
3  2019-10-02  303
4  2019-10-02  404
5  2019-10-02  405
6  2019-10-03  404
7  2019-10-03  405

Then the result given by the two different options would be:

        date   id  option_a  option_b
0 2019-10-01  303         1         1
1 2019-10-01  404         1         1
2 2019-10-01  404         1         1
3 2019-10-02  303         2         2
4 2019-10-02  404         2         2
5 2019-10-02  405         2         1
6 2019-10-03  404         3         3
7 2019-10-03  405         3         2

The key difference being whether the ranks for id 405 starts at 1 or 2.

  • Related