Home > Software engineering >  Pandas Ranking for String Columns
Pandas Ranking for String Columns

Time:12-10

I have a sample dataframe like this. Basically I want to do ranking based on item_number and location_id. I could've done something like that in SQL using window function (dense_rank, over() partition by).

df = pd.DataFrame({'item_number': [1029980, 1029980, 1029980, 1029980, 1029980, 
                                   1029980, 1029980, 1029980, 1029980, 1029980],
                   'location_id': ['L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 'L3-25-AA-05-B', 
                                   'L4-25-AA-05-B', 'L4-25-AA-05-B','L4-25-AA-05-B', 'L4-25-AA-05-B', 'L4-25-AA-05-B'],
                   'Date': ['2021-10-01', '2021-10-02', '2021-10-03', '2021-10-04', '2021-10-05', 
                            '2021-10-01', '2021-10-02', '2021-10-03', '2021-10-04', '2021-10-05']})
item_number location_id Date
1029980 L3-25-AA-05-B 2021-10-01
1029980 L3-25-AA-05-B 2021-10-02
1029980 L3-25-AA-05-B 2021-10-03
1029980 L3-25-AA-05-B 2021-10-04
1029980 L3-25-AA-05-B 2021-10-05
1029980 L4-25-AA-05-B 2021-10-01
1029980 L4-25-AA-05-B 2021-10-02
1029980 L4-25-AA-05-B 2021-10-03
1029980 L4-25-AA-05-B 2021-10-04
1029980 L4-25-AA-05-B 2021-10-05

I want the data to be like this. The ranking is grouped by item_number and location_id. If the item_number and location_id are same then it's considered in a same group and ranking should be done based on the Date.

item_number location_id Date Rank
1029980 L3-25-AA-05-B 2021-10-01 5
1029980 L3-25-AA-05-B 2021-10-02 4
1029980 L3-25-AA-05-B 2021-10-03 3
1029980 L3-25-AA-05-B 2021-10-04 2
1029980 L3-25-AA-05-B 2021-10-05 1
1029980 L4-25-AA-05-B 2021-10-01 5
1029980 L4-25-AA-05-B 2021-10-02 4
1029980 L4-25-AA-05-B 2021-10-03 3
1029980 L4-25-AA-05-B 2021-10-04 2
1029980 L4-25-AA-05-B 2021-10-05 1

I have tried this code but it's giving an error as the columns are all string.

test['rank'] = test.groupby(['item_number','location_id']).rank()

The above code gave me this error.

DataError: No numeric types to aggregate

Can anyone help me in this regard please?

CodePudding user response:

In your case do

df['new'] = df.groupby(['item_number','location_id'])['Date'].rank(ascending=False)
0    5.0
1    4.0
2    3.0
3    2.0
4    1.0
5    5.0
6    4.0
7    3.0
8    2.0
9    1.0
Name: Date, dtype: float64

CodePudding user response:

IIUC, you can reverse the dataframe, groupby on item_number and location_id and cumcount Dates:

df['rank'] = df.groupby(['item_number','location_id'], as_index=False)['Date'].cumcount(ascending=False) 1

Output:

   item_number    location_id        Date  rank
0      1029980  L3-25-AA-05-B  2021-10-01     5
1      1029980  L3-25-AA-05-B  2021-10-02     4
2      1029980  L3-25-AA-05-B  2021-10-03     3
3      1029980  L3-25-AA-05-B  2021-10-04     2
4      1029980  L3-25-AA-05-B  2021-10-05     1
5      1029980  L4-25-AA-05-B  2021-10-01     5
6      1029980  L4-25-AA-05-B  2021-10-02     4
7      1029980  L4-25-AA-05-B  2021-10-03     3
8      1029980  L4-25-AA-05-B  2021-10-04     2
9      1029980  L4-25-AA-05-B  2021-10-05     1

CodePudding user response:

You could use:

test.dtypes

to see what type the columns are that you are using (if they are numeric of not) and then potentially use astype: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

test.astype({"item_number": "int"}).groupby(['item_number','location_id']).rank()

Though I'm not sure if that would work for location_id.

  • Related