Home > other >  Quickly ranking rows in very large dataframes
Quickly ranking rows in very large dataframes

Time:08-15

So I have a very large dataframe with over 500 rows and 100 columns involved, each row representing a person and column representing a performance metric:

[Person] [Metric 1] [Metric 2] ...
A          [num]      [num]
B          [num]      [num]       
C          [num]      [num]
D          [num]      [num]
...

What I want is as follows:

[Person] [Metric 1] [Ranking for 1][Metric 2][Ranking for 2] ...
A          [num]        [Rank]       [num]       [Rank]
B          [num]        [Rank]       [num]       [Rank]     
C          [num]        [Rank]       [num]       [Rank]
D          [num]        [Rank]       [num]       [Rank]
...

Now, I can use pandas.rank to do this for one column, but it seems really clunky when there are over 100 involved, not to mention having to move the columns around afterwards. Is there a quick way (maybe involving loops) to rank all 100 columns at once?

CodePudding user response:

You can use pd.DataFrame.rank:

out = (df
  .join(df.filter(like='Metric')
          .rank()
          .add_prefix('rank_')
        )
  .sort_index(axis=1, key=lambda x: x.str.extract(r'Metric(\d )', expand=False).fillna(0).astype(int))
)

NB. There are many ranking algorithms, check the documentation for the one you need.

Dummy example:

  Person  Metric1  rank_Metric1  Metric2  rank_Metric2
0      A        1           2.0        1           2.0
1      B        3           4.0        0           1.0
2      C        0           1.0        2           3.0
3      D        2           3.0        3           4.0

CodePudding user response:

Given this dataframe

   metric1  metric2  metric3
A        1        2        3
B        3        4        5
C        2        1        7

You can create each rank and a new column name

new_column_order = []

for index, column in enumerate(df.columns):
    df[f"rank {index 1}"] = df[column].rank()
    new_column_order.append(column)
    new_column_order.append(f"rank {index 1}")
df.columns = new_column_order

at the end of the for loop the dataframe is

   metric1  metric2  metric3  rank 1  rank 2  rank 3
A        1        2        3     1.0     2.0     1.0
B        3        4        5     3.0     3.0     2.0
C        2        1        7     2.0     1.0     3.0

and the list of columns reorganized (new_column_order)

['metric1', 'rank 1', 'metric2', 'rank 2', 'metric3', 'rank 3']
  • Related