How can I get percentile of column in dataframe considering only previous values? (Python)-CodePudding

I have a dataframe with a numeric column and I would link to calculate the percentile of the values in each row for that column considering only previous rows of the column. Here is an example:

 ------- 
| col_1 |
 ------- 
|  5    |
 ------- 
|  4    |
 ------- 
|  10   |
 ------- 
|  1    |
 ------- 
|  7    |
 -------

I would like to obtain a dataframe like this:

 ------- ------------ 
| col_1 | percentile |
 ------- ------------ 
|  5    |     100    |
 ------- ------------ 
|  4    |     50     |
 ------- ------------ 
|  10   |     100    |
 ------- ------------ 
|  1    |     25     |
 ------- ------------ 
|  7    |     80     |
 ------- ------------

How can I calculate it?

CodePudding user response：

Try as follows.

Use df.expanding with min_periods=1 to allow expanding window calculations.
For each window, we apply Expanding.rank with pct=True (and we multiply by 100).

We can assign the result directly to the new column percentile:

import pandas as pd

data = {'col_1':[5,4,10,1,7]}
df = pd.DataFrame(data)

df['percentile'] = df['col_1'].expanding(min_periods=1).rank(pct=True).mul(100)

print(df)

   col_1  percentile
0      5       100.0
1      4        50.0
2     10       100.0
3      1        25.0
4      7        80.0

Update: Expanding.rank was added to pandas in version 1.4.0. For earlier versions, you could for instance try:

temp = df['col_1'].expanding(min_periods=1).agg(['rank','count'])
df['percentile'] = (temp['rank']/temp['count']).mul(100)

print(df)

   col_1  percentile
0      5       100.0
1      4        50.0
2     10       100.0
3      1        25.0
4      7        80.0

Or, as a one-liner:

df['percentile'] = df['col_1'].expanding(min_periods=1)\
    .apply(lambda x: (x.rank()/x.count()).to_numpy()[-1]*100)