Home > Net >  Python - To create 2 new column with 25th and 75th percentile of several row values
Python - To create 2 new column with 25th and 75th percentile of several row values

Time:05-20

Here is how my df kind of looks like (with many more rows, and many more columns):

Index WTG1 WTG2 WTG3
1.5 61.25 -7.57 7.18
2 19.69 25.95 28.67
2.5 59.51 81.22 78.22
3 131.81 154.07 142.92

My objective is to get:

Index WTG1 WTG2 WTG3 25th Percentile 75th Percentile Mean
1.5 61.25 -7.57 7.18 (25th Percentile of 61.2, -7.57, 7.18) (75th Percentile of 61.2, -7.57, 7.18) (Avg. of 61.2, -7.57, 7.18)
2 19.6 25.95 28.67 (25th Percentile of 19.69, 25.95, 28.67) (75th Percentile of 19.69, 25.95, 28.67) (AVG. of 19.69, 25.95, 28.67)
2.5 59.51 81.22 78.22 (25th Percentile of 59.51, 81.22, 78.22) (75th Percentile of 59.51, 81.22, 78.22) (AVG. of 59.51, 81.22, 78.22)
3 131.81 154.07 142.92 (25th Percentile of 131.81, 154.07, 142.92) (75th Percentile of 131.81, 154.07, 142.92) (AVG. of 131.81, 154.07, 142.92)

I have been looking for a long time now and the best I can do it :

df['mean'] = df[['WTG1','WTG2','WTG3'].mean(axis=1)
df['25th Percentile'] = np.nanpercentile(df[['WTG1','WTG2','WTG3']],25)
df['75th Percentile'] = np.nanpercentile(df[['WTG1','WTG2','WTG3']],75)

The mean seems to work, have not been checking the values yet though.

But the percentiles are the real issues here... it seems that nanpercentile function works only on columns. It returns the same value on every line (which I guess is the respective 25th and 75th percentile value but of the whole df) for both percentiles columns, which is not what I attend to do.

I was able to find some alternatives but could not adapt them to my need, as:

perc75 = np.vectorize(lambda x: np.percentile(x, 75))
df['75th_percentile'] = perc75(df['WTG01'].values)

which work but only for one column.

or

df['25th_percentile'] = df['WTG1','WTG2','WTG3'].apply(lambda x: np.percentile(x, 25))

which does not work...

CodePudding user response:

I think you could transpose the DataFrame and apply df.describe()

import pandas as pd
df = pd.DataFrame({'WTG1': [61.25, 19.69, 59.51, 131.81],
                   'WTG2': [-7.57, 25.95, 81.22, 154.07],
                   'WTG3': [7.18, 28.67, 78.22, 142.92]
                   })
print(df)
print(df.T)

Output

     WTG1    WTG2    WTG3
0   61.25   -7.57    7.18
1   19.69   25.95   28.67
2   59.51   81.22   78.22
3  131.81  154.07  142.92

          0      1      2       3
WTG1  61.25  19.69  59.51  131.81
WTG2  -7.57  25.95  81.22  154.07
WTG3   7.18  28.67  78.22  142.92

In the question, you're trying to get the statistics for each row. After you transpose the DataFrame, you could look at columns instead, so you'd get the summary statistics of each column conveniently

print(df.T.describe())

Output

               0          1          2           3
count   3.000000   3.000000   3.000000    3.000000
mean   20.286667  24.770000  72.983333  142.933333
std    36.233778   4.604824  11.764269   11.130006
min    -7.570000  19.690000  59.510000  131.810000
25%    -0.195000  22.820000  68.865000  137.365000
50%     7.180000  25.950000  78.220000  142.920000
75%    34.215000  27.310000  79.720000  148.495000
max    61.250000  28.670000  81.220000  154.070000
  • Related