Home > Software design >  Calculating the mean for 4th quartile for multiple columns at an interval
Calculating the mean for 4th quartile for multiple columns at an interval

Time:04-07

I'm quite new to Python so this might be an easy question.

I'm trying to compare peaks in dataframe between columns over multiple regular intervals, the problems is that there is a delay in the peaks between the columns. The most logical solution to me would be to take the mean of the 4th quartile for each column.

My data looks something like this

    0      1     2 
0  -0,2   0,4  -0,5
1   0,3   0,8   0,4 
2  -0,5   0,4   0,7
3   0,4  -0,6   0,5
etc.

I would like to get mean values of the 4th quartile over every 18000 rows. But i'm already stuck before trying to incorporate the intervals, as when i try np.percentile(df,[100]) I only get a single value and I'm not sure if this is the highest value or the mean of the 4th quartile.

The outcome I had in mind would look something like this

               1    2    3
    0-18000   0,8  0,9  0,8
18001-36000   1,2  0,9  0,9
etc.

CodePudding user response:

My solution:

#generate random data
#note that len(df) is multiple of 18000
df = pd.DataFrame(np.random.rand(18000*4, 3))

df has shape 72000 rows × 3 columns.

#reshape the data to group rows of the same interval
X = df.values.reshape(-1, 18_000, df.shape[1])

X is a np.array of shape (4, 18000, 3).

#compute quartiles for every interval and for every column
quartiles = np.quantile(X, q=0.75, axis=1)

quartiles is a np.array of shape (4, 3), where 4 is the number of intervals, while 3 is the number of variables in the starting dataframe.

#mask (with nan) the values lower than the specific quartile
X[X<quartiles.reshape(quartiles.shape[0],1,quartiles.shape[1])] = np.nan

#compute the mean excluding nans and rebuild the dataframe
df_2 = pd.DataFrame(np.nanmean(X, axis=1))

df_2 contains the means of the 4th quartile values for variables 0, 1 and 2 (columns) in specific intervals (rows). My out-cam:

           0           1           2
0   0.876339    0.877293    0.874126
1   0.872108    0.877760    0.871988
2   0.877431    0.875684    0.874172
3   0.876306    0.877515    0.872103

CodePudding user response:

By definition, there are only 3 quartiles, which are cut-points separating the data evenly into four parts.

The upper quartile corresponds to the 75th percentile, and is a single value (not an interval).

Now, to get that threshold (the upper quartile) for each group of 18_000 rows, a simple way is:

by = 18_000
df.groupby(df.index // by).apply(pd.DataFrame.quantile, .75)

Example


# setup (note that in this case we don't have a multiple of 18_000,
# but that doesn't matter; the last group is simply smaller)
n = 100_000
by = 18_000
df = pd.DataFrame(np.random.uniform(0, 1, (n, 3)))

out = df.groupby(df.index // by).apply(pd.DataFrame.quantile, .75)

>>> out
0.75         0         1         2
0     0.750901  0.745639  0.750561
1     0.750213  0.749832  0.755330
2     0.750073  0.751385  0.752866
3     0.750258  0.754985  0.756654
4     0.741512  0.749209  0.748945
5     0.744568  0.746684  0.749249

Note

If you'd like to see where the values come from:

up = df.groupby(df.index // by).size().cumsum()
ab = np.c_[np.r_[0, up[:-1]], up - 1]
out = out.set_axis([f'{a}-{b}' for a, b in ab])

>>> out
0.75                0         1         2
0-17999      0.750901  0.745639  0.750561
18000-35999  0.750213  0.749832  0.755330
36000-53999  0.750073  0.751385  0.752866
54000-71999  0.750258  0.754985  0.756654
72000-89999  0.741512  0.749209  0.748945
90000-99999  0.744568  0.746684  0.749249

Second interpretation: averages of the upper quart

If instead, you are looking for the average values above the upper quartile (instead of the threshold itself):

def upper_quartile_avg(g):
    return g[g >= g.quantile(.75)].mean()

gb = df.groupby(df.index // by)
up = gb.size().cumsum()
ab = np.c_[np.r_[0, up[:-1]], up - 1]
out = gb.apply(upper_quartile_avg).set_axis([f'{a}-{b}' for a, b in ab])

>>> out
                    0         1         2
0-17999      0.875084  0.871676  0.876100
18000-35999  0.873781  0.876534  0.878645
36000-53999  0.874558  0.873917  0.876843
54000-71999  0.876250  0.877212  0.876409
72000-89999  0.868888  0.874781  0.874565
90000-99999  0.871974  0.873183  0.876113
  • Related