How to get business weekly averages per method in pandas-CodePudding

I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A, B and C for which I have certain values for each business day within a time range.

import pandas as pd
import numpy as np

index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022')
index = np.repeat(index_d,3)
values = np.random.randn(3*len(index_d), 1)
columns_v = len(index_d)*["A","B","C"]
df = pd.DataFrame()
df["x"] = np.asarray(index)
df["y"] = values
df["factor"] = np.asarray([columns_v]).T

I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form

dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date()
dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date()
dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date()
dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date()

d = 3*[dt1, dt2, dt3, dt4]
values = np.random.randn(len(d), 1)
factors = 4*["A","B","C"]
df_output = pd.DataFrame()
df_output["time"] = d
df_output["values"] = values
df_output["factors"] = factors

I can then plot the weekly averages using seaborn as a lineplot with hue. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).

I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN I'm not sure how to do it. In particular with regards to the random start / end points that don't need to be Monday / Friday.

CodePudding user response：

You can add a column with the calendar week:

df['week'] = df.x.dt.isocalendar().week

Get a mask for all the Fridays, and for the last day:

last_of_week = (df.x.dt.isocalendar().day == 5).values
last_of_week[-1] = True

Get the actual dates:

last_days = df.x[last_of_week].unique()

Group by week and factor, take the mean:

res = df.groupby(['week', 'factor']).mean().reset_index()

Clean up:

res = res.drop('week', axis=1)
res['x'] = pd.Series(last_days).repeat(3).reset_index(drop=True)

CodePudding user response：

Try as follows:

res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean()\
    .reset_index(drop=False)

res = res.rename(columns={'x':'time','factor':'factors','y':'values'})

res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x', 
                                         right_on='time', direction='forward')\
                           .groupby('time').last()['x']).astype(str)

print(res)

          time factors    values
0   2022-10-07       A  0.171228
1   2022-10-07       B -0.250432
2   2022-10-07       C -0.126960
3   2022-10-14       A  0.455972
4   2022-10-14       B  0.582900
5   2022-10-14       C  0.104652
6   2022-10-21       A -0.526221
7   2022-10-21       B  0.371007
8   2022-10-21       C  0.012099
9   2022-10-27       A -0.123510
10  2022-10-27       B -0.566441
11  2022-10-27       C -0.652455

Plot data:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme()

fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(data=res, x='time', y='values', hue='factors')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

plt.show()

Result:

Explanation

First, apply df.groupby. Grouping by factor is of course easy; for the dates we can use pd.Grouper with freq parameter set to W-FRI (each week through to Friday), and then we want to get the mean for column y (NaN values will just be ignored).
In the next step, let's use df.rename to rename the columns.
We are basically done now, except for the fact that pd.Grouper will use each Friday (even if it isn't present in the actual set). E.g.:

print(res.time.unique())

['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000'
 '2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']

If you are OK with this, you can just start plotting (but see below). If you would like to get '2022-10-27' instead of '2022-10-28', we can combine Series.map applied to column time with pd.merge_asof,and perform another groupby to get last in column x. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last: 2022-10-17).
In either scenario, before plotting, make sure to turn the datetime values into strings: res['time'] = res['time'].astype(str)!