I have the following problem. Suppose I have a wide data Frame consisting of three columns (mock example follows below). Essentially, it consists of three factors, A
, B
and C
for which I have certain values for each business day within a time range.
import pandas as pd
import numpy as np
index_d = pd.bdate_range(start='10/5/2022', end='10/27/2022')
index = np.repeat(index_d,3)
values = np.random.randn(3*len(index_d), 1)
columns_v = len(index_d)*["A","B","C"]
df = pd.DataFrame()
df["x"] = np.asarray(index)
df["y"] = values
df["factor"] = np.asarray([columns_v]).T
I would like to plot the business weekly averages of the the three factors along time. A business week goes from Monday to Friday. However, in the example above I start within a week and end within a week. That means the first weekly averages consist only of the data points on 5th, 6th and 7th of October. Similar for the last week. Ideally, the output should have the form
dt1 = dt.datetime.strptime("20221007", "%Y%m%d").date()
dt2 = dt.datetime.strptime("20221014", "%Y%m%d").date()
dt3 = dt.datetime.strptime("20221021", "%Y%m%d").date()
dt4 = dt.datetime.strptime("20221027", "%Y%m%d").date()
d = 3*[dt1, dt2, dt3, dt4]
values = np.random.randn(len(d), 1)
factors = 4*["A","B","C"]
df_output = pd.DataFrame()
df_output["time"] = d
df_output["values"] = values
df_output["factors"] = factors
I can then plot the weekly averages using seaborn as a lineplot with hue
. Important to note is that the respective time value for weekly average is always the last business day in that week (Friday except for the last, where it is a Thursday).
I was thinking of groupby. However, given that my real data is much larger and has possibly some NaN
I'm not sure how to do it. In particular with regards to the random start / end points that don't need to be Monday / Friday.
CodePudding user response:
You can add a column with the calendar week:
df['week'] = df.x.dt.isocalendar().week
Get a mask for all the Fridays, and for the last day:
last_of_week = (df.x.dt.isocalendar().day == 5).values
last_of_week[-1] = True
Get the actual dates:
last_days = df.x[last_of_week].unique()
Group by week
and factor
, take the mean:
res = df.groupby(['week', 'factor']).mean().reset_index()
Clean up:
res = res.drop('week', axis=1)
res['x'] = pd.Series(last_days).repeat(3).reset_index(drop=True)
CodePudding user response:
Try as follows:
res = df.groupby([pd.Grouper(key='x', freq='W-FRI'),df.factor])['y'].mean()\
.reset_index(drop=False)
res = res.rename(columns={'x':'time','factor':'factors','y':'values'})
res['time'] = res.time.map(pd.merge_asof(df.x, res.time, left_on='x',
right_on='time', direction='forward')\
.groupby('time').last()['x']).astype(str)
print(res)
time factors values
0 2022-10-07 A 0.171228
1 2022-10-07 B -0.250432
2 2022-10-07 C -0.126960
3 2022-10-14 A 0.455972
4 2022-10-14 B 0.582900
5 2022-10-14 C 0.104652
6 2022-10-21 A -0.526221
7 2022-10-21 B 0.371007
8 2022-10-21 C 0.012099
9 2022-10-27 A -0.123510
10 2022-10-27 B -0.566441
11 2022-10-27 C -0.652455
Plot data:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(data=res, x='time', y='values', hue='factors')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.show()
Result:
Explanation
- First, apply
df.groupby
. Grouping byfactor
is of course easy; for the dates we can usepd.Grouper
withfreq
parameter set toW-FRI
(each week through to Friday), and then we want to get themean
for columny
(NaN
values will just be ignored). - In the next step, let's use
df.rename
to rename the columns. - We are basically done now, except for the fact that
pd.Grouper
will use each Friday (even if it isn't present in the actual set). E.g.:
print(res.time.unique())
['2022-10-07T00:00:00.000000000' '2022-10-14T00:00:00.000000000'
'2022-10-21T00:00:00.000000000' '2022-10-28T00:00:00.000000000']
- If you are OK with this, you can just start plotting (but see below). If you would like to get
'2022-10-27'
instead of'2022-10-28'
, we can combineSeries.map
applied to columntime
withpd.merge_asof
,and perform anothergroupby
to getlast
in columnx
. I.e. this will get us the closest match to each Friday within each week (so, in fact just Friday in all cases, except the last:2022-10-17
). - In either scenario, before plotting, make sure to turn the datetime values into strings:
res['time'] = res['time'].astype(str)
!