I have a dataframe like this:
observation_time temperature
2021-11-24 10:00:00 00:00 7.22
2021-11-24 10:30:00 00:00 7.33
2021-11-24 11:00:00 00:00 7.44
2021-11-24 11:30:00 00:00 7.50
2021-11-24 12:00:00 00:00 7.50
2021-11-24 12:30:00 00:00 7.50
2021-11-24 13:00:00 00:00 7.44
2021-11-24 13:30:00 00:00 7.61
2021-11-24 14:00:00 00:00 7.67
2021-11-24 14:30:00 00:00 7.78
...
2021-11-27 08:30:00 00:00 7.22
2021-11-27 09:00:00 00:00 7.33
2021-11-27 09:30:00 00:00 7.44
2021-11-27 10:00:00 00:00 7.50
with time step of 30 minutes and, in theory, it should cover 3 days, but it's very likely that some data is missed. (The dataframe is just an example, I didn't put all the values.)
I would like to calculate the average over the 3 days for each time value, e.g. at 10:30 I need the temperature values for the days: 2021-11-24, 2021-11-25, 2021-11-26, BUT ONLY if all the 3 values exist, otherwise it's NaN.
I started to use groupby
and the Grouper
:
df[["observation_time", "temperature"]].groupby(pd.Grouper(key="observation_time", freq="30min", offset="0m", label="right")).mean()
or:
df[["observation_time", "temperature"]].groupby(pd.Grouper(key="observation_time", freq="24H", offset="0m", label="right")).mean()
but, of course, they are not the right solutions.
Then I was trying to calculate the number of rows for each time value:
num = df[["observation_time", "temperature"]].groupby(df["observation_time"].dt.time)["temperature"].count().reset_index()
num.rename(columns={"observation_time": "observation_hour", "temperature": "count_temperature"}, inplace=True)
and merge this dataframe to the original one using the common column observation_hour
:
df["observation_hour"] = df["observation_time"].dt.time
df = pd.merge(df, num, how="left")
so to have on the same dataframe the time, the temperature and how many times the same hour is repeated.
I could use groupby
grouping the hours, but I don't know how to use the condition on count_temperature
.
I should add to this one:
df.groupby("observation_hour")["temperature"].mean()
something like if df["count_temperature"] == 3
somewhere, I don't know where.
The final result should be something like (just an example):
observation_hour mean_temperature
00:00:00 00:00 7.22
00:30:00 00:00 7.44
01:00 00:00 NaN
01:30:00 00:00 7.44
...
22:00:00 00:00 7.44
22:30:00 00:00 NaN
23:00:00 00:00 7.44
23:30:00 00:00 NaN
the NaN
value in case the mean can't be calculate as we don't have 3 values of temperature.
Any idea?
CodePudding user response:
You can extract the time from the datetime column and group by time only. If that time slow has less than 3 observations, its mean is NaN:
t = pd.date_range("2022-01-01", "2022-01-02", freq="30T").time
grp = df.groupby(df["observation_time"].dt.time)
result = (
grp["temperature"].mean() # Calculate the mean temperature for each 30-min period
.mask(grp.size() < 3, np.nan) # If the period has less than 3 observations, make it nan
.reindex(t) # Make sure we have all periods of a day
.reset_index()
)