Pandas calculate mean using another column as condition-CodePudding

I have a dataframe like this:

         observation_time  temperature
2021-11-24 10:00:00 00:00  7.22
2021-11-24 10:30:00 00:00  7.33
2021-11-24 11:00:00 00:00  7.44
2021-11-24 11:30:00 00:00  7.50
2021-11-24 12:00:00 00:00  7.50
2021-11-24 12:30:00 00:00  7.50
2021-11-24 13:00:00 00:00  7.44
2021-11-24 13:30:00 00:00  7.61
2021-11-24 14:00:00 00:00  7.67
2021-11-24 14:30:00 00:00  7.78
...
2021-11-27 08:30:00 00:00  7.22
2021-11-27 09:00:00 00:00  7.33
2021-11-27 09:30:00 00:00  7.44
2021-11-27 10:00:00 00:00  7.50

with time step of 30 minutes and, in theory, it should cover 3 days, but it's very likely that some data is missed. (The dataframe is just an example, I didn't put all the values.)

I would like to calculate the average over the 3 days for each time value, e.g. at 10:30 I need the temperature values for the days: 2021-11-24, 2021-11-25, 2021-11-26, BUT ONLY if all the 3 values exist, otherwise it's NaN.

I started to use groupby and the Grouper:

df[["observation_time", "temperature"]].groupby(pd.Grouper(key="observation_time", freq="30min", offset="0m", label="right")).mean()

or:

df[["observation_time", "temperature"]].groupby(pd.Grouper(key="observation_time", freq="24H", offset="0m", label="right")).mean()

but, of course, they are not the right solutions.

Then I was trying to calculate the number of rows for each time value:

num = df[["observation_time", "temperature"]].groupby(df["observation_time"].dt.time)["temperature"].count().reset_index()

num.rename(columns={"observation_time": "observation_hour", "temperature": "count_temperature"}, inplace=True)

and merge this dataframe to the original one using the common column observation_hour:

df["observation_hour"] = df["observation_time"].dt.time
df = pd.merge(df, num, how="left")

so to have on the same dataframe the time, the temperature and how many times the same hour is repeated.

I could use groupby grouping the hours, but I don't know how to use the condition on count_temperature.

I should add to this one:

df.groupby("observation_hour")["temperature"].mean()

something like if df["count_temperature"] == 3 somewhere, I don't know where.

The final result should be something like (just an example):

observation_hour  mean_temperature
00:00:00 00:00  7.22
00:30:00 00:00  7.44
01:00 00:00  NaN
01:30:00 00:00  7.44
...
22:00:00 00:00  7.44
22:30:00 00:00  NaN
23:00:00 00:00  7.44
23:30:00 00:00  NaN

the NaN value in case the mean can't be calculate as we don't have 3 values of temperature.

Any idea?

CodePudding user response：

You can extract the time from the datetime column and group by time only. If that time slow has less than 3 observations, its mean is NaN:

t = pd.date_range("2022-01-01", "2022-01-02", freq="30T").time

grp = df.groupby(df["observation_time"].dt.time)
result = (
    grp["temperature"].mean()     # Calculate the mean temperature for each 30-min period
    .mask(grp.size() < 3, np.nan) # If the period has less than 3 observations, make it nan
    .reindex(t)                   # Make sure we have all periods of a day
    .reset_index()
)