I have a dataframe with the column date_time in datetime64[ns] format:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 filename 235 non-null object
1 date_time 235 non-null datetime64[ns]
2 r 235 non-null float64
The df:
filename date_time r
0 01_ 2022-05-24 12:07:06 3.2E 05
1 01_ 2022-05-24 12:08:15 3.1E 05
2 01_ 2022-05-24 12:09:23 2.9E 05
3 02_ 2022-05-24 12:10:43 5.0E 06
4 04_ 2022-05-24 12:38:26 5.6E 06
.. ... ... ...
230 91_ 2022-05-26 09:57:50 8.9E 06
231 91_ 2022-05-26 09:59:06 8.3E 06
232 91_ 2022-05-26 10:00:23 8.5E 06
233 91_ 2022-05-26 10:01:40 9.0E 06
234 91_ 2022-05-26 10:02:57 9.1E 06
Calculating the mean of date_time, grouped by filename, only works using:
df_.groupby(["filename"]).agg(["mean"])
and not with:
df_.groupby(["filename"]).mean()
df_.groupby(["filename"]).agg("mean")
Why does it only work with df_.groupby(["filename"]).agg(["mean"])?
Below is the code with an example:
print("works with:")
print(df_.groupby(["filename"]).agg(["mean"]))
print ("doesn't work with: (no date_time column showing)")
print(df_.groupby(["filename"]).mean())
print(df_.groupby(["filename"]).agg("mean"))
OUT:
works with:
date_time r
mean mean
filename
01_ 2022-05-24 12:08:14.666666752 3.1E 05
02_ 2022-05-24 12:10:43.000000000 5.0E 06
04_ 2022-05-24 12:39:34.999999744 5.2E 06
05_ 2022-05-24 12:42:54.000000000 7.5E 04
06_ 2022-05-24 12:47:06.000000000 3.4E 05
... ... ...
87_ 2022-05-25 16:44:56.000000000 9.5E 06
88_ 2022-05-26 09:15:00.875000064 1.1E 05
89_ 2022-05-26 09:29:22.357143040 8.3E 06
90_ 2022-05-26 09:45:32.500000000 1.1E 05
91_ 2022-05-26 09:55:16.384615424 8.9E 06
[75 rows x 2 columns]
doesn't work with: (no date_time column showing)
r
filename
01_ 3.1E 05
02_ 5.0E 06
04_ 5.2E 06
05_ 7.5E 04
06_ 3.4E 05
... ...
87_ 9.5E 06
88_ 1.1E 05
89_ 8.3E 06
90_ 1.1E 05
91_ 8.9E 06
[75 rows x 1 columns]
r
filename
01_ 3.1E 05
02_ 5.0E 06
04_ 5.2E 06
05_ 7.5E 04
06_ 3.4E 05
... ...
87_ 9.5E 06
88_ 1.1E 05
89_ 8.3E 06
90_ 1.1E 05
91_ 8.9E 06
[75 rows x 1 columns]
CodePudding user response:
When having only a datetime column besides the grouped by column then it works in all cases. But when you have more columns, agg seems to only work on numeric values.
The doc strings say: https://github.com/pandas-dev/pandas/blob/df32e83f36bf485be803be2b87d23135be30540a/pandas/core/base.py#L301
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
They also mention: people may try to aggregate on a non-callable attribute but don't let them think they can pass args to it
I tried using np.mean and it gave the same result as when using ('mean').
This is as far as I reached. Hope this helps.
CodePudding user response:
I posted the issue here on GITHUB, and this was the solution provided:
The issue is in groupby param "numeric_only" = True (set as defualt), which includes only numeric data (removing the datetime64[ns] column).
It worked when passing a ["mean"] as list since: when using list or dict in agg, the DataFrame is broken up into Series before each function is applied.