Home > Blockchain >  Pandas groupby mean not working on datetime column
Pandas groupby mean not working on datetime column

Time:06-02

I have a dataframe with the column date_time in datetime64[ns] format:

 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   filename   235 non-null    object        
 1   date_time  235 non-null    datetime64[ns]
 2   r          235 non-null    float64  

The df:

    filename           date_time       r
0        01_ 2022-05-24 12:07:06 3.2E 05
1        01_ 2022-05-24 12:08:15 3.1E 05
2        01_ 2022-05-24 12:09:23 2.9E 05
3        02_ 2022-05-24 12:10:43 5.0E 06
4        04_ 2022-05-24 12:38:26 5.6E 06
..       ...                 ...     ...
230      91_ 2022-05-26 09:57:50 8.9E 06
231      91_ 2022-05-26 09:59:06 8.3E 06
232      91_ 2022-05-26 10:00:23 8.5E 06
233      91_ 2022-05-26 10:01:40 9.0E 06
234      91_ 2022-05-26 10:02:57 9.1E 06

Calculating the mean of date_time, grouped by filename, only works using:

df_.groupby(["filename"]).agg(["mean"])

and not with:

df_.groupby(["filename"]).mean()
df_.groupby(["filename"]).agg("mean")

Why does it only work with df_.groupby(["filename"]).agg(["mean"])?

Below is the code with an example:

print("works with:")
print(df_.groupby(["filename"]).agg(["mean"]))
print ("doesn't work with: (no date_time column showing)")
print(df_.groupby(["filename"]).mean())
print(df_.groupby(["filename"]).agg("mean"))

OUT: 

works with:
                             date_time       r
                                  mean    mean
filename                                      
01_      2022-05-24 12:08:14.666666752 3.1E 05
02_      2022-05-24 12:10:43.000000000 5.0E 06
04_      2022-05-24 12:39:34.999999744 5.2E 06
05_      2022-05-24 12:42:54.000000000 7.5E 04
06_      2022-05-24 12:47:06.000000000 3.4E 05
...                                ...     ...
87_      2022-05-25 16:44:56.000000000 9.5E 06
88_      2022-05-26 09:15:00.875000064 1.1E 05
89_      2022-05-26 09:29:22.357143040 8.3E 06
90_      2022-05-26 09:45:32.500000000 1.1E 05
91_      2022-05-26 09:55:16.384615424 8.9E 06

[75 rows x 2 columns]
doesn't work with: (no date_time column showing)
               r
filename        
01_      3.1E 05
02_      5.0E 06
04_      5.2E 06
05_      7.5E 04
06_      3.4E 05
...          ...
87_      9.5E 06
88_      1.1E 05
89_      8.3E 06
90_      1.1E 05
91_      8.9E 06

[75 rows x 1 columns]
               r
filename        
01_      3.1E 05
02_      5.0E 06
04_      5.2E 06
05_      7.5E 04
06_      3.4E 05
...          ...
87_      9.5E 06
88_      1.1E 05
89_      8.3E 06
90_      1.1E 05
91_      8.9E 06

[75 rows x 1 columns]

CodePudding user response:

When having only a datetime column besides the grouped by column then it works in all cases. But when you have more columns, agg seems to only work on numeric values.

The doc strings say: https://github.com/pandas-dev/pandas/blob/df32e83f36bf485be803be2b87d23135be30540a/pandas/core/base.py#L301

if arg is a string, then try to operate on it:

    - try to find a function (or attribute) on ourselves
    - try to find a numpy function

They also mention: people may try to aggregate on a non-callable attribute but don't let them think they can pass args to it

I tried using np.mean and it gave the same result as when using ('mean').

This is as far as I reached. Hope this helps.

CodePudding user response:

I posted the issue here on GITHUB, and this was the solution provided:

The issue is in groupby param "numeric_only" = True (set as defualt), which includes only numeric data (removing the datetime64[ns] column).

Doc GroupBy.mean

It worked when passing a ["mean"] as list since: when using list or dict in agg, the DataFrame is broken up into Series before each function is applied.

  • Related