Home > Enterprise >  Pandas Dataframe Timedelta mean() not showing correct values
Pandas Dataframe Timedelta mean() not showing correct values

Time:09-28

I've searched and searched and haven't found a solution or just an answer to why this is happening.

I'm trying to determine the mean() per Borough for Request_Closing_Time in my pandas dataframe. When I groupby Borough and then get the mean for Request Closing Time i'm getting very large values, 13 to 44 days and then even 87,426 days for one result. The max value is only 24 days and the average for the whole dataframe is only 4 hours. If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct.

Any ideas as to why values for mean are so far off when groupby is by Borough only?

In [283]:nyc_sr_calls_trimmed.describe()
Out[283]:        Unique Key     Request_Closing_Time       Incident Zip
         count  3.006980e 05    298534  298083.000000
         mean   3.130054e 07    0 days 04:18:51.832782865   10848.888645
         std    5.738547e 05    0 days 06:05:22.141833856   583.182081
         min    3.027948e 07    0 days 00:01:00 83.000000
         25%    3.080118e 07    0 days 01:16:33 10310.000000
         50%    3.130436e 07    0 days 02:42:55.500000  11208.000000
         75%    3.178446e 07    0 days 05:21:00 11238.000000
         max    3.231065e 07    24 days 16:52:22    11697.000000

In [284]: nyc_sr_calls_trimmed['Request_Closing_Time'].mean()
Out[284]:Timedelta('0 days 04:18:51.832782865')

In [285]:by_burrough = nyc_sr_calls_trimmed.groupby(['Borough'])
         by_burrough['Request_Closing_Time'].mean(numeric_only=None).dt.floor('s')

Out[285]:Borough
         BRONX              -13 days  03:21:47
         BROOKLYN           -13 days  03:18:24
         MANHATTAN          -36 days  14:38:51
         QUEENS             -16 days  07:37:05
         STATEN ISLAND      -44 days  22:01:24
         Unspecified     -87426 days  14:15:03
         Name: Request_Closing_Time, dtype: timedelta64[ns]

In [286]:b = by_burrough.get_group('QUEENS')
         b['Request_Closing_Time'].mean(numeric_only=False)

Out[286]:Timedelta('0 days 04:52:16.450111002')

Any help would be appreciated. I'm not sure what i'm not thinking of or accounting for. Thanks.

CodePudding user response:

"If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct".

Since you mentioned the above, something I would try is to get the mean of Request_Closing_Time for each Borough individually, as follows:

by_burrough["Request_Closing_Time"].apply(lambda df_group: df_group.mean(numeric_only=False))

or

by_burrough.apply(lambda df_group: df_group["Request_Closing_Time"].mean(numeric_only=False))
  • Related