I've searched and searched and haven't found a solution or just an answer to why this is happening.
I'm trying to determine the mean() per Borough for Request_Closing_Time in my pandas dataframe. When I groupby Borough and then get the mean for Request Closing Time i'm getting very large values, 13 to 44 days and then even 87,426 days for one result. The max value is only 24 days and the average for the whole dataframe is only 4 hours. If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct.
Any ideas as to why values for mean are so far off when groupby is by Borough only?
In [283]:nyc_sr_calls_trimmed.describe()
Out[283]: Unique Key Request_Closing_Time Incident Zip
count 3.006980e 05 298534 298083.000000
mean 3.130054e 07 0 days 04:18:51.832782865 10848.888645
std 5.738547e 05 0 days 06:05:22.141833856 583.182081
min 3.027948e 07 0 days 00:01:00 83.000000
25% 3.080118e 07 0 days 01:16:33 10310.000000
50% 3.130436e 07 0 days 02:42:55.500000 11208.000000
75% 3.178446e 07 0 days 05:21:00 11238.000000
max 3.231065e 07 24 days 16:52:22 11697.000000
In [284]: nyc_sr_calls_trimmed['Request_Closing_Time'].mean()
Out[284]:Timedelta('0 days 04:18:51.832782865')
In [285]:by_burrough = nyc_sr_calls_trimmed.groupby(['Borough'])
by_burrough['Request_Closing_Time'].mean(numeric_only=None).dt.floor('s')
Out[285]:Borough
BRONX -13 days 03:21:47
BROOKLYN -13 days 03:18:24
MANHATTAN -36 days 14:38:51
QUEENS -16 days 07:37:05
STATEN ISLAND -44 days 22:01:24
Unspecified -87426 days 14:15:03
Name: Request_Closing_Time, dtype: timedelta64[ns]
In [286]:b = by_burrough.get_group('QUEENS')
b['Request_Closing_Time'].mean(numeric_only=False)
Out[286]:Timedelta('0 days 04:52:16.450111002')
Any help would be appreciated. I'm not sure what i'm not thinking of or accounting for. Thanks.
CodePudding user response:
"If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct".
Since you mentioned the above, something I would try is to get the mean of Request_Closing_Time
for each Borough individually, as follows:
by_burrough["Request_Closing_Time"].apply(lambda df_group: df_group.mean(numeric_only=False))
or
by_burrough.apply(lambda df_group: df_group["Request_Closing_Time"].mean(numeric_only=False))