Pandas: Sum of column produces unexpected negative values or NaT when used on GroupBy object-CodePudding

I have a dataframe which contains run-data of testcases. The most important metric in that dataframe is the column 'Elapsed Time', which is a timedelta object that tells the run time of a specific testcase.

The dataset looks like this: (nothing is sorted, even if it might seem so btw)

	Test key	Started At	Finished At	Elapsed Time	Version
0	TEST-1676	2021-06-10 14:40:00	2021-06-10 15:24:00	0 days 00:44:00	8.0.1.0
1	TEST-1518	2021-06-11 12:14:00	2021-06-11 12:36:00	0 days 00:22:00	8.0.1.0
2	TEST-1518	2021-06-11 09:29:00	2021-06-11 09:44:00	0 days 00:15:00	8.0.1.0

...

	Test key	Started At	Finished At	Elapsed Time	Version
1037	TEST-1140	2018-11-28 09:35:00	2018-11-28 10:35:00	0 days 01:00:00	nan
1038	TEST-1138	2018-11-28 10:56:00	2018-11-28 11:08:00	0 days 00:12:00	nan

What am I doing wrong?
Why is the sum different when applied to all groups?
How come that I get a negative timedelta when summerizing?

Edit:

Here is the code, which produces the faulty output for me:

https://pastebin.com/50qPnnA0

CodePudding user response：

You have 8 rows with NaNs as Version. By defaults, groupby drops the NaNs, thus the missing 6h 50min.

Use:

df_runs.groupby(['Version'], dropna=False)['Elapsed Time'].sum()

output:

Version
7.1.0.0   136 days 15:27:00
7.2.0.0    17 days 15:23:00
8.0.0.0     4 days 05:17:00
8.0.1.0     0 days 06:09:00
NaN         0 days 06:50:00
Name: Elapsed Time, dtype: timedelta64[ns]

CodePudding user response：

After seeing mozways answer and some more comments, it seems he didn't have problems with the data.

I then checked my data for NaN values using:

df_na = df_runs[df_runs.isna().any(axis=1)]
df_na

which returned several rows which didn't have any dates filled in.

That's a failure in the data given, there shouldn't be ANY NaN values in the date columns, because a test-run cannot finish without those values.

However this shouldn't matter for the sum() function, since NaN values are simply ignored. This is shown by using sum on the individual groups, where it works.

Why does this produce faulty values on my machine? - I don't know.

How did I fix it?

Either by dropping the NaN values or replacing them with zero's.

# EITHER: drop NaN values
df_runs = df_runs.dropna()
# OR: replace NaN with Timedelta zero
df_runs['Elapsed Time'] = df_runs['Elapsed Time'].fillna(pd.Timedelta(0))