I'm trying to aggregate a dataframe accross multiple columns, grouped by Date
. Some cells contain nan, which I would like to keep. So my df looks like this:
import random
import numpy as np
import pandas as pd
rng = ['2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24',
'2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25']
rng = pd.to_datetime(rng)
Predicted = [random.randrange(-1, 50, 1) for i in range(12)]
Actual_data = [5,3,8,2,9,20, np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]
Category = ['A','A','A','B','B','B','A','A','A','B','B','B']
df = pd.DataFrame({ 'Date': rng, 'Predicted' : Predicted, 'Actual': Actual_data, 'Category': Category})
df
Date Predicted Actual Category
0 2015-02-24 0 5.0 A
1 2015-02-24 36 3.0 A
2 2015-02-24 30 8.0 A
3 2015-02-24 33 2.0 B
4 2015-02-24 49 9.0 B
5 2015-02-24 42 20.0 B
6 2015-02-25 25 NaN A
7 2015-02-25 9 NaN A
8 2015-02-25 21 NaN A
9 2015-02-25 39 NaN B
10 2015-02-25 17 NaN B
11 2015-02-25 11 NaN B
I Want to group by : Date
and thereby summing Predicted
and Actual
, leaving Category
out. Also, As I want to maintain the groups that only have NaN, to keep the NaN value, not transforming it to 0. So in the end I would like to have this:
Date Predicted Actual
0 2015-02-24 190 49
1 2015-02-25 122 NaN
I tried things like:
df = data.groupby(['Date'])['Predicted', 'Actual'].agg('sum').reset_index()
which looked like it worked, but then I saw that it transforms NaN to 0, which I don't want. I tried it with adding skipna=False
, but it says it's not possible to add it. I did update to the latest version of pandas as I read somewhere that this might be the problem, but it didn't matter. Could someone help me..?
CodePudding user response:
You can use min_count=1
, this means that minimal 1 valid value is needed to perform the operation:
df.groupby(['Date'])[['Predicted', 'Actual']].sum(min_count=1)
Output
Predicted Actual
Date
2015-02-24 124 47.0
2015-02-25 162 NaN
Why this works and skipna=False
doesn't, you can read the discussion on GitHub.
CodePudding user response:
Perhaps you could write a lambda function that returns the sum if any of the values are not null, otherwise return NaN
df.groupby(['Date'])['Predicted', 'Actual'].agg(lambda x: sum(x) if any(x) else np.nan).reset_index()
Output
Date Predicted Actual
0 2015-02-24 174 47.0
1 2015-02-25 164 NaN