Home > Enterprise >  Python - aggregate groupby multiple columns, skipping nan cells
Python - aggregate groupby multiple columns, skipping nan cells

Time:08-19

I'm trying to aggregate a dataframe accross multiple columns, grouped by Date. Some cells contain nan, which I would like to keep. So my df looks like this:

import random
import numpy as np
import pandas as pd

rng = ['2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24',
       '2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25']
rng = pd.to_datetime(rng)
Predicted = [random.randrange(-1, 50, 1) for i in range(12)]
Actual_data = [5,3,8,2,9,20, np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]
Category = ['A','A','A','B','B','B','A','A','A','B','B','B']

df = pd.DataFrame({ 'Date': rng, 'Predicted' : Predicted, 'Actual': Actual_data, 'Category': Category})

df

          Date  Predicted  Actual Category
0   2015-02-24         0      5.0        A
1   2015-02-24         36     3.0        A
2   2015-02-24         30     8.0        A
3   2015-02-24         33     2.0        B
4   2015-02-24         49     9.0        B
5   2015-02-24         42    20.0        B
6   2015-02-25         25     NaN        A
7   2015-02-25          9     NaN        A
8   2015-02-25         21     NaN        A
9   2015-02-25         39     NaN        B
10  2015-02-25         17     NaN        B
11  2015-02-25         11     NaN        B

I Want to group by : Date and thereby summing Predicted and Actual, leaving Category out. Also, As I want to maintain the groups that only have NaN, to keep the NaN value, not transforming it to 0. So in the end I would like to have this:

          Date  Predicted  Actual 
0   2015-02-24        190      49   
1   2015-02-25        122     NaN  

I tried things like:

df = data.groupby(['Date'])['Predicted', 'Actual'].agg('sum').reset_index()

which looked like it worked, but then I saw that it transforms NaN to 0, which I don't want. I tried it with adding skipna=False, but it says it's not possible to add it. I did update to the latest version of pandas as I read somewhere that this might be the problem, but it didn't matter. Could someone help me..?

CodePudding user response:

You can use min_count=1, this means that minimal 1 valid value is needed to perform the operation:

df.groupby(['Date'])[['Predicted', 'Actual']].sum(min_count=1)

Output

            Predicted  Actual
Date                         
2015-02-24        124    47.0
2015-02-25        162     NaN

Why this works and skipna=False doesn't, you can read the discussion on GitHub.

CodePudding user response:

Perhaps you could write a lambda function that returns the sum if any of the values are not null, otherwise return NaN

df.groupby(['Date'])['Predicted', 'Actual'].agg(lambda x: sum(x) if any(x) else np.nan).reset_index()

Output

        Date  Predicted  Actual
0 2015-02-24        174    47.0
1 2015-02-25        164     NaN
  • Related