Home > Blockchain >  How to sum duplicate columns in dataframe and return nan if at least one value is nan
How to sum duplicate columns in dataframe and return nan if at least one value is nan

Time:12-02

I have a dataframe with duplicate columns (number not known a priori) like this example:

a a a b b
0 1 1 1 1 1
1 1 nan 1 1 1

I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.

I have tried this code:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()

The result i get is as follows, but it does not return NaN in the second row of column 'a'.

a b
0 3 2
1 2 2

In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.

For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.


The result I want to achieve is:

a b
0 3 2
1 nan 2

CodePudding user response:

One workaround might be to use apply to get the DataFrame.sum:

df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))

Output:

     a    b
0  3.0  2.0
1  NaN  2.0

CodePudding user response:

Another possible solution:

cols, ldf = df.columns.unique(), len(df)

pd.DataFrame(
    np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
               (len(cols), ldf)), 
    columns=cols)

Output:

    a    b
0  3.0  2.0
1  NaN  2.0
  • Related