Home > Back-end >  Pandas get highest non-null value in each row, in dataframe with variable number of columns
Pandas get highest non-null value in each row, in dataframe with variable number of columns

Time:07-17

I have a dataframe with following sample data, where the number of Columns in Col.x format is unknown:

Col.1,Col.2,Col.3
Val1, 
Val2,Val3
Val3,
Val4,Val2,Val3

I need to have a separate column with values populated from the highest number of x which is not null. Such as:

Col.1,Col.2,Col.3,Latest
Val1,,,Val1
Val2,Val3,,Val3
Val3,,,Val3
Val4,Val2,Val3,Val3

I was able to solve the problem with code below but this solution depends on a) knowing the exact column names and b) doesn't handle the variable number of columns in a scalable way:

df["Latest"] = np.where(df["Col.3"].isnull(),np.where(df["Col.2"].isnull(),df["Col.1"],df["Col.2"]),df["Col.3"])

Part a) I can solve...

cols = [col for col in df.columns if 'Col' in col]

... I need help with part b).

CodePudding user response:

Use fillna with functools.reduce:

# sort column names by suffix in reverse order
cols = sorted(
   (col for col in df.columns if col.startswith('Col')), 
   key=lambda col: -int(col.split('.')[1])
)
cols
# ['Col.3', 'Col.2', 'Col.1']

from functools import reduce
df['Latest'] = reduce(lambda x, y: x.fillna(y), [df[col] for col in cols])

df
#  Col.1 Col.2 Col.3 Latest
#0  Val1   NaN   NaN   Val1
#1  Val2   NaN  Val3   Val3
#2  Val3   NaN   NaN   Val3
#3  Val4  Val2  Val3   Val3

CodePudding user response:

We can use filter to extract certain columns. like and regex are two powerful options that can be used.

Given:

    Col1  Col2  Col3  Ignore_me
0   18.0   NaN  40.0       82.0
1    6.0   NaN   NaN       92.0
2  100.0   NaN  19.0       43.0
3   38.0  98.0   NaN        8.0

Doing:

df['Latest'] = (df[df.filter(like='Col') # Using filter to select certain columns.
                     .columns
                     .sort_values(ascending=False)] # Sort them descending.
                  .bfill(axis=1) # backfill values
                  .iloc[:,0]) # take the first column, 
                              # This has the first non-nan value.

Output, we can see that Ignore_me wasn't used:

    Col1  Col2  Col3  Ignore_me  Latest
0   18.0   NaN  40.0       82.0    40.0
1    6.0   NaN   NaN       92.0     6.0
2  100.0   NaN  19.0       43.0    19.0
3   38.0  98.0   NaN        8.0    98.0
  • Related