I have a dataframe with following sample data, where the number of Columns in Col.x
format is unknown:
Col.1,Col.2,Col.3
Val1,
Val2,Val3
Val3,
Val4,Val2,Val3
I need to have a separate column with values populated from the highest number of x which is not null. Such as:
Col.1,Col.2,Col.3,Latest
Val1,,,Val1
Val2,Val3,,Val3
Val3,,,Val3
Val4,Val2,Val3,Val3
I was able to solve the problem with code below but this solution depends on a) knowing the exact column names and b) doesn't handle the variable number of columns in a scalable way:
df["Latest"] = np.where(df["Col.3"].isnull(),np.where(df["Col.2"].isnull(),df["Col.1"],df["Col.2"]),df["Col.3"])
Part a) I can solve...
cols = [col for col in df.columns if 'Col' in col]
... I need help with part b).
CodePudding user response:
Use fillna
with functools.reduce
:
# sort column names by suffix in reverse order
cols = sorted(
(col for col in df.columns if col.startswith('Col')),
key=lambda col: -int(col.split('.')[1])
)
cols
# ['Col.3', 'Col.2', 'Col.1']
from functools import reduce
df['Latest'] = reduce(lambda x, y: x.fillna(y), [df[col] for col in cols])
df
# Col.1 Col.2 Col.3 Latest
#0 Val1 NaN NaN Val1
#1 Val2 NaN Val3 Val3
#2 Val3 NaN NaN Val3
#3 Val4 Val2 Val3 Val3
CodePudding user response:
We can use filter
to extract certain columns. like
and regex
are two powerful options that can be used.
Given:
Col1 Col2 Col3 Ignore_me
0 18.0 NaN 40.0 82.0
1 6.0 NaN NaN 92.0
2 100.0 NaN 19.0 43.0
3 38.0 98.0 NaN 8.0
Doing:
df['Latest'] = (df[df.filter(like='Col') # Using filter to select certain columns.
.columns
.sort_values(ascending=False)] # Sort them descending.
.bfill(axis=1) # backfill values
.iloc[:,0]) # take the first column,
# This has the first non-nan value.
Output, we can see that Ignore_me
wasn't used:
Col1 Col2 Col3 Ignore_me Latest
0 18.0 NaN 40.0 82.0 40.0
1 6.0 NaN NaN 92.0 6.0
2 100.0 NaN 19.0 43.0 19.0
3 38.0 98.0 NaN 8.0 98.0