I have the following dataframe:
Subject Val1 Val1 Int Val1 Val1 Int2 Val1
A 1 2 3 NaN NaN Sp NaN
B NaN NaN NaN 2 3 NaN NaN
C NaN NaN 4 NaN NaN 0 3
D NaN NaN 3 NaN NaN 8 NaN
I want to ended up with only 2 column that are val1
because it has at most 2 non-nans for a given subject. Namely, the output would look like this:
Subject Val1 Val1 Int Int2
A 1 2 3 Sp
B 2 3 NaN NaN
C 3 NaN 4 0
D NaN NaN 3 8
is there a function in pandas to do this in a clean way? Clean meaning only a few lines of code. Because one way would be to iterate through row with a for loop and bring all nonnan values to the left, but I'd like something cleaner and more efficient as well.
CodePudding user response:
Idea is per groups by duplicated columns names use lambda function for sort values based by missing values, so possible remove all columns with only missing values in last steps:
df = df.set_index('Subject')
f = lambda x: pd.DataFrame(x.apply(sorted, key=pd.isna, axis=1).tolist(), index=x.index)
df = df.groupby(level=0, axis=1).apply(f).dropna(axis=1, how='all').droplevel(1, axis=1)
print (df)
Int Int2 Val1 Val1
Subject
A 3.0 Sp 1.0 2.0
B NaN NaN 2.0 3.0
C 4.0 0 3.0 NaN
D 3.0 8 NaN NaN