Pandas - Replace Last Non-None Value with None Row-wise-CodePudding

One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.

I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).

What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.

I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?

Thanks!

CodePudding user response：

With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:

data = np.random.random((6,10))

df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None

Original dataframe looks like this:

          0         1         2         3         4   5
0  0.992337  0.651785  0.521422       NaN       NaN NaN
1  0.912962  0.292458  0.620195  0.507071  0.010205 NaN
2  0.061320  0.565979  0.344755       NaN       NaN NaN
3  0.521936  0.057917  0.359699  0.484009       NaN NaN

isnull = df.isnull()

col = data.shape[1] - isnull.sum(axis = 1) - 1

df.values[range(len(df)), col] = None

Updated dataframe looks like this:

          0         1         2         3   4   5
0  0.992337  0.651785       NaN       NaN NaN NaN
1  0.912962  0.292458  0.620195  0.507071 NaN NaN
2  0.061320  0.565979       NaN       NaN NaN NaN
3  0.521936  0.057917  0.359699       NaN NaN NaN

CodePudding user response：

You can find the index of the element to replace in each row with np.argmax():

indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1

df.to_numpy()[range(len(df)), indices] = None