Calculate by how much a row has shifted horizontally in pandas dataframe-CodePudding

I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:

Heading 1	Heading 2	Unnamed: 1	Unnamed: 2
NaN	34	24	NaN
5	NaN	NaN	NaN
NaN	NaN	13	77
NaN	NaN	NaN	18

In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.

Now for each row, I want to calculate:

1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.

2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.

So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).

I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:

len(df.filter(regex=("Unnamed:.*")).columns.to_list())

Thank you!

CodePudding user response：

@mozway As mentioned in the comments I am adding the code I tried to apply the logic to only a subset of dataframe:

extra = df.filter(regex=("Unnamed:.*"))
y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()

According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]

CodePudding user response：

You can use isna and cummin to identify the leading NAs, then sum to count them and clip to limit the shift to the original number of columns:

df.isna().cummin(axis=1).sum(axis=1).clip(upper=2)

Output:

0    1
1    0
2    2
3    2
dtype: int64

Intermediates:

df.isna()

   Heading 1  Heading 2  Unnamed: 1  Unnamed: 2
0       True      False       False        True
1      False      False        True        True
2       True       True       False       False
3       True       True        True       False

df.isna().cummin(axis=1)

   Heading 1  Heading 2  Unnamed: 1  Unnamed: 2
0       True      False       False       False
1      False      False       False       False
2       True       True       False       False
3       True       True        True       False

df.isna().cummin(axis=1).sum(axis=1)

0    1
1    0
2    2
3    3
dtype: int64