I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:
Heading 1 | Heading 2 | Unnamed: 1 | Unnamed: 2 |
---|---|---|---|
NaN | 34 | 24 | NaN |
5 | NaN | NaN | NaN |
NaN | NaN | 13 | 77 |
NaN | NaN | NaN | 18 |
In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.
Now for each row, I want to calculate:
1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.
2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.
So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).
I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:
len(df.filter(regex=("Unnamed:.*")).columns.to_list())
Thank you!
CodePudding user response:
@mozway As mentioned in the comments I am adding the code I tried to apply the logic to only a subset of dataframe:
extra = df.filter(regex=("Unnamed:.*"))
y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()
According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]
CodePudding user response:
You can use isna
and cummin
to identify the leading NAs, then sum
to count them and clip
to limit the shift to the original number of columns:
df.isna().cummin(axis=1).sum(axis=1).clip(upper=2)
Output:
0 1
1 0
2 2
3 2
dtype: int64
Intermediates:
df.isna()
Heading 1 Heading 2 Unnamed: 1 Unnamed: 2
0 True False False True
1 False False True True
2 True True False False
3 True True True False
df.isna().cummin(axis=1)
Heading 1 Heading 2 Unnamed: 1 Unnamed: 2
0 True False False False
1 False False False False
2 True True False False
3 True True True False
df.isna().cummin(axis=1).sum(axis=1)
0 1
1 0
2 2
3 3
dtype: int64