Home > Blockchain >  Calculate by how much a row has shifted horizontally in pandas dataframe
Calculate by how much a row has shifted horizontally in pandas dataframe

Time:12-02

I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below:

Heading 1 Heading 2 Unnamed: 1 Unnamed: 2
NaN 34 24 NaN
5 NaN NaN NaN
NaN NaN 13 77
NaN NaN NaN 18

In the above dataframe, there are only 2 original columns (Heading 1 and Heading 2) but due to row shift (in rows 1 and 3), extra columns (Unnamed: 1 and Unnamed: 2) have been created with the default name Unnamed: 1 and Unnamed: 2.

Now for each row, I want to calculate:

1.) The spill over. Spill over is basically the amount of NaN values in extra columns(Unnamed columns). For example in row 1 there is one non NaN value in extra columns (Unnamed: 1) and hence the spill over is 1. In row 2 there are no non NaN values in extra columns so the spill over is 0. In row 3 there are 2 non NaN values in extra columns(Unnamed: 1 and Unnamed: 2) hence the spill over is 2 and in row 4 there are 1 non NaN values in extra columns so the spill over is 1.

2.) The amount of NaN values in the original columns(Heading 1 and Heading 2). For example in row 1 amount of Nan values in original columns are 1, in row 2 amount of NaN values in original columns is 0, in row 3 amount of NaN values in original columns is 2 and in row 4 amount of NaN values in original columns is 2.

So basically for each row, I have to calculate the amount of Nan values in original columns(Heading 1 and Heading 2) and the amount of non NaN values in extra columns(Unnamed: 1 and Unnamed: 2).

I can get the amount of extra columns (Unnamed:1 and so on) present in a dataframe by:

len(df.filter(regex=("Unnamed:.*")).columns.to_list())

Thank you!

CodePudding user response:

@mozway As mentioned in the comments I am adding the code I tried to apply the logic to only a subset of dataframe:

extra = df.filter(regex=("Unnamed:.*"))
y = extra.isna().cummin(axis=1).sum(axis=1).clip(upper=2).tolist()

According to the dataframe the output should be [1, 2, 0, 1] (as there are 1 nan values in row 1, 2 in row 2 0 in row 3 and 1 in row 4) but the above code is giving output [0, 2, 0, 1]

CodePudding user response:

You can use isna and cummin to identify the leading NAs, then sum to count them and clip to limit the shift to the original number of columns:

df.isna().cummin(axis=1).sum(axis=1).clip(upper=2)

Output:

0    1
1    0
2    2
3    2
dtype: int64

Intermediates:

df.isna()

   Heading 1  Heading 2  Unnamed: 1  Unnamed: 2
0       True      False       False        True
1      False      False        True        True
2       True       True       False       False
3       True       True        True       False

df.isna().cummin(axis=1)

   Heading 1  Heading 2  Unnamed: 1  Unnamed: 2
0       True      False       False       False
1      False      False       False       False
2       True       True       False       False
3       True       True        True       False

df.isna().cummin(axis=1).sum(axis=1)

0    1
1    0
2    2
3    3
dtype: int64
  • Related