Home > OS >  how deterring the index of a column under some conditiones?
how deterring the index of a column under some conditiones?

Time:10-31

how I can do the following in pandas. let's I have a column which has an index. I want to find the index in col1 and do the following:

  • If 3 rows above the index and 3 rows below the index, the sign of the values of col1 was constant and it changes in one of these rows and remains constant after the point pick the index of the row that the sign changes. If there is not such a point pick the value of index column.

It is hard to explain but I think it is more clearer in an example:

consider the following data:

ind_column     col1
4             0.5
4             0.65
4             0.6
4             0.2
4             0.1
4             0.8
4             -0.3
4             -0.2
4             -0.3

here, the index column is 4, so we look at 3 rows before 4 and 3 rows after 4. Basically rows: 1,2,3,4,5,6,7. We see that in rows 6th the sign changed, since all signs are the same before this point (positive) and after this point (negative) the index 6th is selected.

Now lets consider the following data

ind_column     col1
5             0.5
5             0.65
5             -0.6
5             0.2
5             -0.1
5             0.8
5             0.3
5             -0.2
5             -0.3

Now we should look at the rows 2,3,4,5,6,7,8. Since the sign is changing in these rows more than once, the index 5 is celected.

in the following data also index 5 is selected since the sign never changes.

ind_column     col1
5             0.5
5             0.65
5             0.6
5             0.2
5             0.1
5             0.8
5             0.3
5             0.2
5             0.3

CodePudding user response:

One approach could be as follows:

import pandas as pd
import numpy as np

def get_index(df):
    idx = df.loc[0,'ind_column']
    tmp = np.sign(df.iloc[max(idx-3,0):idx 4]['col1']).diff().dropna().ne(0)
    if tmp.sum() == 1:
        return tmp[tmp].index.item()
    return idx

indices = [get_index(df) for df in [df1,df2,df3]] # see `dfs` listed below

print(indices)
[6, 5, 5]

Explanation

  • Inside the function, we first retrieve the "start" index from column ind_column (e.g. 4 or 5 in the example dfs).
  • Next, we use df.iloc to select the index range -3 rows through to (and including) 3 rows from idx value. We use max(idx-3,0) to ensure that we won't be trying to select a negative index value (i.e. when idx < 3) as the starting point, causing an empty df selection.
  • From the df slice, we select column col1 and apply np.sign to get a series with -1, 0 or 1.
  • Now, we chain Series.diff to get the difference between consecutive rows, and use Series.dropna to get rid of the first row, which will be NaN by definition.
  • Finally, we chain Series.ne with 0. This will get us the series with booleans: True where there is a switch in sign, False where there is no switch.
  • The resulting pd.Series is stored as tmp and now, we simply have to check whether the sum of the series equals 1. If it does, we want to get the index of the True value responsible, returning tmp[tmp].index.item(). In all other cases, we either have no switches at all, or multiple ones. In both cases, we simply want to return idx.

Data used for df1, df2, df3 above

data1 = {'ind_column': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4}, 
        'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: -0.3, 
                 7: -0.2, 8: -0.3}}
df1 = pd.DataFrame(data1)

data2 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5}, 
         'col1': {0: 0.5, 1: 0.65, 2: -0.6, 3: 0.2, 4: -0.1, 5: 0.8, 6: 0.3, 
                  7: -0.2, 8: -0.3}}
df2 = pd.DataFrame(data2)

data3 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5}, 
         'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: 0.3,
                  7: 0.2, 8: 0.3}}
df3 = pd.DataFrame(data3)
  • Related