how I can do the following in pandas. let's I have a column which has an index. I want to find the index in col1
and do the following:
- If 3 rows above the index and 3 rows below the index, the sign of the values of
col1
was constant and it changes in one of these rows and remains constant after the point pick the index of the row that the sign changes. If there is not such a point pick the value of index column.
It is hard to explain but I think it is more clearer in an example:
consider the following data:
ind_column col1
4 0.5
4 0.65
4 0.6
4 0.2
4 0.1
4 0.8
4 -0.3
4 -0.2
4 -0.3
here, the index column is 4, so we look at 3 rows before 4 and 3 rows after 4. Basically rows: 1,2,3,4,5,6,7. We see that in rows 6th the sign changed, since all signs are the same before this point (positive) and after this point (negative) the index 6th is selected.
Now lets consider the following data
ind_column col1
5 0.5
5 0.65
5 -0.6
5 0.2
5 -0.1
5 0.8
5 0.3
5 -0.2
5 -0.3
Now we should look at the rows 2,3,4,5,6,7,8. Since the sign is changing in these rows more than once, the index 5 is celected.
in the following data also index 5 is selected since the sign never changes.
ind_column col1
5 0.5
5 0.65
5 0.6
5 0.2
5 0.1
5 0.8
5 0.3
5 0.2
5 0.3
CodePudding user response:
One approach could be as follows:
import pandas as pd
import numpy as np
def get_index(df):
idx = df.loc[0,'ind_column']
tmp = np.sign(df.iloc[max(idx-3,0):idx 4]['col1']).diff().dropna().ne(0)
if tmp.sum() == 1:
return tmp[tmp].index.item()
return idx
indices = [get_index(df) for df in [df1,df2,df3]] # see `dfs` listed below
print(indices)
[6, 5, 5]
Explanation
- Inside the function, we first retrieve the "start" index from column
ind_column
(e.g.4
or5
in the exampledfs
). - Next, we use
df.iloc
to select the index range -3 rows through to (and including) 3 rows fromidx
value. We usemax(idx-3,0)
to ensure that we won't be trying to select a negative index value (i.e. whenidx < 3
) as the starting point, causing an emptydf
selection. - From the
df
slice, we select columncol1
and applynp.sign
to get a series with-1
,0
or1
. - Now, we chain
Series.diff
to get the difference between consecutive rows, and useSeries.dropna
to get rid of the first row, which will beNaN
by definition. - Finally, we chain
Series.ne
with0
. This will get us the series with booleans:True
where there is a switch in sign,False
where there is no switch. - The resulting
pd.Series
is stored astmp
and now, we simply have to check whether thesum
of the series equals1
. If it does, we want to get the index of theTrue
value responsible, returningtmp[tmp].index.item()
. In all other cases, we either have no switches at all, or multiple ones. In both cases, we simply want to returnidx
.
Data used for df1, df2, df3
above
data1 = {'ind_column': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: -0.3,
7: -0.2, 8: -0.3}}
df1 = pd.DataFrame(data1)
data2 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: -0.6, 3: 0.2, 4: -0.1, 5: 0.8, 6: 0.3,
7: -0.2, 8: -0.3}}
df2 = pd.DataFrame(data2)
data3 = {'ind_column': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col1': {0: 0.5, 1: 0.65, 2: 0.6, 3: 0.2, 4: 0.1, 5: 0.8, 6: 0.3,
7: 0.2, 8: 0.3}}
df3 = pd.DataFrame(data3)