I would like to convert closest values of a column (col2 in the below) to the same value (say the largest one). Suppose the following dataframe:
df = pd.DataFrame({"col1":[0,1,2,3,4,5,6],"col2":[1,5,6,10,12,14,17]})
col1 col2
0 0 1
1 1 5
2 2 6
3 3 10
4 4 12
5 5 14
6 6 17
Given column col2 and a closeness threshold of 2: difference of 5 and 6 less than threshold, so both will be the same, i.e. 6. Values 1 and 17 are far away from the rest of values in col2, so no changes there. Differences between 10, 12 and 14 are less than 2, so change them all to 14. (why I need this process: when converting image to text using pytesseract.image_to_data, the top coordinates of text are slightly different and I want to fix those coordinates and make them same values.)
The final output given col2 and closeness threshold of 2 will be:
col1 col2
0 0 1
1 1 6
2 2 6
3 3 14
4 4 14
5 5 14
6 6 17
You help much appreciated!
CodePudding user response:
df['s']=abs(df['col2'].diff(-1)).between(1,2).cumsum()
df.update(df[df['s']!=0].groupby('s')['col2'].transform('max'))
df=df.drop('s',1)
CodePudding user response:
Use:
df['col2'] = df['col2'].mask(df['col2'].diff(-1).abs().le(2)).bfill()
print (df)
col1 col2
0 0 1.0
1 1 6.0
2 2 6.0
3 3 14.0
4 4 14.0
5 5 14.0
6 6 17.0