Consider the following DataFrame
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
Whenever the [Start, End]
intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)
CodePudding user response:
You could sort by End
and check cases where the end is greater than the previous Start
. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak
and drop duplicates on the group column.
import pandas as pd
df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})
df = df.sort_values(by='End', ascending=False)
df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()
df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])
print(df)
Output
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
CodePudding user response: