My dataframe:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A1 | [3-5] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |
I would like to remove those rows with (i)same query_name and (ii) overlap entirely for the position_description?
Desired output:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |
CodePudding user response:
If there can be no more than one row contained in another we can use:
from ast import literal_eval
df2 = pd.DataFrame(df['position_description'].str.replace('-', ',')
.apply(literal_eval).tolist(),
index=df.index).sort_values(0)
print(df2)
0 1
0 1 10
2 1 20
3 1 15
5 1 15
1 3 5
4 10 20
check = df2.groupby(df['query_name']).shift()
df.loc[~(df2[0].gt(check[0]) & df2[1].lt(check[1]))]
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]
CodePudding user response:
This should work for any number of ranges being contained by some ranges:
First, extract the boundaries
df = pd.DataFrame({
'query_name': ['A1', 'A1', 'A2', 'A3', 'A4', 'A4'],
'position_description': ['[1-10]', '[3-5]', '[1-20]', '[1-15]', '[10-20]', '[1-15]'],
})
df[['pos_x', 'pos_y']] = df['position_description'].str.extract(r'\[(\d )-(\d )\]').astype(int)
Then we will define the function that can choose what ranges to keep:
def non_contained_ranges(df):
range_min = df['pos_x'].min()
range_max = df['pos_y'].max()
range_size = range_max - range_min 1
b = np.zeros((len(df), range_size))
for i, (x, y) in enumerate(df[['pos_x', 'pos_y']].values - range_min):
b[i, x: y 1] = 1.
b2 = np.logical_and(np.logical_xor(b[:, np.newaxis], b), b).any(axis=2)
np.fill_diagonal(b2, True)
b3 = b2.all(axis=0)
return df[b3]
If there are N ranges within a group (query_name
), this function will do N x N comparisons, using boolean array operations.
Then we can do groupby
and apply the function to yield the expected result
df.groupby('query_name')\
.apply(non_fully_overlapped_ranges)\
.droplevel(0, axis=0).drop(columns=['pos_x', 'pos_y'])
Outcome:
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]