How to extract rows with some processing steps using python pandas?-CodePudding

My dataframe:

| query_name | position_description |
|------------|----------------------|
| A1         | [1-10]               |
| A1         | [3-5]                |
| A2         | [1-20]               |
| A3         | [1-15]               |
| A4         | [10-20]              |
| A4         | [1-15]               |

I would like to remove those rows with (i)same query_name and (ii) overlap entirely for the position_description?

Desired output:

| query_name | position_description |
|------------|----------------------|
| A1         | [1-10]               |
| A2         | [1-20]               |
| A3         | [1-15]               |
| A4         | [10-20]              |
| A4         | [1-15]               |

CodePudding user response：

If there can be no more than one row contained in another we can use:

from ast import literal_eval
df2 = pd.DataFrame(df['position_description'].str.replace('-', ',')
                                             .apply(literal_eval).tolist(),
                   index=df.index).sort_values(0)
print(df2)

    0   1
0   1  10
2   1  20
3   1  15
5   1  15
1   3   5
4  10  20

check = df2.groupby(df['query_name']).shift()
df.loc[~(df2[0].gt(check[0]) & df2[1].lt(check[1]))]

  query_name position_description
0         A1               [1-10]
2         A2               [1-20]
3         A3               [1-15]
4         A4              [10-20]
5         A4               [1-15]

CodePudding user response：

This should work for any number of ranges being contained by some ranges:

First, extract the boundaries

df = pd.DataFrame({
  'query_name': ['A1', 'A1', 'A2', 'A3', 'A4', 'A4'],
  'position_description': ['[1-10]', '[3-5]', '[1-20]', '[1-15]', '[10-20]', '[1-15]'],
})

df[['pos_x', 'pos_y']] = df['position_description'].str.extract(r'\[(\d )-(\d )\]').astype(int)

Then we will define the function that can choose what ranges to keep:

def non_contained_ranges(df):
    range_min = df['pos_x'].min()
    range_max = df['pos_y'].max()
    range_size = range_max - range_min   1
    
    b = np.zeros((len(df), range_size))
    
    for i, (x, y) in enumerate(df[['pos_x', 'pos_y']].values - range_min):
        b[i, x: y 1] = 1.
        
    b2 = np.logical_and(np.logical_xor(b[:, np.newaxis], b), b).any(axis=2)
    np.fill_diagonal(b2, True)
    
    b3 = b2.all(axis=0)
    
    return df[b3]

If there are N ranges within a group (query_name), this function will do N x N comparisons, using boolean array operations.

Then we can do groupby and apply the function to yield the expected result

df.groupby('query_name')\
    .apply(non_fully_overlapped_ranges)\
    .droplevel(0, axis=0).drop(columns=['pos_x', 'pos_y'])

Outcome:

    query_name  position_description
0   A1  [1-10]
2   A2  [1-20]
3   A3  [1-15]
4   A4  [10-20]
5   A4  [1-15]