Home > Back-end >  Remove rows from dataframe if it has partial match with other rows for specific columns
Remove rows from dataframe if it has partial match with other rows for specific columns

Time:10-12

I want to remove rows in a dataframe which have partial overlaps in their start and end character indices.

Details:

I have two sentences and I have extracted some entities from them and organized them in a dataframe.

sentences :

| id  |            sentence                          |
| --- |             ---                              |
|  1  | Today is a very sunny day and sun is shining |
|  2  | I bought the red balloon and playing with it |

My dataframe with the extracted entities looks like this:

| id |     data        | start_char_index | end_char index | token_position |
| ---| --------------  | ---------------- | -------------- | -------------- |
| 1  | very sunny day  |      11          |      26        |  [4,5,6]       |
| 1  | shining         |      37          |      45        |    [10]        |
| 1  | sunny           |      16          |      21        |    [5]         |
| 2  | the red balloon |      9           |      25        |   [3,4,5]      | 
| 2  | playing         |      29          |      37        |     [7]        |
| 2  | red             |      13          |      16        |     [4]        |

P.S. In this token position is the index of the specific token in text (starting from 1)

Now, for id 1. we see that 'very sunny day' and 'sunny' are partial overlaps (their start and end character indices and token position both overlap) Same for id 2, where 'the red balloon' and 'red' have red which is an overlap and I want to remove the rows 'sunny' and 'red' which are smaller of the overlaps in the two different ids.

I thought about grouping them on ids and then removing those records by storing the start and end character indices (or token position) in a dictionary, but if I have alot of data rows and lot of ids, then it would be very slow. Also I read about IntervalTree but I could not get to use it for partial overlaps very efficiently.

So could you please suggest some solution for this?

The final output dataframe should look like this:

| id |     data        | start_char_index | end_char index | token_position |
| ---| --------------  | ---------------- | -------------- | -------------- |
| 1  | very sunny day  |      11          |      26        |  [4,5,6]       |
| 1  | shining         |      37          |      45        |    [10]        |
| 2  | the red balloon |      9           |      25        |   [3,4,5]      | 
| 2  | playing         |      29          |      37        |     [7]        |

Thanks for the help in advance :)

CodePudding user response:

I am not sure a DataFrame is the best structure to solve this problem - but here is one approach

df = pd.DataFrame({'id': [1, 1, 1], 'start': [11, 20, 16], 'end': [18, 35, 17]})

# First we construct a range of numbers from the start and end index
df.loc[:, 'range'] = df.apply(lambda x: list(range(x['start'], x['end'])), axis=1)

# Next, we "cumulate" these ranges and measure the number of unique elements in the cumulative range at each row 
df['range_size'] = df['range'].cumsum().apply(lambda x: len(set(x)))

# Finally we check if every row adds anything to the cumulative range - if a new row adds nothing, then we can drop that row
df['range_size_shifted'] = df['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
df['drop'] = df.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)

print(df)
#   id  start  end   drop
#0   1     11   18  False
#1   1     20   35  False
#2   1     16   17   True

If you want to do this for each group separately -

for key, group in df.groupby('id'):
    group.loc[:, 'range'] = group.apply(lambda x: list(range(x['start'], x['end'])), axis=1)
    group['range_size'] = group['range'].cumsum().apply(lambda x: len(set(x)))
    group['range_size_shifted'] = group['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
    group['drop'] = group.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)
    print(group)

CodePudding user response:

Apart from Mortz's answer, I also tried pandas IntervalArray and overlap which was working faster for me. Putting it here for anyone else who might find it useful (Credits : https://stackoverflow.com/a/69336914/15941713 ):

from intervaltree import Interval, IntervalTree
def drop_subspan_duplicates(df):

   idx1 = pd.arrays.IntervalArray.from_arrays(
                          df['start'], 
                          df['end'], 
                          closed='both')

  df['wrd_id'] = df.apply(lambda x : df.index[idx1.overlaps(pd.Interval(x['start'], x['end'], closed='both'))][0],axis=1)
  df= df.drop_duplicates(['wrd_id'],keep='first')
  df.drop(['wrd_id'],axis=1,inplace=True)
  return df

output = data.groupby('id').apply(drop_subspan_duplicates)

One can also refer to this answer for tackling the issue if one wishes to avoid dataframe operations

  • Related