Home > Net >  How to remove overlapping item in a nested list?
How to remove overlapping item in a nested list?

Time:04-08

I am trying to delete the overlapping values in a nested list.

The data looks like this:

[[22, 37, 'foobar'], [301, 306, 'foobar'],[369, 374, 'foobar'], [650, 672, 'foobar'], [1166, 1174, 'foobar'],[1469, 1477, 'foobar'],[2237, 2245, 'foobar'],[2702, 2724, 'foobar'],[3426, 3446, 'foobar'],[3505, 3513, 'foobar'],[3756, 3764, 'foobar'],[69524, 69535, 'foobar'],[3812, 3820, 'foobar'],[4034, 4057, 'foobar'],[4318, 4347, 'foobar'],[58531, 58548, 'foobar'],[4552, 4574, 'foobar'],[4854, 4861, 'foobar'],[5769, 5831, 'foobar'], [5976, 5986, 'foobar'],[6541, 6558, 'foobar'],[6541, 6608, 'foobar'],[7351, 7364, 'foobar'],[7351, 7364, 'foobar'], [7764, 7770, 'foobar'],[58540, 58548, 'foobar'],[69524, 69556, 'foobar']]

There are some overlapping values in index 0 and 1 of across the list. Such as:

 [6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar']
 [7351, 7364, 'foobar'] overlaps with [7351, 7364, 'foobar']
 [58531, 58548, 'foobar'] overlaps with [58540, 58548, 'foobar']
 [69524, 69535, 'foobar'] overlaps with [69524, 69556, 'foobar']

I am trying to go through the list and remove shorter first instance of the overlapping values. If [6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar'] I want to keep [6541, 6608, 'foobar'] and remove [6541, 6558, 'foobar'] from the list.

So far i tried:

def clean_span(adata):
    data = adata.copy()
    rem_idx = []
    for i in range(len(data)-1):
        if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
            print(" {} overlaps with {}".format(data[i], data[i 1]))
            rem_idx.append(i)
    
    for i in rem_idx:
        del data[i]
    return data

But this code always leaves some overlapping values behind.

It is same with this approach as well.

def clean_span(adata):
    data = adata.copy()
    new_data = []
    for i in range(len(data)-1):
        if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
            print(" {} overlaps with {}".format(data[i], data[i 1]))
            new_data.append(data[i 1])
        else:
            new_data.append(data[i])
    return new_data

I would appreciate your help to solve this problem.

CodePudding user response:

def clean_span(adata):
  # to perform O(1) search for index 0 and 1
  # you can just have one dictionary if indexes don't matter
  d0 = dict()
  d1 = dict()

  r = []
  for a in adata:
    if a[0] in d0:
      print(str(a)   " overlaps with "   str(d0[a[0]]))
    elif a[1] in d1:
      print(str(a)   " overlaps with "   str(d1[a[1]]))
    else:  
      r.append(a)
      d0[a[0]] = a
      d1[a[1]] = a

  return r

Keep 2 dictionaries: one for first element and one for second. Then while iterating on the data, check if any of the keys exist in the respective dictionaries -- if the key is found, it is an overlap; else it is not.


Problem in your code:

if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
    print(" {} overlaps with {}".format(data[i], data[i 1]))
    new_data.append(data[i 1])
else:
    new_data.append(data[i])

In the if statement, you add i 1 to the new_data. So naturally when the loop increments to i 1, it goes into the else and it adds the overlapped element back to the list.


Side note:

if data[i][0] in data[i 1] or data[i][1] in data[i 1]:

You are trying to search the element in the entire list here. Making your time complexity O(nk).

CodePudding user response:

Overlapping can be found with set.intersection.

import itertools as it

l = # list

# merge the first two entries of the sublist into a from-to set values
m = ((set(range(p[0], p[1] 1)), p[2]) for p in l)

# combine each element of the list to check overlapping
new_l = []
for p1, p2 in it.combinations(m, 2):
    s1, l1 = p1
    s2, l2 = p2

    if set.intersection(s1, s2):
        m1, M1 = min(s1), max(s1)
        m2, M2 = min(s2), max(s2)

        # choose the biggest one
        if M2-m2 > M1-m1:
            new_l.append((m2, M2, l2))
        else:
            new_l.append((m1, M1, l1))

print(sorted(new_l, key=lambda p: p[0]))
  • Related