I am trying to delete the overlapping values in a nested list.
The data looks like this:
[[22, 37, 'foobar'], [301, 306, 'foobar'],[369, 374, 'foobar'], [650, 672, 'foobar'], [1166, 1174, 'foobar'],[1469, 1477, 'foobar'],[2237, 2245, 'foobar'],[2702, 2724, 'foobar'],[3426, 3446, 'foobar'],[3505, 3513, 'foobar'],[3756, 3764, 'foobar'],[69524, 69535, 'foobar'],[3812, 3820, 'foobar'],[4034, 4057, 'foobar'],[4318, 4347, 'foobar'],[58531, 58548, 'foobar'],[4552, 4574, 'foobar'],[4854, 4861, 'foobar'],[5769, 5831, 'foobar'], [5976, 5986, 'foobar'],[6541, 6558, 'foobar'],[6541, 6608, 'foobar'],[7351, 7364, 'foobar'],[7351, 7364, 'foobar'], [7764, 7770, 'foobar'],[58540, 58548, 'foobar'],[69524, 69556, 'foobar']]
There are some overlapping values in index 0 and 1 of across the list. Such as:
[6541, 6558, 'foobar'] overlaps with [6541, 6608, 'foobar']
[7351, 7364, 'foobar'] overlaps with [7351, 7364, 'foobar']
[58531, 58548, 'foobar'] overlaps with [58540, 58548, 'foobar']
[69524, 69535, 'foobar'] overlaps with [69524, 69556, 'foobar']
I am trying to go through the list and remove shorter first instance of the overlapping values. If
[6541, 6558, 'foobar']
overlaps with [6541, 6608, 'foobar']
I want to keep [6541, 6608, 'foobar']
and remove [6541, 6558, 'foobar']
from the list.
So far i tried:
def clean_span(adata):
data = adata.copy()
rem_idx = []
for i in range(len(data)-1):
if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
print(" {} overlaps with {}".format(data[i], data[i 1]))
rem_idx.append(i)
for i in rem_idx:
del data[i]
return data
But this code always leaves some overlapping values behind.
It is same with this approach as well.
def clean_span(adata):
data = adata.copy()
new_data = []
for i in range(len(data)-1):
if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
print(" {} overlaps with {}".format(data[i], data[i 1]))
new_data.append(data[i 1])
else:
new_data.append(data[i])
return new_data
I would appreciate your help to solve this problem.
CodePudding user response:
def clean_span(adata):
# to perform O(1) search for index 0 and 1
# you can just have one dictionary if indexes don't matter
d0 = dict()
d1 = dict()
r = []
for a in adata:
if a[0] in d0:
print(str(a) " overlaps with " str(d0[a[0]]))
elif a[1] in d1:
print(str(a) " overlaps with " str(d1[a[1]]))
else:
r.append(a)
d0[a[0]] = a
d1[a[1]] = a
return r
Keep 2 dictionaries: one for first element and one for second. Then while iterating on the data, check if any of the keys exist in the respective dictionaries -- if the key is found, it is an overlap; else it is not.
Problem in your code:
if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
print(" {} overlaps with {}".format(data[i], data[i 1]))
new_data.append(data[i 1])
else:
new_data.append(data[i])
In the if statement, you add i 1
to the new_data
. So naturally when the loop increments to i 1
, it goes into the else
and it adds the overlapped element back to the list.
Side note:
if data[i][0] in data[i 1] or data[i][1] in data[i 1]:
You are trying to search the element in the entire list here. Making your time complexity O(nk)
.
CodePudding user response:
Overlapping can be found with set.intersection
.
import itertools as it
l = # list
# merge the first two entries of the sublist into a from-to set values
m = ((set(range(p[0], p[1] 1)), p[2]) for p in l)
# combine each element of the list to check overlapping
new_l = []
for p1, p2 in it.combinations(m, 2):
s1, l1 = p1
s2, l2 = p2
if set.intersection(s1, s2):
m1, M1 = min(s1), max(s1)
m2, M2 = min(s2), max(s2)
# choose the biggest one
if M2-m2 > M1-m1:
new_l.append((m2, M2, l2))
else:
new_l.append((m1, M1, l1))
print(sorted(new_l, key=lambda p: p[0]))