In python, I have one large dictionary which contains smaller nested dictionaries.
The dictionary could have as many as 7 nested dictionaries, and up to 100 'exons'. There will always be overlaps between some of exons. Here is a relatively small example:
{
0: {
"exon_count": 4,
"length": 29832,
"exon": {
0: {"start": 77, "end": 323},
1: {"start": 1245, "end": 1507},
2: {"start": 6225, "end": 6598},
3: {"start": 29186, "end": 29909},
},
},
1: {
"exon_count": 3,
"length": 6688,
"exon": {
0: {"start": 0, "end": 323},
1: {"start": 1245, "end": 1507},
2: {"start": 6225, "end": 6688},
},
},
2: {
"exon_count": 4,
"length": 6688,
"exon": {
0: {"start": 0, "end": 323},
1: {"start": 487, "end": 971},
2: {"start": 1245, "end": 1507},
3: {"start": 6225, "end": 6688},
},
},
}
I am struggling to find a way to extract all of the overlaps between the exons. I.e. only extracting start and end coordinates which appear in every nested dictionary - the order does not matter.
Here are the overlaps in this example:
77-323
1245-1507
6225-6598
I thought of using a basic for loop, but I don't think it works well here. Please help!
CodePudding user response:
You can put all integers in these ranges into a set
, then intersect all the set
s, and finally convert the resulting set
back into a list of ranges. It's not the most efficient in theory, because it depends on the size of the ranges (end - start
). But if those are always in the same ballpark as your example, it shouldn't matter.
Here's an implementation with some explanatory comments and doctests.
def overlaps(exons_dict):
'''
Computes overlap between all exons in the dictionary. Returns it as a list
of (start, end) tuples, with start and end inclusive.
>>> overlaps(EXAMPLE)
[(77, 323), (1245, 1507), (6225, 6598)]
'''
# Handle the case of no exons, otherwise there is no first.
if not exons_dict:
return []
# Create an iterator over all exons.
exons = iter(exons_dict.values())
# Initialize the overlap set from the first exon.
overlap = exon_values(next(exons))
# Intersect current overlap with each successive exon.
for exon in exons:
overlap.intersection_update(exon_values(exon))
# Turn the overlap set back into a list of ranges.
return values_to_ranges(overlap)
def exon_values(exon):
'''
Given an exon, converts all values in all its ranges into a set.
'''
values = set()
for row in exon["exon"].values():
for i in range(row["start"], row["end"] 1):
values.add(i)
return values
def values_to_ranges(values):
'''
Turns a set into a list of (start, end) tuples, with start and end
inclusive.
>>> values_to_ranges({7, 6, 1, 8, 3, 2})
[(1, 3), (6, 8)]
>>> values_to_ranges({4, 3})
[(3, 4)]
>>> values_to_ranges(set())
[]
'''
# Sort the values into an ascending list.
sorted_values = sorted(list(values))
# Extract ranges.
ranges = []
for value in sorted_values:
# Does current value belong to latest range?
if ranges and value == ranges[-1][1] 1:
# Yes: extend current range.
ranges[-1] = (ranges[-1][0], value)
else:
# No: create a new range.
ranges.append((value, value))
return ranges
CodePudding user response:
Use recursion to travel in the dictionary until the exon
key is met, then append all "intervals" as pairs. The overlaps are found by computing the intersection of each possible pairing (the set -> range -> interval
approach is taken from Thomas otherwise do a if-else
hunt to find the min and max).
import itertools as it
data = # from question
def exon_overlaps(data: dict):
overlaps = []
for k in data:
if isinstance(next_d := data[k], dict):
if k != 'exon':
overlaps.extend(exon_overlaps(next_d))
else:
overlaps.extend((interval['start'], interval['end']) for interval in next_d.values())
return overlaps
return overlaps
intervals = exon_overlaps(d)
# step 2: find intersections
overlaps = []
for (a, b), (c, d) in it.combinations(intervals, 2):
if (a, b) != (c, d):
if (i := set.intersection(set(range(a, b 1)), set(range(c, d 1)))) != set():
if (interval := f'{min(i)}-{max(i)}') not in overlaps:
overlaps.append(interval)
print(overlaps)
Output
['77-323', '6225-6598']
Remark: The output is different from the one in the question because is not explained how repeated interval should be treated in a consistent way