Finding overlaps between multiple nested dictionaries-CodePudding

In python, I have one large dictionary which contains smaller nested dictionaries.

The dictionary could have as many as 7 nested dictionaries, and up to 100 'exons'. There will always be overlaps between some of exons. Here is a relatively small example:

{
    0: {
        "exon_count": 4,
        "length": 29832,
        "exon": {
            0: {"start": 77, "end": 323},
            1: {"start": 1245, "end": 1507},
            2: {"start": 6225, "end": 6598},
            3: {"start": 29186, "end": 29909},
        },
    },
    1: {
        "exon_count": 3,
        "length": 6688,
        "exon": {
            0: {"start": 0, "end": 323},
            1: {"start": 1245, "end": 1507},
            2: {"start": 6225, "end": 6688},
        },
    },
    2: {
        "exon_count": 4,
        "length": 6688,
        "exon": {
            0: {"start": 0, "end": 323},
            1: {"start": 487, "end": 971},
            2: {"start": 1245, "end": 1507},
            3: {"start": 6225, "end": 6688},
        },
    },
}

I am struggling to find a way to extract all of the overlaps between the exons. I.e. only extracting start and end coordinates which appear in every nested dictionary - the order does not matter.

Here are the overlaps in this example:

77-323
1245-1507
6225-6598

I thought of using a basic for loop, but I don't think it works well here. Please help!

CodePudding user response：

You can put all integers in these ranges into a set, then intersect all the sets, and finally convert the resulting set back into a list of ranges. It's not the most efficient in theory, because it depends on the size of the ranges (end - start). But if those are always in the same ballpark as your example, it shouldn't matter.

Here's an implementation with some explanatory comments and doctests.

def overlaps(exons_dict):
    '''
    Computes overlap between all exons in the dictionary. Returns it as a list
    of (start, end) tuples, with start and end inclusive.

    >>> overlaps(EXAMPLE)
    [(77, 323), (1245, 1507), (6225, 6598)]
    '''
    # Handle the case of no exons, otherwise there is no first.
    if not exons_dict:
        return []
    # Create an iterator over all exons.
    exons = iter(exons_dict.values())
    # Initialize the overlap set from the first exon.
    overlap = exon_values(next(exons))
    # Intersect current overlap with each successive exon.
    for exon in exons:
        overlap.intersection_update(exon_values(exon))
    # Turn the overlap set back into a list of ranges.
    return values_to_ranges(overlap)

def exon_values(exon):
    '''
    Given an exon, converts all values in all its ranges into a set.
    '''
    values = set()
    for row in exon["exon"].values():
        for i in range(row["start"], row["end"]   1):
            values.add(i)
    return values

def values_to_ranges(values):
    '''
    Turns a set into a list of (start, end) tuples, with start and end
    inclusive.

    >>> values_to_ranges({7, 6, 1, 8, 3, 2})
    [(1, 3), (6, 8)]
    >>> values_to_ranges({4, 3})
    [(3, 4)]
    >>> values_to_ranges(set())
    []
    '''
    # Sort the values into an ascending list.
    sorted_values = sorted(list(values))
    # Extract ranges.
    ranges = []
    for value in sorted_values:
        # Does current value belong to latest range?
        if ranges and value == ranges[-1][1]   1:
            # Yes: extend current range.
            ranges[-1] = (ranges[-1][0], value)
        else:
            # No: create a new range.
            ranges.append((value, value))
    return ranges

CodePudding user response：

Use recursion to travel in the dictionary until the exon key is met, then append all "intervals" as pairs. The overlaps are found by computing the intersection of each possible pairing (the set -> range -> interval approach is taken from Thomas otherwise do a if-else hunt to find the min and max).

import itertools as it

data = # from question

def exon_overlaps(data: dict):
    overlaps = []

    for k in data:
        if isinstance(next_d := data[k], dict):
            if k != 'exon':
                overlaps.extend(exon_overlaps(next_d))
            else:
                overlaps.extend((interval['start'], interval['end']) for interval in next_d.values())
                return overlaps
    return overlaps


intervals = exon_overlaps(d)

# step 2: find intersections
overlaps = []
for (a, b), (c, d) in it.combinations(intervals, 2):
    if (a, b) != (c, d):
        if (i := set.intersection(set(range(a, b 1)), set(range(c, d 1)))) != set():
            if (interval := f'{min(i)}-{max(i)}') not in overlaps:
                overlaps.append(interval)

print(overlaps)

Output

['77-323', '6225-6598']

Remark: The output is different from the one in the question because is not explained how repeated interval should be treated in a consistent way