Home > Net >  Filtering tuples within a range from a reference list
Filtering tuples within a range from a reference list

Time:05-22

I have a reference list of tuples containing different range of values.

[(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690), 
 (667, 690), (667, 690), (479, 508), (1112, 1578)]

I have the following list of lists containing tuple of values which has to be compared against the reference list.

[  [(450,470)],
   [(100, 200), (500, 700)],
   [(0, 29), (3827, 3856)],
   [(820, 835), (1539, 1554)],
   [(622, 635), (1286, 1299), (1585, 1598), (1607, 1620)],
   [(637, 642), (780, 785), (1341, 1346), (1944, 1949), (2044, 2049),
    (2158, 2163), (2594, 2599), (2643, 2648)]  ]

I am trying to pick one tuple from each list which is in the range of tuples present in the reference list.

The conditions I considered are :

  1. If the input list contains no tuple which has a value in the range of reference list, then any tuple can be taken. For example [(0, 29), (3827, 3856)] is not in the range of the reference list so, I can take any of the tuple. By default I append the first tuple in the list to the reference list.

  2. If a tuple within the range of the reference list is found, then that tuple is appended to the reference list and stops searching in that loop. Example is [(622, 635), (1286, 1299), (1585, 1598), (1607, 1620)]

  3. If more than one tuple is also present in the range of reference list, then the first found tuple is appended to the reference list. Example is [(637, 642), (780, 785), (1341, 1346), (1944, 1949), (2044, 2049), (2158, 2163), (2594, 2599), (2643, 2648)]

  4. Values in a tuple will never be same and second value in a tuple will always be larger than the first value.

The logic I used to find the range is I took the minimum value and maximum value in the first position of tuple of the reference list. The I did simple iteration.

Code I used is

tag_pos_refin = [(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690), 
                 (667, 690), (667, 690), (479, 508), (1112, 1578)]

tag_pos_db = [  [(450,470)],
                [(100, 200), (500, 700)],
                [(0, 29), (3827, 3856)],
                [(820, 835), (1539, 1554)],
                [(622, 635), (1286, 1299), (1585, 1598), (1607, 1620)],
                [(637, 642), (780, 785), (1341, 1346), (1944, 1949), (2044, 2049), (2158, 2163), 
                  (2594, 2599), (2643, 2648)]
            ]


min_threshold = min(tag_pos_refin)[0]
max_threshold = max(tag_pos_refin)[0]

for tag_pos in tag_pos_db:
    if len(tag_pos) == 1:
        tag_pos_refin.extend(tag_pos)

for tag_pos in tag_pos_db:
    if len(tag_pos) > 1:
        for j in tag_pos:
            if j[0] in range(min_threshold, max_threshold):
                tag_pos_refin.append(j)
                break
            elif min(tag_pos)[0] not in range(min_threshold, max_threshold):
                tag_pos_refin.append(j)
                break             

print(tag_pos_refin)

Output Obtained

[(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690), (667, 690), (667, 690), (479, 508), (1112, 1578), (450, 470), (100, 200), (0, 29), (820, 835), (622, 635), (637, 642)]

Desired Output

[(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690), (667, 690), (667, 690), (479, 508), (1112, 1578), (450, 470), (500, 700), (0, 29), (820, 835), (622, 635), (637, 642)]

My doubt is

Is it possible to write the code in a better way or better logic for finding the range so that instead of (100,200), the best tuple is (500,700).

(Use case of this is bit complicated to explain: But the values of a tuple can be considered as index point of words or sentences in a text)

CodePudding user response:

There are a number of issues with your code. Firstly, you need to check both values of every tuple, since either one could be in range, but not necessarily both. Secondly, it's inefficient to constantly re-create a range object to make a simple bounds check, and your implementation also has an off-by-one error (assuming you want an inclusive range). Thirdly, you don't check all tuples when looking for a match, which means the wrong tuple may be appended.

In the solution below, I have added some extra tests to check boundary cases:

tag_pos_refin = [(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690),
                 (667, 690), (667, 690), (479, 508), (1112, 1578)]

tag_pos_db = [
    [(200, 1500), (1112, 1200)], # test upper bound
    [(100, 1600), (275, 479)], # test lower bound
    [(450, 470)],
    [(100, 200), (500, 700)],
    [(0, 29), (3827, 3856)],
    [(820, 835), (1539, 1554)],
    [(622, 635), (1286, 1299), (1585, 1598), (1607, 1620)],
    [(637, 642), (780, 785), (1341, 1346), (1944, 1949), (2044, 2049), (2158, 2163),
     (2594, 2599), (2643, 2648)],
    ]

min_threshold = min(tag_pos_refin)[0]
max_threshold = max(tag_pos_refin)[0]

print(f'min/max: {min_threshold}-{max_threshold}\n')

for tag_pos in tag_pos_db:
    if tag_pos:
        print(f'checking {tag_pos}', end='')
        for j in tag_pos:
            if (min_threshold <= j[0] <= max_threshold or
                min_threshold <= j[1] <= max_threshold):
                print(' -> found match')
                tag_pos_refin.append(j)
                break
        else:
            print(' -> no matches')
            tag_pos_refin.append(tag_pos[0])
        print(f'APPENDED: {tag_pos_refin[-1]}\n')

print(f'RESULT: {tag_pos_refin}\n')
    

Output:

min/max: 479-1112

checking [(200, 1500), (1112, 1200)] -> found match
APPENDED: (1112, 1200)

checking [(100, 1600), (275, 479)] -> found match
APPENDED: (275, 479)

checking [(450, 470)] -> no matches
APPENDED: (450, 470)

checking [(100, 200), (500, 700)] -> found match
APPENDED: (500, 700)

checking [(0, 29), (3827, 3856)] -> no matches
APPENDED: (0, 29)

checking [(820, 835), (1539, 1554)] -> found match
APPENDED: (820, 835)

checking [(622, 635), (1286, 1299), (1585, 1598), (1607, 1620)] -> found match
APPENDED: (622, 635)

checking [(637, 642), (780, 785), (1341, 1346), (1944, 1949), (2044, 2049), (2158, 2163), (2594, 2599), (2643, 2648)] -> found match
APPENDED: (637, 642)

RESULT: [(1042, 1056), (895, 922), (966, 995), (692, 716), (667, 690), (667, 690), (667, 690), (479, 508), (1112, 1578), (1112, 1200), (275, 479), (450, 470), (500, 700), (0, 29), (820, 835), (622, 635), (637, 642)]
  • Related