Home > database >  How to find the longest continuous stretch of matching elements in 2 lists
How to find the longest continuous stretch of matching elements in 2 lists

Time:01-20

I have 2 lists:

a = [
         'Okay. ',
         'Yeah. ',
         'So ',
         'my ',
         'thinking ',
         'is, ',
         'so ',
         'when ',
         "it's ",
         'set ',
         'up ',
         'just ',
         'one ',
         'and ',
         "we're ",
         'like ',
         'next ',
         'to ',
         'each ',
         'other '
     ]

b = [  
     'Okay. ',
     'Yeah. ',
     'Everything ',
     'as ',
     'normal ',
     'as ',
     'possible. ',
     'Yeah. ',
     'Yeah. ',
     'Okay. ',
     'Is ',
     'that ',
     'better? ',
     'Yeah. ',
     'So ',
     'my ',
     'thinking ',
     'is, ',
     'so ',
     'when '
     ]

Each list is slightly different. However, there will be moments when a stretch of continuous elements in a will match a stretch of continuous elements in b.

For example:

The first 2 elements in both lists match. The matching list would be ['Okay.', 'Yeah.']. This is only 2 elements long.

There is a longer stretch of matching words. You can see that each contains the following continuous set: ['Yeah. ','So ','my ','thinking ','is, ','so ','when '] This continuous matching sequence has 7 elements. This is the longest sequence.

I want the index of where this sequence starts for each list. For a, this should be 1 and for b this should be 13.

I understand that I can make every possible ordered sequence in a, starting with the longest, and check for a match in b, stopping once I get the match. However, this seems inefficent.

CodePudding user response:

How I would solve this:

from difflib import SequenceMatcher
match = SequenceMatcher(None, a, b).find_longest_match()
print(a[match.a:match.a   match.size])
print(b[match.b:match.b   match.size])

You get:

['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']
['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']

CodePudding user response:

So, we start from the top of 'a', and search through 'b' to find the longest match. Since this only continues as long as there is a match, it isn't terribly inefficient.

a = [
         'Okay. ',
         'Yeah. ',
         'So ',
         'my ',
         'thinking ',
         'is, ',
         'so ',
         'when ',
         "it's ",
         'set ',
         'up ',
         'just ',
         'one ',
         'and ',
         "we're ",
         'like ',
         'next ',
         'to ',
         'each ',
         'other '
     ]

b = [  
     'Okay. ',
     'Yeah. ',
     'Everything ',
     'as ',
     'normal ',
     'as ',
     'possible. ',
     'Yeah. ',
     'Yeah. ',
     'Okay. ',
     'Is ',
     'that ',
     'better? ',
     'Yeah. ',
     'So ',
     'my ',
     'thinking ',
     'is, ',
     'so ',
     'when '
     ]

start = None
maxlen = 0
for i in range(len(a)):
    for j in range(len(b)):
        for k in range(min(len(a)-i,len(b)-j)):
            if a[i k] != b[j k]:
                break
        if k > maxlen:
            start = (i,j)
            maxlen = k
print(start,maxlen)

Output:

(1, 13) 6
  • Related