Home > Net >  Filter by longest sequence and delete shorter sequences by duplicates
Filter by longest sequence and delete shorter sequences by duplicates

Time:10-12

I've got a problem and I don't really know how to solve it with Python.

I want to filter my output of words and sequences and I want to remove 'duplicates'. I put duplicates in quotation marks because they are not real duplicates. The filter should prefer the longest possible sequence without any duplicate words. See the following example for clarity.

Example output:

out: ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']

After it's going through the filter, only ['quick brown fox', 'grey wolf', 'brown hare'] should be in the result. I don't know if it's possible to filter ['brown'] but keeping both ['quick brown fox'] and ['brown hare']...

I hope that I described my problem clearly enough. Any help and ideas are very welcome! Thanks in advance.

CodePudding user response:

In other words, you need to find all sequences that are not subsets of any other sequence.

Let's proceed like this:

  • convert each item into a pair text, set of words
  • sort the list by the set length descending
  • for every pair, check if the set component is a subset of any preceding set
  • if not, add to the result

inp = ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']

pairs = [
    [s, set(s.split())]
    for s in inp
]

pairs.sort(key=lambda p: -len(p[1]))

res = [
    pair[0]
    for n, pair in enumerate(pairs)
    if all(not pair[1].issubset(prev[1]) for prev in pairs[:n])
]

print(res) # ['quick brown fox', 'grey wolf', 'brown hare']
  • Related