I've got a problem and I don't really know how to solve it with Python.
I want to filter my output of words and sequences and I want to remove 'duplicates'. I put duplicates in quotation marks because they are not real duplicates. The filter should prefer the longest possible sequence without any duplicate words. See the following example for clarity.
Example output:
out: ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']
After it's going through the filter, only ['quick brown fox', 'grey wolf', 'brown hare']
should be in the result.
I don't know if it's possible to filter ['brown']
but keeping both ['quick brown fox'] and ['brown hare']...
I hope that I described my problem clearly enough. Any help and ideas are very welcome! Thanks in advance.
CodePudding user response:
In other words, you need to find all sequences that are not subsets of any other sequence.
Let's proceed like this:
- convert each item into a pair
text
,set of words
- sort the list by the set length descending
- for every pair, check if the
set
component is a subset of any preceding set - if not, add to the result
inp = ['quick', 'brown', 'fox', 'quick fox', 'brown fox', 'grey wolf', 'brown hare', 'quick brown fox']
pairs = [
[s, set(s.split())]
for s in inp
]
pairs.sort(key=lambda p: -len(p[1]))
res = [
pair[0]
for n, pair in enumerate(pairs)
if all(not pair[1].issubset(prev[1]) for prev in pairs[:n])
]
print(res) # ['quick brown fox', 'grey wolf', 'brown hare']