Removing phrases in reverse order from a List-CodePudding

I have two lists.

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list

If I look closely, L1 has 'not worry' and 'good very' which are exact reversed repetitions of 'worry not' and 'very good'.

I need to remove such reversed elements from the list. Similary in L2, 'happy be always' is a reverse of 'always be happy', which is to be removed as well.

The final output I'm looking for is:

L1 = ['worry not', 'be happy', 'very good', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend']

I tried one solution

[[max(zip(map(set, map(str.split, group)), group))[1]] for group in L1]

But it is not giving the correct output. Should I be writing different functions for bigrams and trigrams reverse repetition removal, or is there a pythonic way of doing this in a faster way,because I'll have to run this for about 10K strings.

CodePudding user response：

You can do it with list comprehensions if you iterate over the list from the end

lst = L1[::-1] # L2[::-1]
x = [s for i, s in enumerate(lst) if ' '.join(s.split()[::-1]) not in lst[i 1:]][::-1]

# L1: ['worry not', 'be happy', 'very good', 'full stop']
# L2: ['take into account', 'always be happy', 'stay safe friend']

CodePudding user response：

You can use an index set and add both direct and reversed n-grams to it:

index = set()
res = []

for x in L1:
    a = tuple(x.split())
    b = tuple(reversed(a))
    if a in index or b in index:
        continue
    index.add(a)
    index.add(b)
    res.append(x)

print(res)

CodePudding user response：

Using a set of tuples is the way to deal with this:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list

for list_ in L1, L2:
    s = set()
    for e in list_:
        t = tuple(e.split())
        if not t[::-1] in s:
            s.add(t)
    print([' '.join(e) for e in s])

Output:

['be happy', 'worry not', 'very good', 'full stop']
['always be happy', 'stay safe friend', 'take into account']

CodePudding user response：

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list


def solution(lst):
    res = []
    for item in lst:
        if " ".join(item.split()[::-1]) not in res:
            res.append(item)
    return res

print(solution(L2))

CodePudding user response：

How about keeping track of the seen phrases by using a set and sorted:

def get_deduped_phrases(phrases: list[str]) -> list[str]:
    seen_phrases = set()
    deduped_phrases = []
    for phrase in phrases:
        sorted_phrase = ' '.join(sorted(phrase.split()))
        if sorted_phrase not in seen_phrases:
            deduped_phrases.append(phrase)
        seen_phrases.add(sorted_phrase)
    return deduped_phrases
    
def main() -> None:
    L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
    print(f'{L1 = }')
    L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] # trigrams list
    print(f'{L2 = }')
    L1_deduped = get_deduped_phrases(L1)
    print(f'{L1_deduped = }')
    L2_deduped = get_deduped_phrases(L2)
    print(f'{L2_deduped = }')

if __name__ == '__main__':
    main()

Output:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always']
L1_deduped = ['worry not', 'be happy', 'very good', 'full stop']
L2_deduped = ['take into account', 'always be happy', 'stay safe friend']

CodePudding user response：

My solution consist on iterate foreach element in the list, transform that element in a list, sort it and compare with the next element making the same, transform it in a list and sort it, if the arrays are matching, remove this element. Here is my code:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams l

def remove_duplicates(L):
    for idx_i, l_i in enumerate(L):
        aux_i = l_i.split()
        aux_i.sort()
        for idx_j, l_j in enumerate(L[idx_i 1:]):
            aux_j = l_j.split()
            aux_j.sort()
            if aux_i == aux_j:
                L.pop(idx_i   idx_j   1)
    print(L)

remove_duplicates(L1)
remove_duplicates(L2)

The output is what you're looking for:

>>> remove_duplicates(L1)
['worry not', 'be happy', 'very good', 'full stop']
>>> remove_duplicates(L2)
['take into account', 'always be happy', 'stay safe friend']

Hope this works for you

CodePudding user response：

This is a possible solution (the complexity is linear with respect to the number of strings):

from collections import defaultdict
from operator import itemgetter

d = defaultdict(list)
for s in L2:
    d[max(s, reversed(s.split()))].append(s)

result = list(map(itemgetter(0), d.values()))

Here are the results:

['worry not', 'be happy', 'very good', 'full stop']
['take into account', 'always be happy', 'stay safe friend']