Home > front end >  Removing phrases in reverse order from a List
Removing phrases in reverse order from a List

Time:06-23

I have two lists.

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list

If I look closely, L1 has 'not worry' and 'good very' which are exact reversed repetitions of 'worry not' and 'very good'.

I need to remove such reversed elements from the list. Similary in L2, 'happy be always' is a reverse of 'always be happy', which is to be removed as well.

The final output I'm looking for is:

L1 = ['worry not', 'be happy', 'very good', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend']

I tried one solution

[[max(zip(map(set, map(str.split, group)), group))[1]] for group in L1]

But it is not giving the correct output. Should I be writing different functions for bigrams and trigrams reverse repetition removal, or is there a pythonic way of doing this in a faster way,because I'll have to run this for about 10K strings.

CodePudding user response:

You can do it with list comprehensions if you iterate over the list from the end

lst = L1[::-1] # L2[::-1]
x = [s for i, s in enumerate(lst) if ' '.join(s.split()[::-1]) not in lst[i 1:]][::-1]

# L1: ['worry not', 'be happy', 'very good', 'full stop']
# L2: ['take into account', 'always be happy', 'stay safe friend']

CodePudding user response:

You can use an index set and add both direct and reversed n-grams to it:

index = set()
res = []

for x in L1:
    a = tuple(x.split())
    b = tuple(reversed(a))
    if a in index or b in index:
        continue
    index.add(a)
    index.add(b)
    res.append(x)

print(res)

CodePudding user response:

Using a set of tuples is the way to deal with this:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list

for list_ in L1, L2:
    s = set()
    for e in list_:
        t = tuple(e.split())
        if not t[::-1] in s:
            s.add(t)
    print([' '.join(e) for e in s])

Output:

['be happy', 'worry not', 'very good', 'full stop']
['always be happy', 'stay safe friend', 'take into account']

CodePudding user response:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list


def solution(lst):
    res = []
    for item in lst:
        if " ".join(item.split()[::-1]) not in res:
            res.append(item)
    return res

print(solution(L2))

CodePudding user response:

How about keeping track of the seen phrases by using a set and sorted:

def get_deduped_phrases(phrases: list[str]) -> list[str]:
    seen_phrases = set()
    deduped_phrases = []
    for phrase in phrases:
        sorted_phrase = ' '.join(sorted(phrase.split()))
        if sorted_phrase not in seen_phrases:
            deduped_phrases.append(phrase)
        seen_phrases.add(sorted_phrase)
    return deduped_phrases
    
def main() -> None:
    L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
    print(f'{L1 = }')
    L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] # trigrams list
    print(f'{L2 = }')
    L1_deduped = get_deduped_phrases(L1)
    print(f'{L1_deduped = }')
    L2_deduped = get_deduped_phrases(L2)
    print(f'{L2_deduped = }')

if __name__ == '__main__':
    main()

Output:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always']
L1_deduped = ['worry not', 'be happy', 'very good', 'full stop']
L2_deduped = ['take into account', 'always be happy', 'stay safe friend']

CodePudding user response:

My solution consist on iterate foreach element in the list, transform that element in a list, sort it and compare with the next element making the same, transform it in a list and sort it, if the arrays are matching, remove this element. Here is my code:

L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams l

def remove_duplicates(L):
    for idx_i, l_i in enumerate(L):
        aux_i = l_i.split()
        aux_i.sort()
        for idx_j, l_j in enumerate(L[idx_i 1:]):
            aux_j = l_j.split()
            aux_j.sort()
            if aux_i == aux_j:
                L.pop(idx_i   idx_j   1)
    print(L)

remove_duplicates(L1)
remove_duplicates(L2)

The output is what you're looking for:

>>> remove_duplicates(L1)
['worry not', 'be happy', 'very good', 'full stop']
>>> remove_duplicates(L2)
['take into account', 'always be happy', 'stay safe friend']

Hope this works for you

CodePudding user response:

This is a possible solution (the complexity is linear with respect to the number of strings):

from collections import defaultdict
from operator import itemgetter

d = defaultdict(list)
for s in L2:
    d[max(s, reversed(s.split()))].append(s)

result = list(map(itemgetter(0), d.values()))

Here are the results:

['worry not', 'be happy', 'very good', 'full stop']
['take into account', 'always be happy', 'stay safe friend']
  • Related