I have two lists.
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
If I look closely, L1 has 'not worry'
and 'good very'
which are exact reversed repetitions of 'worry not'
and 'very good'
.
I need to remove such reversed elements from the list. Similary in L2, 'happy be always'
is a reverse of 'always be happy'
, which is to be removed as well.
The final output I'm looking for is:
L1 = ['worry not', 'be happy', 'very good', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend']
I tried one solution
[[max(zip(map(set, map(str.split, group)), group))[1]] for group in L1]
But it is not giving the correct output. Should I be writing different functions for bigrams and trigrams reverse repetition removal, or is there a pythonic way of doing this in a faster way,because I'll have to run this for about 10K strings.
CodePudding user response:
You can do it with list comprehensions if you iterate over the list from the end
lst = L1[::-1] # L2[::-1]
x = [s for i, s in enumerate(lst) if ' '.join(s.split()[::-1]) not in lst[i 1:]][::-1]
# L1: ['worry not', 'be happy', 'very good', 'full stop']
# L2: ['take into account', 'always be happy', 'stay safe friend']
CodePudding user response:
You can use an index set and add both direct and reversed n-grams to it:
index = set()
res = []
for x in L1:
a = tuple(x.split())
b = tuple(reversed(a))
if a in index or b in index:
continue
index.add(a)
index.add(b)
res.append(x)
print(res)
CodePudding user response:
Using a set of tuples is the way to deal with this:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
for list_ in L1, L2:
s = set()
for e in list_:
t = tuple(e.split())
if not t[::-1] in s:
s.add(t)
print([' '.join(e) for e in s])
Output:
['be happy', 'worry not', 'very good', 'full stop']
['always be happy', 'stay safe friend', 'take into account']
CodePudding user response:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
def solution(lst):
res = []
for item in lst:
if " ".join(item.split()[::-1]) not in res:
res.append(item)
return res
print(solution(L2))
CodePudding user response:
How about keeping track of the seen phrases by using a set
and sorted
:
def get_deduped_phrases(phrases: list[str]) -> list[str]:
seen_phrases = set()
deduped_phrases = []
for phrase in phrases:
sorted_phrase = ' '.join(sorted(phrase.split()))
if sorted_phrase not in seen_phrases:
deduped_phrases.append(phrase)
seen_phrases.add(sorted_phrase)
return deduped_phrases
def main() -> None:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
print(f'{L1 = }')
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] # trigrams list
print(f'{L2 = }')
L1_deduped = get_deduped_phrases(L1)
print(f'{L1_deduped = }')
L2_deduped = get_deduped_phrases(L2)
print(f'{L2_deduped = }')
if __name__ == '__main__':
main()
Output:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always']
L1_deduped = ['worry not', 'be happy', 'very good', 'full stop']
L2_deduped = ['take into account', 'always be happy', 'stay safe friend']
CodePudding user response:
My solution consist on iterate foreach element in the list, transform that element in a list, sort it and compare with the next element making the same, transform it in a list and sort it, if the arrays are matching, remove this element. Here is my code:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams l
def remove_duplicates(L):
for idx_i, l_i in enumerate(L):
aux_i = l_i.split()
aux_i.sort()
for idx_j, l_j in enumerate(L[idx_i 1:]):
aux_j = l_j.split()
aux_j.sort()
if aux_i == aux_j:
L.pop(idx_i idx_j 1)
print(L)
remove_duplicates(L1)
remove_duplicates(L2)
The output is what you're looking for:
>>> remove_duplicates(L1)
['worry not', 'be happy', 'very good', 'full stop']
>>> remove_duplicates(L2)
['take into account', 'always be happy', 'stay safe friend']
Hope this works for you
CodePudding user response:
This is a possible solution (the complexity is linear with respect to the number of strings):
from collections import defaultdict
from operator import itemgetter
d = defaultdict(list)
for s in L2:
d[max(s, reversed(s.split()))].append(s)
result = list(map(itemgetter(0), d.values()))
Here are the results:
['worry not', 'be happy', 'very good', 'full stop']
['take into account', 'always be happy', 'stay safe friend']