Home > Software design >  remove repeating pattern in a list
remove repeating pattern in a list

Time:04-06

I have a list with repeating patterns. I want to remove these repeating pattern to make the list as short as possible. For example:

[a, b, a, b, a, b] => [a, b]
[a, b, c, a, b, c] => [a, b, c]
[a, b, c, d, a, b, c, d] => [a, b, c, d]
[a, a, a, b, b, b, c, c] => [a, b, c]

What is the best way to cover all the possible cases?

I have tried to convert the list to string, and apply regular expression on it:

input = ['a', 'a', 'b', 'c', 'a', 'b', 'c']

temp = ",".join(input)   ","

last_temp = ""

while temp != last_temp:
    last_temp = temp
    temp = re.sub(r'(. ?)\1 ', r'\1', temp)
    print(temp)

deduped = temp[:-1]

output = deduped.split(',')

The function works well as expected result: [a, b, c]

However, there is one issue. If the input list is:

['hello', 'sell', 'hello', 'sell', 'hello', 'sell']

The result will be: ['helo', 'sel']

You see, the regular expression also replaced the 'll' to 'l', which is not desired.

How can I fix this issue with my function, or is there any better way? Thanks

CodePudding user response:

I dont get why you would use regex in this case. Why don't you use a "set" instead :

my_set=set(['hello', 'sell', 'hello', 'sell', 'hello', 'sell'])
print(my_set)

my_set=set(['a', 'a', 'b', 'c', 'a', 'b', 'c'])
print(my_set)

Gives :

{'hello', 'sell'}
{'b', 'a', 'c'}

CodePudding user response:

sell will be substituted by sel because re.sub substitutes the repeating character l. You can tweak your regular expression to avoid matching those cases. For example matching repeating patterns starting from the beginning of the string:

temp = re.sub(r'^(. ?)\1 ', r'\1', temp)

Or ensuring the patterns ends with a comma :

temp = re.sub(r'(. ?,)\1 ', r'\1', temp)

Edit: given your last example, it's probably best to check patterns between commas:

import re

list_in = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']

temp = ","   ",".join(list_in)   ","

last_temp = ""

while temp != last_temp:
    last_temp = temp
    temp = re.sub(r'(?<=,)(. ?,)\1 ', r'\1', temp)
    print(temp)

deduped = temp[1:-1]

output = deduped.split(',')

A look-behind makes sure your pattern is preceded by a comma as well.

  • Related