I have a list with repeating patterns. I want to remove these repeating pattern to make the list as short as possible. For example:
[a, b, a, b, a, b] => [a, b]
[a, b, c, a, b, c] => [a, b, c]
[a, b, c, d, a, b, c, d] => [a, b, c, d]
[a, a, a, b, b, b, c, c] => [a, b, c]
What is the best way to cover all the possible cases?
I have tried to convert the list to string, and apply regular expression on it:
input = ['a', 'a', 'b', 'c', 'a', 'b', 'c']
temp = ",".join(input) ","
last_temp = ""
while temp != last_temp:
last_temp = temp
temp = re.sub(r'(. ?)\1 ', r'\1', temp)
print(temp)
deduped = temp[:-1]
output = deduped.split(',')
The function works well as expected result: [a, b, c]
However, there is one issue. If the input list is:
['hello', 'sell', 'hello', 'sell', 'hello', 'sell']
The result will be: ['helo', 'sel']
You see, the regular expression also replaced the 'll' to 'l', which is not desired.
How can I fix this issue with my function, or is there any better way? Thanks
CodePudding user response:
I dont get why you would use regex in this case. Why don't you use a "set" instead :
my_set=set(['hello', 'sell', 'hello', 'sell', 'hello', 'sell'])
print(my_set)
my_set=set(['a', 'a', 'b', 'c', 'a', 'b', 'c'])
print(my_set)
Gives :
{'hello', 'sell'}
{'b', 'a', 'c'}
CodePudding user response:
sell
will be substituted by sel
because re.sub
substitutes the repeating character l
.
You can tweak your regular expression to avoid matching those cases.
For example matching repeating patterns starting from the beginning of the string:
temp = re.sub(r'^(. ?)\1 ', r'\1', temp)
Or ensuring the patterns ends with a comma :
temp = re.sub(r'(. ?,)\1 ', r'\1', temp)
Edit: given your last example, it's probably best to check patterns between commas:
import re
list_in = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
temp = "," ",".join(list_in) ","
last_temp = ""
while temp != last_temp:
last_temp = temp
temp = re.sub(r'(?<=,)(. ?,)\1 ', r'\1', temp)
print(temp)
deduped = temp[1:-1]
output = deduped.split(',')
A look-behind makes sure your pattern is preceded by a comma as well.