Home > database >  Remove number patterns from string
Remove number patterns from string

Time:01-15

I have conversations that look as follows:

s = "1) Person Alpha:\nHello, how are you doing?\n\n1) Human:\nGreat, thank you.\n\n2) Person Alpha:\nHow is the weather?\n\n2) Human:\nThe weather is good."

1) Person Alpha:
Hello, how are you doing?

1) Human:
Great, thank you.

2) Person Alpha:
How is the weather?

2) Human:
The weather is good.

I would like to remove the enumeration at the beginning to get the following result:

s = "Person Alpha:\nHello, how are you doing?\n\nHuman:\nGreat, thank you.\n\nPerson Alpha:\nHow is the weather?\n\nHuman:\nThe weather is good."

Person Alpha:
Hello, how are you doing?

Human:
Great, thank you.

Person Alpha:
How is the weather?

Human:
The weather is good.

My idea is to search for 1), 2), 3),... in the text and replace it with an empty string. This might work but is inefficient (and can be a problem if e.g. 1) appears in the text of the conversation).

Is there a better / more elegant way to do this?

CodePudding user response:

One approach could be like this using the split() method to split the input string by the newline character. Then, you can iterate over the resulting list of lines, and check if each line starts with a digit followed by a close parenthesis and a space. If so, you can remove that prefix. Finally, you can join all the modified lines back together with the newline character to get the final output.

s = "1) Person Alpha:\nHello, how are you doing?\n\n1) Human:\nGreat, thank you.\n\n2) Person Alpha:\nHow is the weather?\n\n2) Human:\nThe weather is good."

lines = s.split("\n")
for i in range(len(lines)):
    if re.match(r"^\d \) ", lines[i]):
        lines[i] = lines[i][4:]

s = "\n".join(lines)
print(s)

CodePudding user response:

Using a regular expresion to replace every number followed by a parenthesis

import re
s = re.sub("[0-9]\) ", "", s)

Would output to:

Person Alpha:
Hello, how are you doing?

Human:
Great, thank you.

Person Alpha:
How is the weather?

Human:
The weather is good.

Or if you don't want to risk to replace something inside the conversation you could use the \n in front of every number pattern

import re
s = re.sub("\n[0-9]\) ", "\n", s)[3:]

Note that, since there's no \n at the beginning of the string, the first pattern was cut manually by sclicing the first 3 characters.

Same output as above.

CodePudding user response:

What do you mean by inefficient?

Don't you want to use loops to avoid poor performance? Give more details of what you have tried and what you want and don't want to be done

CodePudding user response:

I suggest using a version similar to the one of @Always Sunny but with re.sub, this is simpler to read and works for any number of numbers before the parenthesis:

s = "1) Person Alpha:\nHello, how are you doing?\n\n1) Human:\nGreat, thank you.\n\n2) Person Alpha:\nHow is the weather?\n\n2) Human:\nThe weather is good."

lines = s.split("\n")
for i in range(len(lines)):
    lines[i] = re.sub("^[0-9] \)\ ", "", line)

s = "\n".join(lines)
print(s)

CodePudding user response:

That can be done with regular expressions with the re module as follows:

import re
s = re.sub(r'^\d \)\s*', '', s, 0, re.M)

This line uses the multi-line regex flag to make ^ also match after every newline, which normally matches at the start of the string. First the regex looks for one or more digits (\d ), followed by a right parenthesis (\)), and then zero or more spaces (\s*). Then it replaces all occurrences of that pattern with the empty string.

  • Related