Home > Back-end >  Splitting string on several delimiters without considering new line
Splitting string on several delimiters without considering new line

Time:01-14

I have a string representing conversation turns as follows:

str = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."

In plain text, it looks as follows.

person alpha:
How are you today?

person beta:
I'm fine, thank you.

person alpha:
What's up?

person beta:
Not much, just hanging around.

Now, I would like to split the string on person alpha and person beta, so that the resulting list looks as follows:

["person alpha:\nHow are you today?", "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", "person beta:\nNot much, just hanging around."]

I have tried the following approach

import re
res = re.split('person alpha |person beta |\*|\n', str)

But the results is as follows:

['person alpha:', 'How are you today?', '', 'person beta:', "I'm fine, thank you.", '', 'person alpha:', "What's up?", '', 'person beta:', 'Not much, just hanging around.']

What is wrong with my regex?

CodePudding user response:

Your pattern only matches a newline, as in the example data there is a colon : after alpha: and beta: so you are basically splitting on a newline yielding those results.

You could re.split the string using a lookahead (?= asserting instead of matching, and remove empty strings and strip the results.

import re

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"(?=^person (?:alpha|beta):)"
res = [v.rstrip() for v in re.split(pattern, s, 0, re.M) if v]

print(res)

Output

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

See a Python demo.


Using re.findall you can match all lines with at least a single character asserting that the next line does not start with the person pattern:

import re

s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"^person (?:alpha|beta):\n(?:(?!person (?:alpha|beta):). (?=\n|$))*"
print(re.findall(pattern, s, re.M))

Output

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']

See a Python demo.

CodePudding user response:

Use re.DOTALL flag to match new lines. The (alpha|beta) in the regex is a group that matches either "alpha" or "beta", .*? is a non-greedy pattern that matches any characters and (?=\n\nperson) is a positive lookahead which asserts that there is only successful if it immediately followed by a new line characters and person string

import re

str = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."

match = re.findall(r"(person (alpha|beta):\n.*?(?=\n\nperson))", str, re.DOTALL)
result = list(map(lambda x: x[0], match ))
# or
# result = [x[0] for x in match]
print(result)

Ouput:

['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?"]
  • Related