I have a string representing conversation turns as follows:
str = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
In plain text, it looks as follows.
person alpha:
How are you today?
person beta:
I'm fine, thank you.
person alpha:
What's up?
person beta:
Not much, just hanging around.
Now, I would like to split the string on person alpha
and person beta
, so that the resulting list looks as follows:
["person alpha:\nHow are you today?", "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", "person beta:\nNot much, just hanging around."]
I have tried the following approach
import re
res = re.split('person alpha |person beta |\*|\n', str)
But the results is as follows:
['person alpha:', 'How are you today?', '', 'person beta:', "I'm fine, thank you.", '', 'person alpha:', "What's up?", '', 'person beta:', 'Not much, just hanging around.']
What is wrong with my regex?
CodePudding user response:
Your pattern only matches a newline, as in the example data there is a colon : after alpha:
and beta:
so you are basically splitting on a newline yielding those results.
You could re.split the string using a lookahead (?=
asserting instead of matching, and remove empty strings and strip the results.
import re
s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"(?=^person (?:alpha|beta):)"
res = [v.rstrip() for v in re.split(pattern, s, 0, re.M) if v]
print(res)
Output
['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']
See a Python demo.
Using re.findall you can match all lines with at least a single character asserting that the next line does not start with the person pattern:
import re
s = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
pattern = r"^person (?:alpha|beta):\n(?:(?!person (?:alpha|beta):). (?=\n|$))*"
print(re.findall(pattern, s, re.M))
Output
['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?", 'person beta:\nNot much, just hanging around.']
See a Python demo.
CodePudding user response:
Use re.DOTALL
flag to match new lines. The (alpha|beta)
in the regex is a group that matches either "alpha" or "beta", .*?
is a non-greedy pattern that matches any characters and (?=\n\nperson)
is a positive lookahead which asserts that there is only successful if it immediately followed by a new line characters and person string
import re
str = "person alpha:\nHow are you today?\n\nperson beta:\nI'm fine, thank you.\n\nperson alpha:\nWhat's up?\n\nperson beta:\nNot much, just hanging around."
match = re.findall(r"(person (alpha|beta):\n.*?(?=\n\nperson))", str, re.DOTALL)
result = list(map(lambda x: x[0], match ))
# or
# result = [x[0] for x in match]
print(result)
Ouput:
['person alpha:\nHow are you today?', "person beta:\nI'm fine, thank you.", "person alpha:\nWhat's up?"]