Home > other >  Trying to use regex findall for a dialogue
Trying to use regex findall for a dialogue

Time:07-22

I have been stuck on this regex for quite some time now. I am pulling in chat data in a single variable and I am trying to break them up.

I.E

BOT: Ask me a question or select from the options below.

USER: How do I sign up or cancel Auto Pay?

BOT: Would you like to sign up or cancel Auto Pay?

USER: Selected: Cancel Auto Pay"

I am looking for just the Bot's messages and vice versa.

I am using Python and I pulled chat data and I want to break it up into two different sections, Chat verbatims vs User verbatims. I am trying to use regex to break them up and the one findall I used looks like this but that comes back with empty results.

clean.str.findall(r'^BOT:.*USER:$)

My thoughts behind that was to clean and drop the 'USER:' afterwards. I've tried multiple iterations of these. Any insight what I'm doing wrong would be a huge help ! Also, I read the posting rules, if I did it wrong let me know and I'll fix it up.

CodePudding user response:

Use regex ^Bot:(. )

import re

regex = r"^BOT:(. )"

chat = """
BOT: Ask me a question or select from the options below.

USER: How do I sign up or cancel Auto Pay?

BOT: Would you like to sign up or cancel Auto Pay?

USER: Selected: Cancel Auto Pay"
"""

result = re.findall(regex, chat, re.MULTILINE)

print(result)

Output:

[' Ask me a question or select from the options below.', ' Would you like to sign up or cancel Auto Pay?']

Now you can easily iterate over the result:

for r in result:
    print(r)

Output:

 Ask me a question or select from the options below.
 Would you like to sign up or cancel Auto Pay?

If you want to remove the leading space infront of every result. Either use regex = r"^BOT: (. )" as your regex. Or use .strip() on every result.

Without leading space (regex = r"^BOT: (. )"):

Ask me a question or select from the options below.
Would you like to sign up or cancel Auto Pay?

Edit


If you want the Bot: in your results just remove the parenthesis in your regex.

regex = r"^BOT: . "

Test it: https://regex101.com/r/aM1GEj/2

CodePudding user response:

text = """BOT: Ask me a question or select from the options below.
USER: How do I sign up or cancel Auto Pay?
BOT: Would you like to sign up or cancel Auto Pay?
USER: Selected: Cancel Auto Pay"""

re.findall("BOT:.*\n", text)

CodePudding user response:

Try this:

import re

text = """BOT: Ask me a question or select from the options below.

USER: How do I sign up or cancel Auto Pay?

BOT: Would you like to sign up or cancel Auto Pay?

USER: Selected: Cancel Auto Pay"""

print(re.findall(r'BOT:(.*?)(?:$|USER:)', text, flags=re.DOTALL))

Output is:

[' Ask me a question or select from the options below.\n\n', ' Would you like to sign up or cancel Auto Pay?\n\n']

Some considerations:

  • it fails if the user or the ot use the keyword USER: or BOT:
  • in the first group (.*?), the ? is needed to match in non greedy format: that is, stop at the first match, which allows the complete regex to only capture until the next USER:.
  • the last group, the ?: is to make this more efficient, since you are not going to use the result of this group. This group make the regex work for cases when the last message is from the bot.
  • the flags=re.DOTALL allows to catch also the newlines. In case that newlines are only used to separate between messages, consider some other answers based in \n char.
  • Related