Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.
This is the information what is inside the generic_discord_talk.txt file
Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
I have created a regex to detect the tags
regex = re.compile("^(<@!. >){,1}\s{,}(messegeA|messegeB|messegeC)(<@!. >){,1}\s{,}$")
regex_tag = re.compile("^<@!. >")
I need that the sentence print(st)
give me return the words to me but without the emitters and without the tags
CodePudding user response:
You could remove either parts using an alternation |
matching either from the start of the string to the first occurrence of a comma, or match <@! till the first closing tag.
^[^:\n] :\s*|\s*<@!\d >
The pattern matches:
^
Start of string[^:\n] :\s*
Match 1 occurrences of any char except:
or a newline, then match:
and optional whitspace chars|
Or\s*<@!
Match literally, preceded by optional whitespace chars[^<>]
Negated character class, match 1 occurrences of any char except<
and>
>
Match literally
If there can be only digits after <@!
^[^:\n] :|<@!\d >
For example
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n] :\s*|\s*<@![^<>] >", "", a, 0, re.M)
If you also want to clear the leading and ending spaces, you can add this line
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
CodePudding user response:
I think this should work:
import re
data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w : ", "", line) # remove the customer/company part
line = re.sub(r"<@!\d >", "", line) # remove tags
print(line)