How do I make my code remove the sender names found in the messages saved in a txt file and the tags-CodePudding

Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.

This is the information what is inside the generic_discord_talk.txt file

Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me

import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()

with open('stopwords-es.txt') as f:
    st = [word for line in f  for word in line.split()]
    print(st)
    

stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)

I have created a regex to detect the tags

regex = re.compile("^(<@!. >){,1}\s{,}(messegeA|messegeB|messegeC)(<@!. >){,1}\s{,}$")
regex_tag = re.compile("^<@!. >")

I need that the sentence print(st) give me return the words to me but without the emitters and without the tags

CodePudding user response：

You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <@! till the first closing tag.

^[^:\n] :\s*|\s*<@!\d >

The pattern matches:

^ Start of string
[^:\n] :\s* Match 1 occurrences of any char except : or a newline, then match : and optional whitspace chars
| Or
\s*<@! Match literally, preceded by optional whitespace chars
[^<>] Negated character class, match 1 occurrences of any char except < and >
> Match literally

Regex demo

If there can be only digits after <@!

^[^:\n] :|<@!\d >

For example

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n] :\s*|\s*<@![^<>] >", "", a, 0, re.M)

If you also want to clear the leading and ending spaces, you can add this line

st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)

CodePudding user response：

I think this should work:

import re


data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""


def run():
    for line in data.split("\n"):
        line = re.sub(r"^\w : ", "", line)  # remove the customer/company part
        line = re.sub(r"<@!\d >", "", line)  # remove tags
        print(line)