Home > Mobile >  How do I make my code remove the sender names found in the messages saved in a txt file and the tags
How do I make my code remove the sender names found in the messages saved in a txt file and the tags

Time:11-10

Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.

This is the information what is inside the generic_discord_talk.txt file

Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()

with open('stopwords-es.txt') as f:
    st = [word for line in f  for word in line.split()]
    print(st)
    

stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)

I have created a regex to detect the tags

regex = re.compile("^(<@!. >){,1}\s{,}(messegeA|messegeB|messegeC)(<@!. >){,1}\s{,}$")
regex_tag = re.compile("^<@!. >") 

I need that the sentence print(st) give me return the words to me but without the emitters and without the tags

CodePudding user response:

You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <@! till the first closing tag.

^[^:\n] :\s*|\s*<@!\d >

The pattern matches:

  • ^ Start of string
  • [^:\n] :\s* Match 1 occurrences of any char except : or a newline, then match : and optional whitspace chars
  • | Or
  • \s*<@! Match literally, preceded by optional whitespace chars
  • [^<>] Negated character class, match 1 occurrences of any char except < and >
  • > Match literally

Regex demo

If there can be only digits after <@!

^[^:\n] :|<@!\d >

For example

archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n] :\s*|\s*<@![^<>] >", "", a, 0, re.M)

If you also want to clear the leading and ending spaces, you can add this line

st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)

CodePudding user response:

I think this should work:

import re


data = """Company: <@!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <@!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <@!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <@!808947310809317387> Yes, I have it in front of me"""


def run():
    for line in data.split("\n"):
        line = re.sub(r"^\w : ", "", line)  # remove the customer/company part
        line = re.sub(r"<@!\d >", "", line)  # remove tags
        print(line)
  • Related