Replacing replaces in a faster way-CodePudding

I'm filtering lots of tweets and while I was doing tests on how to filter each character I ended up with this:

x = open(string, encoding='utf-8')
text = x.read()
text = re.sub(r'http\S '   '\n', '', text, )
text = re.sub(r'http\S ', '', text,)  # removes links
text = re.sub(r'@\S '   '\n', '', text)
text = re.sub(r'@\S ', '', text)  # removes usernames
text = text.replace('0', '').replace('1', '').replace('2', '').replace('3', '') \
    .replace('4', '').replace('5', '').replace('6', '').replace('7', '').replace('8', '').replace('9', '') \
    .replace(',', '').replace('"', '').replace('“', '').replace('?', '').replace('¿', '').replace(':', '') \
    .replace(';', '').replace('-', '').replace('!', '').replace('¡', '').replace('.', '').replace('ℹ', '') \
    .replace('\'', '').replace('[', '').replace(']', '').replace('   ', '').replace('  ', '').replace('”', '') \
    .replace('º', '').replace(' ', '').replace('#', '').replace('\n', '').replace('·', '\n')
text = remove_emoji(text).lower()
x.close()

Which was useful because I could test many things but now I think that I'm not going to modify this anymore so it's ready to be optimized, how could I make it faster? All the replaces replace with nothing except .replace('·', '\n')

CodePudding user response：

Not necessarily faster, but way easier to read would be something like this:

for char in "#<>$ %!&`*|?=/{}:\\@ ';."   '"':
    string = string.replace(char, '')

CodePudding user response：

You can achieve most of this with string maketrans and translate methods - they let you define a mapping from any single char to any given string

s = "asd123.?fgh"

translations = {"1":"", "2":"", "3":"", ".":"\n", "?": ""}
print(s.translate(s.maketrans(translations)))

It will do all the changes in a single pass through the string, making it much faster.

CodePudding user response：

Taken from this solution.
The re module (should already be installed in python) seems to work.

For example,

import re
string = "abccbdac"
re.sub('b|c', '', string) #ada

In this case, running re.sub('b|c', '', string) would return "ada". The pipeline is used as a separator between characters to replace.