I'm filtering lots of tweets and while I was doing tests on how to filter each character I ended up with this:
x = open(string, encoding='utf-8')
text = x.read()
text = re.sub(r'http\S ' '\n', '', text, )
text = re.sub(r'http\S ', '', text,) # removes links
text = re.sub(r'@\S ' '\n', '', text)
text = re.sub(r'@\S ', '', text) # removes usernames
text = text.replace('0', '').replace('1', '').replace('2', '').replace('3', '') \
.replace('4', '').replace('5', '').replace('6', '').replace('7', '').replace('8', '').replace('9', '') \
.replace(',', '').replace('"', '').replace('“', '').replace('?', '').replace('¿', '').replace(':', '') \
.replace(';', '').replace('-', '').replace('!', '').replace('¡', '').replace('.', '').replace('ℹ', '') \
.replace('\'', '').replace('[', '').replace(']', '').replace(' ', '').replace(' ', '').replace('”', '') \
.replace('º', '').replace(' ', '').replace('#', '').replace('\n', '').replace('·', '\n')
text = remove_emoji(text).lower()
x.close()
Which was useful because I could test many things but now I think that I'm not going to modify this anymore so it's ready to be optimized, how could I make it faster? All the replaces replace with nothing except .replace('·', '\n')
CodePudding user response:
Not necessarily faster, but way easier to read would be something like this:
for char in "#<>$ %!&`*|?=/{}:\\@ ';." '"':
string = string.replace(char, '')
CodePudding user response:
You can achieve most of this with string maketrans
and translate
methods - they let you define a mapping from any single char to any given string
s = "asd123.?fgh"
translations = {"1":"", "2":"", "3":"", ".":"\n", "?": ""}
print(s.translate(s.maketrans(translations)))
It will do all the changes in a single pass through the string, making it much faster.
CodePudding user response:
Taken from this solution.
The re
module (should already be installed in python) seems to work.
For example,
import re
string = "abccbdac"
re.sub('b|c', '', string) #ada
In this case, running re.sub('b|c', '', string)
would return "ada". The pipeline is used as a separator between characters to replace.