I downloaded a book from gutenberg project and saved it as a text file. I started to use the below code as initial steps.
I have read the book I chose (the text file), then I have done the below:
def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:
if ele in punc:
string = string.replace(ele, "")
return string
try:
with open(filename,'r',encoding="utf-8") as f:
data = f.read()
with open(filename,"w ",encoding="utf-8") as f:
f.write(remove_punc(data))
print("Removed punctuations from the file", filename)
It didn't work, so I couldn't proceed with the rest
CodePudding user response:
So If I understand you correctly, you want to remove literally every character except for A-Z and a-z?
import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
data = pattern.sub('', f.read())
with open(filename,"w ",encoding="utf-8") as f:
f.write(data)
CodePudding user response:
You can use the translate() method. First prepare a translation table that will remove punctuation. Then use it directly on your input data to write the output.
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
removePunctuation = str.maketrans('','',punc) # translation table
with open(filename,'r',encoding="utf-8") as f:
data = f.read()
with open(filename,"w ",encoding="utf-8") as f:
f.write(data.translate(removePunctuation)) # use translate directly
print("Removed punctuations from the file", filename)
You seem to want more characters to be excluded than mere punctuation but you can get most of these characters from the string module:
import string
punc = ' ' string.punctuation string.digits "your extra chars"
CodePudding user response:
Wouldn't be easier like this?
from string import digits
yourfile
tokenizer = nltk.RegexpTokenizer(r"\w ")
clean_text = tokenizer.tokenize(yourfile)
my_string= (" ".join(clean_text))
newstring = my_string.translate(None, digits)
print(newstring)
that is, instead of removing what you don't want, get what you want. You get your list of words, then turn that into a string, remove the numbers from the string with the translate method.