Removing all punctuation, spaces and other non-letter characters including numbers from a text file-CodePudding

I downloaded a book from gutenberg project and saved it as a text file. I started to use the below code as initial steps.

I have read the book I chose (the text file), then I have done the below:

def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:  
    if ele in punc:  
        string = string.replace(ele, "") 
return string


try:
with open(filename,'r',encoding="utf-8") as f:
    data = f.read()
with open(filename,"w ",encoding="utf-8") as f:
    f.write(remove_punc(data))
print("Removed punctuations from the file", filename)

It didn't work, so I couldn't proceed with the rest

CodePudding user response：

So If I understand you correctly, you want to remove literally every character except for A-Z and a-z?

import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
    data = pattern.sub('', f.read())
with open(filename,"w ",encoding="utf-8") as f:
    f.write(data)

CodePudding user response：

You can use the translate() method. First prepare a translation table that will remove punctuation. Then use it directly on your input data to write the output.

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''

removePunctuation = str.maketrans('','',punc)   # translation table

with open(filename,'r',encoding="utf-8") as f:
    data = f.read()

with open(filename,"w ",encoding="utf-8") as f:
    f.write(data.translate(removePunctuation))  # use translate directly

print("Removed punctuations from the file", filename)

You seem to want more characters to be excluded than mere punctuation but you can get most of these characters from the string module:

import string

punc = ' '   string.punctuation   string.digits   "your extra chars"

CodePudding user response：

Wouldn't be easier like this?

from string import digits

    yourfile
    tokenizer = nltk.RegexpTokenizer(r"\w ")
    clean_text = tokenizer.tokenize(yourfile)
    my_string= (" ".join(clean_text))
    newstring = my_string.translate(None, digits)
    print(newstring)

that is, instead of removing what you don't want, get what you want. You get your list of words, then turn that into a string, remove the numbers from the string with the translate method.