remove words that consists of numbers and string punctuations from text-CodePudding

Is there a better/efficient way to remove words that consist of either numbers, or string punctuations or a combination these two from some text?

example 1:

 input_text: 'This is a test number  223/34 and this a real number 2333.'
    expected_output =  'This is a test number and this a real number .'

example 2:

input_text: 'This is a test-number  223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com which sells 3-D products'
    expected_output =  'This is a test-number and this a real number . The email is [email protected] and the website is www.test.com which sells 3-D products.

Currently, I have something like this.

def is_valid_word(word):
    return not word.translate(str.maketrans('', '', string.punctuation)).isdigit()

clean_text    = " ".join([word for word in input_text.split() if is_valid_word(word)])

CodePudding user response：

Your method seems ok, but if you want to use a regex (like the tag suggests) you can use this to capture all the characters that are not lower/uppercase letters or spaces:

[^a-zA-Z ]*

Then you can replace with an empty string.

import re

input_text = "This is a test number  223/34 and this a real number 2333."
clean_text=re.sub("[^a-zA-Z ]*", "", input_text)

CodePudding user response：

You can simply check if a token is alphanumeric:

clean_text  = " ".join([word for word in input_text.split() if word.isalnum()])

See a Python demo:

input_text = 'This is a test number  223/34 and this a real number 2333.'
print( " ".join([word for word in input_text.split() if word.isalnum()]) )
# => This is a test number and this a real number

If you have specific patterns in mind, you can write specific regex patterns to find the matching strings and delete them with re.sub. For example, if you want to remove standalone numbers that can contain match operators between them, or dots/commas, you can use the following:

import re
input_text = 'This is a test number  223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com.'
print( re.sub(r'[- ]?\b\d (?:[., /*-]\d )*\b', '', input_text) )

that yields the expected:

This is a test number  and this a real number . The email is [email protected] and the website is www.test.com.

See the Python demo. The regex means

[- ]? - an optional or -
\b - a word boundary (the digit cannot be glued to a word)
\d - one or more digits
(?:[., /*-]\d )* - zero or more repetitions of . / , / , /, *, - and then one or more digits
\b - a word boundary (the digit cannot be glued to a word).