Is there a better/efficient way to remove words that consist of either numbers, or string punctuations or a combination these two from some text?
example 1:
input_text: 'This is a test number 223/34 and this a real number 2333.'
expected_output = 'This is a test number and this a real number .'
example 2:
input_text: 'This is a test-number 223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com which sells 3-D products'
expected_output = 'This is a test-number and this a real number . The email is [email protected] and the website is www.test.com which sells 3-D products.
Currently, I have something like this.
def is_valid_word(word):
return not word.translate(str.maketrans('', '', string.punctuation)).isdigit()
clean_text = " ".join([word for word in input_text.split() if is_valid_word(word)])
CodePudding user response:
Your method seems ok, but if you want to use a regex (like the tag suggests) you can use this to capture all the characters that are not lower/uppercase letters or spaces:
[^a-zA-Z ]*
Then you can replace with an empty string.
import re
input_text = "This is a test number 223/34 and this a real number 2333."
clean_text=re.sub("[^a-zA-Z ]*", "", input_text)
CodePudding user response:
You can simply check if a token is alphanumeric:
clean_text = " ".join([word for word in input_text.split() if word.isalnum()])
See a Python demo:
input_text = 'This is a test number 223/34 and this a real number 2333.'
print( " ".join([word for word in input_text.split() if word.isalnum()]) )
# => This is a test number and this a real number
If you have specific patterns in mind, you can write specific regex patterns to find the matching strings and delete them with re.sub
. For example, if you want to remove standalone numbers that can contain match operators between them, or dots/commas, you can use the following:
import re
input_text = 'This is a test number 223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com.'
print( re.sub(r'[- ]?\b\d (?:[., /*-]\d )*\b', '', input_text) )
that yields the expected:
This is a test number and this a real number . The email is [email protected] and the website is www.test.com.
See the Python demo. The regex means
[- ]?
- an optional-
\b
- a word boundary (the digit cannot be glued to a word)\d
- one or more digits(?:[., /*-]\d )*
- zero or more repetitions of.
/,
//
,*
,-
and then one or more digits\b
- a word boundary (the digit cannot be glued to a word).