I am working in an NLP task and I want to do a general cleaning for a specific purpose that doesn't matter to explain further.
I want a function that
- Remove non English words
- Remove words that are full in capital
- Remove words that have '-' in the beginning or the end
- Remove words that have length less than 2 characters
- Remove words that have only numbers
For example if I have the following string
'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
the output should be
'George play 123_134 _123pantis'
The function that I have created already is the following:
def clean(text):
# remove words that aren't in english words (isn't working)
#text = re.sub(r'^[a-zA-Z] ', '', text)
# remove words that are in capital
text = re.sub(r'(\w*[A-Z] \w*)', '', text)
# remove words that start or have - in the middle (isn't working)
text = re.sub(r'(\s)-\w ', '', text)
# remove words that have length less than 2 characters (is working)
text = re.sub(r'\b\w{1,2}\b', '', text)
# remove words with only numbers
text = re.sub(r'[0-9] ', '', text) (isn't working)
return text
The output is
- play _ foot-ball _ _pantis ελλαδα
which is not what I need. Thank you very much for your time and help!
CodePudding user response:
You can do this in single re.sub
call.
Search using this regex:
(?:\b(?:\w (?=-)|\w{2}|\d |[A-Z] |\w*[^\x01-\x7F]\w*)\b|-\w )\s*
and replace with empty string.
Code:
import re
s = 'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
r = re.sub(r'(?:\b(?:\w (?=-)|\w{2}|\d |[A-Z] |\w*[^\x01-\x7F]\w*)\b|-\w )\s*', '', s)
print (r)
# George play 123_134 _123pantis