Home > Blockchain >  General regex about english words, numbers, length and specific symbols
General regex about english words, numbers, length and specific symbols

Time:07-17

I am working in an NLP task and I want to do a general cleaning for a specific purpose that doesn't matter to explain further.

I want a function that

  1. Remove non English words
  2. Remove words that are full in capital
  3. Remove words that have '-' in the beginning or the end
  4. Remove words that have length less than 2 characters
  5. Remove words that have only numbers

For example if I have the following string

'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'

the output should be

'George play 123_134 _123pantis' 

The function that I have created already is the following:

def clean(text):
    # remove words that aren't in english words (isn't working)
    #text = re.sub(r'^[a-zA-Z] ', '', text)
    
    # remove words that are in capital
    text = re.sub(r'(\w*[A-Z] \w*)', '', text)
    
    # remove words that start or have - in the middle (isn't working)
    text = re.sub(r'(\s)-\w ', '', text)
    
    # remove words that have length less than 2 characters (is working)
    text = re.sub(r'\b\w{1,2}\b', '', text)
    
    # remove words with only numbers 
    text = re.sub(r'[0-9] ', '', text) (isn't working)
    return text 

The output is

  -  play _ foot-ball _  _pantis   ελλαδα 

which is not what I need. Thank you very much for your time and help!

CodePudding user response:

You can do this in single re.sub call.

Search using this regex:

(?:\b(?:\w (?=-)|\w{2}|\d |[A-Z] |\w*[^\x01-\x7F]\w*)\b|-\w )\s*

and replace with empty string.

RegEx Demo

Code:

import re

s = 'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'

r = re.sub(r'(?:\b(?:\w (?=-)|\w{2}|\d |[A-Z] |\w*[^\x01-\x7F]\w*)\b|-\w )\s*', '', s)

print (r)
# George play 123_134 _123pantis

Online Code Demo

  • Related