I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).
def preprocess(text):
#remove punctuation
text = re.sub('\b[A-Za-z] (?:- [A-Za-z] ) \b', '-', text)
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.split()
text = " ".join(text)
return text
For instance, the original text:
"Attended pre-tender meetings"
should be split into
['attended', 'pre-tender', 'meeting']
rather than
['attended', 'pre', 'tender', 'meeting']
Any help would be appreciated!
CodePudding user response:
To remove all non-alpha characters but -
between letters, you can use
[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))
ASCII only equivalent:
[^A-Za-z](?<![A-Za-z]-(?=[A-Za-z]))
See the regex demo. Details:
[\W\d_]
- any non-letter(?<![^\W\d_]-(?=[^\W\d_]))
- a negative lookbehind that fails the match if there is a letter and a-
immediately to the left, and right after-
, there is any letter (checked with the(?=[^\W\d_])
positive lookahead).
See the Python demo:
import re
def preprocess(text):
#remove all non-alpha characters but - between letters
text = re.sub(r'[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))', r' ', text)
return " ".join(text.split())
print(preprocess("Attended pre-tender, etc meetings."))
# => Attended pre-tender etc meetings
CodePudding user response:
The string module has constant strings punctuation and digits.
However, a hyphen is considered to be a punctuation character. Therefore, without the aid of RE, you could do this:
from string import punctuation, digits
s = 'Attended pre-tender meetings'
t = s.strip(punctuation.replace('-', '') digits).split()
print(t)
Output:
['Attended', 'pre-tender', 'meetings']