Tokenize text but keep compund hyphenated words together-CodePudding

I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).

def preprocess(text):
  #remove punctuation
  text = re.sub('\b[A-Za-z] (?:- [A-Za-z] ) \b', '-', text)
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = text.split()
  text = " ".join(text)
  return text

For instance, the original text:

"Attended pre-tender meetings"

should be split into

['attended', 'pre-tender', 'meeting']

rather than

['attended', 'pre', 'tender', 'meeting']

Any help would be appreciated!

CodePudding user response：

To remove all non-alpha characters but - between letters, you can use

[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))

ASCII only equivalent:

[^A-Za-z](?<![A-Za-z]-(?=[A-Za-z]))

See the regex demo. Details:

[\W\d_] - any non-letter
(?<![^\W\d_]-(?=[^\W\d_])) - a negative lookbehind that fails the match if there is a letter and a - immediately to the left, and right after -, there is any letter (checked with the (?=[^\W\d_]) positive lookahead).

See the Python demo:

import re

def preprocess(text):
  #remove all non-alpha characters but - between letters
  text = re.sub(r'[\W\d_](?<![^\W\d_]-(?=[^\W\d_]))', r' ', text)
  return " ".join(text.split())

print(preprocess("Attended pre-tender, etc meetings."))
# => Attended pre-tender etc meetings

CodePudding user response：

The string module has constant strings punctuation and digits.

However, a hyphen is considered to be a punctuation character. Therefore, without the aid of RE, you could do this:

from string import punctuation, digits

s = 'Attended pre-tender meetings'

t = s.strip(punctuation.replace('-', '') digits).split()

print(t)

Output:

['Attended', 'pre-tender', 'meetings']