I would like to create a very simple non-English word identification script which replaces every word in a text with a <FOREIGN>
tag if that word contains any specific non-English character. For this I used the .isascii()
method.
I have the following sample string:
s = "abc def déf äëü المزيد 한글 - 1 2 3 4 5"
And the following is the expected output:
s_exp = "abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN> - 1 2 3 4 5"
My current working implementation is:
import re
for word in s.split():
if not word.isascii():
s = re.sub(word, "<FOREIGN>", s)
While this works perfectly for small amount of data, I am worried about its performance on 100,000s of rows of textual data organized in a pandas dataframe. I was wondering if there is any solution that might be better performing than this for loop. At the moment, I am using
df['Text'].apply(lambda x: replace_nonenglish(x))
where replace_nonenglish
is:
def replace_nonenglish(s):
for word in s.split():
if not word.isascii():
s = re.sub(word, "<FOREIGN>", s)
return s
Note:
I am fully aware that this will provide a bunch of false negatives, i.e. non-English words not tagged as <FOREIGN>
such as the French "bien" or the German "gut" but that is acceptable for now.
CodePudding user response:
You can also use
import re
s = "abc def déf äëü المزيد 한글 - 1 2 3 4 5"
print( re.sub(r"\b[a-zA-Z]*[^\W\d_a-zA-Z][^\W\d_]*\b", "<FOREIGN>", s) )
# => abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN> - 1 2 3 4 5
See the Python demo and a regex demo.
Details:
\b
- a word boundary (it is Unicode aware in Python by default)[a-zA-Z]*
- zero or more ASCII letters[^\W\d_a-zA-Z]
- any Unicode letter but an ASCII letter[^\W\d_]*
- zero or more Unicode letters\b
- a word boundary.
With the PyPi regex library (install with pip install regex
in your terminal/console window) it would look a bit cleaner:
import regex
s = "abc def déf äëü المزيد 한글 - 1 2 3 4 5"
print( regex.sub(r"\b[a-zA-Z]*[^\P{L}a-zA-Z]\p{L}*\b", "<FOREIGN>", s) )
See this Python demo. Here, \p{L}
matches any Unicode letter and \P{L}
matches any char other than a Unicode letter.