Tagging foreign text using isascii in python-CodePudding

I would like to create a very simple non-English word identification script which replaces every word in a text with a <FOREIGN> tag if that word contains any specific non-English character. For this I used the .isascii() method.

I have the following sample string:

s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"

And the following is the expected output:

s_exp = "abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN> - 1 2 3 4 5"

My current working implementation is:

import re
for word in s.split():
    if not word.isascii():
        s = re.sub(word, "<FOREIGN>", s)

While this works perfectly for small amount of data, I am worried about its performance on 100,000s of rows of textual data organized in a pandas dataframe. I was wondering if there is any solution that might be better performing than this for loop. At the moment, I am using df['Text'].apply(lambda x: replace_nonenglish(x)) where replace_nonenglish is:

def replace_nonenglish(s):
    for word in s.split():
        if not word.isascii():
            s = re.sub(word, "<FOREIGN>", s)
    return s

Note:

I am fully aware that this will provide a bunch of false negatives, i.e. non-English words not tagged as <FOREIGN> such as the French "bien" or the German "gut" but that is acceptable for now.

CodePudding user response：

You can also use

import re
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( re.sub(r"\b[a-zA-Z]*[^\W\d_a-zA-Z][^\W\d_]*\b", "<FOREIGN>", s) )
# => abc def <FOREIGN> <FOREIGN> <FOREIGN> <FOREIGN>  - 1 2 3 4 5

See the Python demo and a regex demo.

Details:

\b - a word boundary (it is Unicode aware in Python by default)
[a-zA-Z]* - zero or more ASCII letters
[^\W\d_a-zA-Z] - any Unicode letter but an ASCII letter
[^\W\d_]* - zero or more Unicode letters
\b - a word boundary.

With the PyPi regex library (install with pip install regex in your terminal/console window) it would look a bit cleaner:

import regex
s = "abc def déf äëü المزيد 한글  - 1 2 3 4 5"
print( regex.sub(r"\b[a-zA-Z]*[^\P{L}a-zA-Z]\p{L}*\b", "<FOREIGN>", s) )

See this Python demo. Here, \p{L} matches any Unicode letter and \P{L} matches any char other than a Unicode letter.