Regex For Special Character (S with line on top)-CodePudding

I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "S̄" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii

Here's there code:

import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))

I would expect it to output:

ra_ndom_word_

But instead I get:

ra_ndom_wordS__

CodePudding user response：

The reason Python works this way is that you are actually looking at two distinct characters; there's an S and then it's followed by a combining macron U 0304

In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.

import unicodedata

def cleanup(line):
    cleaned = []
    strip = False
    for char in line:
        if unicodedata.combining(char):
            strip = True
            continue
        if strip:
            cleaned.pop()
            strip = False
        if unicodedata.category(char) != "Ll":
            char = "_"
        cleaned.append(char)
    return ''.join(cleaned)

By the by, \W does not need square brackets around it; it's already a regex character class.

Python's re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories.

CodePudding user response：

You can use the following script to get the desired output:

import re

line="ra*ndom wordS̄"
print(re.sub('[^[-~] ]*','_',line))

Output

ra_ndom_word_

In this approach, it works with other non-ascii characters as well :

import re

line="ra*ndom ¡¢£Ä wordS̄.  another non-ascii: Ä and Ï"
print(re.sub('[^[-~] ]*','_',line))

output:

ra_ndom_word_another_non_ascii_and_