Home > database >  How to remove Modifier Letters from string
How to remove Modifier Letters from string

Time:06-23

Modifier letter are like these; I am curious what is the most efficient way to remove them from a list of strings.

I know I can make a list, containing all these unicodes and run a for loop that goes through all of them against the string. I wonder how I can remove them using "re" package and perhaps specifying their range.

my string looks like

mystr = 'سلام خوبی dsdsd ᴶᴼᴵᴺ'

this is the unicode for 'ᴶ'

https://www.compart.com/en/unicode/U 1D36

CodePudding user response:

You can find unicode categories here:

https://unicodebook.readthedocs.io/unicode.html

You can try this code (python3):

import unicodedata

inputData = u"سلام خوبی dsdsd ᴶᴼᴵᴺ"
print(u"".join( x for x in inputData if not unicodedata.category(x)=='Sk'))

CodePudding user response:

Turned out the regex is faster for longer sentence

import unicodedata

inputData = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"

a = time.time()
for i in range(1_000_000):
    d = u"".join( x for x in inputData if not unicodedata.category(x)=='Sk')

print(time.time() - a)

which took on my 2,4 GHz 8-Core Intel Core i9 - 17.69 second

import time
import regex as re

text = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"

a = time.time()
for i in range(1_000_000):
    d = re.sub("\p{LM}", "", text)

print(time.time() - a)

took 6.1 second

if you use

u"سلام خوبی dxxxxxxxxxxsdᴶᴼᴵᴺ"

the regex approach is 6.08 second while the character level look is 5.08 second.

  • Related