Modifier letter are like these; I am curious what is the most efficient way to remove them from a list of strings.
I know I can make a list, containing all these unicodes and run a for loop that goes through all of them against the string. I wonder how I can remove them using "re" package and perhaps specifying their range.
my string looks like
mystr = 'سلام خوبی dsdsd ᴶᴼᴵᴺ'
this is the unicode for 'ᴶ'
https://www.compart.com/en/unicode/U 1D36
CodePudding user response:
You can find unicode categories here:
https://unicodebook.readthedocs.io/unicode.html
You can try this code (python3):
import unicodedata
inputData = u"سلام خوبی dsdsd ᴶᴼᴵᴺ"
print(u"".join( x for x in inputData if not unicodedata.category(x)=='Sk'))
CodePudding user response:
Turned out the regex is faster for longer sentence
import unicodedata
inputData = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"
a = time.time()
for i in range(1_000_000):
d = u"".join( x for x in inputData if not unicodedata.category(x)=='Sk')
print(time.time() - a)
which took on my 2,4 GHz 8-Core Intel Core i9
- 17.69 second
import time
import regex as re
text = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"
a = time.time()
for i in range(1_000_000):
d = re.sub("\p{LM}", "", text)
print(time.time() - a)
took 6.1 second
if you use
u"سلام خوبی dxxxxxxxxxxsdᴶᴼᴵᴺ"
the regex approach is 6.08 second while the character level look is 5.08 second.