Home > Blockchain >  Encounter an issue while trying to remove unicode emojis from strings
Encounter an issue while trying to remove unicode emojis from strings

Time:07-21

I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I've seen in my data

['\\\\ud83d\\\\ude0e', '\\\\ud83e\\\\udd20', '\\\\ud83e\\\\udd23', '\\\\ud83d\\\\udc4d', '\\\\ud83d\\\\ude43', '\\\\ud83d\\\\ude31', '\\\\ud83d\\\\ude14', '\\\\ud83d\\\\udcaa', '\\\\ud83d\\\\ude0e', '\\\\ud83d\\\\ude09', '\\\\ud83d\\\\ude09', '\\\\ud83d\\\\ude18','\\\\ud83d\\\\ude01' , '\\\\ud83d\\\\ude44', '\\\\ud83d\\\\ude17']

I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.

Here is the function I tried to remove them

def remove_emojis(data):
    emoji_pattern = re.compile(
        u"(\\\\ud83d[\\\\ude00-\\\\ude4f])|"  # emoticons
        u"(\\\\ud83c[\\\\udf00-\\\\uffff])|"  # symbols & pictographs (1 of 2)
        u"(\\\\ud83d[\\\\u0000-\\\\uddff])|"  # symbols & pictographs (2 of 2)
        u"(\\\\ud83d[\\\\ude80-\\\\udeff])|"  # transport & map symbols
        u"(\\\\ud83c[\\\\udde0-\\\\uddff])"  # flags (iOS)
        " ", flags=re.UNICODE)
    return re.sub(emoji_pattern, '', data)

If I use "Naja, gegen dich ist sie ein Waisenknabe \\\\ud83d\\\\ude02\\\\ud83d\\\\ude02\\\\ud83d\\\\ude02" as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\\\ude02\\\\ude02\\\\ude02". However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe ".

What is the mistake that I am doing and how can I fix that to get my desired results.

CodePudding user response:

Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (\uXXXX), you can use

data = re.sub(r'\s*(?:\\ u[a-fA-F0-9]{4}) ', '', data)

Details:

  • \s* - zero or more whitespaces
  • (?:\\ u[a-fA-F0-9]{4}) - one or more sequences of
    • \\ - one or more backslashes
    • u - a u char
    • [a-fA-F0-9]{4} - four hex chars.

See the regex demo.

  • Related