Home > Back-end >  How to using `regexp` to remove all the character not in chinese and english
How to using `regexp` to remove all the character not in chinese and english

Time:12-22

There is ori_string ,how to using regexp to remove all the character not in chinese and english? Thanks!

ori_string<-"没a w t _ 中/国.sz"

the wished result is

  "没awt中国sz"

CodePudding user response:

I have coded it in python, as you didn't specify anything. The idea is here.

def remove_non_english_chinese(text):
    # Use a regex pattern to match any character that is not a letter or number
    pattern = r'[^a-zA-Z0-9\u4e00-\u9fff]'

    # Replace all non-English and non-Chinese characters with an empty string
    return re.sub(pattern, '', text)

CodePudding user response:

Seems you want to remove punctuation and spaces:

> regex <- '[[:punct:][:space:]] '
> gsub(regex, '', ori_string)
[1] "没awt中国sz"
  • Related