How to split Chinese words and English words in a string using python with space? [closed]-CodePudding

I want to split "Coffee Hello 咖啡咖啡" into "Coffee Hello" and "咖啡咖啡" How should I do it? I used isalpha and isspace but it is not working. It is splitting into "Coffee Hello 咖啡咖啡", "" instead.

CodePudding user response：

Regex would do:

>>> import re
>>> string = "Coffee Hello 咖啡 咖啡"
>>> re.split("(?<=[A-Za-z ])\s*(?=[\u4e00-\u9fa5 ])", string)
['Coffee Hello', '咖啡 咖啡']
>>>

CodePudding user response：

@Lutz's and @U12-Forward's answers only work when English words precede Chinese words.

A better-rounded approach that works regardless of the order of English and Chinese words would be to use re.findall with an alternation pattern instead:

re.findall(r'[a-z] (?:\s [a-z] )*|[\u4e00-\u9fa5] (?:\s [\u4e00-\u9fa5] )*', string, re.I)

Try it online!

CodePudding user response：

If all you want to consider is plain english characters you can use a combination of lookahead and look behind:

re.split("(?<=[a-zA-Z])\s*(?=[^a-zA-Z]*$)","Coffee Hello 咖啡 咖啡" )

this splits by any number of spaces (\s*) but only if the character before is from the English alphabet ((?<=[a-zA-Z]) = (?<=): lookbehind; [a-zA-Z]: english characters) and if everything that follows is not from the English alphabet ((?=[^a-zA-Z]*$) = (?=): lookahead; [^a-zA-Z]*$: not Englisch characters to the end of the line)