I want to split "Coffee Hello 咖啡 咖啡"
into
"Coffee Hello" and "咖啡 咖啡" How should I do it? I used isalpha
and isspace
but it is not working. It is splitting into "Coffee Hello 咖啡 咖啡", "" instead.
CodePudding user response:
Regex would do:
>>> import re
>>> string = "Coffee Hello 咖啡 咖啡"
>>> re.split("(?<=[A-Za-z ])\s*(?=[\u4e00-\u9fa5 ])", string)
['Coffee Hello', '咖啡 咖啡']
>>>
CodePudding user response:
@Lutz's and @U12-Forward's answers only work when English words precede Chinese words.
A better-rounded approach that works regardless of the order of English and Chinese words would be to use re.findall
with an alternation pattern instead:
re.findall(r'[a-z] (?:\s [a-z] )*|[\u4e00-\u9fa5] (?:\s [\u4e00-\u9fa5] )*', string, re.I)
CodePudding user response:
If all you want to consider is plain english characters you can use a combination of lookahead and look behind:
re.split("(?<=[a-zA-Z])\s*(?=[^a-zA-Z]*$)","Coffee Hello 咖啡 咖啡" )
this splits by any number of spaces (\s*
) but only if the character before is from the English alphabet ((?<=[a-zA-Z])
= (?<=)
: lookbehind; [a-zA-Z]
: english characters) and if everything that follows is not from the English alphabet ((?=[^a-zA-Z]*$)
= (?=)
: lookahead; [^a-zA-Z]*$
: not Englisch characters to the end of the line)