How to split string according to two conditions at the beginning and end of a sentence simultaneousl-CodePudding

I have a string like,

str1 = "ZZZ。10月,AAA。11月2日,BBB。CCC。3日,DDD。EEE。12月，FFF"

And I want to split this string by two conditions: 日 or 月 appear at the begining of string, at the same time, period 。at the end of string. Thus, the result should like,

# ZZZ。 / 10月,AAA。/ 11月2日,BBB。CCC。/3日,DDD。EEE。/12月，FFF

And now, my idea is split them by period at first, then combine each of them according to the second rules(日 or 月), the code can be run like,

import re
str1 = "ZZZ。10月,AAA。11月2日,BBB。CCC。3日,DDD。EEE。12月，FFF"

for i, item in enumerate(re.split(r'(?<=。)',str1)):
    if i == 0:
        cache = item 
    else:
        if re.match(r'(^.{0,2}日)|(^.{0,2}月)', item):
            res.append(cache)
            cache = item
        else:
            cache  = item  
res.append(cache)      
print(res)

But I was wondering is there anything in this format: re.match(r'(^.{0,2}日)|(^.{0,2}月)', item) and re.match(r'。$', item) can directly in one loop or some simple regex?

CodePudding user response：

You can use re.split with

(?<=。)(?=\s*\d{1,2}[日月])

See the regex demo. Details:

(?<=。) - match a location right after a dot
(?=\s*\d{1,2}[日月]) - that is immediately followed with zero or more whitespaces, then one or two digits and then a 日 or 月.

See the Python demo:

import re
text = "ZZZ。10月,AAA。11月2日,BBB。CCC。3日,DDD。EEE。12月，FFF"
print( re.split(r'(?<=。)(?=\s*\d{1,2}[日月])', text) )
# => ['ZZZ。', '10月,AAA。', '11月2日,BBB。CCC。', '3日,DDD。EEE。', '12月，FFF']