Home > Software engineering >  capture mullti-line groups
capture mullti-line groups

Time:01-06

I would like to extract each NAME_ group's information using regex (Python3). For example, I have a text like

AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like 
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt"; 

and the result I want to get is three groups:
1)

AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like 
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt"; 

which are split by their ID (Name_ ID).

I tried to capture AB_ NAME_ info followed by zero or more AB_ EX_ info as below but it failed. I also went with 're.S', 're.M' flags but didn't work well.

AB_ NAME_ \d  . ;\n(AB_ EX_ \d  (.|\n) ;\n)*

CodePudding user response:

You should use re.DOTALL to make all next line symbols to be matched with . and then you can use findall() to get all results, like this:

import re

text = """AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like
apple

or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt";"""

regex = r"AB_ NAME_.*?(?=AB_ NAME_|$)"

print(re.findall(regex, text, re.DOTALL))

The regex pattern is this: AB_ NAME_.*?(?=AB_ NAME_|$)

This part (?=AB_ NAME_|$) searches for the next AB_ NAME_ or end of the line (in your case end of the entire string).

  • Related