I would like to extract each NAME_
group's information using regex (Python3).
For example, I have a text like
AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like
apple
or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt";
and the result I want to get is three groups:
1)
AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like
apple
or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt";
which are split by their ID (Name_ ID
).
I'd appreciate any suggestions.
Thank you.
I tried to capture AB_ NAME_
info followed by zero or more AB_ EX_
info as below but it failed. I also went with 're.S', 're.M' flags but didn't work well.
AB_ NAME_ \d . ;\n(AB_ EX_ \d (.|\n) ;\n)*
CodePudding user response:
You should use re.DOTALL
to make all next line symbols to be matched with .
and then you can use findall()
to get all results, like this:
import re
text = """AB_ NAME_ 111 "fruit";
AB_ EX_ 111 first_fruit "banana";
AB_ EX_ 111 second_fruit_info "Do you like
apple
or grape?";
AB_ EX_ 111 third_fruit "tomato";
AB_ NAME_ 120 "food";
AB_ NAME_ 130 "clothes";
AB_ EX_ 130 first_clothes "t-shirt";"""
regex = r"AB_ NAME_.*?(?=AB_ NAME_|$)"
print(re.findall(regex, text, re.DOTALL))
The regex pattern is this: AB_ NAME_.*?(?=AB_ NAME_|$)
This part (?=AB_ NAME_|$)
searches for the next AB_ NAME_
or end of the line (in your case end of the entire string).