Python Regex Capturing Multiple Line Files based on Condition-CodePudding

I have a text file similar to below

<Start  
1;  
b;  
c;  
d;  
<End  
<Start  
2;  
b;  
c;  
d;  
<End  
<Start  
1;  
b;  
c;  
d;  
<End

Basically this text file consist of 3 sets of each starts with <Start and ends with <End. I would like to capture only numbers with "1" only. Expected data would be as below:

<Start  
1;  
b;  
c;  
d;  
<End  
<Start  
1;  
b;  
c;  
d;  
<End

I'm trying to figure out a way on how to do this via Python Regex but couldn't find any methods up to now. Would appreciate if I can get help from this community. Thank you in advance

CodePudding user response：

You use this regex (<Start[\s]*1;[\sa-z;]*<End)

it will capture all strings that match the pattern between ( and )

The string starts with <Start
Then could have any * whitespace [\s]
Then there should have the string 1;
Then we can have whitespaces \s, semicolons ; or lower cases letters a-z
It finally ends with <End

try it here

CodePudding user response：

An alternative pattern would be ("regex_input.txt" is a file which contains exactly your input):

import re

# This simply loads your content
content = None
with open("regex_input.txt", "r") as f:
    content = "\n".join([line.strip() for line in f])

# The re.DOTALL option allows to capture newlines as well
pattern = re.compile("<Start[^<]*?1;.*?<End", re.DOTALL)
print(pattern.findall(content))

OUTPUT:

['<Start\n1;\nb;\nc;\nd;\n<End', '<Start\n1;\nb;\nc;\nd;\n<End']