I have a text file similar to below
<Start
1;
b;
c;
d;
<End
<Start
2;
b;
c;
d;
<End
<Start
1;
b;
c;
d;
<End
Basically this text file consist of 3 sets of each starts with <Start and ends with <End. I would like to capture only numbers with "1" only. Expected data would be as below:
<Start
1;
b;
c;
d;
<End
<Start
1;
b;
c;
d;
<End
I'm trying to figure out a way on how to do this via Python Regex but couldn't find any methods up to now. Would appreciate if I can get help from this community. Thank you in advance
CodePudding user response:
You use this regex (<Start[\s]*1;[\sa-z;]*<End)
it will capture all strings that match the pattern between (
and )
- The string starts with
<Start
- Then could have any
*
whitespace[\s]
- Then there should have the string
1;
- Then we can have whitespaces
\s
, semicolons;
or lower cases lettersa-z
- It finally ends with
<End
try it here
CodePudding user response:
An alternative pattern would be ("regex_input.txt" is a file which contains exactly your input):
import re
# This simply loads your content
content = None
with open("regex_input.txt", "r") as f:
content = "\n".join([line.strip() for line in f])
# The re.DOTALL option allows to capture newlines as well
pattern = re.compile("<Start[^<]*?1;.*?<End", re.DOTALL)
print(pattern.findall(content))
OUTPUT:
['<Start\n1;\nb;\nc;\nd;\n<End', '<Start\n1;\nb;\nc;\nd;\n<End']