I have a simulationoutput with many lines, parts of it look like this:
</GraphicData>
</Connection>
<Connection>
<Name>ES1</Name>
<Type>Port</Type>
<From>Windfarm.Out</From>
<To>BR1.In</To>
<GraphicData>
<Icon>
<Points>
</GraphicData>
</Connection>
<Connection>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
<GraphicData>
<Icon>
<Points>
The word between Name and /Name varies from output to output. These names (here: ES1 and S2) are stored in a textfile (keywords.txt).
What I need is a Regex that gets the keywords from the list (keywords.txt). searches for matches in (Simulationoutput.txt) until /To> and writes these matches into another textfile (finaloutput.txt).
Here is what I've done so far
with open("keywords.txt", 'r') as f:
keywords = ast.literal_eval(f.read())
pattern = '|'.join(keywords)
results = []
with open('Simulationoutput.txt', 'r') as f:
for line in f:
matches = re.findall(pattern,line)
if matches:
results.append((line, len(matches)))
results = sorted(results, key=lambda x: x[1], reverse=True)
with open('finaloutput.txt', 'w') as f:
for line, num_matches in results:
f.write('{} {}\n'.format(num_matches, line))
The finaloutput.txt looks like this now:
<Name>ES1</Name>
<Name>S2</Name>
So the code almost does what I want but the output should look like this
<Name>ES1</Name>
<Type>Port</Type>
<From>Hydro.Out</From>
<To>BR1.In</To>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
Thanks in advance.
CodePudding user response:
Although I strongly advise you to use xml.etree.ElementTree
to do this, here's how you could do it using regex:
import re
keywords = ["ES1", "S2"]
pattern = "|".join([re.escape(key) for key in keywords])
pattern = fr"<Name>(?:{pattern}).*?<\/To>"
with open("Simulationoutput.txt", "r") as f:
matches = re.findall(pattern, f.read(), flags=re.DOTALL)
with open("finaloutput.txt", "w") as f:
f.write("\n\n".join(matches).replace("\n ", "\n"))
The regex used is the following:
<Name>(?:ES1|S2).*?<\/To>
<Name>
: Matches `.(?:)
: Non-capturing group.ES1|S2
: Matches eitherES1
orS2
..*?
: Matches any character, between zero and unlimited times, as few as possible (lazy). Note that.
does not match newlines by default, only because there.DOTALL
flag is set.<\/To>
: Matches</To>
.