Home > Software engineering >  Regex in combination with a list of keywords from a textfile to parse into another textfile
Regex in combination with a list of keywords from a textfile to parse into another textfile

Time:03-13

I have a simulationoutput with many lines, parts of it look like this:

    </GraphicData>
  </Connection>
  <Connection>
    <Name>ES1</Name>
    <Type>Port</Type>
    <From>Windfarm.Out</From>
    <To>BR1.In</To>
    <GraphicData>
      <Icon>
        <Points>
    </GraphicData>
  </Connection>
  <Connection>
    <Name>S2</Name>
    <Type>Port</Type>
    <From>BR1.Out</From>
    <To>C1.In</To>
    <GraphicData>
      <Icon>
        <Points>

The word between Name and /Name varies from output to output. These names (here: ES1 and S2) are stored in a textfile (keywords.txt).

What I need is a Regex that gets the keywords from the list (keywords.txt). searches for matches in (Simulationoutput.txt) until /To> and writes these matches into another textfile (finaloutput.txt).

Here is what I've done so far

with open("keywords.txt", 'r') as f: 
    keywords = ast.literal_eval(f.read())

pattern = '|'.join(keywords)
results = []
with open('Simulationoutput.txt', 'r') as f:
    for line in f:
        matches = re.findall(pattern,line)
        if matches:
            results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

with open('finaloutput.txt', 'w') as f:
    for line, num_matches in results:
        f.write('{}  {}\n'.format(num_matches, line))

The finaloutput.txt looks like this now:

<Name>ES1</Name>
<Name>S2</Name>

So the code almost does what I want but the output should look like this

    <Name>ES1</Name>
    <Type>Port</Type>
    <From>Hydro.Out</From>
    <To>BR1.In</To>

    <Name>S2</Name>
    <Type>Port</Type>
    <From>BR1.Out</From>
    <To>C1.In</To>

Thanks in advance.

CodePudding user response:

Although I strongly advise you to use xml.etree.ElementTree to do this, here's how you could do it using regex:

import re

keywords = ["ES1", "S2"]

pattern = "|".join([re.escape(key) for key in keywords])
pattern = fr"<Name>(?:{pattern}).*?<\/To>"

with open("Simulationoutput.txt", "r") as f:
    matches = re.findall(pattern, f.read(), flags=re.DOTALL)

with open("finaloutput.txt", "w") as f:
    f.write("\n\n".join(matches).replace("\n    ", "\n"))

The regex used is the following:

<Name>(?:ES1|S2).*?<\/To>
  • <Name>: Matches `.
  • (?:): Non-capturing group.
  • ES1|S2: Matches either ES1 or S2.
  • .*?: Matches any character, between zero and unlimited times, as few as possible (lazy). Note that . does not match newlines by default, only because the re.DOTALL flag is set.
  • <\/To>: Matches </To>.
  • Related