RegEx to Search in Multiline Text-CodePudding

assume I have a text file like this: (Multiline) (Notice label1,label2,label3)

label1 Lorem ipsum dolor sit amet, consectetur adipiscing 
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]

In here aside from lorem ipsum texts, I have label1, label2, label3 as strings in a list.

For every label, I need to get "Start Value : xxx [yy]" They can be positioned in a list or dictionary.

For example, for label1 in this text I need to get: "Start Value : 0.25 [kg]"

There may be lines between labels and their start value, or they may be side to side line in last line.

In my idea, I need to use RegEx to search string areas starting with - label name, and ends with a string where the string starts with "Start Value : " and ends with "]" How can I complete this task?

So far I tried re.findall(...) but could not understand.

CodePudding user response：

One approach would be to first use re.findall to find all text blocks belonging to each label. Then iterate that result and find the Start Value lines for each label.

inp = """label1 Lorem ipsum dolor sit amet, consectetur adipiscing 
Lorem ipsum dolor-
Lorem ipsum dolor sit amet (dolor sit)
Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Praesent tellus
lkorem ipsum dolor.
Start Value : 0.25 [kg]
label2 Lorem ipsum dolor sit amer, consecflsşl
the solor dolar
Start Value : 8000 [mg]
label3 Start Value : 0.3 [kg]"""

labels = ["label1", "label2", "label3"]
regex = r'\b(?:'   r'|'.join(labels)   r')\b'
matches = re.findall(r'(?:^|(?<=\n))'   regex   r'.*?(?='   regex   r'|$)', inp, flags=re.S)
output = [re.search(r'\bStart Value\s*:\s*\d (?:\.\d )?\s*\[\w \]', x).group() for x in matches]
print(output)

# ['Start Value : 0.25 [kg]', 'Start Value : 8000 [mg]', 'Start Value : 0.3 [kg]']

CodePudding user response：

Since you did not post any code of how you wrote it so I will just post my solution.

use ^label\d(.|\n) ?Start Value : (. ?)(\[. ?\]) by enabling re.MULTILINE will works.

import re

regex = r"^label\d(.|\n) ?Start Value : (. ?)(\[. ?\])"

test_str = ("label1 Lorem ipsum dolor sit amet, consectetur adipiscing \n"
    "Lorem ipsum dolor-\n"
    "Lorem ipsum dolor sit amet (dolor sit)\n"
    "Hint Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tellus\n"
    "lkorem ipsum dolor.\n"
    "Start Value : 0.25 [kg]\n"
    "label2 Lorem ipsum dolor sit amer, consecflsşl\n"
    "the solor dolar\n"
    "Start Value : 8000 [mg]\n"
    "label3 Start Value : 0.3 [kg]")

matches = re.finditer(regex, test_str, re.MULTILINE)

for m in matches:
    for g in m.groups()[1:]:
        print(g)
    print()