Get string between two identifiers on multiple lines with a line by line read-CodePudding

I have a huge text file which I need to read line by line for memory optimization. I would like to get the string within two identifiers, as an example here between the identifiers '{' and '}':

input:
"
not this line
not this line
Pattern 'pattern' {
get this line 
get this line 
}
not this line
not this line
"

the output would be a string "get this line get this line "

There can be some other identifiers ('{', '}', '[', ...) inside the string but I need matching ones. Ex: Pattern { something else {...} } would get something else {...} (the englobed {...} is inside the string)

I have written a simple counter like this but it is quite slow. I was looking at a faster way of doing this.

currentString = ""
counter = 0

def GetStringBetweenIdentifiers(string, identifierA, identifierB):
    global currentString, counter

    for i in string:
        if (i == identifierB):
            counter -= 1
        
        if(counter > 0):
            currentString  = i
            
        if(i == identifierA):
            counter  = 1
            
    if(counter==0):
        string = currentString
        currentString = ""
        return string
    return ""


with open(filePath) as read_obj:
        for num, line in enumerate(read_obj, 1):
            String = GetStringBetweenIdentifiers(line, '{', '}')
            if (String != ""):
                "Do something with the string"

To add some examples, there can be identifiers in the middle of the line, for example:

input:
"
not this line
not this line
Pattern 'pattern' { I want this 
get this line { something here }
get this line 
also this part } not this part
not this line
not this line
"

the output would be a string " I want this get this line { something here } get this line also this part"

Thank you for reading!

CodePudding user response：

This kind of thing can be very tricky due to ambiguous sequences. For example... Let's say that the start of a sequence of interest is '{' and the end is '}'. Now imagine that you've observed a start sentinel then, before you see an end marker, you see another start marker. What do you do then?

Anyway, here's something that will work in the perfect world (which doesn't really exist but it might give some ideas).

My input file looks like this:

not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line

...and the code like this...

START = '{'
END = '}'

capture = 0
data = []
section = []
with open('foo.txt') as txt:
    while (c := txt.read(1)):
        if c == START:
            if (capture := capture   1) > 1:
                section.append(c)
        elif c == END:
            if (capture := capture - 1) < 0:
                print('ERROR: unable to process (too many end tags)')
                break
            if capture:
                section.append(c)
            elif section:
                data.append(section)
                section = []
        elif capture and c not in '\r\n':
            section.append(c)
for section in data:
    print(''.join(section))

...and this output....

I want this get this line { something here }get this line also this part

CodePudding user response：

Welcome to the world of regex. It's quirky, but highly effective. This works for your situation, if in the lines you read there is only one capture-able sequence, which may contain sub sequences that might also be captured, as you show in your example. It will fail if there are independent sequences within the same input string, as it will capture the "outer most" subsequence that it finds. It would be a little more work to have it handle this case. (As they say, an exercise left to the interested reader.)

Lots of good info in the python dox and this website is key for testing.

Aside: You may also want to look into grep terminal command (not a python solution). grep is highly effective at processing massive files and pulling out matches and it works seamlessly with regex also

Anyhow:

import re


with open('dummy_text.txt', 'r') as src:
    lines = src.readlines()

composite_string = ''.join(lines)

print('loaded and working with:\n')
print(composite_string)
print()

pattern = r'{((?s:.*))}'
results = re.search(pattern, composite_string)

print(f'I found: {results.group(1)}')

Produces:

loaded and working with:

not this line
not this line
Pattern 'pattern' {
get this line 
get {this} line 
}
not this line
not this line

I found: 
get this line 
get {this} line