Home > Back-end >  How to search backwards with regex on multilne string in python
How to search backwards with regex on multilne string in python

Time:04-02

I'm wondering if there's an efficient way of doing the following:

I have a python script that reads an entire file into a single string. Then, given the location of a token of interest, I'd like to find the string index of the beginning of the line given that token.

file_str = read_file("foo.txt")
token_pos = re.search("token",file_str).start()

#this does not work, as str.rfind does not take regex, and you cannot specify re.M:
beginning_of_line = file_str.rfind("^",0,token_pos)

I could use a greedy regex to find the last beginning of line, but this has to be done many times, so I'm concerned that I don't want to read the whole file on each iteration. Is there a good way to do this?

----------------- EDIT ----------------

I tried to post as simple of a question, but it looks like more details are required. Here's a better example of one of the things I'm trying to do:

file_str = """
{
   blah {  
      {} {{}  "string with unmatched }" }
   }
}"""

I happen to know where the opening an closing positions of blah's braces are. I need to get the lines between the braces (non-inclusive). So, given the position of the closing brace, I need to find the beginning of the line containing it. I'd like to do something akin to a reverse regex to find it. I can, of course, write a special function to do this, but I was thinking there would be some more python-ish way of going about it. To further complicate things, I would have to do this several times per file, and the file string can potentially change between iterations, so pre-indexing doesn't really work either...

CodePudding user response:

Instead of matching just the keyword, match everything from the start of the line to the keyword. You could use re.finditer()docs to get an iterator that keeps yielding matches as it finds them.

file_str = """Lorem ipsum dolor sit amet, consectetur adipiscing elit amet.
Vestibulum vestibulum mollis enim, eu tristique est rhoncus et.
Curabitur sem nisi, ornare eu pellentesque at, interdum at lectus.
Phasellus molestie, turpis id ornare efficitur, ex tellus aliquet ipsum, vitae ullamcorper tellus diam a velit.
Nulla eget eleifend nisl.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nullam finibus, velit non euismod faucibus, dolor orci maximus lacus, sed mattis nisi erat eget turpis.
Maecenas ut pharetra lorem.
Curabitur nec dui sed velit euismod bibendum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Pellentesque tempor dolor at placerat aliquet.
Duis laoreet, est vitae tempor porta, risus leo ullamcorper risus, quis vestibulum massa orci ut felis.
In finibus purus ac nulla congue mattis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis efficitur dui ac nisi lobortis, a bibendum felis volutpat.
Aenean consectetur diam at risus hendrerit, in vestibulum erat porttitor.
Quisque fringilla accumsan neque, sed efficitur nunc tristique maximus.
Maecenas gravida lectus et porttitor ultrices.
Nam lobortis, massa et porta vulputate, nulla turpis maximus sapien, sit amet finibus libero mauris eu sapien.
Donec sollicitudin vulputate neque, in tempor nisi suscipit quis.
"""

keyword = "amet"
for match_obj in re.finditer(f"^.*{keyword}", file_str, re.MULTILINE):
    beginning_of_line = match_obj.start()
    print(beginning_of_line, match_obj)

Which gives:

0 <re.Match object; span=(0, 60), match='Lorem ipsum dolor sit amet, consectetur adipiscin>
331 <re.Match object; span=(331, 357), match='Lorem ipsum dolor sit amet'>
566 <re.Match object; span=(566, 592), match='Lorem ipsum dolor sit amet'>
815 <re.Match object; span=(815, 841), match='Lorem ipsum dolor sit amet'>
1129 <re.Match object; span=(1129, 1206), match='Nam lobortis, massa et porta vulputate, nulla tur>

Note that the first line gets matched only once even though it contains two amets because we do a greedy match on . so the first amet on the line is consumed by the .*

CodePudding user response:

You don't need use regex to find the beginning of lines with the token

This will iterate the file line by line, create the string foo with the file's content and record where the newlines are in list named line_pos_with_token

token = "token"
foo = ''
line_pos_with_token = []

with open("foo.txt", "r") as f:
    for line in f:
        if token in line:
            line_pos_with_token.append(len(foo))
        foo  = line

print(line_pos_with_token)
  • Related