Search for sentences containing characters using Python regular expressions-CodePudding

I am searching for sentences containing characters using Python regular expressions. But I can't find the sentence I want. Please help me

regex.py

opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()

index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))

file.txt(example)

[start file]
name:      steve
age:       23
[end file]

result

<re.Match object; span=(94, 738), match='age:               >

The age is not printed

CodePudding user response：

There are several issues here:

The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.

Here is a possible solution:

match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
    match2 = re.search(r"\bage:\s*(\d )", match.group())
    if match2:
        print(match2.group(1))

Output:

If you want to get age in the output, use match2.group().

CodePudding user response：

If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.

^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d )(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]

Regex demo

Example

import re 

regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d )(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]" 
s = ("[start file]\n"   "name: steve \n"    "age: 23\n"     "[end file]") 

m = re.search(regex, s)

if m:
  print(m.group(1))

Output

CodePudding user response：

The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:

re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record

Then, in a second step, access the extracted records.

Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.

Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.

Here is a full standalone example:

import re

# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
    contents = f.read()

# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
    for line in match.group(1).splitlines():
        k, v = line.split(':', 1)
        fields[k.strip()] = v.strip()

# 3: Actual data exploitation
print(fields['age'])