Home > Back-end >  Search for sentences containing characters using Python regular expressions
Search for sentences containing characters using Python regular expressions

Time:06-19

I am searching for sentences containing characters using Python regular expressions. But I can't find the sentence I want. Please help me

regex.py

opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()

index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))

file.txt(example)

[start file]
name:      steve
age:       23
[end file]

result

<re.Match object; span=(94, 738), match='age:               >

The age is not printed

CodePudding user response:

There are several issues here:

  • The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
  • (?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
  • If you plan to find a single match, use re.search, not re.findall.

Here is a possible solution:

match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
    match2 = re.search(r"\bage:\s*(\d )", match.group())
    if match2:
        print(match2.group(1))

Output:

23

If you want to get age in the output, use match2.group().

CodePudding user response:

If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.

^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d )(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]

Regex demo

Example

import re 

regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d )(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]" 
s = ("[start file]\n"   "name: steve \n"    "age: 23\n"     "[end file]") 

m = re.search(regex, s)

if m:
  print(m.group(1))

Output

23

CodePudding user response:

The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:

  1. re.search to locate the document
  2. splitlines() to isolate individual records
  3. split() to extract the key and value of each record

Then, in a second step, access the extracted records.

Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.

Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.

Here is a full standalone example:

import re

# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
    contents = f.read()

# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
    for line in match.group(1).splitlines():
        k, v = line.split(':', 1)
        fields[k.strip()] = v.strip()

# 3: Actual data exploitation
print(fields['age'])
  • Related