Home > Mobile >  How to use regex to find different strings present in a text file
How to use regex to find different strings present in a text file

Time:12-10

I have a file file.txt like this:

this is first line       new line1
this is second line      new line2
new( line added )        new line3
this is third line       new line4
this is fourth line_1    new line5

from this i need to check if certain strings are present in file and if present need to do some action.

This is current code:


    import re
    with open("file.txt", "r") as f:
        first = False
        second = False
        third = False
        fourth = False
        
        for line in f:
            for str in line.split():
                pattern = re.compile('this|is|first|line')
                if pattern.match(str):
                    first = True
                    continue
                pattern = re.compile('this|is|second|line')
                if pattern.match(str):
                    second = True
                    continue
            if(first == True):
                print("something1")
            else:
                print("not found first")
                exit()
            if(second == True):
                print("something2")
            else:
                print("not found second")
                exit()

but the current output is:

something1
not found second

but the expected code this:

something1
something2

the second line is present in file still it is printing else condition.(Note: Checking of string should be done in this order. i.e if first condition is not present then it should exit and not do further checks)

CodePudding user response:

The word or phrase patterns are regular expressions — just very simple ones. In a regular expression, most characters, including letters and numbers, represent themselves. For example, the regex pattern 1 matches the string ‘1’, and the pattern boy matches the string ‘boy’.

There are a number of reserved characters called metacharacters that do not represent themselves in a regular expression, but they have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ˆ, $, and .

CodePudding user response:

There were multiple issues with your program:

  1. first and second were not being reset back to false (I moved it into the body of the outer loop so they are now).
  2. You call exit() when the second line fails to match the first pattern, which aborts the entire program.
  3. The previously mentioned issue that the first regex ALSO matches the second line's "this". It probably shouldn't if they're mutually exclusive.

Other issues:

  1. You shouldn't need to split() the line and iterate over it.
  2. Calling re.compile inside the loop is performance heavy. Call it outside the loop for faster programs.
  3. You can just use if (pattern.match(line)) instead of assigning True/False to other variables.

The bugs fixed:

import re
with open("file.txt", "r") as f:

    for line in f:
        first = False
        second = False
        third = False
        fourth = False
        for str in line.split():
            pattern = re.compile('first')
            if pattern.match(str):
                first = True
                continue
            pattern = re.compile('second')
            if pattern.match(str):
                second = True
                continue
        if(first == True):
            print("something1")
        elif(second == True):
            print("something2")

All issues fixed, code DRY-ed up, etc.:

import re
with open("file.txt", "r") as f:
    firstPattern = re.compile('this is first line')
    secondPattern = re.compile('this is second line')

    for line in f:
        if (firstPattern.match(line)):
            print("something1")
        elif (secondPattern.match(line)):
            print("something2")
  • Related