Select a specific set of characters in a file using regex in python-CodePudding

In my code I have views defined like below.

VIEW Company_Person_Sd IS
   Prompt = 'Company'
   Company.Prompt = 'Company ID'
SELECT company_id                          company_id,
       emp_no                              emp_no,
       Get_Person(company_id, emp_no)      person_id,
       cp.rowid                            objid,
       to_char(cp.rowversion)              objversion,
       rowkey                              objkey
FROM   companies cp;

There can be more than one view defined in a single file (usually there are 20 or more).

I want to get the whole view using a regex in python.

I did the same thing to a method like below, using the following regex.(and it worked fine)

methodRegex = r"^\s*((FUNCTION|PROCEDURE)\s (\w ))(.*?)BEGIN(.*?)^END\s*(\w );"

methodMatches = re.finditer(methodRegex, fContent, re.DOTALL | re.MULTILINE | re.IGNORECASE | re.VERBOSE)
        
        for methodMatchNum, methodMatch in enumerate(methodMatches, start=1):
            methodContent=methodMatch.group()
            methodNameFull=methodMatch.group(1)
            methodType=methodMatch.group(2)
            methodName=methodMatch.group(3)

method example

PROCEDURE Prepare___ (
   attr_ IN OUT VARCHAR2 )
IS
  ----
BEGIN
   --
END Prepare___;

PROCEDURE Insert___ (
   attr_ IN OUT VARCHAR2 )
IS
  ----
BEGIN
   --
END Insert___;

When I try to do the same for views, it gives the wrong output. Actually I couldn't find how to catch the end of the view. I tried with semicolon as well, which gave a wrong output.

My regex for views

 viewRegex = r"^\s*(VIEW\s (\w ))(.*?)SELECT(.*?)^FROM\s*(\w );"

Please help me find out where I'm doing it wrong. Thanks in advance.

CodePudding user response：

You don't get any match with viewRegex because it only matches when there are only word characters ([a-zA-Z0-9_]) between FROM and ;. Whereas your example also includes a whitespace. So take whitespaces into account as well:

viewRegex = r"^\s*(VIEW\s (\w ))(.*?)SELECT(.*?)^FROM\s*([\w\s] );"

CodePudding user response：

If you have a lot of views in a single file, another option is to prevent using .*? with re.DOTALL to prevent unnecessary backtracking.

Instead, you can match the parts from VIEW to SELECT to FROM checking that what is in between is not another one of the key words to prevent matching too much using a negative lookahead (Assuming these can not occur in between)

For the last part after FROM, you can match word characters, optionally repeated by whitespace chars and again word characters.

^(VIEW\s (\w ))(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nSELECT\s (.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nFROM\s (\w (?:\s \w ));

The pattern matches:

^ Start of string
(VIEW\s (\w )) Capture group for VIEW followed by a group for the word characters
(.*(?:\n(?!SELECT|VIEW|FROM).*)*) Capture group matching the rest of the lines, and all lines that do not start with a keyword
\nSELECT\s Match a newline, SELECT and 1 whitespace cahrs
(.*(?:\n(?!SELECT|VIEW|FROM).*)*) Capture group matching the rest of the lines, and all lines that do not start with a keyword
\nFROM\s Match a newline, FROM and 1 whitespace chars
(\w (?:\s \w )); Capture group for the value of FROM, matching 1 word characters and optionally repeated by whitespace chars and word characters

Regex demo

For example (You can omit the re.VERBOSE and re.DOTALL)

import re

methodRegex = r"^^(VIEW\s (\w ))(.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nSELECT\s (.*(?:\n(?!SELECT|VIEW|FROM).*)*)\nFROM\s (\w (?:\s \w ));"
fContent = ("VIEW Company_Person_Sd IS\n"
            "   Prompt = 'Company'\n"
            "   Company.Prompt = 'Company ID'\n"
            "SELECT company_id                          company_id,\n"
            "       emp_no                              emp_no,\n"
            "       Get_Person(company_id, emp_no)      person_id,\n"
            "       cp.rowid                            objid,\n"
            "       to_char(cp.rowversion)              objversion,\n"
            "       rowkey                              objkey\n"
            "FROM   companies cp;")
methodMatches = re.finditer(methodRegex, fContent, re.MULTILINE | re.IGNORECASE)

for methodMatchNum, methodMatch in enumerate(methodMatches, start=1):
    methodContent = methodMatch.group()
    methodNameFull = methodMatch.group(1)
    methodType = methodMatch.group(2)
    methodName = methodMatch.group(3)