RegEx to extract certain lines in file and remove certain lines starting with str-CodePudding

I have the following code.

r"""file description line one.

Further explanation 1.
Further explanation 2.

TODO: refactor 1.
TODO: refactor 2.

Further explanation 3. 
"""

import numpy as np
import pandas as pd
...

I want to ask for a single regular expression to return the following code:

r"""file description line one.

Further explanation 1.
Further explanation 2.


Further explanation 3. 
"""

That is, returns the file description only, but removes lines starting with TODO.

Thanks!

Do you want to catch the words after comment?

Currently I could find this workaround: regex101. It does not match Further explanation 3. (though I wanted to match it). It does not match r""" but it's OK.

I found one solution myself. Please see below (one of the answers).

CodePudding user response：

You can use regex for

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"^(?!TODO|import).*$"

test_str = ("r\"\"\"file description line one.\n\n"
    "Further explanation 1.\n"
    "Further explanation 2.\n\n"
    "TODO: refactor 1.\n"
    "TODO: refactor 1.\n\n"
    "Further explanation 3. \n"
    "\"\"\"\n\n"
    "import numpy as np\n"
    "import pandas as pd\n"
    "...")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum   1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

If you just want TODOS to remove, remove import from regex line. Change regex to regex = r"^((?!TODO).).*$"

CodePudding user response：

Building upon a previous answer, let's capture first the whole docstring and then all lines excluding the ones starting with "TODO"

import re

docstring_pattern = r'(?s)""".*?"""$'  # should work the same with r'"""[\s\S]*?"""$'
pattern_TODOs = r'^(?!TODO).*\s'
test_str = '''
r"""file description line one.

Further explanation 1.
Further explanation 2.

TODO: refactor 1.
TODO: refactor 1.

Further explanation 3. 
"""

import numpy as np
import pandas as pd
...
'''

doc = re.findall(docstring_pattern, test_str, re.MULTILINE)[0]

matches = re.findall(pattern_TODOs, doc '\n', re.MULTILINE)
>>>print(''.join(list(matches)))

"""file description line one.

Further explanation 1.
Further explanation 2.


Further explanation 3. 
"""

The \s at the end of the second pattern is to capture the empty lines, that was one problem in the other answer.

docstring_pattern will capture the whole string in a single match. The *? inside the pattern ensures it captures only one docstring at a time (in case there were more than one in the whole code text) and only the first one is assigned to doc.

CodePudding user response：

I think I find one solution: regex101.

Basically I have two parts: ^(?!TODO).*$(?=[\S\s]*^\"\"\"$)|^\"\"\"$.

Second part (^\"\"\"$) is trivial.

^(?!TODO).*$, I tried to match all the lines that not start with TODO (negative lookahead).

(?=[\S\s]*^\"\"\"$) is another lookahead but positively. It tries to examine if there is a single line """ (^\"\"\"$) after possibly lines later[\S\s]*.